* [PATCH 0/7]: mm: memcontrol: switch to rstat @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team This series converts memcg stats tracking to the streamlined rstat infrastructure provided by the cgroup core code. rstat is already used by the CPU controller and the IO controller. This change is motivated by recent accuracy problems in memcg's custom stats code, as well as the benefits of sharing common infra with other controllers. The current memcg implementation does batched tree aggregation on the write side: local stat changes are cached in per-cpu counters, which are then propagated upward in batches when a threshold (32 pages) is exceeded. This is cheap, but the error introduced by the lazy upward propagation adds up: 32 pages times CPUs times cgroups in the subtree. We've had complaints from service owners that the stats do not reliably track and react to allocation behavior as expected, sometimes swallowing the results of entire test applications. The original memcg stat implementation used to do tree aggregation exclusively on the read side: local stats would only ever be tracked in per-cpu counters, and a memory.stat read would iterate the entire subtree and sum those counters up. This didn't keep up with the times: - Cgroup trees are much bigger now. We switched to lazily-freed cgroups, where deleted groups would hang around until their remaining page cache has been reclaimed. This can result in large subtrees that are expensive to walk, while most of the groups are idle and their statistics don't change much anymore. - Automated monitoring increased. With the proliferation of userspace oom killing, proactive reclaim, and higher-resolution logging of workload trends in general, top-level stat files are polled at least once a second in many deployments. - The lifetime of cgroups got shorter. Where most cgroup setups in the past would have a few large policy-oriented cgroups for everything running on the system, newer cgroup deployments tend to create one group per application - which gets deleted again as the processes exit. An aggregation scheme that doesn't retain child data inside the parents loses event history of the subtree. Rstat addresses all three of those concerns through intelligent, persistent read-side aggregation. As statistics change at the local level, rstat tracks - on a per-cpu basis - only those parts of a subtree that have changes pending and require aggregation. The actual aggregation occurs on the colder read side - which can now skip over (potentially large) numbers of recently idle cgroups. --- A kernel build test confirms that the cost is comparable. Two kernels are built simultaneously in a nested tree with several idle siblings: root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16) `- build-b (defconfig, make -j16) `- idle-1 `- ... `- idle-9 During the builds, kernelbuild/memory.stat is read once a second. A perf diff shows that the changes in cycle distribution is minimal. Top 10 kernel symbols: 0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated 0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0 0.16% -0.04% [kernel.kallsyms] [k] release_pages 0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events 0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0 0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size 0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault --- And of course, the on-demand aggregated stats are now fully accurate again: $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \ grep -e inactive_file /sys/fs/cgroup/memory.stat vanilla: patched: nr_inactive_file 1574105088 nr_inactive_file 1027801088 inactive_file 1577410560 inactive_file 1027801088 --- block/blk-cgroup.c | 14 +- include/linux/memcontrol.h | 119 ++++++---------- kernel/cgroup/cgroup.c | 34 +++-- kernel/cgroup/rstat.c | 62 +++++---- mm/memcontrol.c | 320 +++++++++++++++++++++---------------------- 5 files changed, 266 insertions(+), 283 deletions(-) Based on v5.11-rc5-mm1. ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH 0/7]: mm: memcontrol: switch to rstat @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg This series converts memcg stats tracking to the streamlined rstat infrastructure provided by the cgroup core code. rstat is already used by the CPU controller and the IO controller. This change is motivated by recent accuracy problems in memcg's custom stats code, as well as the benefits of sharing common infra with other controllers. The current memcg implementation does batched tree aggregation on the write side: local stat changes are cached in per-cpu counters, which are then propagated upward in batches when a threshold (32 pages) is exceeded. This is cheap, but the error introduced by the lazy upward propagation adds up: 32 pages times CPUs times cgroups in the subtree. We've had complaints from service owners that the stats do not reliably track and react to allocation behavior as expected, sometimes swallowing the results of entire test applications. The original memcg stat implementation used to do tree aggregation exclusively on the read side: local stats would only ever be tracked in per-cpu counters, and a memory.stat read would iterate the entire subtree and sum those counters up. This didn't keep up with the times: - Cgroup trees are much bigger now. We switched to lazily-freed cgroups, where deleted groups would hang around until their remaining page cache has been reclaimed. This can result in large subtrees that are expensive to walk, while most of the groups are idle and their statistics don't change much anymore. - Automated monitoring increased. With the proliferation of userspace oom killing, proactive reclaim, and higher-resolution logging of workload trends in general, top-level stat files are polled at least once a second in many deployments. - The lifetime of cgroups got shorter. Where most cgroup setups in the past would have a few large policy-oriented cgroups for everything running on the system, newer cgroup deployments tend to create one group per application - which gets deleted again as the processes exit. An aggregation scheme that doesn't retain child data inside the parents loses event history of the subtree. Rstat addresses all three of those concerns through intelligent, persistent read-side aggregation. As statistics change at the local level, rstat tracks - on a per-cpu basis - only those parts of a subtree that have changes pending and require aggregation. The actual aggregation occurs on the colder read side - which can now skip over (potentially large) numbers of recently idle cgroups. --- A kernel build test confirms that the cost is comparable. Two kernels are built simultaneously in a nested tree with several idle siblings: root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16) `- build-b (defconfig, make -j16) `- idle-1 `- ... `- idle-9 During the builds, kernelbuild/memory.stat is read once a second. A perf diff shows that the changes in cycle distribution is minimal. Top 10 kernel symbols: 0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated 0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0 0.16% -0.04% [kernel.kallsyms] [k] release_pages 0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events 0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0 0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size 0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault --- And of course, the on-demand aggregated stats are now fully accurate again: $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \ grep -e inactive_file /sys/fs/cgroup/memory.stat vanilla: patched: nr_inactive_file 1574105088 nr_inactive_file 1027801088 inactive_file 1577410560 inactive_file 1027801088 --- block/blk-cgroup.c | 14 +- include/linux/memcontrol.h | 119 ++++++---------- kernel/cgroup/cgroup.c | 34 +++-- kernel/cgroup/rstat.c | 62 +++++---- mm/memcontrol.c | 320 +++++++++++++++++++++---------------------- 5 files changed, 266 insertions(+), 283 deletions(-) Based on v5.11-rc5-mm1. ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team The memcg hotunplug callback erroneously flushes counts on the local CPU, not the counts of the CPU going away; those counts will be lost. Flush the CPU that is actually going away. Also simplify the code a bit by using mod_memcg_state() and count_memcg_events() instead of open-coding the upward flush - this is comparable to how vmstat.c handles hotunplug flushing. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/memcontrol.c | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ed5cc78a8dbf..8120d565dd79 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2411,45 +2411,52 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { struct memcg_stock_pcp *stock; - struct mem_cgroup *memcg, *mi; + struct mem_cgroup *memcg; stock = &per_cpu(memcg_stock, cpu); drain_stock(stock); for_each_mem_cgroup(memcg) { + struct memcg_vmstats_percpu *statc; int i; + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); + for (i = 0; i < MEMCG_NR_STAT; i++) { int nid; - long x; - x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->vmstats[i]); + if (statc->stat[i]) { + mod_memcg_state(memcg, i, statc->stat[i]); + statc->stat[i] = 0; + } if (i >= NR_VM_NODE_STAT_ITEMS) continue; for_each_node(nid) { + struct batched_lruvec_stat *lstatc; struct mem_cgroup_per_node *pn; + long x; pn = mem_cgroup_nodeinfo(memcg, nid); - x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); - if (x) + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); + + x = lstatc->count[i]; + lstatc->count[i] = 0; + + if (x) { do { atomic_long_add(x, &pn->lruvec_stat[i]); } while ((pn = parent_nodeinfo(pn, nid))); + } } } for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { - long x; - - x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->vmevents[i]); + if (statc->events[i]) { + count_memcg_events(memcg, i, statc->events[i]); + statc->events[i] = 0; + } } } -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg The memcg hotunplug callback erroneously flushes counts on the local CPU, not the counts of the CPU going away; those counts will be lost. Flush the CPU that is actually going away. Also simplify the code a bit by using mod_memcg_state() and count_memcg_events() instead of open-coding the upward flush - this is comparable to how vmstat.c handles hotunplug flushing. Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- mm/memcontrol.c | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ed5cc78a8dbf..8120d565dd79 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2411,45 +2411,52 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { struct memcg_stock_pcp *stock; - struct mem_cgroup *memcg, *mi; + struct mem_cgroup *memcg; stock = &per_cpu(memcg_stock, cpu); drain_stock(stock); for_each_mem_cgroup(memcg) { + struct memcg_vmstats_percpu *statc; int i; + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); + for (i = 0; i < MEMCG_NR_STAT; i++) { int nid; - long x; - x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->vmstats[i]); + if (statc->stat[i]) { + mod_memcg_state(memcg, i, statc->stat[i]); + statc->stat[i] = 0; + } if (i >= NR_VM_NODE_STAT_ITEMS) continue; for_each_node(nid) { + struct batched_lruvec_stat *lstatc; struct mem_cgroup_per_node *pn; + long x; pn = mem_cgroup_nodeinfo(memcg, nid); - x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); - if (x) + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); + + x = lstatc->count[i]; + lstatc->count[i] = 0; + + if (x) { do { atomic_long_add(x, &pn->lruvec_stat[i]); } while ((pn = parent_nodeinfo(pn, nid))); + } } } for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { - long x; - - x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->vmevents[i]); + if (statc->events[i]) { + count_memcg_events(memcg, i, statc->events[i]); + statc->events[i] = 0; + } } } -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing 2021-02-02 18:47 ` Johannes Weiner (?) @ 2021-02-02 22:23 ` Shakeel Butt -1 siblings, 0 replies; 82+ messages in thread From: Shakeel Butt @ 2021-02-02 22:23 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team On Tue, Feb 2, 2021 at 12:18 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > The memcg hotunplug callback erroneously flushes counts on the local > CPU, not the counts of the CPU going away; those counts will be lost. > > Flush the CPU that is actually going away. > > Also simplify the code a bit by using mod_memcg_state() and > count_memcg_events() instead of open-coding the upward flush - this is > comparable to how vmstat.c handles hotunplug flushing. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> I think we need Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Reviewed-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-02 22:23 ` Shakeel Butt 0 siblings, 0 replies; 82+ messages in thread From: Shakeel Butt @ 2021-02-02 22:23 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team On Tue, Feb 2, 2021 at 12:18 PM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote: > > The memcg hotunplug callback erroneously flushes counts on the local > CPU, not the counts of the CPU going away; those counts will be lost. > > Flush the CPU that is actually going away. > > Also simplify the code a bit by using mod_memcg_state() and > count_memcg_events() instead of open-coding the upward flush - this is > comparable to how vmstat.c handles hotunplug flushing. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> I think we need Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-02 22:23 ` Shakeel Butt 0 siblings, 0 replies; 82+ messages in thread From: Shakeel Butt @ 2021-02-02 22:23 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team On Tue, Feb 2, 2021 at 12:18 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > The memcg hotunplug callback erroneously flushes counts on the local > CPU, not the counts of the CPU going away; those counts will be lost. > > Flush the CPU that is actually going away. > > Also simplify the code a bit by using mod_memcg_state() and > count_memcg_events() instead of open-coding the upward flush - this is > comparable to how vmstat.c handles hotunplug flushing. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> I think we need Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Reviewed-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing 2021-02-02 18:47 ` Johannes Weiner @ 2021-02-02 23:07 ` Roman Gushchin -1 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-02 23:07 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > The memcg hotunplug callback erroneously flushes counts on the local > CPU, not the counts of the CPU going away; those counts will be lost. > > Flush the CPU that is actually going away. > > Also simplify the code a bit by using mod_memcg_state() and > count_memcg_events() instead of open-coding the upward flush - this is > comparable to how vmstat.c handles hotunplug flushing. To the whole series: it's really nice to have an accurate stats at non-leaf levels. Just as an illustration: if there are 32 CPUs and 1000 sub-cgroups (which is an absolutely realistic number, because often there are many dying generations of each cgroup), the error margin is 3.9GB. It makes all numbers pretty much random and all possible tests extremely flaky. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> To this patch: Reviewed-by: Roman Gushchin <guro@fb.com> Thanks! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-02 23:07 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-02 23:07 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > The memcg hotunplug callback erroneously flushes counts on the local > CPU, not the counts of the CPU going away; those counts will be lost. > > Flush the CPU that is actually going away. > > Also simplify the code a bit by using mod_memcg_state() and > count_memcg_events() instead of open-coding the upward flush - this is > comparable to how vmstat.c handles hotunplug flushing. To the whole series: it's really nice to have an accurate stats at non-leaf levels. Just as an illustration: if there are 32 CPUs and 1000 sub-cgroups (which is an absolutely realistic number, because often there are many dying generations of each cgroup), the error margin is 3.9GB. It makes all numbers pretty much random and all possible tests extremely flaky. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> To this patch: Reviewed-by: Roman Gushchin <guro@fb.com> Thanks! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-03 2:28 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-03 2:28 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote: > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > > The memcg hotunplug callback erroneously flushes counts on the local > > CPU, not the counts of the CPU going away; those counts will be lost. > > > > Flush the CPU that is actually going away. > > > > Also simplify the code a bit by using mod_memcg_state() and > > count_memcg_events() instead of open-coding the upward flush - this is > > comparable to how vmstat.c handles hotunplug flushing. > > To the whole series: it's really nice to have an accurate stats at > non-leaf levels. Just as an illustration: if there are 32 CPUs and > 1000 sub-cgroups (which is an absolutely realistic number, because > often there are many dying generations of each cgroup), the error > margin is 3.9GB. It makes all numbers pretty much random and all > possible tests extremely flaky. Btw, I was just looking into kmem kselftests failures/flakiness, which is caused by exactly this problem: without waiting for the finish of dying cgroups reclaim, we can't make any reliable assumptions about what to expect from memcg stats. So looking forward to have this patchset merged! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-03 2:28 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-03 2:28 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote: > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > > The memcg hotunplug callback erroneously flushes counts on the local > > CPU, not the counts of the CPU going away; those counts will be lost. > > > > Flush the CPU that is actually going away. > > > > Also simplify the code a bit by using mod_memcg_state() and > > count_memcg_events() instead of open-coding the upward flush - this is > > comparable to how vmstat.c handles hotunplug flushing. > > To the whole series: it's really nice to have an accurate stats at > non-leaf levels. Just as an illustration: if there are 32 CPUs and > 1000 sub-cgroups (which is an absolutely realistic number, because > often there are many dying generations of each cgroup), the error > margin is 3.9GB. It makes all numbers pretty much random and all > possible tests extremely flaky. Btw, I was just looking into kmem kselftests failures/flakiness, which is caused by exactly this problem: without waiting for the finish of dying cgroups reclaim, we can't make any reliable assumptions about what to expect from memcg stats. So looking forward to have this patchset merged! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-04 19:29 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 19:29 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 06:28:53PM -0800, Roman Gushchin wrote: > On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote: > > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > > > The memcg hotunplug callback erroneously flushes counts on the local > > > CPU, not the counts of the CPU going away; those counts will be lost. > > > > > > Flush the CPU that is actually going away. > > > > > > Also simplify the code a bit by using mod_memcg_state() and > > > count_memcg_events() instead of open-coding the upward flush - this is > > > comparable to how vmstat.c handles hotunplug flushing. > > > > To the whole series: it's really nice to have an accurate stats at > > non-leaf levels. Just as an illustration: if there are 32 CPUs and > > 1000 sub-cgroups (which is an absolutely realistic number, because > > often there are many dying generations of each cgroup), the error > > margin is 3.9GB. It makes all numbers pretty much random and all > > possible tests extremely flaky. > > Btw, I was just looking into kmem kselftests failures/flakiness, > which is caused by exactly this problem: without waiting for the > finish of dying cgroups reclaim, we can't make any reliable assumptions > about what to expect from memcg stats. Good point about the selftests. I gave them a shot, and indeed this series makes test_kmem work again: vanilla: ok 1 test_kmem_basic memory.current = 8810496 slab + anon + file + kernel_stack = 17074568 slab = 6101384 anon = 946176 file = 0 kernel_stack = 10027008 not ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic patched: ok 1 test_kmem_basic ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic It even passes with a reduced margin in the patched kernel, since the percpu drift - which this test already tried to account for - is now only on the page_counter side (whereas memory.stat is always precise). I'm going to include that data in the v2 changelog, as well as a patch to update test_kmem.c to the more stringent error tolerances. > So looking forward to have this patchset merged! Thanks ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-04 19:29 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 19:29 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Feb 02, 2021 at 06:28:53PM -0800, Roman Gushchin wrote: > On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote: > > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > > > The memcg hotunplug callback erroneously flushes counts on the local > > > CPU, not the counts of the CPU going away; those counts will be lost. > > > > > > Flush the CPU that is actually going away. > > > > > > Also simplify the code a bit by using mod_memcg_state() and > > > count_memcg_events() instead of open-coding the upward flush - this is > > > comparable to how vmstat.c handles hotunplug flushing. > > > > To the whole series: it's really nice to have an accurate stats at > > non-leaf levels. Just as an illustration: if there are 32 CPUs and > > 1000 sub-cgroups (which is an absolutely realistic number, because > > often there are many dying generations of each cgroup), the error > > margin is 3.9GB. It makes all numbers pretty much random and all > > possible tests extremely flaky. > > Btw, I was just looking into kmem kselftests failures/flakiness, > which is caused by exactly this problem: without waiting for the > finish of dying cgroups reclaim, we can't make any reliable assumptions > about what to expect from memcg stats. Good point about the selftests. I gave them a shot, and indeed this series makes test_kmem work again: vanilla: ok 1 test_kmem_basic memory.current = 8810496 slab + anon + file + kernel_stack = 17074568 slab = 6101384 anon = 946176 file = 0 kernel_stack = 10027008 not ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic patched: ok 1 test_kmem_basic ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic It even passes with a reduced margin in the patched kernel, since the percpu drift - which this test already tried to account for - is now only on the page_counter side (whereas memory.stat is always precise). I'm going to include that data in the v2 changelog, as well as a patch to update test_kmem.c to the more stringent error tolerances. > So looking forward to have this patchset merged! Thanks ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-04 19:34 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-04 19:34 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Thu, Feb 04, 2021 at 02:29:57PM -0500, Johannes Weiner wrote: > On Tue, Feb 02, 2021 at 06:28:53PM -0800, Roman Gushchin wrote: > > On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote: > > > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > > > > The memcg hotunplug callback erroneously flushes counts on the local > > > > CPU, not the counts of the CPU going away; those counts will be lost. > > > > > > > > Flush the CPU that is actually going away. > > > > > > > > Also simplify the code a bit by using mod_memcg_state() and > > > > count_memcg_events() instead of open-coding the upward flush - this is > > > > comparable to how vmstat.c handles hotunplug flushing. > > > > > > To the whole series: it's really nice to have an accurate stats at > > > non-leaf levels. Just as an illustration: if there are 32 CPUs and > > > 1000 sub-cgroups (which is an absolutely realistic number, because > > > often there are many dying generations of each cgroup), the error > > > margin is 3.9GB. It makes all numbers pretty much random and all > > > possible tests extremely flaky. > > > > Btw, I was just looking into kmem kselftests failures/flakiness, > > which is caused by exactly this problem: without waiting for the > > finish of dying cgroups reclaim, we can't make any reliable assumptions > > about what to expect from memcg stats. > > Good point about the selftests. I gave them a shot, and indeed this > series makes test_kmem work again: > > vanilla: > ok 1 test_kmem_basic > memory.current = 8810496 > slab + anon + file + kernel_stack = 17074568 > slab = 6101384 > anon = 946176 > file = 0 > kernel_stack = 10027008 > not ok 2 test_kmem_memcg_deletion > ok 3 test_kmem_proc_kpagecgroup > ok 4 test_kmem_kernel_stacks > ok 5 test_kmem_dead_cgroups > ok 6 test_percpu_basic > > patched: > ok 1 test_kmem_basic > ok 2 test_kmem_memcg_deletion > ok 3 test_kmem_proc_kpagecgroup > ok 4 test_kmem_kernel_stacks > ok 5 test_kmem_dead_cgroups > ok 6 test_percpu_basic Nice! Thanks for checking. > > It even passes with a reduced margin in the patched kernel, since the > percpu drift - which this test already tried to account for - is now > only on the page_counter side (whereas memory.stat is always precise). > > I'm going to include that data in the v2 changelog, as well as a patch > to update test_kmem.c to the more stringent error tolerances. Hm, I'm not sure it's a good idea to unconditionally lower the error tolerance: it's convenient to be able to run the same test on older kernels. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-04 19:34 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-04 19:34 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Feb 04, 2021 at 02:29:57PM -0500, Johannes Weiner wrote: > On Tue, Feb 02, 2021 at 06:28:53PM -0800, Roman Gushchin wrote: > > On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote: > > > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > > > > The memcg hotunplug callback erroneously flushes counts on the local > > > > CPU, not the counts of the CPU going away; those counts will be lost. > > > > > > > > Flush the CPU that is actually going away. > > > > > > > > Also simplify the code a bit by using mod_memcg_state() and > > > > count_memcg_events() instead of open-coding the upward flush - this is > > > > comparable to how vmstat.c handles hotunplug flushing. > > > > > > To the whole series: it's really nice to have an accurate stats at > > > non-leaf levels. Just as an illustration: if there are 32 CPUs and > > > 1000 sub-cgroups (which is an absolutely realistic number, because > > > often there are many dying generations of each cgroup), the error > > > margin is 3.9GB. It makes all numbers pretty much random and all > > > possible tests extremely flaky. > > > > Btw, I was just looking into kmem kselftests failures/flakiness, > > which is caused by exactly this problem: without waiting for the > > finish of dying cgroups reclaim, we can't make any reliable assumptions > > about what to expect from memcg stats. > > Good point about the selftests. I gave them a shot, and indeed this > series makes test_kmem work again: > > vanilla: > ok 1 test_kmem_basic > memory.current = 8810496 > slab + anon + file + kernel_stack = 17074568 > slab = 6101384 > anon = 946176 > file = 0 > kernel_stack = 10027008 > not ok 2 test_kmem_memcg_deletion > ok 3 test_kmem_proc_kpagecgroup > ok 4 test_kmem_kernel_stacks > ok 5 test_kmem_dead_cgroups > ok 6 test_percpu_basic > > patched: > ok 1 test_kmem_basic > ok 2 test_kmem_memcg_deletion > ok 3 test_kmem_proc_kpagecgroup > ok 4 test_kmem_kernel_stacks > ok 5 test_kmem_dead_cgroups > ok 6 test_percpu_basic Nice! Thanks for checking. > > It even passes with a reduced margin in the patched kernel, since the > percpu drift - which this test already tried to account for - is now > only on the page_counter side (whereas memory.stat is always precise). > > I'm going to include that data in the v2 changelog, as well as a patch > to update test_kmem.c to the more stringent error tolerances. Hm, I'm not sure it's a good idea to unconditionally lower the error tolerance: it's convenient to be able to run the same test on older kernels. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-05 17:50 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-05 17:50 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Thu, Feb 04, 2021 at 11:34:46AM -0800, Roman Gushchin wrote: > On Thu, Feb 04, 2021 at 02:29:57PM -0500, Johannes Weiner wrote: > > It even passes with a reduced margin in the patched kernel, since the > > percpu drift - which this test already tried to account for - is now > > only on the page_counter side (whereas memory.stat is always precise). > > > > I'm going to include that data in the v2 changelog, as well as a patch > > to update test_kmem.c to the more stringent error tolerances. > > Hm, I'm not sure it's a good idea to unconditionally lower the error tolerance: > it's convenient to be able to run the same test on older kernels. Well, an older version of the kernel will have an older version of the test that is tailored towards that kernel's specific behavior. That's sort of the point of tracking code and tests in the same git tree: to have meaningful, effective and precise tests of an ever-changing implementation. Trying to be backward compatible will lower the test signal and miss regressions, when a backward compatible version is at most one git checkout away. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-05 17:50 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-05 17:50 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Feb 04, 2021 at 11:34:46AM -0800, Roman Gushchin wrote: > On Thu, Feb 04, 2021 at 02:29:57PM -0500, Johannes Weiner wrote: > > It even passes with a reduced margin in the patched kernel, since the > > percpu drift - which this test already tried to account for - is now > > only on the page_counter side (whereas memory.stat is always precise). > > > > I'm going to include that data in the v2 changelog, as well as a patch > > to update test_kmem.c to the more stringent error tolerances. > > Hm, I'm not sure it's a good idea to unconditionally lower the error tolerance: > it's convenient to be able to run the same test on older kernels. Well, an older version of the kernel will have an older version of the test that is tailored towards that kernel's specific behavior. That's sort of the point of tracking code and tests in the same git tree: to have meaningful, effective and precise tests of an ever-changing implementation. Trying to be backward compatible will lower the test signal and miss regressions, when a backward compatible version is at most one git checkout away. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-04 13:28 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 13:28 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Tue 02-02-21 13:47:40, Johannes Weiner wrote: > The memcg hotunplug callback erroneously flushes counts on the local > CPU, not the counts of the CPU going away; those counts will be lost. > > Flush the CPU that is actually going away. > > Also simplify the code a bit by using mod_memcg_state() and > count_memcg_events() instead of open-coding the upward flush - this is > comparable to how vmstat.c handles hotunplug flushing. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Shakeel has already pointed out Fixes. > --- > mm/memcontrol.c | 35 +++++++++++++++++++++-------------- > 1 file changed, 21 insertions(+), 14 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ed5cc78a8dbf..8120d565dd79 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2411,45 +2411,52 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > static int memcg_hotplug_cpu_dead(unsigned int cpu) > { > struct memcg_stock_pcp *stock; > - struct mem_cgroup *memcg, *mi; > + struct mem_cgroup *memcg; > > stock = &per_cpu(memcg_stock, cpu); > drain_stock(stock); > > for_each_mem_cgroup(memcg) { > + struct memcg_vmstats_percpu *statc; > int i; > > + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > + > for (i = 0; i < MEMCG_NR_STAT; i++) { > int nid; > - long x; > > - x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); > - if (x) > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &memcg->vmstats[i]); > + if (statc->stat[i]) { > + mod_memcg_state(memcg, i, statc->stat[i]); > + statc->stat[i] = 0; > + } > > if (i >= NR_VM_NODE_STAT_ITEMS) > continue; > > for_each_node(nid) { > + struct batched_lruvec_stat *lstatc; > struct mem_cgroup_per_node *pn; > + long x; > > pn = mem_cgroup_nodeinfo(memcg, nid); > - x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); > - if (x) > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > + > + x = lstatc->count[i]; > + lstatc->count[i] = 0; > + > + if (x) { > do { > atomic_long_add(x, &pn->lruvec_stat[i]); > } while ((pn = parent_nodeinfo(pn, nid))); > + } > } > } > > for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > - long x; > - > - x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); > - if (x) > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &memcg->vmevents[i]); > + if (statc->events[i]) { > + count_memcg_events(memcg, i, statc->events[i]); > + statc->events[i] = 0; > + } > } > } > > -- > 2.30.0 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing @ 2021-02-04 13:28 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 13:28 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue 02-02-21 13:47:40, Johannes Weiner wrote: > The memcg hotunplug callback erroneously flushes counts on the local > CPU, not the counts of the CPU going away; those counts will be lost. > > Flush the CPU that is actually going away. > > Also simplify the code a bit by using mod_memcg_state() and > count_memcg_events() instead of open-coding the upward flush - this is > comparable to how vmstat.c handles hotunplug flushing. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Shakeel has already pointed out Fixes. > --- > mm/memcontrol.c | 35 +++++++++++++++++++++-------------- > 1 file changed, 21 insertions(+), 14 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ed5cc78a8dbf..8120d565dd79 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2411,45 +2411,52 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > static int memcg_hotplug_cpu_dead(unsigned int cpu) > { > struct memcg_stock_pcp *stock; > - struct mem_cgroup *memcg, *mi; > + struct mem_cgroup *memcg; > > stock = &per_cpu(memcg_stock, cpu); > drain_stock(stock); > > for_each_mem_cgroup(memcg) { > + struct memcg_vmstats_percpu *statc; > int i; > > + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > + > for (i = 0; i < MEMCG_NR_STAT; i++) { > int nid; > - long x; > > - x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); > - if (x) > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &memcg->vmstats[i]); > + if (statc->stat[i]) { > + mod_memcg_state(memcg, i, statc->stat[i]); > + statc->stat[i] = 0; > + } > > if (i >= NR_VM_NODE_STAT_ITEMS) > continue; > > for_each_node(nid) { > + struct batched_lruvec_stat *lstatc; > struct mem_cgroup_per_node *pn; > + long x; > > pn = mem_cgroup_nodeinfo(memcg, nid); > - x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); > - if (x) > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > + > + x = lstatc->count[i]; > + lstatc->count[i] = 0; > + > + if (x) { > do { > atomic_long_add(x, &pn->lruvec_stat[i]); > } while ((pn = parent_nodeinfo(pn, nid))); > + } > } > } > > for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > - long x; > - > - x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); > - if (x) > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &memcg->vmevents[i]); > + if (statc->events[i]) { > + count_memcg_events(memcg, i, statc->events[i]); > + statc->events[i] = 0; > + } > } > } > > -- > 2.30.0 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team No need to encapsulate a simple struct member access. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/memcontrol.h | 8 +------- mm/memcontrol.c | 21 +++++++++++---------- 2 files changed, 12 insertions(+), 17 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7a38a1517a05..c7f387a6233e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -602,12 +602,6 @@ void mem_cgroup_uncharge_list(struct list_head *page_list); void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); -static struct mem_cgroup_per_node * -mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid) -{ - return memcg->nodeinfo[nid]; -} - /** * mem_cgroup_lruvec - get the lru list vector for a memcg & node * @memcg: memcg of the wanted lruvec @@ -631,7 +625,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, if (!memcg) memcg = root_mem_cgroup; - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); + mz = memcg->nodeinfo[pgdat->node_id]; lruvec = &mz->lruvec; out: /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8120d565dd79..7e05a4ebf80f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -414,13 +414,14 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, int size, int old_size) { struct memcg_shrinker_map *new, *old; + struct mem_cgroup_per_node *pn; int nid; lockdep_assert_held(&memcg_shrinker_map_mutex); for_each_node(nid) { - old = rcu_dereference_protected( - mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true); + pn = memcg->nodeinfo[nid]; + old = rcu_dereference_protected(pn->shrinker_map, true); /* Not yet online memcg */ if (!old) return 0; @@ -433,7 +434,7 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, memset(new->map, (int)0xff, old_size); memset((void *)new->map + old_size, 0, size - old_size); - rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new); + rcu_assign_pointer(pn->shrinker_map, new); call_rcu(&old->rcu, memcg_free_shrinker_map_rcu); } @@ -450,7 +451,7 @@ static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) return; for_each_node(nid) { - pn = mem_cgroup_nodeinfo(memcg, nid); + pn = memcg->nodeinfo[nid]; map = rcu_dereference_protected(pn->shrinker_map, true); kvfree(map); rcu_assign_pointer(pn->shrinker_map, NULL); @@ -713,7 +714,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) int nid; for_each_node(nid) { - mz = mem_cgroup_nodeinfo(memcg, nid); + mz = memcg->nodeinfo[nid]; mctz = soft_limit_tree_node(nid); if (mctz) mem_cgroup_remove_exceeded(mz, mctz); @@ -796,7 +797,7 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) parent = parent_mem_cgroup(pn->memcg); if (!parent) return NULL; - return mem_cgroup_nodeinfo(parent, nid); + return parent->nodeinfo[nid]; } void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, @@ -1163,7 +1164,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (reclaim) { struct mem_cgroup_per_node *mz; - mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id); + mz = root->nodeinfo[reclaim->pgdat->node_id]; iter = &mz->iter; if (prev && reclaim->generation != iter->generation) @@ -1265,7 +1266,7 @@ static void __invalidate_reclaim_iterators(struct mem_cgroup *from, int nid; for_each_node(nid) { - mz = mem_cgroup_nodeinfo(from, nid); + mz = from->nodeinfo[nid]; iter = &mz->iter; cmpxchg(&iter->position, dead_memcg, NULL); } @@ -2438,7 +2439,7 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) struct mem_cgroup_per_node *pn; long x; - pn = mem_cgroup_nodeinfo(memcg, nid); + pn = memcg->nodeinfo[nid]; lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); x = lstatc->count[i]; @@ -4145,7 +4146,7 @@ static int memcg_stat_show(struct seq_file *m, void *v) unsigned long file_cost = 0; for_each_online_pgdat(pgdat) { - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); + mz = memcg->nodeinfo[pgdat->node_id]; anon_cost += mz->lruvec.anon_cost; file_cost += mz->lruvec.file_cost; -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg No need to encapsulate a simple struct member access. Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- include/linux/memcontrol.h | 8 +------- mm/memcontrol.c | 21 +++++++++++---------- 2 files changed, 12 insertions(+), 17 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7a38a1517a05..c7f387a6233e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -602,12 +602,6 @@ void mem_cgroup_uncharge_list(struct list_head *page_list); void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); -static struct mem_cgroup_per_node * -mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid) -{ - return memcg->nodeinfo[nid]; -} - /** * mem_cgroup_lruvec - get the lru list vector for a memcg & node * @memcg: memcg of the wanted lruvec @@ -631,7 +625,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, if (!memcg) memcg = root_mem_cgroup; - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); + mz = memcg->nodeinfo[pgdat->node_id]; lruvec = &mz->lruvec; out: /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8120d565dd79..7e05a4ebf80f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -414,13 +414,14 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, int size, int old_size) { struct memcg_shrinker_map *new, *old; + struct mem_cgroup_per_node *pn; int nid; lockdep_assert_held(&memcg_shrinker_map_mutex); for_each_node(nid) { - old = rcu_dereference_protected( - mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true); + pn = memcg->nodeinfo[nid]; + old = rcu_dereference_protected(pn->shrinker_map, true); /* Not yet online memcg */ if (!old) return 0; @@ -433,7 +434,7 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, memset(new->map, (int)0xff, old_size); memset((void *)new->map + old_size, 0, size - old_size); - rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new); + rcu_assign_pointer(pn->shrinker_map, new); call_rcu(&old->rcu, memcg_free_shrinker_map_rcu); } @@ -450,7 +451,7 @@ static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) return; for_each_node(nid) { - pn = mem_cgroup_nodeinfo(memcg, nid); + pn = memcg->nodeinfo[nid]; map = rcu_dereference_protected(pn->shrinker_map, true); kvfree(map); rcu_assign_pointer(pn->shrinker_map, NULL); @@ -713,7 +714,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) int nid; for_each_node(nid) { - mz = mem_cgroup_nodeinfo(memcg, nid); + mz = memcg->nodeinfo[nid]; mctz = soft_limit_tree_node(nid); if (mctz) mem_cgroup_remove_exceeded(mz, mctz); @@ -796,7 +797,7 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) parent = parent_mem_cgroup(pn->memcg); if (!parent) return NULL; - return mem_cgroup_nodeinfo(parent, nid); + return parent->nodeinfo[nid]; } void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, @@ -1163,7 +1164,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (reclaim) { struct mem_cgroup_per_node *mz; - mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id); + mz = root->nodeinfo[reclaim->pgdat->node_id]; iter = &mz->iter; if (prev && reclaim->generation != iter->generation) @@ -1265,7 +1266,7 @@ static void __invalidate_reclaim_iterators(struct mem_cgroup *from, int nid; for_each_node(nid) { - mz = mem_cgroup_nodeinfo(from, nid); + mz = from->nodeinfo[nid]; iter = &mz->iter; cmpxchg(&iter->position, dead_memcg, NULL); } @@ -2438,7 +2439,7 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) struct mem_cgroup_per_node *pn; long x; - pn = mem_cgroup_nodeinfo(memcg, nid); + pn = memcg->nodeinfo[nid]; lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); x = lstatc->count[i]; @@ -4145,7 +4146,7 @@ static int memcg_stat_show(struct seq_file *m, void *v) unsigned long file_cost = 0; for_each_online_pgdat(pgdat) { - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); + mz = memcg->nodeinfo[pgdat->node_id]; anon_cost += mz->lruvec.anon_cost; file_cost += mz->lruvec.file_cost; -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* Re: [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() 2021-02-02 18:47 ` Johannes Weiner (?) @ 2021-02-02 22:24 ` Shakeel Butt -1 siblings, 0 replies; 82+ messages in thread From: Shakeel Butt @ 2021-02-02 22:24 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team On Tue, Feb 2, 2021 at 12:51 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > No need to encapsulate a simple struct member access. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() @ 2021-02-02 22:24 ` Shakeel Butt 0 siblings, 0 replies; 82+ messages in thread From: Shakeel Butt @ 2021-02-02 22:24 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team On Tue, Feb 2, 2021 at 12:51 PM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote: > > No need to encapsulate a simple struct member access. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() @ 2021-02-02 22:24 ` Shakeel Butt 0 siblings, 0 replies; 82+ messages in thread From: Shakeel Butt @ 2021-02-02 22:24 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team On Tue, Feb 2, 2021 at 12:51 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > No need to encapsulate a simple struct member access. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() @ 2021-02-02 23:13 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-02 23:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 01:47:41PM -0500, Johannes Weiner wrote: > No need to encapsulate a simple struct member access. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() @ 2021-02-02 23:13 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-02 23:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Feb 02, 2021 at 01:47:41PM -0500, Johannes Weiner wrote: > No need to encapsulate a simple struct member access. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Reviewed-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() @ 2021-02-04 13:29 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 13:29 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Tue 02-02-21 13:47:41, Johannes Weiner wrote: > No need to encapsulate a simple struct member access. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/memcontrol.h | 8 +------- > mm/memcontrol.c | 21 +++++++++++---------- > 2 files changed, 12 insertions(+), 17 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 7a38a1517a05..c7f387a6233e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -602,12 +602,6 @@ void mem_cgroup_uncharge_list(struct list_head *page_list); > > void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); > > -static struct mem_cgroup_per_node * > -mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid) > -{ > - return memcg->nodeinfo[nid]; > -} > - > /** > * mem_cgroup_lruvec - get the lru list vector for a memcg & node > * @memcg: memcg of the wanted lruvec > @@ -631,7 +625,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, > if (!memcg) > memcg = root_mem_cgroup; > > - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); > + mz = memcg->nodeinfo[pgdat->node_id]; > lruvec = &mz->lruvec; > out: > /* > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 8120d565dd79..7e05a4ebf80f 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -414,13 +414,14 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, > int size, int old_size) > { > struct memcg_shrinker_map *new, *old; > + struct mem_cgroup_per_node *pn; > int nid; > > lockdep_assert_held(&memcg_shrinker_map_mutex); > > for_each_node(nid) { > - old = rcu_dereference_protected( > - mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true); > + pn = memcg->nodeinfo[nid]; > + old = rcu_dereference_protected(pn->shrinker_map, true); > /* Not yet online memcg */ > if (!old) > return 0; > @@ -433,7 +434,7 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, > memset(new->map, (int)0xff, old_size); > memset((void *)new->map + old_size, 0, size - old_size); > > - rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new); > + rcu_assign_pointer(pn->shrinker_map, new); > call_rcu(&old->rcu, memcg_free_shrinker_map_rcu); > } > > @@ -450,7 +451,7 @@ static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) > return; > > for_each_node(nid) { > - pn = mem_cgroup_nodeinfo(memcg, nid); > + pn = memcg->nodeinfo[nid]; > map = rcu_dereference_protected(pn->shrinker_map, true); > kvfree(map); > rcu_assign_pointer(pn->shrinker_map, NULL); > @@ -713,7 +714,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) > int nid; > > for_each_node(nid) { > - mz = mem_cgroup_nodeinfo(memcg, nid); > + mz = memcg->nodeinfo[nid]; > mctz = soft_limit_tree_node(nid); > if (mctz) > mem_cgroup_remove_exceeded(mz, mctz); > @@ -796,7 +797,7 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) > parent = parent_mem_cgroup(pn->memcg); > if (!parent) > return NULL; > - return mem_cgroup_nodeinfo(parent, nid); > + return parent->nodeinfo[nid]; > } > > void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, > @@ -1163,7 +1164,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (reclaim) { > struct mem_cgroup_per_node *mz; > > - mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id); > + mz = root->nodeinfo[reclaim->pgdat->node_id]; > iter = &mz->iter; > > if (prev && reclaim->generation != iter->generation) > @@ -1265,7 +1266,7 @@ static void __invalidate_reclaim_iterators(struct mem_cgroup *from, > int nid; > > for_each_node(nid) { > - mz = mem_cgroup_nodeinfo(from, nid); > + mz = from->nodeinfo[nid]; > iter = &mz->iter; > cmpxchg(&iter->position, dead_memcg, NULL); > } > @@ -2438,7 +2439,7 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > struct mem_cgroup_per_node *pn; > long x; > > - pn = mem_cgroup_nodeinfo(memcg, nid); > + pn = memcg->nodeinfo[nid]; > lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > > x = lstatc->count[i]; > @@ -4145,7 +4146,7 @@ static int memcg_stat_show(struct seq_file *m, void *v) > unsigned long file_cost = 0; > > for_each_online_pgdat(pgdat) { > - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); > + mz = memcg->nodeinfo[pgdat->node_id]; > > anon_cost += mz->lruvec.anon_cost; > file_cost += mz->lruvec.file_cost; > -- > 2.30.0 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() @ 2021-02-04 13:29 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 13:29 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue 02-02-21 13:47:41, Johannes Weiner wrote: > No need to encapsulate a simple struct member access. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > --- > include/linux/memcontrol.h | 8 +------- > mm/memcontrol.c | 21 +++++++++++---------- > 2 files changed, 12 insertions(+), 17 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 7a38a1517a05..c7f387a6233e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -602,12 +602,6 @@ void mem_cgroup_uncharge_list(struct list_head *page_list); > > void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); > > -static struct mem_cgroup_per_node * > -mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid) > -{ > - return memcg->nodeinfo[nid]; > -} > - > /** > * mem_cgroup_lruvec - get the lru list vector for a memcg & node > * @memcg: memcg of the wanted lruvec > @@ -631,7 +625,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, > if (!memcg) > memcg = root_mem_cgroup; > > - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); > + mz = memcg->nodeinfo[pgdat->node_id]; > lruvec = &mz->lruvec; > out: > /* > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 8120d565dd79..7e05a4ebf80f 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -414,13 +414,14 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, > int size, int old_size) > { > struct memcg_shrinker_map *new, *old; > + struct mem_cgroup_per_node *pn; > int nid; > > lockdep_assert_held(&memcg_shrinker_map_mutex); > > for_each_node(nid) { > - old = rcu_dereference_protected( > - mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true); > + pn = memcg->nodeinfo[nid]; > + old = rcu_dereference_protected(pn->shrinker_map, true); > /* Not yet online memcg */ > if (!old) > return 0; > @@ -433,7 +434,7 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, > memset(new->map, (int)0xff, old_size); > memset((void *)new->map + old_size, 0, size - old_size); > > - rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new); > + rcu_assign_pointer(pn->shrinker_map, new); > call_rcu(&old->rcu, memcg_free_shrinker_map_rcu); > } > > @@ -450,7 +451,7 @@ static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) > return; > > for_each_node(nid) { > - pn = mem_cgroup_nodeinfo(memcg, nid); > + pn = memcg->nodeinfo[nid]; > map = rcu_dereference_protected(pn->shrinker_map, true); > kvfree(map); > rcu_assign_pointer(pn->shrinker_map, NULL); > @@ -713,7 +714,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) > int nid; > > for_each_node(nid) { > - mz = mem_cgroup_nodeinfo(memcg, nid); > + mz = memcg->nodeinfo[nid]; > mctz = soft_limit_tree_node(nid); > if (mctz) > mem_cgroup_remove_exceeded(mz, mctz); > @@ -796,7 +797,7 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) > parent = parent_mem_cgroup(pn->memcg); > if (!parent) > return NULL; > - return mem_cgroup_nodeinfo(parent, nid); > + return parent->nodeinfo[nid]; > } > > void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, > @@ -1163,7 +1164,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (reclaim) { > struct mem_cgroup_per_node *mz; > > - mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id); > + mz = root->nodeinfo[reclaim->pgdat->node_id]; > iter = &mz->iter; > > if (prev && reclaim->generation != iter->generation) > @@ -1265,7 +1266,7 @@ static void __invalidate_reclaim_iterators(struct mem_cgroup *from, > int nid; > > for_each_node(nid) { > - mz = mem_cgroup_nodeinfo(from, nid); > + mz = from->nodeinfo[nid]; > iter = &mz->iter; > cmpxchg(&iter->position, dead_memcg, NULL); > } > @@ -2438,7 +2439,7 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > struct mem_cgroup_per_node *pn; > long x; > > - pn = mem_cgroup_nodeinfo(memcg, nid); > + pn = memcg->nodeinfo[nid]; > lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > > x = lstatc->count[i]; > @@ -4145,7 +4146,7 @@ static int memcg_stat_show(struct seq_file *m, void *v) > unsigned long file_cost = 0; > > for_each_online_pgdat(pgdat) { > - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); > + mz = memcg->nodeinfo[pgdat->node_id]; > > anon_cost += mz->lruvec.anon_cost; > file_cost += mz->lruvec.file_cost; > -- > 2.30.0 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team There are no users outside of the memory controller itself. The rest of the kernel cares either about node or lruvec stats. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/memcontrol.h | 44 -------------------------------------- mm/memcontrol.c | 32 +++++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 44 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c7f387a6233e..20ecdfae3289 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -867,39 +867,6 @@ struct mem_cgroup *lock_page_memcg(struct page *page); void __unlock_page_memcg(struct mem_cgroup *memcg); void unlock_page_memcg(struct page *page); -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) -{ - long x = atomic_long_read(&memcg->vmstats[idx]); -#ifdef CONFIG_SMP - if (x < 0) - x = 0; -#endif - return x; -} - -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ -static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, - int idx) -{ - long x = 0; - int cpu; - - for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); -#ifdef CONFIG_SMP - if (x < 0) - x = 0; -#endif - return x; -} - void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val); /* idx can be of type enum memcg_stat_item or node_stat_item */ @@ -1337,17 +1304,6 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) { } -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) -{ - return 0; -} - -static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, - int idx) -{ - return 0; -} - static inline void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int nr) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7e05a4ebf80f..2f97cb4cef6d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -789,6 +789,38 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); } +/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page_state(). + */ +static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) +{ + long x = atomic_long_read(&memcg->vmstats[idx]); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; +} + +/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page_state(). + */ +static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) +{ + long x = 0; + int cpu; + + for_each_possible_cpu(cpu) + x += per_cpu(memcg->vmstats_local->stat[idx], cpu); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; +} + static struct mem_cgroup_per_node * parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) { -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg There are no users outside of the memory controller itself. The rest of the kernel cares either about node or lruvec stats. Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- include/linux/memcontrol.h | 44 -------------------------------------- mm/memcontrol.c | 32 +++++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 44 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c7f387a6233e..20ecdfae3289 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -867,39 +867,6 @@ struct mem_cgroup *lock_page_memcg(struct page *page); void __unlock_page_memcg(struct mem_cgroup *memcg); void unlock_page_memcg(struct page *page); -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) -{ - long x = atomic_long_read(&memcg->vmstats[idx]); -#ifdef CONFIG_SMP - if (x < 0) - x = 0; -#endif - return x; -} - -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ -static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, - int idx) -{ - long x = 0; - int cpu; - - for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); -#ifdef CONFIG_SMP - if (x < 0) - x = 0; -#endif - return x; -} - void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val); /* idx can be of type enum memcg_stat_item or node_stat_item */ @@ -1337,17 +1304,6 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) { } -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) -{ - return 0; -} - -static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, - int idx) -{ - return 0; -} - static inline void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int nr) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7e05a4ebf80f..2f97cb4cef6d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -789,6 +789,38 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); } +/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page_state(). + */ +static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) +{ + long x = atomic_long_read(&memcg->vmstats[idx]); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; +} + +/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page_state(). + */ +static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) +{ + long x = 0; + int cpu; + + for_each_possible_cpu(cpu) + x += per_cpu(memcg->vmstats_local->stat[idx], cpu); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; +} + static struct mem_cgroup_per_node * parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) { -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* Re: [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions 2021-02-02 18:47 ` Johannes Weiner @ 2021-02-02 22:26 ` Shakeel Butt -1 siblings, 0 replies; 82+ messages in thread From: Shakeel Butt @ 2021-02-02 22:26 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team On Tue, Feb 2, 2021 at 12:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > There are no users outside of the memory controller itself. The rest > of the kernel cares either about node or lruvec stats. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions @ 2021-02-02 22:26 ` Shakeel Butt 0 siblings, 0 replies; 82+ messages in thread From: Shakeel Butt @ 2021-02-02 22:26 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team On Tue, Feb 2, 2021 at 12:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > There are no users outside of the memory controller itself. The rest > of the kernel cares either about node or lruvec stats. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions @ 2021-02-02 23:17 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-02 23:17 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 01:47:42PM -0500, Johannes Weiner wrote: > There are no users outside of the memory controller itself. The rest > of the kernel cares either about node or lruvec stats. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions @ 2021-02-02 23:17 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-02 23:17 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Feb 02, 2021 at 01:47:42PM -0500, Johannes Weiner wrote: > There are no users outside of the memory controller itself. The rest > of the kernel cares either about node or lruvec stats. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Reviewed-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions @ 2021-02-04 13:30 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 13:30 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Tue 02-02-21 13:47:42, Johannes Weiner wrote: > There are no users outside of the memory controller itself. The rest > of the kernel cares either about node or lruvec stats. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/memcontrol.h | 44 -------------------------------------- > mm/memcontrol.c | 32 +++++++++++++++++++++++++++ > 2 files changed, 32 insertions(+), 44 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index c7f387a6233e..20ecdfae3289 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -867,39 +867,6 @@ struct mem_cgroup *lock_page_memcg(struct page *page); > void __unlock_page_memcg(struct mem_cgroup *memcg); > void unlock_page_memcg(struct page *page); > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > -{ > - long x = atomic_long_read(&memcg->vmstats[idx]); > -#ifdef CONFIG_SMP > - if (x < 0) > - x = 0; > -#endif > - return x; > -} > - > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > -static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, > - int idx) > -{ > - long x = 0; > - int cpu; > - > - for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); > -#ifdef CONFIG_SMP > - if (x < 0) > - x = 0; > -#endif > - return x; > -} > - > void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val); > > /* idx can be of type enum memcg_stat_item or node_stat_item */ > @@ -1337,17 +1304,6 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) > { > } > > -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > -{ > - return 0; > -} > - > -static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, > - int idx) > -{ > - return 0; > -} > - > static inline void __mod_memcg_state(struct mem_cgroup *memcg, > int idx, > int nr) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7e05a4ebf80f..2f97cb4cef6d 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -789,6 +789,38 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) > __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); > } > > +/* > + * idx can be of type enum memcg_stat_item or node_stat_item. > + * Keep in sync with memcg_exact_page_state(). > + */ > +static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > +{ > + long x = atomic_long_read(&memcg->vmstats[idx]); > +#ifdef CONFIG_SMP > + if (x < 0) > + x = 0; > +#endif > + return x; > +} > + > +/* > + * idx can be of type enum memcg_stat_item or node_stat_item. > + * Keep in sync with memcg_exact_page_state(). > + */ > +static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) > +{ > + long x = 0; > + int cpu; > + > + for_each_possible_cpu(cpu) > + x += per_cpu(memcg->vmstats_local->stat[idx], cpu); > +#ifdef CONFIG_SMP > + if (x < 0) > + x = 0; > +#endif > + return x; > +} > + > static struct mem_cgroup_per_node * > parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) > { > -- > 2.30.0 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions @ 2021-02-04 13:30 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 13:30 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue 02-02-21 13:47:42, Johannes Weiner wrote: > There are no users outside of the memory controller itself. The rest > of the kernel cares either about node or lruvec stats. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > --- > include/linux/memcontrol.h | 44 -------------------------------------- > mm/memcontrol.c | 32 +++++++++++++++++++++++++++ > 2 files changed, 32 insertions(+), 44 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index c7f387a6233e..20ecdfae3289 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -867,39 +867,6 @@ struct mem_cgroup *lock_page_memcg(struct page *page); > void __unlock_page_memcg(struct mem_cgroup *memcg); > void unlock_page_memcg(struct page *page); > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > -{ > - long x = atomic_long_read(&memcg->vmstats[idx]); > -#ifdef CONFIG_SMP > - if (x < 0) > - x = 0; > -#endif > - return x; > -} > - > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > -static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, > - int idx) > -{ > - long x = 0; > - int cpu; > - > - for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); > -#ifdef CONFIG_SMP > - if (x < 0) > - x = 0; > -#endif > - return x; > -} > - > void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val); > > /* idx can be of type enum memcg_stat_item or node_stat_item */ > @@ -1337,17 +1304,6 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) > { > } > > -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > -{ > - return 0; > -} > - > -static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, > - int idx) > -{ > - return 0; > -} > - > static inline void __mod_memcg_state(struct mem_cgroup *memcg, > int idx, > int nr) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7e05a4ebf80f..2f97cb4cef6d 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -789,6 +789,38 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) > __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); > } > > +/* > + * idx can be of type enum memcg_stat_item or node_stat_item. > + * Keep in sync with memcg_exact_page_state(). > + */ > +static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > +{ > + long x = atomic_long_read(&memcg->vmstats[idx]); > +#ifdef CONFIG_SMP > + if (x < 0) > + x = 0; > +#endif > + return x; > +} > + > +/* > + * idx can be of type enum memcg_stat_item or node_stat_item. > + * Keep in sync with memcg_exact_page_state(). > + */ > +static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) > +{ > + long x = 0; > + int cpu; > + > + for_each_possible_cpu(cpu) > + x += per_cpu(memcg->vmstats_local->stat[idx], cpu); > +#ifdef CONFIG_SMP > + if (x < 0) > + x = 0; > +#endif > + return x; > +} > + > static struct mem_cgroup_per_node * > parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) > { > -- > 2.30.0 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH 4/7] cgroup: rstat: support cgroup1 @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team Rstat currently only supports the default hierarchy in cgroup2. In order to replace memcg's private stats infrastructure - used in both cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. The initialization and destruction callbacks for regular cgroups are already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. The initialization of the root cgroup is currently hardcoded to only handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() and cgroup_destroy_root() to handle the default root as well as the various cgroup1 roots we may set up during mounting. The linking of css to cgroups happens in code shared between cgroup1 and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. Linkage of the root css to the root cgroup is a bit trickier: per default, the root css of a subsystem controller belongs to the default hierarchy (i.e. the cgroup2 root). When a controller is mounted in its cgroup1 version, the root css is stolen and moved to the cgroup1 root; on unmount, the css moves back to the default hierarchy. Annotate rebind_subsystems() to move the root css linkage along between roots. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- kernel/cgroup/cgroup.c | 34 +++++++++++++++++++++------------- kernel/cgroup/rstat.c | 2 -- 2 files changed, 21 insertions(+), 15 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 9153b20e5cc6..e049edd66776 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1339,6 +1339,7 @@ static void cgroup_destroy_root(struct cgroup_root *root) mutex_unlock(&cgroup_mutex); + cgroup_rstat_exit(cgrp); kernfs_destroy_root(root->kf_root); cgroup_free_root(root); } @@ -1751,6 +1752,12 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask) &dcgrp->e_csets[ss->id]); spin_unlock_irq(&css_set_lock); + if (ss->css_rstat_flush) { + list_del_rcu(&css->rstat_css_node); + list_add_rcu(&css->rstat_css_node, + &dcgrp->rstat_css_list); + } + /* default hierarchy doesn't enable controllers by default */ dst_root->subsys_mask |= 1 << ssid; if (dst_root == &cgrp_dfl_root) { @@ -1971,10 +1978,14 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) if (ret) goto destroy_root; - ret = rebind_subsystems(root, ss_mask); + ret = cgroup_rstat_init(root_cgrp); if (ret) goto destroy_root; + ret = rebind_subsystems(root, ss_mask); + if (ret) + goto exit_stats; + ret = cgroup_bpf_inherit(root_cgrp); WARN_ON_ONCE(ret); @@ -2006,6 +2017,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) ret = 0; goto out; +exit_stats: + cgroup_rstat_exit(root_cgrp); destroy_root: kernfs_destroy_root(root->kf_root); root->kf_root = NULL; @@ -4934,8 +4947,7 @@ static void css_free_rwork_fn(struct work_struct *work) cgroup_put(cgroup_parent(cgrp)); kernfs_put(cgrp->kn); psi_cgroup_free(cgrp); - if (cgroup_on_dfl(cgrp)) - cgroup_rstat_exit(cgrp); + cgroup_rstat_exit(cgrp); kfree(cgrp); } else { /* @@ -4976,8 +4988,7 @@ static void css_release_work_fn(struct work_struct *work) /* cgroup release path */ TRACE_CGROUP_PATH(release, cgrp); - if (cgroup_on_dfl(cgrp)) - cgroup_rstat_flush(cgrp); + cgroup_rstat_flush(cgrp); spin_lock_irq(&css_set_lock); for (tcgrp = cgroup_parent(cgrp); tcgrp; @@ -5034,7 +5045,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css, css_get(css->parent); } - if (cgroup_on_dfl(cgrp) && ss->css_rstat_flush) + if (ss->css_rstat_flush) list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list); BUG_ON(cgroup_css(cgrp, ss)); @@ -5159,11 +5170,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, if (ret) goto out_free_cgrp; - if (cgroup_on_dfl(parent)) { - ret = cgroup_rstat_init(cgrp); - if (ret) - goto out_cancel_ref; - } + ret = cgroup_rstat_init(cgrp); + if (ret) + goto out_cancel_ref; /* create the directory */ kn = kernfs_create_dir(parent->kn, name, mode, cgrp); @@ -5250,8 +5259,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, out_kernfs_remove: kernfs_remove(cgrp->kn); out_stat_exit: - if (cgroup_on_dfl(parent)) - cgroup_rstat_exit(cgrp); + cgroup_rstat_exit(cgrp); out_cancel_ref: percpu_ref_exit(&cgrp->self.refcnt); out_free_cgrp: diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index d51175cedfca..faa767a870ba 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -285,8 +285,6 @@ void __init cgroup_rstat_boot(void) for_each_possible_cpu(cpu) raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu)); - - BUG_ON(cgroup_rstat_init(&cgrp_dfl_root.cgrp)); } /* -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 4/7] cgroup: rstat: support cgroup1 @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Rstat currently only supports the default hierarchy in cgroup2. In order to replace memcg's private stats infrastructure - used in both cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. The initialization and destruction callbacks for regular cgroups are already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. The initialization of the root cgroup is currently hardcoded to only handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() and cgroup_destroy_root() to handle the default root as well as the various cgroup1 roots we may set up during mounting. The linking of css to cgroups happens in code shared between cgroup1 and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. Linkage of the root css to the root cgroup is a bit trickier: per default, the root css of a subsystem controller belongs to the default hierarchy (i.e. the cgroup2 root). When a controller is mounted in its cgroup1 version, the root css is stolen and moved to the cgroup1 root; on unmount, the css moves back to the default hierarchy. Annotate rebind_subsystems() to move the root css linkage along between roots. Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- kernel/cgroup/cgroup.c | 34 +++++++++++++++++++++------------- kernel/cgroup/rstat.c | 2 -- 2 files changed, 21 insertions(+), 15 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 9153b20e5cc6..e049edd66776 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1339,6 +1339,7 @@ static void cgroup_destroy_root(struct cgroup_root *root) mutex_unlock(&cgroup_mutex); + cgroup_rstat_exit(cgrp); kernfs_destroy_root(root->kf_root); cgroup_free_root(root); } @@ -1751,6 +1752,12 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask) &dcgrp->e_csets[ss->id]); spin_unlock_irq(&css_set_lock); + if (ss->css_rstat_flush) { + list_del_rcu(&css->rstat_css_node); + list_add_rcu(&css->rstat_css_node, + &dcgrp->rstat_css_list); + } + /* default hierarchy doesn't enable controllers by default */ dst_root->subsys_mask |= 1 << ssid; if (dst_root == &cgrp_dfl_root) { @@ -1971,10 +1978,14 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) if (ret) goto destroy_root; - ret = rebind_subsystems(root, ss_mask); + ret = cgroup_rstat_init(root_cgrp); if (ret) goto destroy_root; + ret = rebind_subsystems(root, ss_mask); + if (ret) + goto exit_stats; + ret = cgroup_bpf_inherit(root_cgrp); WARN_ON_ONCE(ret); @@ -2006,6 +2017,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) ret = 0; goto out; +exit_stats: + cgroup_rstat_exit(root_cgrp); destroy_root: kernfs_destroy_root(root->kf_root); root->kf_root = NULL; @@ -4934,8 +4947,7 @@ static void css_free_rwork_fn(struct work_struct *work) cgroup_put(cgroup_parent(cgrp)); kernfs_put(cgrp->kn); psi_cgroup_free(cgrp); - if (cgroup_on_dfl(cgrp)) - cgroup_rstat_exit(cgrp); + cgroup_rstat_exit(cgrp); kfree(cgrp); } else { /* @@ -4976,8 +4988,7 @@ static void css_release_work_fn(struct work_struct *work) /* cgroup release path */ TRACE_CGROUP_PATH(release, cgrp); - if (cgroup_on_dfl(cgrp)) - cgroup_rstat_flush(cgrp); + cgroup_rstat_flush(cgrp); spin_lock_irq(&css_set_lock); for (tcgrp = cgroup_parent(cgrp); tcgrp; @@ -5034,7 +5045,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css, css_get(css->parent); } - if (cgroup_on_dfl(cgrp) && ss->css_rstat_flush) + if (ss->css_rstat_flush) list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list); BUG_ON(cgroup_css(cgrp, ss)); @@ -5159,11 +5170,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, if (ret) goto out_free_cgrp; - if (cgroup_on_dfl(parent)) { - ret = cgroup_rstat_init(cgrp); - if (ret) - goto out_cancel_ref; - } + ret = cgroup_rstat_init(cgrp); + if (ret) + goto out_cancel_ref; /* create the directory */ kn = kernfs_create_dir(parent->kn, name, mode, cgrp); @@ -5250,8 +5259,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, out_kernfs_remove: kernfs_remove(cgrp->kn); out_stat_exit: - if (cgroup_on_dfl(parent)) - cgroup_rstat_exit(cgrp); + cgroup_rstat_exit(cgrp); out_cancel_ref: percpu_ref_exit(&cgrp->self.refcnt); out_free_cgrp: diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index d51175cedfca..faa767a870ba 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -285,8 +285,6 @@ void __init cgroup_rstat_boot(void) for_each_possible_cpu(cpu) raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu)); - - BUG_ON(cgroup_rstat_init(&cgrp_dfl_root.cgrp)); } /* -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* Re: [PATCH 4/7] cgroup: rstat: support cgroup1 @ 2021-02-03 1:16 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-03 1:16 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 01:47:43PM -0500, Johannes Weiner wrote: > Rstat currently only supports the default hierarchy in cgroup2. In > order to replace memcg's private stats infrastructure - used in both > cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. > > The initialization and destruction callbacks for regular cgroups are > already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. > > The initialization of the root cgroup is currently hardcoded to only > handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() > and cgroup_destroy_root() to handle the default root as well as the > various cgroup1 roots we may set up during mounting. > > The linking of css to cgroups happens in code shared between cgroup1 > and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. > > Linkage of the root css to the root cgroup is a bit trickier: per > default, the root css of a subsystem controller belongs to the default > hierarchy (i.e. the cgroup2 root). When a controller is mounted in its > cgroup1 version, the root css is stolen and moved to the cgroup1 root; > on unmount, the css moves back to the default hierarchy. Annotate > rebind_subsystems() to move the root css linkage along between roots. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 4/7] cgroup: rstat: support cgroup1 @ 2021-02-03 1:16 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-03 1:16 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Feb 02, 2021 at 01:47:43PM -0500, Johannes Weiner wrote: > Rstat currently only supports the default hierarchy in cgroup2. In > order to replace memcg's private stats infrastructure - used in both > cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. > > The initialization and destruction callbacks for regular cgroups are > already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. > > The initialization of the root cgroup is currently hardcoded to only > handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() > and cgroup_destroy_root() to handle the default root as well as the > various cgroup1 roots we may set up during mounting. > > The linking of css to cgroups happens in code shared between cgroup1 > and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. > > Linkage of the root css to the root cgroup is a bit trickier: per > default, the root css of a subsystem controller belongs to the default > hierarchy (i.e. the cgroup2 root). When a controller is mounted in its > cgroup1 version, the root css is stolen and moved to the cgroup1 root; > on unmount, the css moves back to the default hierarchy. Annotate > rebind_subsystems() to move the root css linkage along between roots. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Reviewed-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 4/7] cgroup: rstat: support cgroup1 @ 2021-02-04 13:39 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 13:39 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Tue 02-02-21 13:47:43, Johannes Weiner wrote: > Rstat currently only supports the default hierarchy in cgroup2. In > order to replace memcg's private stats infrastructure - used in both > cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. > > The initialization and destruction callbacks for regular cgroups are > already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. > > The initialization of the root cgroup is currently hardcoded to only > handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() > and cgroup_destroy_root() to handle the default root as well as the > various cgroup1 roots we may set up during mounting. > > The linking of css to cgroups happens in code shared between cgroup1 > and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. > > Linkage of the root css to the root cgroup is a bit trickier: per > default, the root css of a subsystem controller belongs to the default > hierarchy (i.e. the cgroup2 root). When a controller is mounted in its > cgroup1 version, the root css is stolen and moved to the cgroup1 root; > on unmount, the css moves back to the default hierarchy. Annotate > rebind_subsystems() to move the root css linkage along between roots. I am not familiar with rstat API and from this patch it is not really clear to me how does it deal with memcg v1 use_hierarchy oddness. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 4/7] cgroup: rstat: support cgroup1 @ 2021-02-04 13:39 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 13:39 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue 02-02-21 13:47:43, Johannes Weiner wrote: > Rstat currently only supports the default hierarchy in cgroup2. In > order to replace memcg's private stats infrastructure - used in both > cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. > > The initialization and destruction callbacks for regular cgroups are > already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. > > The initialization of the root cgroup is currently hardcoded to only > handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() > and cgroup_destroy_root() to handle the default root as well as the > various cgroup1 roots we may set up during mounting. > > The linking of css to cgroups happens in code shared between cgroup1 > and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. > > Linkage of the root css to the root cgroup is a bit trickier: per > default, the root css of a subsystem controller belongs to the default > hierarchy (i.e. the cgroup2 root). When a controller is mounted in its > cgroup1 version, the root css is stolen and moved to the cgroup1 root; > on unmount, the css moves back to the default hierarchy. Annotate > rebind_subsystems() to move the root css linkage along between roots. I am not familiar with rstat API and from this patch it is not really clear to me how does it deal with memcg v1 use_hierarchy oddness. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 4/7] cgroup: rstat: support cgroup1 @ 2021-02-04 16:01 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 16:01 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Thu, Feb 04, 2021 at 02:39:25PM +0100, Michal Hocko wrote: > On Tue 02-02-21 13:47:43, Johannes Weiner wrote: > > Rstat currently only supports the default hierarchy in cgroup2. In > > order to replace memcg's private stats infrastructure - used in both > > cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. > > > > The initialization and destruction callbacks for regular cgroups are > > already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. > > > > The initialization of the root cgroup is currently hardcoded to only > > handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() > > and cgroup_destroy_root() to handle the default root as well as the > > various cgroup1 roots we may set up during mounting. > > > > The linking of css to cgroups happens in code shared between cgroup1 > > and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. > > > > Linkage of the root css to the root cgroup is a bit trickier: per > > default, the root css of a subsystem controller belongs to the default > > hierarchy (i.e. the cgroup2 root). When a controller is mounted in its > > cgroup1 version, the root css is stolen and moved to the cgroup1 root; > > on unmount, the css moves back to the default hierarchy. Annotate > > rebind_subsystems() to move the root css linkage along between roots. > > I am not familiar with rstat API and from this patch it is not really > clear to me how does it deal with memcg v1 use_hierarchy oddness. That's gone, right? static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val) { if (val == 1) return 0; pr_warn_once("Non-hierarchical mode is deprecated. " "Please report your usecase to linux-mm@kvack.org if you " "depend on this functionality.\n"); return -EINVAL; } ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 4/7] cgroup: rstat: support cgroup1 @ 2021-02-04 16:01 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 16:01 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Feb 04, 2021 at 02:39:25PM +0100, Michal Hocko wrote: > On Tue 02-02-21 13:47:43, Johannes Weiner wrote: > > Rstat currently only supports the default hierarchy in cgroup2. In > > order to replace memcg's private stats infrastructure - used in both > > cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. > > > > The initialization and destruction callbacks for regular cgroups are > > already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. > > > > The initialization of the root cgroup is currently hardcoded to only > > handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() > > and cgroup_destroy_root() to handle the default root as well as the > > various cgroup1 roots we may set up during mounting. > > > > The linking of css to cgroups happens in code shared between cgroup1 > > and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. > > > > Linkage of the root css to the root cgroup is a bit trickier: per > > default, the root css of a subsystem controller belongs to the default > > hierarchy (i.e. the cgroup2 root). When a controller is mounted in its > > cgroup1 version, the root css is stolen and moved to the cgroup1 root; > > on unmount, the css moves back to the default hierarchy. Annotate > > rebind_subsystems() to move the root css linkage along between roots. > > I am not familiar with rstat API and from this patch it is not really > clear to me how does it deal with memcg v1 use_hierarchy oddness. That's gone, right? static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val) { if (val == 1) return 0; pr_warn_once("Non-hierarchical mode is deprecated. " "Please report your usecase to linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org if you " "depend on this functionality.\n"); return -EINVAL; } ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 4/7] cgroup: rstat: support cgroup1 @ 2021-02-04 16:42 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 16:42 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Thu 04-02-21 11:01:30, Johannes Weiner wrote: > On Thu, Feb 04, 2021 at 02:39:25PM +0100, Michal Hocko wrote: > > On Tue 02-02-21 13:47:43, Johannes Weiner wrote: > > > Rstat currently only supports the default hierarchy in cgroup2. In > > > order to replace memcg's private stats infrastructure - used in both > > > cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. > > > > > > The initialization and destruction callbacks for regular cgroups are > > > already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. > > > > > > The initialization of the root cgroup is currently hardcoded to only > > > handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() > > > and cgroup_destroy_root() to handle the default root as well as the > > > various cgroup1 roots we may set up during mounting. > > > > > > The linking of css to cgroups happens in code shared between cgroup1 > > > and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. > > > > > > Linkage of the root css to the root cgroup is a bit trickier: per > > > default, the root css of a subsystem controller belongs to the default > > > hierarchy (i.e. the cgroup2 root). When a controller is mounted in its > > > cgroup1 version, the root css is stolen and moved to the cgroup1 root; > > > on unmount, the css moves back to the default hierarchy. Annotate > > > rebind_subsystems() to move the root css linkage along between roots. > > > > I am not familiar with rstat API and from this patch it is not really > > clear to me how does it deal with memcg v1 use_hierarchy oddness. > > That's gone, right? > > static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css, > struct cftype *cft, u64 val) > { > if (val == 1) > return 0; > > pr_warn_once("Non-hierarchical mode is deprecated. " > "Please report your usecase to linux-mm@kvack.org if you " > "depend on this functionality.\n"); > > return -EINVAL; > } Ohh, right! I have completely forgot it hit the Linus tree. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 4/7] cgroup: rstat: support cgroup1 @ 2021-02-04 16:42 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 16:42 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu 04-02-21 11:01:30, Johannes Weiner wrote: > On Thu, Feb 04, 2021 at 02:39:25PM +0100, Michal Hocko wrote: > > On Tue 02-02-21 13:47:43, Johannes Weiner wrote: > > > Rstat currently only supports the default hierarchy in cgroup2. In > > > order to replace memcg's private stats infrastructure - used in both > > > cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. > > > > > > The initialization and destruction callbacks for regular cgroups are > > > already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. > > > > > > The initialization of the root cgroup is currently hardcoded to only > > > handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() > > > and cgroup_destroy_root() to handle the default root as well as the > > > various cgroup1 roots we may set up during mounting. > > > > > > The linking of css to cgroups happens in code shared between cgroup1 > > > and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. > > > > > > Linkage of the root css to the root cgroup is a bit trickier: per > > > default, the root css of a subsystem controller belongs to the default > > > hierarchy (i.e. the cgroup2 root). When a controller is mounted in its > > > cgroup1 version, the root css is stolen and moved to the cgroup1 root; > > > on unmount, the css moves back to the default hierarchy. Annotate > > > rebind_subsystems() to move the root css linkage along between roots. > > > > I am not familiar with rstat API and from this patch it is not really > > clear to me how does it deal with memcg v1 use_hierarchy oddness. > > That's gone, right? > > static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css, > struct cftype *cft, u64 val) > { > if (val == 1) > return 0; > > pr_warn_once("Non-hierarchical mode is deprecated. " > "Please report your usecase to linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org if you " > "depend on this functionality.\n"); > > return -EINVAL; > } Ohh, right! I have completely forgot it hit the Linus tree. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH 5/7] cgroup: rstat: punt root-level optimization to individual controllers @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team Current users of the rstat code can source root-level statistics from the native counters of their respective subsystem, allowing them to forego aggregation at the root level. This optimization is currently implemented inside the generic rstat code, which doesn't track the root cgroup and doesn't invoke the subsystem flush callbacks on it. However, the memory controller cannot do this optimization, because cgroup1 breaks out memory specifically for the local level, including at the root level. In preparation for the memory controller switching to rstat, move the optimization from rstat core to the controllers. Afterwards, rstat will always track the root cgroup for changes and invoke the subsystem callbacks on it; and it's up to the subsystem to special-case and skip aggregation of the root cgroup if it can source this information through other, cheaper means. The extra cost of tracking the root cgroup is negligible: on stat changes, we actually remove a branch that checks for the root. The queueing for a flush touches only per-cpu data, and only the first stat change since a flush requires a (per-cpu) lock. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- block/blk-cgroup.c | 14 +++++++--- kernel/cgroup/rstat.c | 60 +++++++++++++++++++++++++------------------ 2 files changed, 45 insertions(+), 29 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 02ce2058c14b..76725e1cad7f 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -766,6 +766,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct blkcg *blkcg = css_to_blkcg(css); struct blkcg_gq *blkg; + /* Root-level stats are sourced from system-wide IO stats */ + if (!cgroup_parent(css->cgroup)) + return; + rcu_read_lock(); hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { @@ -789,6 +793,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) u64_stats_update_end(&blkg->iostat.sync); /* propagate global delta to parent */ + /* XXX: could skip this if parent is root */ if (parent) { u64_stats_update_begin(&parent->iostat.sync); blkg_iostat_set(&delta, &blkg->iostat.cur); @@ -803,10 +808,11 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) } /* - * The rstat algorithms intentionally don't handle the root cgroup to avoid - * incurring overhead when no cgroups are defined. For that reason, - * cgroup_rstat_flush in blkcg_print_stat does not actually fill out the - * iostat in the root cgroup's blkcg_gq. + * We source root cgroup stats from the system-wide stats to avoid + * tracking the same information twice and incurring overhead when no + * cgroups are defined. For that reason, cgroup_rstat_flush in + * blkcg_print_stat does not actually fill out the iostat in the root + * cgroup's blkcg_gq. * * However, we would like to re-use the printing code between the root and * non-root cgroups to the extent possible. For that reason, we simulate diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index faa767a870ba..6f50c199bf2a 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -25,13 +25,8 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) { raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); - struct cgroup *parent; unsigned long flags; - /* nothing to do for root */ - if (!cgroup_parent(cgrp)) - return; - /* * Speculative already-on-list test. This may race leading to * temporary inaccuracies, which is fine. @@ -46,10 +41,10 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) raw_spin_lock_irqsave(cpu_lock, flags); /* put @cgrp and all ancestors on the corresponding updated lists */ - for (parent = cgroup_parent(cgrp); parent; - cgrp = parent, parent = cgroup_parent(cgrp)) { + while (true) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); - struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); + struct cgroup *parent = cgroup_parent(cgrp); + struct cgroup_rstat_cpu *prstatc; /* * Both additions and removals are bottom-up. If a cgroup @@ -58,8 +53,16 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) if (rstatc->updated_next) break; + if (!parent) { + rstatc->updated_next = cgrp; + break; + } + + prstatc = cgroup_rstat_cpu(parent, cpu); rstatc->updated_next = prstatc->updated_children; prstatc->updated_children = cgrp; + + cgrp = parent; } raw_spin_unlock_irqrestore(cpu_lock, flags); @@ -113,23 +116,26 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, */ if (rstatc->updated_next) { struct cgroup *parent = cgroup_parent(pos); - struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); - struct cgroup_rstat_cpu *nrstatc; - struct cgroup **nextp; - - nextp = &prstatc->updated_children; - while (true) { - nrstatc = cgroup_rstat_cpu(*nextp, cpu); - if (*nextp == pos) - break; - - WARN_ON_ONCE(*nextp == parent); - nextp = &nrstatc->updated_next; + + if (parent) { + struct cgroup_rstat_cpu *prstatc; + struct cgroup **nextp; + + prstatc = cgroup_rstat_cpu(parent, cpu); + nextp = &prstatc->updated_children; + while (true) { + struct cgroup_rstat_cpu *nrstatc; + + nrstatc = cgroup_rstat_cpu(*nextp, cpu); + if (*nextp == pos) + break; + WARN_ON_ONCE(*nextp == parent); + nextp = &nrstatc->updated_next; + } + *nextp = rstatc->updated_next; } - *nextp = rstatc->updated_next; rstatc->updated_next = NULL; - return pos; } @@ -309,11 +315,15 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat, static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) { - struct cgroup *parent = cgroup_parent(cgrp); struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); + struct cgroup *parent = cgroup_parent(cgrp); struct cgroup_base_stat cur, delta; unsigned seq; + /* Root-level stats are sourced from system-wide CPU stats */ + if (!parent) + return; + /* fetch the current per-cpu values */ do { seq = __u64_stats_fetch_begin(&rstatc->bsync); @@ -326,8 +336,8 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) cgroup_base_stat_add(&cgrp->bstat, &delta); cgroup_base_stat_add(&rstatc->last_bstat, &delta); - /* propagate global delta to parent */ - if (parent) { + /* propagate global delta to parent (unless that's root) */ + if (cgroup_parent(parent)) { delta = cgrp->bstat; cgroup_base_stat_sub(&delta, &cgrp->last_bstat); cgroup_base_stat_add(&parent->bstat, &delta); -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 5/7] cgroup: rstat: punt root-level optimization to individual controllers @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Current users of the rstat code can source root-level statistics from the native counters of their respective subsystem, allowing them to forego aggregation at the root level. This optimization is currently implemented inside the generic rstat code, which doesn't track the root cgroup and doesn't invoke the subsystem flush callbacks on it. However, the memory controller cannot do this optimization, because cgroup1 breaks out memory specifically for the local level, including at the root level. In preparation for the memory controller switching to rstat, move the optimization from rstat core to the controllers. Afterwards, rstat will always track the root cgroup for changes and invoke the subsystem callbacks on it; and it's up to the subsystem to special-case and skip aggregation of the root cgroup if it can source this information through other, cheaper means. The extra cost of tracking the root cgroup is negligible: on stat changes, we actually remove a branch that checks for the root. The queueing for a flush touches only per-cpu data, and only the first stat change since a flush requires a (per-cpu) lock. Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- block/blk-cgroup.c | 14 +++++++--- kernel/cgroup/rstat.c | 60 +++++++++++++++++++++++++------------------ 2 files changed, 45 insertions(+), 29 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 02ce2058c14b..76725e1cad7f 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -766,6 +766,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct blkcg *blkcg = css_to_blkcg(css); struct blkcg_gq *blkg; + /* Root-level stats are sourced from system-wide IO stats */ + if (!cgroup_parent(css->cgroup)) + return; + rcu_read_lock(); hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { @@ -789,6 +793,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) u64_stats_update_end(&blkg->iostat.sync); /* propagate global delta to parent */ + /* XXX: could skip this if parent is root */ if (parent) { u64_stats_update_begin(&parent->iostat.sync); blkg_iostat_set(&delta, &blkg->iostat.cur); @@ -803,10 +808,11 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) } /* - * The rstat algorithms intentionally don't handle the root cgroup to avoid - * incurring overhead when no cgroups are defined. For that reason, - * cgroup_rstat_flush in blkcg_print_stat does not actually fill out the - * iostat in the root cgroup's blkcg_gq. + * We source root cgroup stats from the system-wide stats to avoid + * tracking the same information twice and incurring overhead when no + * cgroups are defined. For that reason, cgroup_rstat_flush in + * blkcg_print_stat does not actually fill out the iostat in the root + * cgroup's blkcg_gq. * * However, we would like to re-use the printing code between the root and * non-root cgroups to the extent possible. For that reason, we simulate diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index faa767a870ba..6f50c199bf2a 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -25,13 +25,8 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) { raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); - struct cgroup *parent; unsigned long flags; - /* nothing to do for root */ - if (!cgroup_parent(cgrp)) - return; - /* * Speculative already-on-list test. This may race leading to * temporary inaccuracies, which is fine. @@ -46,10 +41,10 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) raw_spin_lock_irqsave(cpu_lock, flags); /* put @cgrp and all ancestors on the corresponding updated lists */ - for (parent = cgroup_parent(cgrp); parent; - cgrp = parent, parent = cgroup_parent(cgrp)) { + while (true) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); - struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); + struct cgroup *parent = cgroup_parent(cgrp); + struct cgroup_rstat_cpu *prstatc; /* * Both additions and removals are bottom-up. If a cgroup @@ -58,8 +53,16 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) if (rstatc->updated_next) break; + if (!parent) { + rstatc->updated_next = cgrp; + break; + } + + prstatc = cgroup_rstat_cpu(parent, cpu); rstatc->updated_next = prstatc->updated_children; prstatc->updated_children = cgrp; + + cgrp = parent; } raw_spin_unlock_irqrestore(cpu_lock, flags); @@ -113,23 +116,26 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, */ if (rstatc->updated_next) { struct cgroup *parent = cgroup_parent(pos); - struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); - struct cgroup_rstat_cpu *nrstatc; - struct cgroup **nextp; - - nextp = &prstatc->updated_children; - while (true) { - nrstatc = cgroup_rstat_cpu(*nextp, cpu); - if (*nextp == pos) - break; - - WARN_ON_ONCE(*nextp == parent); - nextp = &nrstatc->updated_next; + + if (parent) { + struct cgroup_rstat_cpu *prstatc; + struct cgroup **nextp; + + prstatc = cgroup_rstat_cpu(parent, cpu); + nextp = &prstatc->updated_children; + while (true) { + struct cgroup_rstat_cpu *nrstatc; + + nrstatc = cgroup_rstat_cpu(*nextp, cpu); + if (*nextp == pos) + break; + WARN_ON_ONCE(*nextp == parent); + nextp = &nrstatc->updated_next; + } + *nextp = rstatc->updated_next; } - *nextp = rstatc->updated_next; rstatc->updated_next = NULL; - return pos; } @@ -309,11 +315,15 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat, static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) { - struct cgroup *parent = cgroup_parent(cgrp); struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); + struct cgroup *parent = cgroup_parent(cgrp); struct cgroup_base_stat cur, delta; unsigned seq; + /* Root-level stats are sourced from system-wide CPU stats */ + if (!parent) + return; + /* fetch the current per-cpu values */ do { seq = __u64_stats_fetch_begin(&rstatc->bsync); @@ -326,8 +336,8 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) cgroup_base_stat_add(&cgrp->bstat, &delta); cgroup_base_stat_add(&rstatc->last_bstat, &delta); - /* propagate global delta to parent */ - if (parent) { + /* propagate global delta to parent (unless that's root) */ + if (cgroup_parent(parent)) { delta = cgrp->bstat; cgroup_base_stat_sub(&delta, &cgrp->last_bstat); cgroup_base_stat_add(&parent->bstat, &delta); -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team Replace the memory controller's custom hierarchical stats code with the generic rstat infrastructure provided by the cgroup core. The current implementation does batched upward propagation from the write side (i.e. as stats change). The per-cpu batches introduce an error, which is multiplied by the number of subgroups in a tree. In systems with many CPUs and sizable cgroup trees, the error can be large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 subgroups results in an error of up to 128M per stat item). This can entirely swallow allocation bursts inside a workload that the user is expecting to see reflected in the statistics. In the past, we've done read-side aggregation, where a memory.stat read would have to walk the entire subtree and add up per-cpu counts. This became problematic with lazily-freed cgroups: we could have large subtrees where most cgroups were entirely idle. Hence the switch to change-driven upward propagation. Unfortunately, it needed to trade accuracy for speed due to the write side being so hot. Rstat combines the best of both worlds: from the write side, it cheaply maintains a queue of cgroups that have pending changes, so that the read side can do selective tree aggregation. This way the reported stats will always be precise and recent as can be, while the aggregation can skip over potentially large numbers of idle cgroups. This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward aggregation. It removes 3 words from the per-cpu data. It eliminates memcg_exact_page_state(), since memcg_page_state() is now exact. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/memcontrol.h | 67 ++++++----- mm/memcontrol.c | 224 +++++++++++++++---------------------- 2 files changed, 133 insertions(+), 158 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 20ecdfae3289..a8c7a0ccc759 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -76,10 +76,27 @@ enum mem_cgroup_events_target { }; struct memcg_vmstats_percpu { - long stat[MEMCG_NR_STAT]; - unsigned long events[NR_VM_EVENT_ITEMS]; - unsigned long nr_page_events; - unsigned long targets[MEM_CGROUP_NTARGETS]; + /* Local (CPU and cgroup) page state & events */ + long state[MEMCG_NR_STAT]; + unsigned long events[NR_VM_EVENT_ITEMS]; + + /* Delta calculation for lockless upward propagation */ + long state_prev[MEMCG_NR_STAT]; + unsigned long events_prev[NR_VM_EVENT_ITEMS]; + + /* Cgroup1: threshold notifications & softlimit tree updates */ + unsigned long nr_page_events; + unsigned long targets[MEM_CGROUP_NTARGETS]; +}; + +struct memcg_vmstats { + /* Aggregated (CPU and subtree) page state & events */ + long state[MEMCG_NR_STAT]; + unsigned long events[NR_VM_EVENT_ITEMS]; + + /* Pending child counts during tree propagation */ + long state_pending[MEMCG_NR_STAT]; + unsigned long events_pending[NR_VM_EVENT_ITEMS]; }; struct mem_cgroup_reclaim_iter { @@ -287,8 +304,8 @@ struct mem_cgroup { MEMCG_PADDING(_pad1_); - atomic_long_t vmstats[MEMCG_NR_STAT]; - atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; + /* memory.stat */ + struct memcg_vmstats vmstats; /* memory.events */ atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; @@ -315,10 +332,6 @@ struct mem_cgroup { atomic_t moving_account; struct task_struct *move_lock_task; - /* Legacy local VM stats and events */ - struct memcg_vmstats_percpu __percpu *vmstats_local; - - /* Subtree VM stats and events (batched updates) */ struct memcg_vmstats_percpu __percpu *vmstats_percpu; #ifdef CONFIG_CGROUP_WRITEBACK @@ -942,10 +955,6 @@ static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, local_irq_restore(flags); } -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); - void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count); @@ -1028,6 +1037,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, void mem_cgroup_split_huge_fixup(struct page *head); #endif +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, + gfp_t gfp_mask, + unsigned long *total_scanned); + #else /* CONFIG_MEMCG */ #define MEM_CGROUP_ID_SHIFT 0 @@ -1136,6 +1149,10 @@ static inline bool lruvec_holds_page_lru_lock(struct page *page, return lruvec == &pgdat->__lruvec; } +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ +} + static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) { return NULL; @@ -1349,18 +1366,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, mod_node_page_state(page_pgdat(page), idx, val); } -static inline -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - return 0; -} - -static inline void mem_cgroup_split_huge_fixup(struct page *head) -{ -} - static inline void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count) @@ -1383,8 +1388,16 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } -static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +static inline void mem_cgroup_split_huge_fixup(struct page *head) +{ +} + +static inline +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, + gfp_t gfp_mask, + unsigned long *total_scanned) { + return 0; } #endif /* CONFIG_MEMCG */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2f97cb4cef6d..b205b2413186 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -757,6 +757,11 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) return mz; } +static void memcg_flush_vmstats(struct mem_cgroup *memcg) +{ + cgroup_rstat_flush(memcg->css.cgroup); +} + /** * __mod_memcg_state - update cgroup memory statistics * @memcg: the memory cgroup @@ -765,37 +770,17 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) */ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) { - long x, threshold = MEMCG_CHARGE_BATCH; - if (mem_cgroup_disabled()) return; - if (memcg_stat_item_in_bytes(idx)) - threshold <<= PAGE_SHIFT; - - x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); - if (unlikely(abs(x) > threshold)) { - struct mem_cgroup *mi; - - /* - * Batch local counters to keep them in sync with - * the hierarchical ones. - */ - __this_cpu_add(memcg->vmstats_local->stat[idx], x); - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &mi->vmstats[idx]); - x = 0; - } - __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); } -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ +/* idx can be of type enum memcg_stat_item or node_stat_item. */ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) { - long x = atomic_long_read(&memcg->vmstats[idx]); + long x = READ_ONCE(memcg->vmstats.state[idx]); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -803,17 +788,14 @@ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) return x; } -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ +/* idx can be of type enum memcg_stat_item or node_stat_item. */ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) { long x = 0; int cpu; for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); + x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -936,30 +918,16 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count) { - unsigned long x; - if (mem_cgroup_disabled()) return; - x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); - if (unlikely(x > MEMCG_CHARGE_BATCH)) { - struct mem_cgroup *mi; - - /* - * Batch local counters to keep them in sync with - * the hierarchical ones. - */ - __this_cpu_add(memcg->vmstats_local->events[idx], x); - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &mi->vmevents[idx]); - x = 0; - } - __this_cpu_write(memcg->vmstats_percpu->events[idx], x); + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); } static unsigned long memcg_events(struct mem_cgroup *memcg, int event) { - return atomic_long_read(&memcg->vmevents[event]); + return READ_ONCE(memcg->vmstats.events[event]); } static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) @@ -968,7 +936,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) int cpu; for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_local->events[event], cpu); + x += per_cpu(memcg->vmstats_percpu->events[event], cpu); return x; } @@ -1631,6 +1599,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg) * * Current memory state: */ + memcg_flush_vmstats(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -2450,22 +2419,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) drain_stock(stock); for_each_mem_cgroup(memcg) { - struct memcg_vmstats_percpu *statc; int i; - statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); - - for (i = 0; i < MEMCG_NR_STAT; i++) { + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { int nid; - if (statc->stat[i]) { - mod_memcg_state(memcg, i, statc->stat[i]); - statc->stat[i] = 0; - } - - if (i >= NR_VM_NODE_STAT_ITEMS) - continue; - for_each_node(nid) { struct batched_lruvec_stat *lstatc; struct mem_cgroup_per_node *pn; @@ -2484,13 +2442,6 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) } } } - - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { - if (statc->events[i]) { - count_memcg_events(memcg, i, statc->events[i]); - statc->events[i] = 0; - } - } } return 0; @@ -3618,6 +3569,8 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) { unsigned long val; + memcg_flush_vmstats(memcg); + if (mem_cgroup_is_root(memcg)) { val = memcg_page_state(memcg, NR_FILE_PAGES) + memcg_page_state(memcg, NR_ANON_MAPPED); @@ -3683,26 +3636,15 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, } } -static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) { - unsigned long stat[MEMCG_NR_STAT] = {0}; - struct mem_cgroup *mi; - int node, cpu, i; - - for_each_online_cpu(cpu) - for (i = 0; i < MEMCG_NR_STAT; i++) - stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); - - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - for (i = 0; i < MEMCG_NR_STAT; i++) - atomic_long_add(stat[i], &mi->vmstats[i]); + int node; for_each_node(node) { struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; struct mem_cgroup_per_node *pi; - - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) - stat[i] = 0; + int cpu, i; for_each_online_cpu(cpu) for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) @@ -3715,25 +3657,6 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) } } -static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg) -{ - unsigned long events[NR_VM_EVENT_ITEMS]; - struct mem_cgroup *mi; - int cpu, i; - - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) - events[i] = 0; - - for_each_online_cpu(cpu) - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) - events[i] += per_cpu(memcg->vmstats_percpu->events[i], - cpu); - - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) - atomic_long_add(events[i], &mi->vmevents[i]); -} - #ifdef CONFIG_MEMCG_KMEM static int memcg_online_kmem(struct mem_cgroup *memcg) { @@ -4050,6 +3973,8 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + memcg_flush_vmstats(memcg); + for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, @@ -4120,6 +4045,8 @@ static int memcg_stat_show(struct seq_file *m, void *v) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); + memcg_flush_vmstats(memcg); + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4596,22 +4523,6 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) return &memcg->cgwb_domain; } -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page(). - */ -static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) -{ - long x = atomic_long_read(&memcg->vmstats[idx]); - int cpu; - - for_each_online_cpu(cpu) - x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx]; - if (x < 0) - x = 0; - return x; -} - /** * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg * @wb: bdi_writeback in question @@ -4637,13 +4548,14 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); + memcg_flush_vmstats(memcg); - *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); - *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + - memcg_exact_page_state(memcg, NR_ACTIVE_FILE); - *pheadroom = PAGE_COUNTER_MAX; + *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); + *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); + *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) + + memcg_page_state(memcg, NR_ACTIVE_FILE); + *pheadroom = PAGE_COUNTER_MAX; while ((parent = parent_mem_cgroup(memcg))) { unsigned long ceiling = min(READ_ONCE(memcg->memory.max), READ_ONCE(memcg->memory.high)); @@ -5275,7 +5187,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu); - free_percpu(memcg->vmstats_local); kfree(memcg); } @@ -5283,11 +5194,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) { memcg_wb_domain_exit(memcg); /* - * Flush percpu vmstats and vmevents to guarantee the value correctness - * on parent's and all ancestor levels. + * Flush percpu lruvec stats to guarantee the value + * correctness on parent's and all ancestor levels. */ - memcg_flush_percpu_vmstats(memcg); - memcg_flush_percpu_vmevents(memcg); + memcg_flush_lruvec_page_state(memcg); __mem_cgroup_free(memcg); } @@ -5314,11 +5224,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) goto fail; } - memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, - GFP_KERNEL_ACCOUNT); - if (!memcg->vmstats_local) - goto fail; - memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_percpu) @@ -5518,6 +5423,62 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) memcg_wb_domain_size_changed(memcg); } +static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + struct mem_cgroup *parent = parent_mem_cgroup(memcg); + struct memcg_vmstats_percpu *statc; + long delta, v; + int i; + + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); + + for (i = 0; i < MEMCG_NR_STAT; i++) { + /* + * Collect the aggregated propagation counts of groups + * below us. We're in a per-cpu loop here and this is + * a global counter, so the first cycle will get them. + */ + delta = memcg->vmstats.state_pending[i]; + if (delta) + memcg->vmstats.state_pending[i] = 0; + + /* Add CPU changes on this level since the last flush */ + v = READ_ONCE(statc->state[i]); + if (v != statc->state_prev[i]) { + delta += v - statc->state_prev[i]; + statc->state_prev[i] = v; + } + + if (!delta) + continue; + + /* Aggregate counts on this level and propagate upwards */ + memcg->vmstats.state[i] += delta; + if (parent) + parent->vmstats.state_pending[i] += delta; + } + + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { + delta = memcg->vmstats.events_pending[i]; + if (delta) + memcg->vmstats.events_pending[i] = 0; + + v = READ_ONCE(statc->events[i]); + if (v != statc->events_prev[i]) { + delta += v - statc->events_prev[i]; + statc->events_prev[i] = v; + } + + if (!delta) + continue; + + memcg->vmstats.events[i] += delta; + if (parent) + parent->vmstats.events_pending[i] += delta; + } +} + #ifdef CONFIG_MMU /* Handlers for move charge at task migration. */ static int mem_cgroup_do_precharge(unsigned long count) @@ -6571,6 +6532,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_released = mem_cgroup_css_released, .css_free = mem_cgroup_css_free, .css_reset = mem_cgroup_css_reset, + .css_rstat_flush = mem_cgroup_css_rstat_flush, .can_attach = mem_cgroup_can_attach, .cancel_attach = mem_cgroup_cancel_attach, .post_attach = mem_cgroup_move_task, -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-02 18:47 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Replace the memory controller's custom hierarchical stats code with the generic rstat infrastructure provided by the cgroup core. The current implementation does batched upward propagation from the write side (i.e. as stats change). The per-cpu batches introduce an error, which is multiplied by the number of subgroups in a tree. In systems with many CPUs and sizable cgroup trees, the error can be large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 subgroups results in an error of up to 128M per stat item). This can entirely swallow allocation bursts inside a workload that the user is expecting to see reflected in the statistics. In the past, we've done read-side aggregation, where a memory.stat read would have to walk the entire subtree and add up per-cpu counts. This became problematic with lazily-freed cgroups: we could have large subtrees where most cgroups were entirely idle. Hence the switch to change-driven upward propagation. Unfortunately, it needed to trade accuracy for speed due to the write side being so hot. Rstat combines the best of both worlds: from the write side, it cheaply maintains a queue of cgroups that have pending changes, so that the read side can do selective tree aggregation. This way the reported stats will always be precise and recent as can be, while the aggregation can skip over potentially large numbers of idle cgroups. This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward aggregation. It removes 3 words from the per-cpu data. It eliminates memcg_exact_page_state(), since memcg_page_state() is now exact. Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- include/linux/memcontrol.h | 67 ++++++----- mm/memcontrol.c | 224 +++++++++++++++---------------------- 2 files changed, 133 insertions(+), 158 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 20ecdfae3289..a8c7a0ccc759 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -76,10 +76,27 @@ enum mem_cgroup_events_target { }; struct memcg_vmstats_percpu { - long stat[MEMCG_NR_STAT]; - unsigned long events[NR_VM_EVENT_ITEMS]; - unsigned long nr_page_events; - unsigned long targets[MEM_CGROUP_NTARGETS]; + /* Local (CPU and cgroup) page state & events */ + long state[MEMCG_NR_STAT]; + unsigned long events[NR_VM_EVENT_ITEMS]; + + /* Delta calculation for lockless upward propagation */ + long state_prev[MEMCG_NR_STAT]; + unsigned long events_prev[NR_VM_EVENT_ITEMS]; + + /* Cgroup1: threshold notifications & softlimit tree updates */ + unsigned long nr_page_events; + unsigned long targets[MEM_CGROUP_NTARGETS]; +}; + +struct memcg_vmstats { + /* Aggregated (CPU and subtree) page state & events */ + long state[MEMCG_NR_STAT]; + unsigned long events[NR_VM_EVENT_ITEMS]; + + /* Pending child counts during tree propagation */ + long state_pending[MEMCG_NR_STAT]; + unsigned long events_pending[NR_VM_EVENT_ITEMS]; }; struct mem_cgroup_reclaim_iter { @@ -287,8 +304,8 @@ struct mem_cgroup { MEMCG_PADDING(_pad1_); - atomic_long_t vmstats[MEMCG_NR_STAT]; - atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; + /* memory.stat */ + struct memcg_vmstats vmstats; /* memory.events */ atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; @@ -315,10 +332,6 @@ struct mem_cgroup { atomic_t moving_account; struct task_struct *move_lock_task; - /* Legacy local VM stats and events */ - struct memcg_vmstats_percpu __percpu *vmstats_local; - - /* Subtree VM stats and events (batched updates) */ struct memcg_vmstats_percpu __percpu *vmstats_percpu; #ifdef CONFIG_CGROUP_WRITEBACK @@ -942,10 +955,6 @@ static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, local_irq_restore(flags); } -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); - void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count); @@ -1028,6 +1037,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, void mem_cgroup_split_huge_fixup(struct page *head); #endif +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, + gfp_t gfp_mask, + unsigned long *total_scanned); + #else /* CONFIG_MEMCG */ #define MEM_CGROUP_ID_SHIFT 0 @@ -1136,6 +1149,10 @@ static inline bool lruvec_holds_page_lru_lock(struct page *page, return lruvec == &pgdat->__lruvec; } +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ +} + static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) { return NULL; @@ -1349,18 +1366,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, mod_node_page_state(page_pgdat(page), idx, val); } -static inline -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - return 0; -} - -static inline void mem_cgroup_split_huge_fixup(struct page *head) -{ -} - static inline void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count) @@ -1383,8 +1388,16 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } -static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +static inline void mem_cgroup_split_huge_fixup(struct page *head) +{ +} + +static inline +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, + gfp_t gfp_mask, + unsigned long *total_scanned) { + return 0; } #endif /* CONFIG_MEMCG */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2f97cb4cef6d..b205b2413186 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -757,6 +757,11 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) return mz; } +static void memcg_flush_vmstats(struct mem_cgroup *memcg) +{ + cgroup_rstat_flush(memcg->css.cgroup); +} + /** * __mod_memcg_state - update cgroup memory statistics * @memcg: the memory cgroup @@ -765,37 +770,17 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) */ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) { - long x, threshold = MEMCG_CHARGE_BATCH; - if (mem_cgroup_disabled()) return; - if (memcg_stat_item_in_bytes(idx)) - threshold <<= PAGE_SHIFT; - - x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); - if (unlikely(abs(x) > threshold)) { - struct mem_cgroup *mi; - - /* - * Batch local counters to keep them in sync with - * the hierarchical ones. - */ - __this_cpu_add(memcg->vmstats_local->stat[idx], x); - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &mi->vmstats[idx]); - x = 0; - } - __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); } -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ +/* idx can be of type enum memcg_stat_item or node_stat_item. */ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) { - long x = atomic_long_read(&memcg->vmstats[idx]); + long x = READ_ONCE(memcg->vmstats.state[idx]); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -803,17 +788,14 @@ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) return x; } -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ +/* idx can be of type enum memcg_stat_item or node_stat_item. */ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) { long x = 0; int cpu; for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); + x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -936,30 +918,16 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count) { - unsigned long x; - if (mem_cgroup_disabled()) return; - x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); - if (unlikely(x > MEMCG_CHARGE_BATCH)) { - struct mem_cgroup *mi; - - /* - * Batch local counters to keep them in sync with - * the hierarchical ones. - */ - __this_cpu_add(memcg->vmstats_local->events[idx], x); - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &mi->vmevents[idx]); - x = 0; - } - __this_cpu_write(memcg->vmstats_percpu->events[idx], x); + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); } static unsigned long memcg_events(struct mem_cgroup *memcg, int event) { - return atomic_long_read(&memcg->vmevents[event]); + return READ_ONCE(memcg->vmstats.events[event]); } static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) @@ -968,7 +936,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) int cpu; for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_local->events[event], cpu); + x += per_cpu(memcg->vmstats_percpu->events[event], cpu); return x; } @@ -1631,6 +1599,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg) * * Current memory state: */ + memcg_flush_vmstats(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -2450,22 +2419,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) drain_stock(stock); for_each_mem_cgroup(memcg) { - struct memcg_vmstats_percpu *statc; int i; - statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); - - for (i = 0; i < MEMCG_NR_STAT; i++) { + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { int nid; - if (statc->stat[i]) { - mod_memcg_state(memcg, i, statc->stat[i]); - statc->stat[i] = 0; - } - - if (i >= NR_VM_NODE_STAT_ITEMS) - continue; - for_each_node(nid) { struct batched_lruvec_stat *lstatc; struct mem_cgroup_per_node *pn; @@ -2484,13 +2442,6 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) } } } - - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { - if (statc->events[i]) { - count_memcg_events(memcg, i, statc->events[i]); - statc->events[i] = 0; - } - } } return 0; @@ -3618,6 +3569,8 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) { unsigned long val; + memcg_flush_vmstats(memcg); + if (mem_cgroup_is_root(memcg)) { val = memcg_page_state(memcg, NR_FILE_PAGES) + memcg_page_state(memcg, NR_ANON_MAPPED); @@ -3683,26 +3636,15 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, } } -static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) { - unsigned long stat[MEMCG_NR_STAT] = {0}; - struct mem_cgroup *mi; - int node, cpu, i; - - for_each_online_cpu(cpu) - for (i = 0; i < MEMCG_NR_STAT; i++) - stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); - - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - for (i = 0; i < MEMCG_NR_STAT; i++) - atomic_long_add(stat[i], &mi->vmstats[i]); + int node; for_each_node(node) { struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; struct mem_cgroup_per_node *pi; - - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) - stat[i] = 0; + int cpu, i; for_each_online_cpu(cpu) for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) @@ -3715,25 +3657,6 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) } } -static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg) -{ - unsigned long events[NR_VM_EVENT_ITEMS]; - struct mem_cgroup *mi; - int cpu, i; - - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) - events[i] = 0; - - for_each_online_cpu(cpu) - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) - events[i] += per_cpu(memcg->vmstats_percpu->events[i], - cpu); - - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) - atomic_long_add(events[i], &mi->vmevents[i]); -} - #ifdef CONFIG_MEMCG_KMEM static int memcg_online_kmem(struct mem_cgroup *memcg) { @@ -4050,6 +3973,8 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + memcg_flush_vmstats(memcg); + for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, @@ -4120,6 +4045,8 @@ static int memcg_stat_show(struct seq_file *m, void *v) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); + memcg_flush_vmstats(memcg); + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4596,22 +4523,6 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) return &memcg->cgwb_domain; } -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page(). - */ -static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) -{ - long x = atomic_long_read(&memcg->vmstats[idx]); - int cpu; - - for_each_online_cpu(cpu) - x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx]; - if (x < 0) - x = 0; - return x; -} - /** * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg * @wb: bdi_writeback in question @@ -4637,13 +4548,14 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); + memcg_flush_vmstats(memcg); - *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); - *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + - memcg_exact_page_state(memcg, NR_ACTIVE_FILE); - *pheadroom = PAGE_COUNTER_MAX; + *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); + *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); + *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) + + memcg_page_state(memcg, NR_ACTIVE_FILE); + *pheadroom = PAGE_COUNTER_MAX; while ((parent = parent_mem_cgroup(memcg))) { unsigned long ceiling = min(READ_ONCE(memcg->memory.max), READ_ONCE(memcg->memory.high)); @@ -5275,7 +5187,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu); - free_percpu(memcg->vmstats_local); kfree(memcg); } @@ -5283,11 +5194,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) { memcg_wb_domain_exit(memcg); /* - * Flush percpu vmstats and vmevents to guarantee the value correctness - * on parent's and all ancestor levels. + * Flush percpu lruvec stats to guarantee the value + * correctness on parent's and all ancestor levels. */ - memcg_flush_percpu_vmstats(memcg); - memcg_flush_percpu_vmevents(memcg); + memcg_flush_lruvec_page_state(memcg); __mem_cgroup_free(memcg); } @@ -5314,11 +5224,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) goto fail; } - memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, - GFP_KERNEL_ACCOUNT); - if (!memcg->vmstats_local) - goto fail; - memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_percpu) @@ -5518,6 +5423,62 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) memcg_wb_domain_size_changed(memcg); } +static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + struct mem_cgroup *parent = parent_mem_cgroup(memcg); + struct memcg_vmstats_percpu *statc; + long delta, v; + int i; + + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); + + for (i = 0; i < MEMCG_NR_STAT; i++) { + /* + * Collect the aggregated propagation counts of groups + * below us. We're in a per-cpu loop here and this is + * a global counter, so the first cycle will get them. + */ + delta = memcg->vmstats.state_pending[i]; + if (delta) + memcg->vmstats.state_pending[i] = 0; + + /* Add CPU changes on this level since the last flush */ + v = READ_ONCE(statc->state[i]); + if (v != statc->state_prev[i]) { + delta += v - statc->state_prev[i]; + statc->state_prev[i] = v; + } + + if (!delta) + continue; + + /* Aggregate counts on this level and propagate upwards */ + memcg->vmstats.state[i] += delta; + if (parent) + parent->vmstats.state_pending[i] += delta; + } + + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { + delta = memcg->vmstats.events_pending[i]; + if (delta) + memcg->vmstats.events_pending[i] = 0; + + v = READ_ONCE(statc->events[i]); + if (v != statc->events_prev[i]) { + delta += v - statc->events_prev[i]; + statc->events_prev[i] = v; + } + + if (!delta) + continue; + + memcg->vmstats.events[i] += delta; + if (parent) + parent->vmstats.events_pending[i] += delta; + } +} + #ifdef CONFIG_MMU /* Handlers for move charge at task migration. */ static int mem_cgroup_do_precharge(unsigned long count) @@ -6571,6 +6532,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_released = mem_cgroup_css_released, .css_free = mem_cgroup_css_free, .css_reset = mem_cgroup_css_reset, + .css_rstat_flush = mem_cgroup_css_rstat_flush, .can_attach = mem_cgroup_can_attach, .cancel_attach = mem_cgroup_cancel_attach, .post_attach = mem_cgroup_move_task, -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-03 1:47 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-03 1:47 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 01:47:45PM -0500, Johannes Weiner wrote: > Replace the memory controller's custom hierarchical stats code with > the generic rstat infrastructure provided by the cgroup core. > > The current implementation does batched upward propagation from the > write side (i.e. as stats change). The per-cpu batches introduce an > error, which is multiplied by the number of subgroups in a tree. In > systems with many CPUs and sizable cgroup trees, the error can be > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > subgroups results in an error of up to 128M per stat item). This can > entirely swallow allocation bursts inside a workload that the user is > expecting to see reflected in the statistics. > > In the past, we've done read-side aggregation, where a memory.stat > read would have to walk the entire subtree and add up per-cpu > counts. This became problematic with lazily-freed cgroups: we could > have large subtrees where most cgroups were entirely idle. Hence the > switch to change-driven upward propagation. Unfortunately, it needed > to trade accuracy for speed due to the write side being so hot. > > Rstat combines the best of both worlds: from the write side, it > cheaply maintains a queue of cgroups that have pending changes, so > that the read side can do selective tree aggregation. This way the > reported stats will always be precise and recent as can be, while the > aggregation can skip over potentially large numbers of idle cgroups. > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > aggregation. It removes 3 words from the per-cpu data. It eliminates > memcg_exact_page_state(), since memcg_page_state() is now exact. Nice! > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > include/linux/memcontrol.h | 67 ++++++----- > mm/memcontrol.c | 224 +++++++++++++++---------------------- > 2 files changed, 133 insertions(+), 158 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 20ecdfae3289..a8c7a0ccc759 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -76,10 +76,27 @@ enum mem_cgroup_events_target { > }; > > struct memcg_vmstats_percpu { > - long stat[MEMCG_NR_STAT]; > - unsigned long events[NR_VM_EVENT_ITEMS]; > - unsigned long nr_page_events; > - unsigned long targets[MEM_CGROUP_NTARGETS]; > + /* Local (CPU and cgroup) page state & events */ > + long state[MEMCG_NR_STAT]; > + unsigned long events[NR_VM_EVENT_ITEMS]; > + > + /* Delta calculation for lockless upward propagation */ > + long state_prev[MEMCG_NR_STAT]; > + unsigned long events_prev[NR_VM_EVENT_ITEMS]; > + > + /* Cgroup1: threshold notifications & softlimit tree updates */ > + unsigned long nr_page_events; > + unsigned long targets[MEM_CGROUP_NTARGETS]; > +}; > + > +struct memcg_vmstats { > + /* Aggregated (CPU and subtree) page state & events */ > + long state[MEMCG_NR_STAT]; > + unsigned long events[NR_VM_EVENT_ITEMS]; > + > + /* Pending child counts during tree propagation */ > + long state_pending[MEMCG_NR_STAT]; > + unsigned long events_pending[NR_VM_EVENT_ITEMS]; > }; > > struct mem_cgroup_reclaim_iter { > @@ -287,8 +304,8 @@ struct mem_cgroup { > > MEMCG_PADDING(_pad1_); > > - atomic_long_t vmstats[MEMCG_NR_STAT]; > - atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; > + /* memory.stat */ > + struct memcg_vmstats vmstats; > > /* memory.events */ > atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; > @@ -315,10 +332,6 @@ struct mem_cgroup { > atomic_t moving_account; > struct task_struct *move_lock_task; > > - /* Legacy local VM stats and events */ > - struct memcg_vmstats_percpu __percpu *vmstats_local; > - > - /* Subtree VM stats and events (batched updates) */ > struct memcg_vmstats_percpu __percpu *vmstats_percpu; > > #ifdef CONFIG_CGROUP_WRITEBACK > @@ -942,10 +955,6 @@ static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, > local_irq_restore(flags); > } > > -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned); > - > void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > unsigned long count); > > @@ -1028,6 +1037,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > void mem_cgroup_split_huge_fixup(struct page *head); > #endif > > +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > + gfp_t gfp_mask, > + unsigned long *total_scanned); > + > #else /* CONFIG_MEMCG */ > > #define MEM_CGROUP_ID_SHIFT 0 > @@ -1136,6 +1149,10 @@ static inline bool lruvec_holds_page_lru_lock(struct page *page, > return lruvec == &pgdat->__lruvec; > } > > +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) > +{ > +} > + > static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) > { > return NULL; > @@ -1349,18 +1366,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, > mod_node_page_state(page_pgdat(page), idx, val); > } > > -static inline > -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned) > -{ > - return 0; > -} > - > -static inline void mem_cgroup_split_huge_fixup(struct page *head) > -{ > -} > - > static inline void count_memcg_events(struct mem_cgroup *memcg, > enum vm_event_item idx, > unsigned long count) > @@ -1383,8 +1388,16 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) > { > } > > -static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) > +static inline void mem_cgroup_split_huge_fixup(struct page *head) > +{ > +} > + > +static inline > +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > + gfp_t gfp_mask, > + unsigned long *total_scanned) > { > + return 0; > } > #endif /* CONFIG_MEMCG */ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2f97cb4cef6d..b205b2413186 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -757,6 +757,11 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > return mz; > } > > +static void memcg_flush_vmstats(struct mem_cgroup *memcg) > +{ > + cgroup_rstat_flush(memcg->css.cgroup); > +} > + > /** > * __mod_memcg_state - update cgroup memory statistics > * @memcg: the memory cgroup > @@ -765,37 +770,17 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > */ > void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) > { > - long x, threshold = MEMCG_CHARGE_BATCH; > - > if (mem_cgroup_disabled()) > return; > > - if (memcg_stat_item_in_bytes(idx)) > - threshold <<= PAGE_SHIFT; > - > - x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); > - if (unlikely(abs(x) > threshold)) { > - struct mem_cgroup *mi; > - > - /* > - * Batch local counters to keep them in sync with > - * the hierarchical ones. > - */ > - __this_cpu_add(memcg->vmstats_local->stat[idx], x); > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &mi->vmstats[idx]); > - x = 0; > - } > - __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); > + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); > + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > +/* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > { > - long x = atomic_long_read(&memcg->vmstats[idx]); > + long x = READ_ONCE(memcg->vmstats.state[idx]); > #ifdef CONFIG_SMP > if (x < 0) > x = 0; > @@ -803,17 +788,14 @@ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > return x; > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > +/* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) > { > long x = 0; > int cpu; > > for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); > + x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); > #ifdef CONFIG_SMP > if (x < 0) > x = 0; > @@ -936,30 +918,16 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) > void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > unsigned long count) > { > - unsigned long x; > - > if (mem_cgroup_disabled()) > return; > > - x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); > - if (unlikely(x > MEMCG_CHARGE_BATCH)) { > - struct mem_cgroup *mi; > - > - /* > - * Batch local counters to keep them in sync with > - * the hierarchical ones. > - */ > - __this_cpu_add(memcg->vmstats_local->events[idx], x); > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &mi->vmevents[idx]); > - x = 0; > - } > - __this_cpu_write(memcg->vmstats_percpu->events[idx], x); > + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); > + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > } > > static unsigned long memcg_events(struct mem_cgroup *memcg, int event) > { > - return atomic_long_read(&memcg->vmevents[event]); > + return READ_ONCE(memcg->vmstats.events[event]); > } > > static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) > @@ -968,7 +936,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) > int cpu; > > for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_local->events[event], cpu); > + x += per_cpu(memcg->vmstats_percpu->events[event], cpu); > return x; > } > > @@ -1631,6 +1599,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg) > * > * Current memory state: > */ > + memcg_flush_vmstats(memcg); > > for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { > u64 size; > @@ -2450,22 +2419,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > drain_stock(stock); > > for_each_mem_cgroup(memcg) { > - struct memcg_vmstats_percpu *statc; > int i; > > - statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > - > - for (i = 0; i < MEMCG_NR_STAT; i++) { > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > int nid; > > - if (statc->stat[i]) { > - mod_memcg_state(memcg, i, statc->stat[i]); > - statc->stat[i] = 0; > - } > - > - if (i >= NR_VM_NODE_STAT_ITEMS) > - continue; > - > for_each_node(nid) { > struct batched_lruvec_stat *lstatc; > struct mem_cgroup_per_node *pn; > @@ -2484,13 +2442,6 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > } > } > } > - > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > - if (statc->events[i]) { > - count_memcg_events(memcg, i, statc->events[i]); > - statc->events[i] = 0; > - } > - } > } > > return 0; > @@ -3618,6 +3569,8 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) > { > unsigned long val; > > + memcg_flush_vmstats(memcg); > + > if (mem_cgroup_is_root(memcg)) { > val = memcg_page_state(memcg, NR_FILE_PAGES) + > memcg_page_state(memcg, NR_ANON_MAPPED); > @@ -3683,26 +3636,15 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > } > } > > -static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > { > - unsigned long stat[MEMCG_NR_STAT] = {0}; > - struct mem_cgroup *mi; > - int node, cpu, i; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < MEMCG_NR_STAT; i++) > - stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); > - > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - for (i = 0; i < MEMCG_NR_STAT; i++) > - atomic_long_add(stat[i], &mi->vmstats[i]); > + int node; > > for_each_node(node) { > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; ^^ I'd drop the comma here. It seems that "{0}" version is way more popular over the mm code and in the kernel in general. > struct mem_cgroup_per_node *pi; > - > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - stat[i] = 0; > + int cpu, i; > > for_each_online_cpu(cpu) > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > @@ -3715,25 +3657,6 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) > } > } > > -static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg) > -{ > - unsigned long events[NR_VM_EVENT_ITEMS]; > - struct mem_cgroup *mi; > - int cpu, i; > - > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - events[i] = 0; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - events[i] += per_cpu(memcg->vmstats_percpu->events[i], > - cpu); > - > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - atomic_long_add(events[i], &mi->vmevents[i]); > -} > - > #ifdef CONFIG_MEMCG_KMEM > static int memcg_online_kmem(struct mem_cgroup *memcg) > { > @@ -4050,6 +3973,8 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) > int nid; > struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > + memcg_flush_vmstats(memcg); > + > for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { > seq_printf(m, "%s=%lu", stat->name, > mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, > @@ -4120,6 +4045,8 @@ static int memcg_stat_show(struct seq_file *m, void *v) > > BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); > > + memcg_flush_vmstats(memcg); > + > for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { > unsigned long nr; > > @@ -4596,22 +4523,6 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) > return &memcg->cgwb_domain; > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page(). > - */ > -static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) > -{ > - long x = atomic_long_read(&memcg->vmstats[idx]); > - int cpu; > - > - for_each_online_cpu(cpu) > - x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx]; > - if (x < 0) > - x = 0; > - return x; > -} > - > /** > * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg > * @wb: bdi_writeback in question > @@ -4637,13 +4548,14 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, > struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); > struct mem_cgroup *parent; > > - *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); > + memcg_flush_vmstats(memcg); > > - *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); > - *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + > - memcg_exact_page_state(memcg, NR_ACTIVE_FILE); > - *pheadroom = PAGE_COUNTER_MAX; > + *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); > + *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); > + *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) + > + memcg_page_state(memcg, NR_ACTIVE_FILE); > > + *pheadroom = PAGE_COUNTER_MAX; > while ((parent = parent_mem_cgroup(memcg))) { > unsigned long ceiling = min(READ_ONCE(memcg->memory.max), > READ_ONCE(memcg->memory.high)); > @@ -5275,7 +5187,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) > for_each_node(node) > free_mem_cgroup_per_node_info(memcg, node); > free_percpu(memcg->vmstats_percpu); > - free_percpu(memcg->vmstats_local); > kfree(memcg); > } > > @@ -5283,11 +5194,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > { > memcg_wb_domain_exit(memcg); > /* > - * Flush percpu vmstats and vmevents to guarantee the value correctness > - * on parent's and all ancestor levels. > + * Flush percpu lruvec stats to guarantee the value > + * correctness on parent's and all ancestor levels. > */ > - memcg_flush_percpu_vmstats(memcg); > - memcg_flush_percpu_vmevents(memcg); > + memcg_flush_lruvec_page_state(memcg); > __mem_cgroup_free(memcg); > } > > @@ -5314,11 +5224,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > goto fail; > } > > - memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, > - GFP_KERNEL_ACCOUNT); > - if (!memcg->vmstats_local) > - goto fail; > - > memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, > GFP_KERNEL_ACCOUNT); > if (!memcg->vmstats_percpu) > @@ -5518,6 +5423,62 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) > memcg_wb_domain_size_changed(memcg); > } > > +static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + struct mem_cgroup *parent = parent_mem_cgroup(memcg); > + struct memcg_vmstats_percpu *statc; > + long delta, v; > + int i; > + > + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > + > + for (i = 0; i < MEMCG_NR_STAT; i++) { > + /* > + * Collect the aggregated propagation counts of groups > + * below us. We're in a per-cpu loop here and this is > + * a global counter, so the first cycle will get them. > + */ > + delta = memcg->vmstats.state_pending[i]; > + if (delta) > + memcg->vmstats.state_pending[i] = 0; > + > + /* Add CPU changes on this level since the last flush */ > + v = READ_ONCE(statc->state[i]); > + if (v != statc->state_prev[i]) { > + delta += v - statc->state_prev[i]; > + statc->state_prev[i] = v; > + } > + > + if (!delta) > + continue; > + > + /* Aggregate counts on this level and propagate upwards */ > + memcg->vmstats.state[i] += delta; > + if (parent) > + parent->vmstats.state_pending[i] += delta; > + } > + > + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > + delta = memcg->vmstats.events_pending[i]; > + if (delta) > + memcg->vmstats.events_pending[i] = 0; > + > + v = READ_ONCE(statc->events[i]); > + if (v != statc->events_prev[i]) { > + delta += v - statc->events_prev[i]; > + statc->events_prev[i] = v; > + } > + > + if (!delta) > + continue; > + > + memcg->vmstats.events[i] += delta; > + if (parent) > + parent->vmstats.events_pending[i] += delta; > + } > +} > + > #ifdef CONFIG_MMU > /* Handlers for move charge at task migration. */ > static int mem_cgroup_do_precharge(unsigned long count) > @@ -6571,6 +6532,7 @@ struct cgroup_subsys memory_cgrp_subsys = { > .css_released = mem_cgroup_css_released, > .css_free = mem_cgroup_css_free, > .css_reset = mem_cgroup_css_reset, > + .css_rstat_flush = mem_cgroup_css_rstat_flush, > .can_attach = mem_cgroup_can_attach, > .cancel_attach = mem_cgroup_cancel_attach, > .post_attach = mem_cgroup_move_task, > -- > 2.30.0 > With a tiny nit above Reviewed-by: Roman Gushchin <guro@fb.com> . Thanks! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-03 1:47 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-03 1:47 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Feb 02, 2021 at 01:47:45PM -0500, Johannes Weiner wrote: > Replace the memory controller's custom hierarchical stats code with > the generic rstat infrastructure provided by the cgroup core. > > The current implementation does batched upward propagation from the > write side (i.e. as stats change). The per-cpu batches introduce an > error, which is multiplied by the number of subgroups in a tree. In > systems with many CPUs and sizable cgroup trees, the error can be > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > subgroups results in an error of up to 128M per stat item). This can > entirely swallow allocation bursts inside a workload that the user is > expecting to see reflected in the statistics. > > In the past, we've done read-side aggregation, where a memory.stat > read would have to walk the entire subtree and add up per-cpu > counts. This became problematic with lazily-freed cgroups: we could > have large subtrees where most cgroups were entirely idle. Hence the > switch to change-driven upward propagation. Unfortunately, it needed > to trade accuracy for speed due to the write side being so hot. > > Rstat combines the best of both worlds: from the write side, it > cheaply maintains a queue of cgroups that have pending changes, so > that the read side can do selective tree aggregation. This way the > reported stats will always be precise and recent as can be, while the > aggregation can skip over potentially large numbers of idle cgroups. > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > aggregation. It removes 3 words from the per-cpu data. It eliminates > memcg_exact_page_state(), since memcg_page_state() is now exact. Nice! > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > --- > include/linux/memcontrol.h | 67 ++++++----- > mm/memcontrol.c | 224 +++++++++++++++---------------------- > 2 files changed, 133 insertions(+), 158 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 20ecdfae3289..a8c7a0ccc759 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -76,10 +76,27 @@ enum mem_cgroup_events_target { > }; > > struct memcg_vmstats_percpu { > - long stat[MEMCG_NR_STAT]; > - unsigned long events[NR_VM_EVENT_ITEMS]; > - unsigned long nr_page_events; > - unsigned long targets[MEM_CGROUP_NTARGETS]; > + /* Local (CPU and cgroup) page state & events */ > + long state[MEMCG_NR_STAT]; > + unsigned long events[NR_VM_EVENT_ITEMS]; > + > + /* Delta calculation for lockless upward propagation */ > + long state_prev[MEMCG_NR_STAT]; > + unsigned long events_prev[NR_VM_EVENT_ITEMS]; > + > + /* Cgroup1: threshold notifications & softlimit tree updates */ > + unsigned long nr_page_events; > + unsigned long targets[MEM_CGROUP_NTARGETS]; > +}; > + > +struct memcg_vmstats { > + /* Aggregated (CPU and subtree) page state & events */ > + long state[MEMCG_NR_STAT]; > + unsigned long events[NR_VM_EVENT_ITEMS]; > + > + /* Pending child counts during tree propagation */ > + long state_pending[MEMCG_NR_STAT]; > + unsigned long events_pending[NR_VM_EVENT_ITEMS]; > }; > > struct mem_cgroup_reclaim_iter { > @@ -287,8 +304,8 @@ struct mem_cgroup { > > MEMCG_PADDING(_pad1_); > > - atomic_long_t vmstats[MEMCG_NR_STAT]; > - atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; > + /* memory.stat */ > + struct memcg_vmstats vmstats; > > /* memory.events */ > atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; > @@ -315,10 +332,6 @@ struct mem_cgroup { > atomic_t moving_account; > struct task_struct *move_lock_task; > > - /* Legacy local VM stats and events */ > - struct memcg_vmstats_percpu __percpu *vmstats_local; > - > - /* Subtree VM stats and events (batched updates) */ > struct memcg_vmstats_percpu __percpu *vmstats_percpu; > > #ifdef CONFIG_CGROUP_WRITEBACK > @@ -942,10 +955,6 @@ static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, > local_irq_restore(flags); > } > > -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned); > - > void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > unsigned long count); > > @@ -1028,6 +1037,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > void mem_cgroup_split_huge_fixup(struct page *head); > #endif > > +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > + gfp_t gfp_mask, > + unsigned long *total_scanned); > + > #else /* CONFIG_MEMCG */ > > #define MEM_CGROUP_ID_SHIFT 0 > @@ -1136,6 +1149,10 @@ static inline bool lruvec_holds_page_lru_lock(struct page *page, > return lruvec == &pgdat->__lruvec; > } > > +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) > +{ > +} > + > static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) > { > return NULL; > @@ -1349,18 +1366,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, > mod_node_page_state(page_pgdat(page), idx, val); > } > > -static inline > -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned) > -{ > - return 0; > -} > - > -static inline void mem_cgroup_split_huge_fixup(struct page *head) > -{ > -} > - > static inline void count_memcg_events(struct mem_cgroup *memcg, > enum vm_event_item idx, > unsigned long count) > @@ -1383,8 +1388,16 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) > { > } > > -static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) > +static inline void mem_cgroup_split_huge_fixup(struct page *head) > +{ > +} > + > +static inline > +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > + gfp_t gfp_mask, > + unsigned long *total_scanned) > { > + return 0; > } > #endif /* CONFIG_MEMCG */ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2f97cb4cef6d..b205b2413186 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -757,6 +757,11 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > return mz; > } > > +static void memcg_flush_vmstats(struct mem_cgroup *memcg) > +{ > + cgroup_rstat_flush(memcg->css.cgroup); > +} > + > /** > * __mod_memcg_state - update cgroup memory statistics > * @memcg: the memory cgroup > @@ -765,37 +770,17 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > */ > void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) > { > - long x, threshold = MEMCG_CHARGE_BATCH; > - > if (mem_cgroup_disabled()) > return; > > - if (memcg_stat_item_in_bytes(idx)) > - threshold <<= PAGE_SHIFT; > - > - x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); > - if (unlikely(abs(x) > threshold)) { > - struct mem_cgroup *mi; > - > - /* > - * Batch local counters to keep them in sync with > - * the hierarchical ones. > - */ > - __this_cpu_add(memcg->vmstats_local->stat[idx], x); > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &mi->vmstats[idx]); > - x = 0; > - } > - __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); > + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); > + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > +/* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > { > - long x = atomic_long_read(&memcg->vmstats[idx]); > + long x = READ_ONCE(memcg->vmstats.state[idx]); > #ifdef CONFIG_SMP > if (x < 0) > x = 0; > @@ -803,17 +788,14 @@ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > return x; > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > +/* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) > { > long x = 0; > int cpu; > > for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); > + x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); > #ifdef CONFIG_SMP > if (x < 0) > x = 0; > @@ -936,30 +918,16 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) > void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > unsigned long count) > { > - unsigned long x; > - > if (mem_cgroup_disabled()) > return; > > - x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); > - if (unlikely(x > MEMCG_CHARGE_BATCH)) { > - struct mem_cgroup *mi; > - > - /* > - * Batch local counters to keep them in sync with > - * the hierarchical ones. > - */ > - __this_cpu_add(memcg->vmstats_local->events[idx], x); > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &mi->vmevents[idx]); > - x = 0; > - } > - __this_cpu_write(memcg->vmstats_percpu->events[idx], x); > + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); > + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > } > > static unsigned long memcg_events(struct mem_cgroup *memcg, int event) > { > - return atomic_long_read(&memcg->vmevents[event]); > + return READ_ONCE(memcg->vmstats.events[event]); > } > > static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) > @@ -968,7 +936,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) > int cpu; > > for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_local->events[event], cpu); > + x += per_cpu(memcg->vmstats_percpu->events[event], cpu); > return x; > } > > @@ -1631,6 +1599,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg) > * > * Current memory state: > */ > + memcg_flush_vmstats(memcg); > > for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { > u64 size; > @@ -2450,22 +2419,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > drain_stock(stock); > > for_each_mem_cgroup(memcg) { > - struct memcg_vmstats_percpu *statc; > int i; > > - statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > - > - for (i = 0; i < MEMCG_NR_STAT; i++) { > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > int nid; > > - if (statc->stat[i]) { > - mod_memcg_state(memcg, i, statc->stat[i]); > - statc->stat[i] = 0; > - } > - > - if (i >= NR_VM_NODE_STAT_ITEMS) > - continue; > - > for_each_node(nid) { > struct batched_lruvec_stat *lstatc; > struct mem_cgroup_per_node *pn; > @@ -2484,13 +2442,6 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > } > } > } > - > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > - if (statc->events[i]) { > - count_memcg_events(memcg, i, statc->events[i]); > - statc->events[i] = 0; > - } > - } > } > > return 0; > @@ -3618,6 +3569,8 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) > { > unsigned long val; > > + memcg_flush_vmstats(memcg); > + > if (mem_cgroup_is_root(memcg)) { > val = memcg_page_state(memcg, NR_FILE_PAGES) + > memcg_page_state(memcg, NR_ANON_MAPPED); > @@ -3683,26 +3636,15 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > } > } > > -static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > { > - unsigned long stat[MEMCG_NR_STAT] = {0}; > - struct mem_cgroup *mi; > - int node, cpu, i; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < MEMCG_NR_STAT; i++) > - stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); > - > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - for (i = 0; i < MEMCG_NR_STAT; i++) > - atomic_long_add(stat[i], &mi->vmstats[i]); > + int node; > > for_each_node(node) { > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; ^^ I'd drop the comma here. It seems that "{0}" version is way more popular over the mm code and in the kernel in general. > struct mem_cgroup_per_node *pi; > - > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - stat[i] = 0; > + int cpu, i; > > for_each_online_cpu(cpu) > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > @@ -3715,25 +3657,6 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) > } > } > > -static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg) > -{ > - unsigned long events[NR_VM_EVENT_ITEMS]; > - struct mem_cgroup *mi; > - int cpu, i; > - > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - events[i] = 0; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - events[i] += per_cpu(memcg->vmstats_percpu->events[i], > - cpu); > - > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - atomic_long_add(events[i], &mi->vmevents[i]); > -} > - > #ifdef CONFIG_MEMCG_KMEM > static int memcg_online_kmem(struct mem_cgroup *memcg) > { > @@ -4050,6 +3973,8 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) > int nid; > struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > + memcg_flush_vmstats(memcg); > + > for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { > seq_printf(m, "%s=%lu", stat->name, > mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, > @@ -4120,6 +4045,8 @@ static int memcg_stat_show(struct seq_file *m, void *v) > > BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); > > + memcg_flush_vmstats(memcg); > + > for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { > unsigned long nr; > > @@ -4596,22 +4523,6 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) > return &memcg->cgwb_domain; > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page(). > - */ > -static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) > -{ > - long x = atomic_long_read(&memcg->vmstats[idx]); > - int cpu; > - > - for_each_online_cpu(cpu) > - x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx]; > - if (x < 0) > - x = 0; > - return x; > -} > - > /** > * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg > * @wb: bdi_writeback in question > @@ -4637,13 +4548,14 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, > struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); > struct mem_cgroup *parent; > > - *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); > + memcg_flush_vmstats(memcg); > > - *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); > - *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + > - memcg_exact_page_state(memcg, NR_ACTIVE_FILE); > - *pheadroom = PAGE_COUNTER_MAX; > + *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); > + *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); > + *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) + > + memcg_page_state(memcg, NR_ACTIVE_FILE); > > + *pheadroom = PAGE_COUNTER_MAX; > while ((parent = parent_mem_cgroup(memcg))) { > unsigned long ceiling = min(READ_ONCE(memcg->memory.max), > READ_ONCE(memcg->memory.high)); > @@ -5275,7 +5187,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) > for_each_node(node) > free_mem_cgroup_per_node_info(memcg, node); > free_percpu(memcg->vmstats_percpu); > - free_percpu(memcg->vmstats_local); > kfree(memcg); > } > > @@ -5283,11 +5194,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > { > memcg_wb_domain_exit(memcg); > /* > - * Flush percpu vmstats and vmevents to guarantee the value correctness > - * on parent's and all ancestor levels. > + * Flush percpu lruvec stats to guarantee the value > + * correctness on parent's and all ancestor levels. > */ > - memcg_flush_percpu_vmstats(memcg); > - memcg_flush_percpu_vmevents(memcg); > + memcg_flush_lruvec_page_state(memcg); > __mem_cgroup_free(memcg); > } > > @@ -5314,11 +5224,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > goto fail; > } > > - memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, > - GFP_KERNEL_ACCOUNT); > - if (!memcg->vmstats_local) > - goto fail; > - > memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, > GFP_KERNEL_ACCOUNT); > if (!memcg->vmstats_percpu) > @@ -5518,6 +5423,62 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) > memcg_wb_domain_size_changed(memcg); > } > > +static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + struct mem_cgroup *parent = parent_mem_cgroup(memcg); > + struct memcg_vmstats_percpu *statc; > + long delta, v; > + int i; > + > + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > + > + for (i = 0; i < MEMCG_NR_STAT; i++) { > + /* > + * Collect the aggregated propagation counts of groups > + * below us. We're in a per-cpu loop here and this is > + * a global counter, so the first cycle will get them. > + */ > + delta = memcg->vmstats.state_pending[i]; > + if (delta) > + memcg->vmstats.state_pending[i] = 0; > + > + /* Add CPU changes on this level since the last flush */ > + v = READ_ONCE(statc->state[i]); > + if (v != statc->state_prev[i]) { > + delta += v - statc->state_prev[i]; > + statc->state_prev[i] = v; > + } > + > + if (!delta) > + continue; > + > + /* Aggregate counts on this level and propagate upwards */ > + memcg->vmstats.state[i] += delta; > + if (parent) > + parent->vmstats.state_pending[i] += delta; > + } > + > + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > + delta = memcg->vmstats.events_pending[i]; > + if (delta) > + memcg->vmstats.events_pending[i] = 0; > + > + v = READ_ONCE(statc->events[i]); > + if (v != statc->events_prev[i]) { > + delta += v - statc->events_prev[i]; > + statc->events_prev[i] = v; > + } > + > + if (!delta) > + continue; > + > + memcg->vmstats.events[i] += delta; > + if (parent) > + parent->vmstats.events_pending[i] += delta; > + } > +} > + > #ifdef CONFIG_MMU > /* Handlers for move charge at task migration. */ > static int mem_cgroup_do_precharge(unsigned long count) > @@ -6571,6 +6532,7 @@ struct cgroup_subsys memory_cgrp_subsys = { > .css_released = mem_cgroup_css_released, > .css_free = mem_cgroup_css_free, > .css_reset = mem_cgroup_css_reset, > + .css_rstat_flush = mem_cgroup_css_rstat_flush, > .can_attach = mem_cgroup_can_attach, > .cancel_attach = mem_cgroup_cancel_attach, > .post_attach = mem_cgroup_move_task, > -- > 2.30.0 > With a tiny nit above Reviewed-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> . Thanks! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 16:26 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 16:26 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 05:47:26PM -0800, Roman Gushchin wrote: > On Tue, Feb 02, 2021 at 01:47:45PM -0500, Johannes Weiner wrote: > > for_each_node(node) { > > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > ^^ > I'd drop the comma here. It seems that "{0}" version is way more popular > over the mm code and in the kernel in general. Is there a downside to the comma? I'm finding more { 0, } than { 0 } in mm code, and at least kernel-wide it seems both are acceptable (although { 0 } is more popular overall). I don't care much either way. I can change it in v2 if there is one. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 16:26 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 16:26 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Feb 02, 2021 at 05:47:26PM -0800, Roman Gushchin wrote: > On Tue, Feb 02, 2021 at 01:47:45PM -0500, Johannes Weiner wrote: > > for_each_node(node) { > > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > ^^ > I'd drop the comma here. It seems that "{0}" version is way more popular > over the mm code and in the kernel in general. Is there a downside to the comma? I'm finding more { 0, } than { 0 } in mm code, and at least kernel-wide it seems both are acceptable (although { 0 } is more popular overall). I don't care much either way. I can change it in v2 if there is one. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 18:45 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-04 18:45 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Thu, Feb 04, 2021 at 11:26:32AM -0500, Johannes Weiner wrote: > On Tue, Feb 02, 2021 at 05:47:26PM -0800, Roman Gushchin wrote: > > On Tue, Feb 02, 2021 at 01:47:45PM -0500, Johannes Weiner wrote: > > > for_each_node(node) { > > > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > > > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > > ^^ > > I'd drop the comma here. It seems that "{0}" version is way more popular > > over the mm code and in the kernel in general. > > Is there a downside to the comma? I'm finding more { 0, } than { 0 } > in mm code, and at least kernel-wide it seems both are acceptable > (although { 0 } is more popular overall). { 0 } is more obvious and saves a character. The "problem" with comma version is that { 1, } and { 0, } have a different meaning. It seems like 13 (no comma) vs 11 (comma) in the mm code: [guro@carbon mm]$ pwd /home/guro/linux/mm [guro@carbon mm]$ ag --nofilename "\{0\}" DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}}; return (swp_entry_t) {0}; unsigned long stat[MEMCG_NR_STAT] = {0}; swp_entry_t entry = (swp_entry_t){0}; [guro@carbon mm]$ ag --nofilename "\{ 0 \}" struct cleancache_filekey key = { .u.key = { 0 } }; struct cleancache_filekey key = { .u.key = { 0 } }; struct cleancache_filekey key = { .u.key = { 0 } }; struct cleancache_filekey key = { .u.key = { 0 } }; unsigned long stack_entries[KFENCE_STACK_DEPTH] = { 0 }; DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 }; [guro@carbon mm]$ ag --nofilename "\{ 0, \}" int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; unsigned long count[MIGRATE_TYPES] = { 0, }; struct memory_failure_entry entry = { 0, }; unsigned long nr_skipped[MAX_NR_ZONES] = { 0, }; unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; unsigned long count[MIGRATE_TYPES] = { 0, }; > > I don't care much either way. I can change it in v2 if there is one. Sure, of course it's not worth a separate version. Thanks! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 18:45 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-04 18:45 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Feb 04, 2021 at 11:26:32AM -0500, Johannes Weiner wrote: > On Tue, Feb 02, 2021 at 05:47:26PM -0800, Roman Gushchin wrote: > > On Tue, Feb 02, 2021 at 01:47:45PM -0500, Johannes Weiner wrote: > > > for_each_node(node) { > > > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > > > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > > ^^ > > I'd drop the comma here. It seems that "{0}" version is way more popular > > over the mm code and in the kernel in general. > > Is there a downside to the comma? I'm finding more { 0, } than { 0 } > in mm code, and at least kernel-wide it seems both are acceptable > (although { 0 } is more popular overall). { 0 } is more obvious and saves a character. The "problem" with comma version is that { 1, } and { 0, } have a different meaning. It seems like 13 (no comma) vs 11 (comma) in the mm code: [guro@carbon mm]$ pwd /home/guro/linux/mm [guro@carbon mm]$ ag --nofilename "\{0\}" DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}}; return (swp_entry_t) {0}; unsigned long stat[MEMCG_NR_STAT] = {0}; swp_entry_t entry = (swp_entry_t){0}; [guro@carbon mm]$ ag --nofilename "\{ 0 \}" struct cleancache_filekey key = { .u.key = { 0 } }; struct cleancache_filekey key = { .u.key = { 0 } }; struct cleancache_filekey key = { .u.key = { 0 } }; struct cleancache_filekey key = { .u.key = { 0 } }; unsigned long stack_entries[KFENCE_STACK_DEPTH] = { 0 }; DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 }; [guro@carbon mm]$ ag --nofilename "\{ 0, \}" int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; unsigned long count[MIGRATE_TYPES] = { 0, }; struct memory_failure_entry entry = { 0, }; unsigned long nr_skipped[MAX_NR_ZONES] = { 0, }; unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; unsigned long count[MIGRATE_TYPES] = { 0, }; > > I don't care much either way. I can change it in v2 if there is one. Sure, of course it's not worth a separate version. Thanks! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 20:05 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 20:05 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Thu, Feb 04, 2021 at 10:45:20AM -0800, Roman Gushchin wrote: > On Thu, Feb 04, 2021 at 11:26:32AM -0500, Johannes Weiner wrote: > > On Tue, Feb 02, 2021 at 05:47:26PM -0800, Roman Gushchin wrote: > > > On Tue, Feb 02, 2021 at 01:47:45PM -0500, Johannes Weiner wrote: > > > > for_each_node(node) { > > > > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > > > > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > > > ^^ > > > I'd drop the comma here. It seems that "{0}" version is way more popular > > > over the mm code and in the kernel in general. > > > > Is there a downside to the comma? I'm finding more { 0, } than { 0 } > > in mm code, and at least kernel-wide it seems both are acceptable > > (although { 0 } is more popular overall). > > { 0 } is more obvious and saves a character. The comma signals that the author is aware that the array or structure has more elements than specified, and that they expect the rest to be zeroed. We use it extensively to initialize structures (like struct cgroup_subsys inits, cftypes, struct address_space_operations, etc.) So I'd say "more obvious" is subjective. I find the comma version a bit more obvious. > The "problem" with comma version is that { 1, } and { 0, } have a > different meaning. ...which is? They both mean set the first element to x and zerofill the rest, no? Again, I don't really care too much either way, I'm just wondering if I'm missing something bigger here. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 20:05 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 20:05 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Feb 04, 2021 at 10:45:20AM -0800, Roman Gushchin wrote: > On Thu, Feb 04, 2021 at 11:26:32AM -0500, Johannes Weiner wrote: > > On Tue, Feb 02, 2021 at 05:47:26PM -0800, Roman Gushchin wrote: > > > On Tue, Feb 02, 2021 at 01:47:45PM -0500, Johannes Weiner wrote: > > > > for_each_node(node) { > > > > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > > > > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > > > ^^ > > > I'd drop the comma here. It seems that "{0}" version is way more popular > > > over the mm code and in the kernel in general. > > > > Is there a downside to the comma? I'm finding more { 0, } than { 0 } > > in mm code, and at least kernel-wide it seems both are acceptable > > (although { 0 } is more popular overall). > > { 0 } is more obvious and saves a character. The comma signals that the author is aware that the array or structure has more elements than specified, and that they expect the rest to be zeroed. We use it extensively to initialize structures (like struct cgroup_subsys inits, cftypes, struct address_space_operations, etc.) So I'd say "more obvious" is subjective. I find the comma version a bit more obvious. > The "problem" with comma version is that { 1, } and { 0, } have a > different meaning. ...which is? They both mean set the first element to x and zerofill the rest, no? Again, I don't really care too much either way, I'm just wondering if I'm missing something bigger here. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 14:19 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 14:19 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > Replace the memory controller's custom hierarchical stats code with > the generic rstat infrastructure provided by the cgroup core. > > The current implementation does batched upward propagation from the > write side (i.e. as stats change). The per-cpu batches introduce an > error, which is multiplied by the number of subgroups in a tree. In > systems with many CPUs and sizable cgroup trees, the error can be > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > subgroups results in an error of up to 128M per stat item). This can > entirely swallow allocation bursts inside a workload that the user is > expecting to see reflected in the statistics. > > In the past, we've done read-side aggregation, where a memory.stat > read would have to walk the entire subtree and add up per-cpu > counts. This became problematic with lazily-freed cgroups: we could > have large subtrees where most cgroups were entirely idle. Hence the > switch to change-driven upward propagation. Unfortunately, it needed > to trade accuracy for speed due to the write side being so hot. > > Rstat combines the best of both worlds: from the write side, it > cheaply maintains a queue of cgroups that have pending changes, so > that the read side can do selective tree aggregation. This way the > reported stats will always be precise and recent as can be, while the > aggregation can skip over potentially large numbers of idle cgroups. > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > aggregation. It removes 3 words from the per-cpu data. It eliminates > memcg_exact_page_state(), since memcg_page_state() is now exact. I am still digesting details and need to look deeper into how rstat works but removing our own stats is definitely a good plan. Especially when there are existing limitations and problems that would need fixing. Just to check that my high level understanding is correct. The transition is effectivelly removing a need to manually sync counters up the hierarchy and partially outsources that decision to rstat core. The controller is responsible just to tell the core how that syncing is done (e.g. which specific counters etc). Excplicit flushes are needed when you want an exact value (e.g. when values are presented to the userspace). I do not see any flushes to be done by the core pro-actively except for clean up on a release. Is the above correct understanding? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 14:19 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 14:19 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > Replace the memory controller's custom hierarchical stats code with > the generic rstat infrastructure provided by the cgroup core. > > The current implementation does batched upward propagation from the > write side (i.e. as stats change). The per-cpu batches introduce an > error, which is multiplied by the number of subgroups in a tree. In > systems with many CPUs and sizable cgroup trees, the error can be > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > subgroups results in an error of up to 128M per stat item). This can > entirely swallow allocation bursts inside a workload that the user is > expecting to see reflected in the statistics. > > In the past, we've done read-side aggregation, where a memory.stat > read would have to walk the entire subtree and add up per-cpu > counts. This became problematic with lazily-freed cgroups: we could > have large subtrees where most cgroups were entirely idle. Hence the > switch to change-driven upward propagation. Unfortunately, it needed > to trade accuracy for speed due to the write side being so hot. > > Rstat combines the best of both worlds: from the write side, it > cheaply maintains a queue of cgroups that have pending changes, so > that the read side can do selective tree aggregation. This way the > reported stats will always be precise and recent as can be, while the > aggregation can skip over potentially large numbers of idle cgroups. > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > aggregation. It removes 3 words from the per-cpu data. It eliminates > memcg_exact_page_state(), since memcg_page_state() is now exact. I am still digesting details and need to look deeper into how rstat works but removing our own stats is definitely a good plan. Especially when there are existing limitations and problems that would need fixing. Just to check that my high level understanding is correct. The transition is effectivelly removing a need to manually sync counters up the hierarchy and partially outsources that decision to rstat core. The controller is responsible just to tell the core how that syncing is done (e.g. which specific counters etc). Excplicit flushes are needed when you want an exact value (e.g. when values are presented to the userspace). I do not see any flushes to be done by the core pro-actively except for clean up on a release. Is the above correct understanding? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat 2021-02-04 14:19 ` Michal Hocko (?) @ 2021-02-04 16:15 ` Johannes Weiner 2021-02-04 16:44 ` Michal Hocko -1 siblings, 1 reply; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 16:15 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team Hello Michal, On Thu, Feb 04, 2021 at 03:19:17PM +0100, Michal Hocko wrote: > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > Replace the memory controller's custom hierarchical stats code with > > the generic rstat infrastructure provided by the cgroup core. > > > > The current implementation does batched upward propagation from the > > write side (i.e. as stats change). The per-cpu batches introduce an > > error, which is multiplied by the number of subgroups in a tree. In > > systems with many CPUs and sizable cgroup trees, the error can be > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > subgroups results in an error of up to 128M per stat item). This can > > entirely swallow allocation bursts inside a workload that the user is > > expecting to see reflected in the statistics. > > > > In the past, we've done read-side aggregation, where a memory.stat > > read would have to walk the entire subtree and add up per-cpu > > counts. This became problematic with lazily-freed cgroups: we could > > have large subtrees where most cgroups were entirely idle. Hence the > > switch to change-driven upward propagation. Unfortunately, it needed > > to trade accuracy for speed due to the write side being so hot. > > > > Rstat combines the best of both worlds: from the write side, it > > cheaply maintains a queue of cgroups that have pending changes, so > > that the read side can do selective tree aggregation. This way the > > reported stats will always be precise and recent as can be, while the > > aggregation can skip over potentially large numbers of idle cgroups. > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > I am still digesting details and need to look deeper into how rstat > works but removing our own stats is definitely a good plan. Especially > when there are existing limitations and problems that would need fixing. > > Just to check that my high level understanding is correct. The > transition is effectivelly removing a need to manually sync counters up > the hierarchy and partially outsources that decision to rstat core. The > controller is responsible just to tell the core how that syncing is done > (e.g. which specific counters etc). Yes, exactly. rstat implements a tree of cgroups that have local changes pending, and a flush walk on that tree. But it's all driven by the controller. memcg needs to tell rstat 1) when stats in a local cgroup change e.g. when we do mod_memcg_state() (cgroup_rstat_updated), 2) when to flush, e.g. before a memory.stat read (cgroup_rstat_flush), and 3) how to flush one cgroup's per-cpu state and propagate it upward to the parent during rstat's flush walk (.css_rstat_flush). > Excplicit flushes are needed when you want an exact value (e.g. when > values are presented to the userspace). I do not see any flushes to > be done by the core pro-actively except for clean up on a release. > > Is the above correct understanding? Yes, that's correct. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 16:44 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 16:44 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Thu 04-02-21 11:15:06, Johannes Weiner wrote: > Hello Michal, > > On Thu, Feb 04, 2021 at 03:19:17PM +0100, Michal Hocko wrote: > > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > > Replace the memory controller's custom hierarchical stats code with > > > the generic rstat infrastructure provided by the cgroup core. > > > > > > The current implementation does batched upward propagation from the > > > write side (i.e. as stats change). The per-cpu batches introduce an > > > error, which is multiplied by the number of subgroups in a tree. In > > > systems with many CPUs and sizable cgroup trees, the error can be > > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > > subgroups results in an error of up to 128M per stat item). This can > > > entirely swallow allocation bursts inside a workload that the user is > > > expecting to see reflected in the statistics. > > > > > > In the past, we've done read-side aggregation, where a memory.stat > > > read would have to walk the entire subtree and add up per-cpu > > > counts. This became problematic with lazily-freed cgroups: we could > > > have large subtrees where most cgroups were entirely idle. Hence the > > > switch to change-driven upward propagation. Unfortunately, it needed > > > to trade accuracy for speed due to the write side being so hot. > > > > > > Rstat combines the best of both worlds: from the write side, it > > > cheaply maintains a queue of cgroups that have pending changes, so > > > that the read side can do selective tree aggregation. This way the > > > reported stats will always be precise and recent as can be, while the > > > aggregation can skip over potentially large numbers of idle cgroups. > > > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > > > I am still digesting details and need to look deeper into how rstat > > works but removing our own stats is definitely a good plan. Especially > > when there are existing limitations and problems that would need fixing. > > > > Just to check that my high level understanding is correct. The > > transition is effectivelly removing a need to manually sync counters up > > the hierarchy and partially outsources that decision to rstat core. The > > controller is responsible just to tell the core how that syncing is done > > (e.g. which specific counters etc). > > Yes, exactly. > > rstat implements a tree of cgroups that have local changes pending, > and a flush walk on that tree. But it's all driven by the controller. > > memcg needs to tell rstat 1) when stats in a local cgroup change > e.g. when we do mod_memcg_state() (cgroup_rstat_updated), 2) when to > flush, e.g. before a memory.stat read (cgroup_rstat_flush), and 3) how > to flush one cgroup's per-cpu state and propagate it upward to the > parent during rstat's flush walk (.css_rstat_flush). Can we have this short summary in a changelog please? > > Excplicit flushes are needed when you want an exact value (e.g. when > > values are presented to the userspace). I do not see any flushes to > > be done by the core pro-actively except for clean up on a release. > > > > Is the above correct understanding? > > Yes, that's correct. OK, thanks for the confirmation. I will have a closer look tomorrow but I do not see any problems now. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 16:44 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-04 16:44 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu 04-02-21 11:15:06, Johannes Weiner wrote: > Hello Michal, > > On Thu, Feb 04, 2021 at 03:19:17PM +0100, Michal Hocko wrote: > > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > > Replace the memory controller's custom hierarchical stats code with > > > the generic rstat infrastructure provided by the cgroup core. > > > > > > The current implementation does batched upward propagation from the > > > write side (i.e. as stats change). The per-cpu batches introduce an > > > error, which is multiplied by the number of subgroups in a tree. In > > > systems with many CPUs and sizable cgroup trees, the error can be > > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > > subgroups results in an error of up to 128M per stat item). This can > > > entirely swallow allocation bursts inside a workload that the user is > > > expecting to see reflected in the statistics. > > > > > > In the past, we've done read-side aggregation, where a memory.stat > > > read would have to walk the entire subtree and add up per-cpu > > > counts. This became problematic with lazily-freed cgroups: we could > > > have large subtrees where most cgroups were entirely idle. Hence the > > > switch to change-driven upward propagation. Unfortunately, it needed > > > to trade accuracy for speed due to the write side being so hot. > > > > > > Rstat combines the best of both worlds: from the write side, it > > > cheaply maintains a queue of cgroups that have pending changes, so > > > that the read side can do selective tree aggregation. This way the > > > reported stats will always be precise and recent as can be, while the > > > aggregation can skip over potentially large numbers of idle cgroups. > > > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > > > I am still digesting details and need to look deeper into how rstat > > works but removing our own stats is definitely a good plan. Especially > > when there are existing limitations and problems that would need fixing. > > > > Just to check that my high level understanding is correct. The > > transition is effectivelly removing a need to manually sync counters up > > the hierarchy and partially outsources that decision to rstat core. The > > controller is responsible just to tell the core how that syncing is done > > (e.g. which specific counters etc). > > Yes, exactly. > > rstat implements a tree of cgroups that have local changes pending, > and a flush walk on that tree. But it's all driven by the controller. > > memcg needs to tell rstat 1) when stats in a local cgroup change > e.g. when we do mod_memcg_state() (cgroup_rstat_updated), 2) when to > flush, e.g. before a memory.stat read (cgroup_rstat_flush), and 3) how > to flush one cgroup's per-cpu state and propagate it upward to the > parent during rstat's flush walk (.css_rstat_flush). Can we have this short summary in a changelog please? > > Excplicit flushes are needed when you want an exact value (e.g. when > > values are presented to the userspace). I do not see any flushes to > > be done by the core pro-actively except for clean up on a release. > > > > Is the above correct understanding? > > Yes, that's correct. OK, thanks for the confirmation. I will have a closer look tomorrow but I do not see any problems now. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 20:28 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 20:28 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Thu, Feb 04, 2021 at 05:44:06PM +0100, Michal Hocko wrote: > On Thu 04-02-21 11:15:06, Johannes Weiner wrote: > > Hello Michal, > > > > On Thu, Feb 04, 2021 at 03:19:17PM +0100, Michal Hocko wrote: > > > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > > > Replace the memory controller's custom hierarchical stats code with > > > > the generic rstat infrastructure provided by the cgroup core. > > > > > > > > The current implementation does batched upward propagation from the > > > > write side (i.e. as stats change). The per-cpu batches introduce an > > > > error, which is multiplied by the number of subgroups in a tree. In > > > > systems with many CPUs and sizable cgroup trees, the error can be > > > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > > > subgroups results in an error of up to 128M per stat item). This can > > > > entirely swallow allocation bursts inside a workload that the user is > > > > expecting to see reflected in the statistics. > > > > > > > > In the past, we've done read-side aggregation, where a memory.stat > > > > read would have to walk the entire subtree and add up per-cpu > > > > counts. This became problematic with lazily-freed cgroups: we could > > > > have large subtrees where most cgroups were entirely idle. Hence the > > > > switch to change-driven upward propagation. Unfortunately, it needed > > > > to trade accuracy for speed due to the write side being so hot. > > > > > > > > Rstat combines the best of both worlds: from the write side, it > > > > cheaply maintains a queue of cgroups that have pending changes, so > > > > that the read side can do selective tree aggregation. This way the > > > > reported stats will always be precise and recent as can be, while the > > > > aggregation can skip over potentially large numbers of idle cgroups. > > > > > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > > > > > I am still digesting details and need to look deeper into how rstat > > > works but removing our own stats is definitely a good plan. Especially > > > when there are existing limitations and problems that would need fixing. > > > > > > Just to check that my high level understanding is correct. The > > > transition is effectivelly removing a need to manually sync counters up > > > the hierarchy and partially outsources that decision to rstat core. The > > > controller is responsible just to tell the core how that syncing is done > > > (e.g. which specific counters etc). > > > > Yes, exactly. > > > > rstat implements a tree of cgroups that have local changes pending, > > and a flush walk on that tree. But it's all driven by the controller. > > > > memcg needs to tell rstat 1) when stats in a local cgroup change > > e.g. when we do mod_memcg_state() (cgroup_rstat_updated), 2) when to > > flush, e.g. before a memory.stat read (cgroup_rstat_flush), and 3) how > > to flush one cgroup's per-cpu state and propagate it upward to the > > parent during rstat's flush walk (.css_rstat_flush). > > Can we have this short summary in a changelog please? Sure thing, I'll include that v2. > > > Excplicit flushes are needed when you want an exact value (e.g. when > > > values are presented to the userspace). I do not see any flushes to > > > be done by the core pro-actively except for clean up on a release. > > > > > > Is the above correct understanding? > > > > Yes, that's correct. > > OK, thanks for the confirmation. I will have a closer look tomorrow but > I do not see any problems now. Thanks ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-04 20:28 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 20:28 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Feb 04, 2021 at 05:44:06PM +0100, Michal Hocko wrote: > On Thu 04-02-21 11:15:06, Johannes Weiner wrote: > > Hello Michal, > > > > On Thu, Feb 04, 2021 at 03:19:17PM +0100, Michal Hocko wrote: > > > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > > > Replace the memory controller's custom hierarchical stats code with > > > > the generic rstat infrastructure provided by the cgroup core. > > > > > > > > The current implementation does batched upward propagation from the > > > > write side (i.e. as stats change). The per-cpu batches introduce an > > > > error, which is multiplied by the number of subgroups in a tree. In > > > > systems with many CPUs and sizable cgroup trees, the error can be > > > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > > > subgroups results in an error of up to 128M per stat item). This can > > > > entirely swallow allocation bursts inside a workload that the user is > > > > expecting to see reflected in the statistics. > > > > > > > > In the past, we've done read-side aggregation, where a memory.stat > > > > read would have to walk the entire subtree and add up per-cpu > > > > counts. This became problematic with lazily-freed cgroups: we could > > > > have large subtrees where most cgroups were entirely idle. Hence the > > > > switch to change-driven upward propagation. Unfortunately, it needed > > > > to trade accuracy for speed due to the write side being so hot. > > > > > > > > Rstat combines the best of both worlds: from the write side, it > > > > cheaply maintains a queue of cgroups that have pending changes, so > > > > that the read side can do selective tree aggregation. This way the > > > > reported stats will always be precise and recent as can be, while the > > > > aggregation can skip over potentially large numbers of idle cgroups. > > > > > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > > > > > I am still digesting details and need to look deeper into how rstat > > > works but removing our own stats is definitely a good plan. Especially > > > when there are existing limitations and problems that would need fixing. > > > > > > Just to check that my high level understanding is correct. The > > > transition is effectivelly removing a need to manually sync counters up > > > the hierarchy and partially outsources that decision to rstat core. The > > > controller is responsible just to tell the core how that syncing is done > > > (e.g. which specific counters etc). > > > > Yes, exactly. > > > > rstat implements a tree of cgroups that have local changes pending, > > and a flush walk on that tree. But it's all driven by the controller. > > > > memcg needs to tell rstat 1) when stats in a local cgroup change > > e.g. when we do mod_memcg_state() (cgroup_rstat_updated), 2) when to > > flush, e.g. before a memory.stat read (cgroup_rstat_flush), and 3) how > > to flush one cgroup's per-cpu state and propagate it upward to the > > parent during rstat's flush walk (.css_rstat_flush). > > Can we have this short summary in a changelog please? Sure thing, I'll include that v2. > > > Excplicit flushes are needed when you want an exact value (e.g. when > > > values are presented to the userspace). I do not see any flushes to > > > be done by the core pro-actively except for clean up on a release. > > > > > > Is the above correct understanding? > > > > Yes, that's correct. > > OK, thanks for the confirmation. I will have a closer look tomorrow but > I do not see any problems now. Thanks ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-05 15:05 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-05 15:05 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > Replace the memory controller's custom hierarchical stats code with > the generic rstat infrastructure provided by the cgroup core. > > The current implementation does batched upward propagation from the > write side (i.e. as stats change). The per-cpu batches introduce an > error, which is multiplied by the number of subgroups in a tree. In > systems with many CPUs and sizable cgroup trees, the error can be > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > subgroups results in an error of up to 128M per stat item). This can > entirely swallow allocation bursts inside a workload that the user is > expecting to see reflected in the statistics. > > In the past, we've done read-side aggregation, where a memory.stat > read would have to walk the entire subtree and add up per-cpu > counts. This became problematic with lazily-freed cgroups: we could > have large subtrees where most cgroups were entirely idle. Hence the > switch to change-driven upward propagation. Unfortunately, it needed > to trade accuracy for speed due to the write side being so hot. > > Rstat combines the best of both worlds: from the write side, it > cheaply maintains a queue of cgroups that have pending changes, so > that the read side can do selective tree aggregation. This way the > reported stats will always be precise and recent as can be, while the > aggregation can skip over potentially large numbers of idle cgroups. > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > aggregation. It removes 3 words from the per-cpu data. It eliminates > memcg_exact_page_state(), since memcg_page_state() is now exact. The above confused me a bit. I can see the pcp data size increased by adding _prev. The resulting memory footprint should be increased by sizeof(long) * (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) * (CPUS + 1) which is roughly 1kB per CPU per memcg unless I have made any mistake. This is a quite a lot and it should be mentioned in the changelog. > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Although the memory overhead is quite large and it scales both with memcg count and CPUs so it can grow quite a bit I do not think this is prohibitive. Although it would be really nice if this could be optimized in the future. All that being said, the code looks more manageable now. Acked-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/memcontrol.h | 67 ++++++----- > mm/memcontrol.c | 224 +++++++++++++++---------------------- > 2 files changed, 133 insertions(+), 158 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 20ecdfae3289..a8c7a0ccc759 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -76,10 +76,27 @@ enum mem_cgroup_events_target { > }; > > struct memcg_vmstats_percpu { > - long stat[MEMCG_NR_STAT]; > - unsigned long events[NR_VM_EVENT_ITEMS]; > - unsigned long nr_page_events; > - unsigned long targets[MEM_CGROUP_NTARGETS]; > + /* Local (CPU and cgroup) page state & events */ > + long state[MEMCG_NR_STAT]; > + unsigned long events[NR_VM_EVENT_ITEMS]; > + > + /* Delta calculation for lockless upward propagation */ > + long state_prev[MEMCG_NR_STAT]; > + unsigned long events_prev[NR_VM_EVENT_ITEMS]; > + > + /* Cgroup1: threshold notifications & softlimit tree updates */ > + unsigned long nr_page_events; > + unsigned long targets[MEM_CGROUP_NTARGETS]; > +}; > + > +struct memcg_vmstats { > + /* Aggregated (CPU and subtree) page state & events */ > + long state[MEMCG_NR_STAT]; > + unsigned long events[NR_VM_EVENT_ITEMS]; > + > + /* Pending child counts during tree propagation */ > + long state_pending[MEMCG_NR_STAT]; > + unsigned long events_pending[NR_VM_EVENT_ITEMS]; > }; > > struct mem_cgroup_reclaim_iter { > @@ -287,8 +304,8 @@ struct mem_cgroup { > > MEMCG_PADDING(_pad1_); > > - atomic_long_t vmstats[MEMCG_NR_STAT]; > - atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; > + /* memory.stat */ > + struct memcg_vmstats vmstats; > > /* memory.events */ > atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; > @@ -315,10 +332,6 @@ struct mem_cgroup { > atomic_t moving_account; > struct task_struct *move_lock_task; > > - /* Legacy local VM stats and events */ > - struct memcg_vmstats_percpu __percpu *vmstats_local; > - > - /* Subtree VM stats and events (batched updates) */ > struct memcg_vmstats_percpu __percpu *vmstats_percpu; > > #ifdef CONFIG_CGROUP_WRITEBACK > @@ -942,10 +955,6 @@ static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, > local_irq_restore(flags); > } > > -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned); > - > void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > unsigned long count); > > @@ -1028,6 +1037,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > void mem_cgroup_split_huge_fixup(struct page *head); > #endif > > +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > + gfp_t gfp_mask, > + unsigned long *total_scanned); > + > #else /* CONFIG_MEMCG */ > > #define MEM_CGROUP_ID_SHIFT 0 > @@ -1136,6 +1149,10 @@ static inline bool lruvec_holds_page_lru_lock(struct page *page, > return lruvec == &pgdat->__lruvec; > } > > +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) > +{ > +} > + > static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) > { > return NULL; > @@ -1349,18 +1366,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, > mod_node_page_state(page_pgdat(page), idx, val); > } > > -static inline > -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned) > -{ > - return 0; > -} > - > -static inline void mem_cgroup_split_huge_fixup(struct page *head) > -{ > -} > - > static inline void count_memcg_events(struct mem_cgroup *memcg, > enum vm_event_item idx, > unsigned long count) > @@ -1383,8 +1388,16 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) > { > } > > -static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) > +static inline void mem_cgroup_split_huge_fixup(struct page *head) > +{ > +} > + > +static inline > +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > + gfp_t gfp_mask, > + unsigned long *total_scanned) > { > + return 0; > } > #endif /* CONFIG_MEMCG */ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2f97cb4cef6d..b205b2413186 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -757,6 +757,11 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > return mz; > } > > +static void memcg_flush_vmstats(struct mem_cgroup *memcg) > +{ > + cgroup_rstat_flush(memcg->css.cgroup); > +} > + > /** > * __mod_memcg_state - update cgroup memory statistics > * @memcg: the memory cgroup > @@ -765,37 +770,17 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > */ > void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) > { > - long x, threshold = MEMCG_CHARGE_BATCH; > - > if (mem_cgroup_disabled()) > return; > > - if (memcg_stat_item_in_bytes(idx)) > - threshold <<= PAGE_SHIFT; > - > - x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); > - if (unlikely(abs(x) > threshold)) { > - struct mem_cgroup *mi; > - > - /* > - * Batch local counters to keep them in sync with > - * the hierarchical ones. > - */ > - __this_cpu_add(memcg->vmstats_local->stat[idx], x); > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &mi->vmstats[idx]); > - x = 0; > - } > - __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); > + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); > + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > +/* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > { > - long x = atomic_long_read(&memcg->vmstats[idx]); > + long x = READ_ONCE(memcg->vmstats.state[idx]); > #ifdef CONFIG_SMP > if (x < 0) > x = 0; > @@ -803,17 +788,14 @@ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > return x; > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > +/* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) > { > long x = 0; > int cpu; > > for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); > + x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); > #ifdef CONFIG_SMP > if (x < 0) > x = 0; > @@ -936,30 +918,16 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) > void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > unsigned long count) > { > - unsigned long x; > - > if (mem_cgroup_disabled()) > return; > > - x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); > - if (unlikely(x > MEMCG_CHARGE_BATCH)) { > - struct mem_cgroup *mi; > - > - /* > - * Batch local counters to keep them in sync with > - * the hierarchical ones. > - */ > - __this_cpu_add(memcg->vmstats_local->events[idx], x); > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &mi->vmevents[idx]); > - x = 0; > - } > - __this_cpu_write(memcg->vmstats_percpu->events[idx], x); > + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); > + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > } > > static unsigned long memcg_events(struct mem_cgroup *memcg, int event) > { > - return atomic_long_read(&memcg->vmevents[event]); > + return READ_ONCE(memcg->vmstats.events[event]); > } > > static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) > @@ -968,7 +936,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) > int cpu; > > for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_local->events[event], cpu); > + x += per_cpu(memcg->vmstats_percpu->events[event], cpu); > return x; > } > > @@ -1631,6 +1599,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg) > * > * Current memory state: > */ > + memcg_flush_vmstats(memcg); > > for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { > u64 size; > @@ -2450,22 +2419,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > drain_stock(stock); > > for_each_mem_cgroup(memcg) { > - struct memcg_vmstats_percpu *statc; > int i; > > - statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > - > - for (i = 0; i < MEMCG_NR_STAT; i++) { > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > int nid; > > - if (statc->stat[i]) { > - mod_memcg_state(memcg, i, statc->stat[i]); > - statc->stat[i] = 0; > - } > - > - if (i >= NR_VM_NODE_STAT_ITEMS) > - continue; > - > for_each_node(nid) { > struct batched_lruvec_stat *lstatc; > struct mem_cgroup_per_node *pn; > @@ -2484,13 +2442,6 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > } > } > } > - > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > - if (statc->events[i]) { > - count_memcg_events(memcg, i, statc->events[i]); > - statc->events[i] = 0; > - } > - } > } > > return 0; > @@ -3618,6 +3569,8 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) > { > unsigned long val; > > + memcg_flush_vmstats(memcg); > + > if (mem_cgroup_is_root(memcg)) { > val = memcg_page_state(memcg, NR_FILE_PAGES) + > memcg_page_state(memcg, NR_ANON_MAPPED); > @@ -3683,26 +3636,15 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > } > } > > -static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > { > - unsigned long stat[MEMCG_NR_STAT] = {0}; > - struct mem_cgroup *mi; > - int node, cpu, i; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < MEMCG_NR_STAT; i++) > - stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); > - > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - for (i = 0; i < MEMCG_NR_STAT; i++) > - atomic_long_add(stat[i], &mi->vmstats[i]); > + int node; > > for_each_node(node) { > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > struct mem_cgroup_per_node *pi; > - > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - stat[i] = 0; > + int cpu, i; > > for_each_online_cpu(cpu) > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > @@ -3715,25 +3657,6 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) > } > } > > -static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg) > -{ > - unsigned long events[NR_VM_EVENT_ITEMS]; > - struct mem_cgroup *mi; > - int cpu, i; > - > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - events[i] = 0; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - events[i] += per_cpu(memcg->vmstats_percpu->events[i], > - cpu); > - > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - atomic_long_add(events[i], &mi->vmevents[i]); > -} > - > #ifdef CONFIG_MEMCG_KMEM > static int memcg_online_kmem(struct mem_cgroup *memcg) > { > @@ -4050,6 +3973,8 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) > int nid; > struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > + memcg_flush_vmstats(memcg); > + > for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { > seq_printf(m, "%s=%lu", stat->name, > mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, > @@ -4120,6 +4045,8 @@ static int memcg_stat_show(struct seq_file *m, void *v) > > BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); > > + memcg_flush_vmstats(memcg); > + > for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { > unsigned long nr; > > @@ -4596,22 +4523,6 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) > return &memcg->cgwb_domain; > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page(). > - */ > -static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) > -{ > - long x = atomic_long_read(&memcg->vmstats[idx]); > - int cpu; > - > - for_each_online_cpu(cpu) > - x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx]; > - if (x < 0) > - x = 0; > - return x; > -} > - > /** > * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg > * @wb: bdi_writeback in question > @@ -4637,13 +4548,14 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, > struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); > struct mem_cgroup *parent; > > - *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); > + memcg_flush_vmstats(memcg); > > - *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); > - *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + > - memcg_exact_page_state(memcg, NR_ACTIVE_FILE); > - *pheadroom = PAGE_COUNTER_MAX; > + *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); > + *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); > + *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) + > + memcg_page_state(memcg, NR_ACTIVE_FILE); > > + *pheadroom = PAGE_COUNTER_MAX; > while ((parent = parent_mem_cgroup(memcg))) { > unsigned long ceiling = min(READ_ONCE(memcg->memory.max), > READ_ONCE(memcg->memory.high)); > @@ -5275,7 +5187,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) > for_each_node(node) > free_mem_cgroup_per_node_info(memcg, node); > free_percpu(memcg->vmstats_percpu); > - free_percpu(memcg->vmstats_local); > kfree(memcg); > } > > @@ -5283,11 +5194,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > { > memcg_wb_domain_exit(memcg); > /* > - * Flush percpu vmstats and vmevents to guarantee the value correctness > - * on parent's and all ancestor levels. > + * Flush percpu lruvec stats to guarantee the value > + * correctness on parent's and all ancestor levels. > */ > - memcg_flush_percpu_vmstats(memcg); > - memcg_flush_percpu_vmevents(memcg); > + memcg_flush_lruvec_page_state(memcg); > __mem_cgroup_free(memcg); > } > > @@ -5314,11 +5224,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > goto fail; > } > > - memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, > - GFP_KERNEL_ACCOUNT); > - if (!memcg->vmstats_local) > - goto fail; > - > memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, > GFP_KERNEL_ACCOUNT); > if (!memcg->vmstats_percpu) > @@ -5518,6 +5423,62 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) > memcg_wb_domain_size_changed(memcg); > } > > +static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + struct mem_cgroup *parent = parent_mem_cgroup(memcg); > + struct memcg_vmstats_percpu *statc; > + long delta, v; > + int i; > + > + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > + > + for (i = 0; i < MEMCG_NR_STAT; i++) { > + /* > + * Collect the aggregated propagation counts of groups > + * below us. We're in a per-cpu loop here and this is > + * a global counter, so the first cycle will get them. > + */ > + delta = memcg->vmstats.state_pending[i]; > + if (delta) > + memcg->vmstats.state_pending[i] = 0; > + > + /* Add CPU changes on this level since the last flush */ > + v = READ_ONCE(statc->state[i]); > + if (v != statc->state_prev[i]) { > + delta += v - statc->state_prev[i]; > + statc->state_prev[i] = v; > + } > + > + if (!delta) > + continue; > + > + /* Aggregate counts on this level and propagate upwards */ > + memcg->vmstats.state[i] += delta; > + if (parent) > + parent->vmstats.state_pending[i] += delta; > + } > + > + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > + delta = memcg->vmstats.events_pending[i]; > + if (delta) > + memcg->vmstats.events_pending[i] = 0; > + > + v = READ_ONCE(statc->events[i]); > + if (v != statc->events_prev[i]) { > + delta += v - statc->events_prev[i]; > + statc->events_prev[i] = v; > + } > + > + if (!delta) > + continue; > + > + memcg->vmstats.events[i] += delta; > + if (parent) > + parent->vmstats.events_pending[i] += delta; > + } > +} > + > #ifdef CONFIG_MMU > /* Handlers for move charge at task migration. */ > static int mem_cgroup_do_precharge(unsigned long count) > @@ -6571,6 +6532,7 @@ struct cgroup_subsys memory_cgrp_subsys = { > .css_released = mem_cgroup_css_released, > .css_free = mem_cgroup_css_free, > .css_reset = mem_cgroup_css_reset, > + .css_rstat_flush = mem_cgroup_css_rstat_flush, > .can_attach = mem_cgroup_can_attach, > .cancel_attach = mem_cgroup_cancel_attach, > .post_attach = mem_cgroup_move_task, > -- > 2.30.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-05 15:05 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-05 15:05 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > Replace the memory controller's custom hierarchical stats code with > the generic rstat infrastructure provided by the cgroup core. > > The current implementation does batched upward propagation from the > write side (i.e. as stats change). The per-cpu batches introduce an > error, which is multiplied by the number of subgroups in a tree. In > systems with many CPUs and sizable cgroup trees, the error can be > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > subgroups results in an error of up to 128M per stat item). This can > entirely swallow allocation bursts inside a workload that the user is > expecting to see reflected in the statistics. > > In the past, we've done read-side aggregation, where a memory.stat > read would have to walk the entire subtree and add up per-cpu > counts. This became problematic with lazily-freed cgroups: we could > have large subtrees where most cgroups were entirely idle. Hence the > switch to change-driven upward propagation. Unfortunately, it needed > to trade accuracy for speed due to the write side being so hot. > > Rstat combines the best of both worlds: from the write side, it > cheaply maintains a queue of cgroups that have pending changes, so > that the read side can do selective tree aggregation. This way the > reported stats will always be precise and recent as can be, while the > aggregation can skip over potentially large numbers of idle cgroups. > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > aggregation. It removes 3 words from the per-cpu data. It eliminates > memcg_exact_page_state(), since memcg_page_state() is now exact. The above confused me a bit. I can see the pcp data size increased by adding _prev. The resulting memory footprint should be increased by sizeof(long) * (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) * (CPUS + 1) which is roughly 1kB per CPU per memcg unless I have made any mistake. This is a quite a lot and it should be mentioned in the changelog. > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Although the memory overhead is quite large and it scales both with memcg count and CPUs so it can grow quite a bit I do not think this is prohibitive. Although it would be really nice if this could be optimized in the future. All that being said, the code looks more manageable now. Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > --- > include/linux/memcontrol.h | 67 ++++++----- > mm/memcontrol.c | 224 +++++++++++++++---------------------- > 2 files changed, 133 insertions(+), 158 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 20ecdfae3289..a8c7a0ccc759 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -76,10 +76,27 @@ enum mem_cgroup_events_target { > }; > > struct memcg_vmstats_percpu { > - long stat[MEMCG_NR_STAT]; > - unsigned long events[NR_VM_EVENT_ITEMS]; > - unsigned long nr_page_events; > - unsigned long targets[MEM_CGROUP_NTARGETS]; > + /* Local (CPU and cgroup) page state & events */ > + long state[MEMCG_NR_STAT]; > + unsigned long events[NR_VM_EVENT_ITEMS]; > + > + /* Delta calculation for lockless upward propagation */ > + long state_prev[MEMCG_NR_STAT]; > + unsigned long events_prev[NR_VM_EVENT_ITEMS]; > + > + /* Cgroup1: threshold notifications & softlimit tree updates */ > + unsigned long nr_page_events; > + unsigned long targets[MEM_CGROUP_NTARGETS]; > +}; > + > +struct memcg_vmstats { > + /* Aggregated (CPU and subtree) page state & events */ > + long state[MEMCG_NR_STAT]; > + unsigned long events[NR_VM_EVENT_ITEMS]; > + > + /* Pending child counts during tree propagation */ > + long state_pending[MEMCG_NR_STAT]; > + unsigned long events_pending[NR_VM_EVENT_ITEMS]; > }; > > struct mem_cgroup_reclaim_iter { > @@ -287,8 +304,8 @@ struct mem_cgroup { > > MEMCG_PADDING(_pad1_); > > - atomic_long_t vmstats[MEMCG_NR_STAT]; > - atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; > + /* memory.stat */ > + struct memcg_vmstats vmstats; > > /* memory.events */ > atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; > @@ -315,10 +332,6 @@ struct mem_cgroup { > atomic_t moving_account; > struct task_struct *move_lock_task; > > - /* Legacy local VM stats and events */ > - struct memcg_vmstats_percpu __percpu *vmstats_local; > - > - /* Subtree VM stats and events (batched updates) */ > struct memcg_vmstats_percpu __percpu *vmstats_percpu; > > #ifdef CONFIG_CGROUP_WRITEBACK > @@ -942,10 +955,6 @@ static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, > local_irq_restore(flags); > } > > -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned); > - > void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > unsigned long count); > > @@ -1028,6 +1037,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > void mem_cgroup_split_huge_fixup(struct page *head); > #endif > > +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > + gfp_t gfp_mask, > + unsigned long *total_scanned); > + > #else /* CONFIG_MEMCG */ > > #define MEM_CGROUP_ID_SHIFT 0 > @@ -1136,6 +1149,10 @@ static inline bool lruvec_holds_page_lru_lock(struct page *page, > return lruvec == &pgdat->__lruvec; > } > > +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) > +{ > +} > + > static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) > { > return NULL; > @@ -1349,18 +1366,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, > mod_node_page_state(page_pgdat(page), idx, val); > } > > -static inline > -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > - gfp_t gfp_mask, > - unsigned long *total_scanned) > -{ > - return 0; > -} > - > -static inline void mem_cgroup_split_huge_fixup(struct page *head) > -{ > -} > - > static inline void count_memcg_events(struct mem_cgroup *memcg, > enum vm_event_item idx, > unsigned long count) > @@ -1383,8 +1388,16 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) > { > } > > -static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) > +static inline void mem_cgroup_split_huge_fixup(struct page *head) > +{ > +} > + > +static inline > +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > + gfp_t gfp_mask, > + unsigned long *total_scanned) > { > + return 0; > } > #endif /* CONFIG_MEMCG */ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2f97cb4cef6d..b205b2413186 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -757,6 +757,11 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > return mz; > } > > +static void memcg_flush_vmstats(struct mem_cgroup *memcg) > +{ > + cgroup_rstat_flush(memcg->css.cgroup); > +} > + > /** > * __mod_memcg_state - update cgroup memory statistics > * @memcg: the memory cgroup > @@ -765,37 +770,17 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > */ > void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) > { > - long x, threshold = MEMCG_CHARGE_BATCH; > - > if (mem_cgroup_disabled()) > return; > > - if (memcg_stat_item_in_bytes(idx)) > - threshold <<= PAGE_SHIFT; > - > - x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); > - if (unlikely(abs(x) > threshold)) { > - struct mem_cgroup *mi; > - > - /* > - * Batch local counters to keep them in sync with > - * the hierarchical ones. > - */ > - __this_cpu_add(memcg->vmstats_local->stat[idx], x); > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &mi->vmstats[idx]); > - x = 0; > - } > - __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); > + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); > + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > +/* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > { > - long x = atomic_long_read(&memcg->vmstats[idx]); > + long x = READ_ONCE(memcg->vmstats.state[idx]); > #ifdef CONFIG_SMP > if (x < 0) > x = 0; > @@ -803,17 +788,14 @@ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > return x; > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page_state(). > - */ > +/* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) > { > long x = 0; > int cpu; > > for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); > + x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); > #ifdef CONFIG_SMP > if (x < 0) > x = 0; > @@ -936,30 +918,16 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) > void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > unsigned long count) > { > - unsigned long x; > - > if (mem_cgroup_disabled()) > return; > > - x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); > - if (unlikely(x > MEMCG_CHARGE_BATCH)) { > - struct mem_cgroup *mi; > - > - /* > - * Batch local counters to keep them in sync with > - * the hierarchical ones. > - */ > - __this_cpu_add(memcg->vmstats_local->events[idx], x); > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &mi->vmevents[idx]); > - x = 0; > - } > - __this_cpu_write(memcg->vmstats_percpu->events[idx], x); > + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); > + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > } > > static unsigned long memcg_events(struct mem_cgroup *memcg, int event) > { > - return atomic_long_read(&memcg->vmevents[event]); > + return READ_ONCE(memcg->vmstats.events[event]); > } > > static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) > @@ -968,7 +936,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) > int cpu; > > for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_local->events[event], cpu); > + x += per_cpu(memcg->vmstats_percpu->events[event], cpu); > return x; > } > > @@ -1631,6 +1599,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg) > * > * Current memory state: > */ > + memcg_flush_vmstats(memcg); > > for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { > u64 size; > @@ -2450,22 +2419,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > drain_stock(stock); > > for_each_mem_cgroup(memcg) { > - struct memcg_vmstats_percpu *statc; > int i; > > - statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > - > - for (i = 0; i < MEMCG_NR_STAT; i++) { > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > int nid; > > - if (statc->stat[i]) { > - mod_memcg_state(memcg, i, statc->stat[i]); > - statc->stat[i] = 0; > - } > - > - if (i >= NR_VM_NODE_STAT_ITEMS) > - continue; > - > for_each_node(nid) { > struct batched_lruvec_stat *lstatc; > struct mem_cgroup_per_node *pn; > @@ -2484,13 +2442,6 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > } > } > } > - > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > - if (statc->events[i]) { > - count_memcg_events(memcg, i, statc->events[i]); > - statc->events[i] = 0; > - } > - } > } > > return 0; > @@ -3618,6 +3569,8 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) > { > unsigned long val; > > + memcg_flush_vmstats(memcg); > + > if (mem_cgroup_is_root(memcg)) { > val = memcg_page_state(memcg, NR_FILE_PAGES) + > memcg_page_state(memcg, NR_ANON_MAPPED); > @@ -3683,26 +3636,15 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > } > } > > -static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > { > - unsigned long stat[MEMCG_NR_STAT] = {0}; > - struct mem_cgroup *mi; > - int node, cpu, i; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < MEMCG_NR_STAT; i++) > - stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); > - > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - for (i = 0; i < MEMCG_NR_STAT; i++) > - atomic_long_add(stat[i], &mi->vmstats[i]); > + int node; > > for_each_node(node) { > struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > struct mem_cgroup_per_node *pi; > - > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - stat[i] = 0; > + int cpu, i; > > for_each_online_cpu(cpu) > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > @@ -3715,25 +3657,6 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) > } > } > > -static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg) > -{ > - unsigned long events[NR_VM_EVENT_ITEMS]; > - struct mem_cgroup *mi; > - int cpu, i; > - > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - events[i] = 0; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - events[i] += per_cpu(memcg->vmstats_percpu->events[i], > - cpu); > - > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) > - atomic_long_add(events[i], &mi->vmevents[i]); > -} > - > #ifdef CONFIG_MEMCG_KMEM > static int memcg_online_kmem(struct mem_cgroup *memcg) > { > @@ -4050,6 +3973,8 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) > int nid; > struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > + memcg_flush_vmstats(memcg); > + > for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { > seq_printf(m, "%s=%lu", stat->name, > mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, > @@ -4120,6 +4045,8 @@ static int memcg_stat_show(struct seq_file *m, void *v) > > BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); > > + memcg_flush_vmstats(memcg); > + > for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { > unsigned long nr; > > @@ -4596,22 +4523,6 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) > return &memcg->cgwb_domain; > } > > -/* > - * idx can be of type enum memcg_stat_item or node_stat_item. > - * Keep in sync with memcg_exact_page(). > - */ > -static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) > -{ > - long x = atomic_long_read(&memcg->vmstats[idx]); > - int cpu; > - > - for_each_online_cpu(cpu) > - x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx]; > - if (x < 0) > - x = 0; > - return x; > -} > - > /** > * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg > * @wb: bdi_writeback in question > @@ -4637,13 +4548,14 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, > struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); > struct mem_cgroup *parent; > > - *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); > + memcg_flush_vmstats(memcg); > > - *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); > - *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + > - memcg_exact_page_state(memcg, NR_ACTIVE_FILE); > - *pheadroom = PAGE_COUNTER_MAX; > + *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); > + *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); > + *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) + > + memcg_page_state(memcg, NR_ACTIVE_FILE); > > + *pheadroom = PAGE_COUNTER_MAX; > while ((parent = parent_mem_cgroup(memcg))) { > unsigned long ceiling = min(READ_ONCE(memcg->memory.max), > READ_ONCE(memcg->memory.high)); > @@ -5275,7 +5187,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) > for_each_node(node) > free_mem_cgroup_per_node_info(memcg, node); > free_percpu(memcg->vmstats_percpu); > - free_percpu(memcg->vmstats_local); > kfree(memcg); > } > > @@ -5283,11 +5194,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > { > memcg_wb_domain_exit(memcg); > /* > - * Flush percpu vmstats and vmevents to guarantee the value correctness > - * on parent's and all ancestor levels. > + * Flush percpu lruvec stats to guarantee the value > + * correctness on parent's and all ancestor levels. > */ > - memcg_flush_percpu_vmstats(memcg); > - memcg_flush_percpu_vmevents(memcg); > + memcg_flush_lruvec_page_state(memcg); > __mem_cgroup_free(memcg); > } > > @@ -5314,11 +5224,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > goto fail; > } > > - memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, > - GFP_KERNEL_ACCOUNT); > - if (!memcg->vmstats_local) > - goto fail; > - > memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, > GFP_KERNEL_ACCOUNT); > if (!memcg->vmstats_percpu) > @@ -5518,6 +5423,62 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) > memcg_wb_domain_size_changed(memcg); > } > > +static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + struct mem_cgroup *parent = parent_mem_cgroup(memcg); > + struct memcg_vmstats_percpu *statc; > + long delta, v; > + int i; > + > + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > + > + for (i = 0; i < MEMCG_NR_STAT; i++) { > + /* > + * Collect the aggregated propagation counts of groups > + * below us. We're in a per-cpu loop here and this is > + * a global counter, so the first cycle will get them. > + */ > + delta = memcg->vmstats.state_pending[i]; > + if (delta) > + memcg->vmstats.state_pending[i] = 0; > + > + /* Add CPU changes on this level since the last flush */ > + v = READ_ONCE(statc->state[i]); > + if (v != statc->state_prev[i]) { > + delta += v - statc->state_prev[i]; > + statc->state_prev[i] = v; > + } > + > + if (!delta) > + continue; > + > + /* Aggregate counts on this level and propagate upwards */ > + memcg->vmstats.state[i] += delta; > + if (parent) > + parent->vmstats.state_pending[i] += delta; > + } > + > + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > + delta = memcg->vmstats.events_pending[i]; > + if (delta) > + memcg->vmstats.events_pending[i] = 0; > + > + v = READ_ONCE(statc->events[i]); > + if (v != statc->events_prev[i]) { > + delta += v - statc->events_prev[i]; > + statc->events_prev[i] = v; > + } > + > + if (!delta) > + continue; > + > + memcg->vmstats.events[i] += delta; > + if (parent) > + parent->vmstats.events_pending[i] += delta; > + } > +} > + > #ifdef CONFIG_MMU > /* Handlers for move charge at task migration. */ > static int mem_cgroup_do_precharge(unsigned long count) > @@ -6571,6 +6532,7 @@ struct cgroup_subsys memory_cgrp_subsys = { > .css_released = mem_cgroup_css_released, > .css_free = mem_cgroup_css_free, > .css_reset = mem_cgroup_css_reset, > + .css_rstat_flush = mem_cgroup_css_rstat_flush, > .can_attach = mem_cgroup_can_attach, > .cancel_attach = mem_cgroup_cancel_attach, > .post_attach = mem_cgroup_move_task, > -- > 2.30.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-05 16:34 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-05 16:34 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Fri, Feb 05, 2021 at 04:05:20PM +0100, Michal Hocko wrote: > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > Replace the memory controller's custom hierarchical stats code with > > the generic rstat infrastructure provided by the cgroup core. > > > > The current implementation does batched upward propagation from the > > write side (i.e. as stats change). The per-cpu batches introduce an > > error, which is multiplied by the number of subgroups in a tree. In > > systems with many CPUs and sizable cgroup trees, the error can be > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > subgroups results in an error of up to 128M per stat item). This can > > entirely swallow allocation bursts inside a workload that the user is > > expecting to see reflected in the statistics. > > > > In the past, we've done read-side aggregation, where a memory.stat > > read would have to walk the entire subtree and add up per-cpu > > counts. This became problematic with lazily-freed cgroups: we could > > have large subtrees where most cgroups were entirely idle. Hence the > > switch to change-driven upward propagation. Unfortunately, it needed > > to trade accuracy for speed due to the write side being so hot. > > > > Rstat combines the best of both worlds: from the write side, it > > cheaply maintains a queue of cgroups that have pending changes, so > > that the read side can do selective tree aggregation. This way the > > reported stats will always be precise and recent as can be, while the > > aggregation can skip over potentially large numbers of idle cgroups. > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > The above confused me a bit. I can see the pcp data size increased by > adding _prev. The resulting memory footprint should be increased by > sizeof(long) * (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) * (CPUS + 1) > which is roughly 1kB per CPU per memcg unless I have made any > mistake. This is a quite a lot and it should be mentioned in the > changelog. Not quite, you missed a hunk further below in the patch. Yes, the _prev arrays are added to the percpu struct. HOWEVER, we used to have TWO percpu structs in a memcg: one for local data, one for hierarchical data. In the rstat format, one is enough to capture both: - /* Legacy local VM stats and events */ - struct memcg_vmstats_percpu __percpu *vmstats_local; - - /* Subtree VM stats and events (batched updates) */ struct memcg_vmstats_percpu __percpu *vmstats_percpu; This eliminates dead duplicates of the nr_page_events and targets[MEM_CGROUP_NTARGETS(2)] we used to carry, which means we have a net reduction of 3 longs in the percpu data with this series. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > Although the memory overhead is quite large and it scales both with > memcg count and CPUs so it can grow quite a bit I do not think this is > prohibitive. Although it would be really nice if this could be optimized > in the future. > > All that being said, the code looks more manageable now. > Acked-by: Michal Hocko <mhocko@suse.com> Thanks ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-05 16:34 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-05 16:34 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri, Feb 05, 2021 at 04:05:20PM +0100, Michal Hocko wrote: > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > Replace the memory controller's custom hierarchical stats code with > > the generic rstat infrastructure provided by the cgroup core. > > > > The current implementation does batched upward propagation from the > > write side (i.e. as stats change). The per-cpu batches introduce an > > error, which is multiplied by the number of subgroups in a tree. In > > systems with many CPUs and sizable cgroup trees, the error can be > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > subgroups results in an error of up to 128M per stat item). This can > > entirely swallow allocation bursts inside a workload that the user is > > expecting to see reflected in the statistics. > > > > In the past, we've done read-side aggregation, where a memory.stat > > read would have to walk the entire subtree and add up per-cpu > > counts. This became problematic with lazily-freed cgroups: we could > > have large subtrees where most cgroups were entirely idle. Hence the > > switch to change-driven upward propagation. Unfortunately, it needed > > to trade accuracy for speed due to the write side being so hot. > > > > Rstat combines the best of both worlds: from the write side, it > > cheaply maintains a queue of cgroups that have pending changes, so > > that the read side can do selective tree aggregation. This way the > > reported stats will always be precise and recent as can be, while the > > aggregation can skip over potentially large numbers of idle cgroups. > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > The above confused me a bit. I can see the pcp data size increased by > adding _prev. The resulting memory footprint should be increased by > sizeof(long) * (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) * (CPUS + 1) > which is roughly 1kB per CPU per memcg unless I have made any > mistake. This is a quite a lot and it should be mentioned in the > changelog. Not quite, you missed a hunk further below in the patch. Yes, the _prev arrays are added to the percpu struct. HOWEVER, we used to have TWO percpu structs in a memcg: one for local data, one for hierarchical data. In the rstat format, one is enough to capture both: - /* Legacy local VM stats and events */ - struct memcg_vmstats_percpu __percpu *vmstats_local; - - /* Subtree VM stats and events (batched updates) */ struct memcg_vmstats_percpu __percpu *vmstats_percpu; This eliminates dead duplicates of the nr_page_events and targets[MEM_CGROUP_NTARGETS(2)] we used to carry, which means we have a net reduction of 3 longs in the percpu data with this series. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > > Although the memory overhead is quite large and it scales both with > memcg count and CPUs so it can grow quite a bit I do not think this is > prohibitive. Although it would be really nice if this could be optimized > in the future. > > All that being said, the code looks more manageable now. > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Thanks ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-08 14:07 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-08 14:07 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Fri 05-02-21 11:34:19, Johannes Weiner wrote: > On Fri, Feb 05, 2021 at 04:05:20PM +0100, Michal Hocko wrote: > > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > > Replace the memory controller's custom hierarchical stats code with > > > the generic rstat infrastructure provided by the cgroup core. > > > > > > The current implementation does batched upward propagation from the > > > write side (i.e. as stats change). The per-cpu batches introduce an > > > error, which is multiplied by the number of subgroups in a tree. In > > > systems with many CPUs and sizable cgroup trees, the error can be > > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > > subgroups results in an error of up to 128M per stat item). This can > > > entirely swallow allocation bursts inside a workload that the user is > > > expecting to see reflected in the statistics. > > > > > > In the past, we've done read-side aggregation, where a memory.stat > > > read would have to walk the entire subtree and add up per-cpu > > > counts. This became problematic with lazily-freed cgroups: we could > > > have large subtrees where most cgroups were entirely idle. Hence the > > > switch to change-driven upward propagation. Unfortunately, it needed > > > to trade accuracy for speed due to the write side being so hot. > > > > > > Rstat combines the best of both worlds: from the write side, it > > > cheaply maintains a queue of cgroups that have pending changes, so > > > that the read side can do selective tree aggregation. This way the > > > reported stats will always be precise and recent as can be, while the > > > aggregation can skip over potentially large numbers of idle cgroups. > > > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > > > The above confused me a bit. I can see the pcp data size increased by > > adding _prev. The resulting memory footprint should be increased by > > sizeof(long) * (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) * (CPUS + 1) > > which is roughly 1kB per CPU per memcg unless I have made any > > mistake. This is a quite a lot and it should be mentioned in the > > changelog. > > Not quite, you missed a hunk further below in the patch. You are right. > Yes, the _prev arrays are added to the percpu struct. HOWEVER, we used > to have TWO percpu structs in a memcg: one for local data, one for > hierarchical data. In the rstat format, one is enough to capture both: > > - /* Legacy local VM stats and events */ > - struct memcg_vmstats_percpu __percpu *vmstats_local; > - > - /* Subtree VM stats and events (batched updates) */ > struct memcg_vmstats_percpu __percpu *vmstats_percpu; > > This eliminates dead duplicates of the nr_page_events and > targets[MEM_CGROUP_NTARGETS(2)] we used to carry, which means we have > a net reduction of 3 longs in the percpu data with this series. In the old code we used to have 2*(MEMCG_NR_STAT + NR_VM_EVENT_ITEMS + MEM_CGROUP_NTARGETS) (2 struct memcg_vmstats_percpu) pcp data plus MEMCG_NR_STAT + NR_VM_EVENT_ITEMS atomics. New code has 2*MEMCG_NR_STAT + 2*NR_VM_EVENT_ITEMS + MEM_CGROUP_NTARGETS in pcp plus 2*MEMCG_NR_STAT + 2*NR_VM_EVENT_ITEMS aggregated counters. So the resulting diff is MEMCG_NR_STAT + NR_VM_EVENT_ITEMS - MEM_CGROUP_NTARGETS * nr_cpus which would be 1024 - 2 * nr_cpus. Which looks better. Thanks and sorry for misreading the patch. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 6/7] mm: memcontrol: switch to rstat @ 2021-02-08 14:07 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-08 14:07 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri 05-02-21 11:34:19, Johannes Weiner wrote: > On Fri, Feb 05, 2021 at 04:05:20PM +0100, Michal Hocko wrote: > > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > > Replace the memory controller's custom hierarchical stats code with > > > the generic rstat infrastructure provided by the cgroup core. > > > > > > The current implementation does batched upward propagation from the > > > write side (i.e. as stats change). The per-cpu batches introduce an > > > error, which is multiplied by the number of subgroups in a tree. In > > > systems with many CPUs and sizable cgroup trees, the error can be > > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > > subgroups results in an error of up to 128M per stat item). This can > > > entirely swallow allocation bursts inside a workload that the user is > > > expecting to see reflected in the statistics. > > > > > > In the past, we've done read-side aggregation, where a memory.stat > > > read would have to walk the entire subtree and add up per-cpu > > > counts. This became problematic with lazily-freed cgroups: we could > > > have large subtrees where most cgroups were entirely idle. Hence the > > > switch to change-driven upward propagation. Unfortunately, it needed > > > to trade accuracy for speed due to the write side being so hot. > > > > > > Rstat combines the best of both worlds: from the write side, it > > > cheaply maintains a queue of cgroups that have pending changes, so > > > that the read side can do selective tree aggregation. This way the > > > reported stats will always be precise and recent as can be, while the > > > aggregation can skip over potentially large numbers of idle cgroups. > > > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > > > The above confused me a bit. I can see the pcp data size increased by > > adding _prev. The resulting memory footprint should be increased by > > sizeof(long) * (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) * (CPUS + 1) > > which is roughly 1kB per CPU per memcg unless I have made any > > mistake. This is a quite a lot and it should be mentioned in the > > changelog. > > Not quite, you missed a hunk further below in the patch. You are right. > Yes, the _prev arrays are added to the percpu struct. HOWEVER, we used > to have TWO percpu structs in a memcg: one for local data, one for > hierarchical data. In the rstat format, one is enough to capture both: > > - /* Legacy local VM stats and events */ > - struct memcg_vmstats_percpu __percpu *vmstats_local; > - > - /* Subtree VM stats and events (batched updates) */ > struct memcg_vmstats_percpu __percpu *vmstats_percpu; > > This eliminates dead duplicates of the nr_page_events and > targets[MEM_CGROUP_NTARGETS(2)] we used to carry, which means we have > a net reduction of 3 longs in the percpu data with this series. In the old code we used to have 2*(MEMCG_NR_STAT + NR_VM_EVENT_ITEMS + MEM_CGROUP_NTARGETS) (2 struct memcg_vmstats_percpu) pcp data plus MEMCG_NR_STAT + NR_VM_EVENT_ITEMS atomics. New code has 2*MEMCG_NR_STAT + 2*NR_VM_EVENT_ITEMS + MEM_CGROUP_NTARGETS in pcp plus 2*MEMCG_NR_STAT + 2*NR_VM_EVENT_ITEMS aggregated counters. So the resulting diff is MEMCG_NR_STAT + NR_VM_EVENT_ITEMS - MEM_CGROUP_NTARGETS * nr_cpus which would be 1024 - 2 * nr_cpus. Which looks better. Thanks and sorry for misreading the patch. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing 2021-02-02 18:47 ` Johannes Weiner ` (6 preceding siblings ...) (?) @ 2021-02-02 18:47 ` Johannes Weiner 2021-02-03 2:25 ` Roman Gushchin 2021-02-05 15:17 ` Michal Hocko -1 siblings, 2 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-02 18:47 UTC (permalink / raw) To: Andrew Morton, Tejun Heo Cc: Michal Hocko, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team There are two functions to flush the per-cpu data of an lruvec into the rest of the cgroup tree: when the cgroup is being freed, and when a CPU disappears during hotplug. The difference is whether all CPUs or just one is being collected, but the rest of the flushing code is the same. Merge them into one function and share the common code. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/memcontrol.c | 88 +++++++++++++++++++++++-------------------------- 1 file changed, 42 insertions(+), 46 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b205b2413186..88e8afc49a46 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2410,39 +2410,56 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) mutex_unlock(&percpu_charge_mutex); } -static int memcg_hotplug_cpu_dead(unsigned int cpu) +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) { - struct memcg_stock_pcp *stock; - struct mem_cgroup *memcg; - - stock = &per_cpu(memcg_stock, cpu); - drain_stock(stock); + int nid; - for_each_mem_cgroup(memcg) { + for_each_node(nid) { + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0, }; + struct batched_lruvec_stat *lstatc; int i; - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { - int nid; - - for_each_node(nid) { - struct batched_lruvec_stat *lstatc; - struct mem_cgroup_per_node *pn; - long x; - - pn = memcg->nodeinfo[nid]; + if (cpu == -1) { + int cpui; + /* + * The memcg is about to be freed, collect all + * CPUs, no need to zero anything out. + */ + for_each_online_cpu(cpui) { + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpui); + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) + stat[i] += lstatc->count[i]; + } + } else { + /* + * The CPU has gone away, collect and zero out + * its stats, it may come back later. + */ + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); - - x = lstatc->count[i]; + stat[i] = lstatc->count[i]; lstatc->count[i] = 0; - - if (x) { - do { - atomic_long_add(x, &pn->lruvec_stat[i]); - } while ((pn = parent_nodeinfo(pn, nid))); - } } } + + do { + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) + atomic_long_add(stat[i], &pn->lruvec_stat[i]); + } while ((pn = parent_nodeinfo(pn, nid))); } +} + +static int memcg_hotplug_cpu_dead(unsigned int cpu) +{ + struct memcg_stock_pcp *stock; + struct mem_cgroup *memcg; + + stock = &per_cpu(memcg_stock, cpu); + drain_stock(stock); + + for_each_mem_cgroup(memcg) + memcg_flush_lruvec_page_state(memcg, cpu); return 0; } @@ -3636,27 +3653,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, } } -static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) -{ - int node; - - for_each_node(node) { - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; - unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; - struct mem_cgroup_per_node *pi; - int cpu, i; - - for_each_online_cpu(cpu) - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) - stat[i] += per_cpu( - pn->lruvec_stat_cpu->count[i], cpu); - - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) - atomic_long_add(stat[i], &pi->lruvec_stat[i]); - } -} - #ifdef CONFIG_MEMCG_KMEM static int memcg_online_kmem(struct mem_cgroup *memcg) { @@ -5197,7 +5193,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) * Flush percpu lruvec stats to guarantee the value * correctness on parent's and all ancestor levels. */ - memcg_flush_lruvec_page_state(memcg); + memcg_flush_lruvec_page_state(memcg, -1); __mem_cgroup_free(memcg); } -- 2.30.0 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* Re: [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing @ 2021-02-03 2:25 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-03 2:25 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 01:47:46PM -0500, Johannes Weiner wrote: > There are two functions to flush the per-cpu data of an lruvec into > the rest of the cgroup tree: when the cgroup is being freed, and when > a CPU disappears during hotplug. The difference is whether all CPUs or > just one is being collected, but the rest of the flushing code is the > same. Merge them into one function and share the common code. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/memcontrol.c | 88 +++++++++++++++++++++++-------------------------- > 1 file changed, 42 insertions(+), 46 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b205b2413186..88e8afc49a46 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2410,39 +2410,56 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > mutex_unlock(&percpu_charge_mutex); > } > > -static int memcg_hotplug_cpu_dead(unsigned int cpu) > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) > { > - struct memcg_stock_pcp *stock; > - struct mem_cgroup *memcg; > - > - stock = &per_cpu(memcg_stock, cpu); > - drain_stock(stock); > + int nid; > > - for_each_mem_cgroup(memcg) { > + for_each_node(nid) { > + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0, }; ^^^^ Same here. > + struct batched_lruvec_stat *lstatc; > int i; > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > - int nid; > - > - for_each_node(nid) { > - struct batched_lruvec_stat *lstatc; > - struct mem_cgroup_per_node *pn; > - long x; > - > - pn = memcg->nodeinfo[nid]; > + if (cpu == -1) { > + int cpui; > + /* > + * The memcg is about to be freed, collect all > + * CPUs, no need to zero anything out. > + */ > + for_each_online_cpu(cpui) { > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpui); > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > + stat[i] += lstatc->count[i]; > + } > + } else { > + /* > + * The CPU has gone away, collect and zero out > + * its stats, it may come back later. > + */ > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > - > - x = lstatc->count[i]; > + stat[i] = lstatc->count[i]; > lstatc->count[i] = 0; > - > - if (x) { > - do { > - atomic_long_add(x, &pn->lruvec_stat[i]); > - } while ((pn = parent_nodeinfo(pn, nid))); > - } > } > } > + > + do { > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > + atomic_long_add(stat[i], &pn->lruvec_stat[i]); > + } while ((pn = parent_nodeinfo(pn, nid))); > } > +} > + > +static int memcg_hotplug_cpu_dead(unsigned int cpu) > +{ > + struct memcg_stock_pcp *stock; > + struct mem_cgroup *memcg; > + > + stock = &per_cpu(memcg_stock, cpu); > + drain_stock(stock); > + > + for_each_mem_cgroup(memcg) > + memcg_flush_lruvec_page_state(memcg, cpu); > > return 0; > } > @@ -3636,27 +3653,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > } > } > > -static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > -{ > - int node; > - > - for_each_node(node) { > - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > - unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > - struct mem_cgroup_per_node *pi; > - int cpu, i; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - stat[i] += per_cpu( > - pn->lruvec_stat_cpu->count[i], cpu); > - > - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - atomic_long_add(stat[i], &pi->lruvec_stat[i]); > - } > -} > - > #ifdef CONFIG_MEMCG_KMEM > static int memcg_online_kmem(struct mem_cgroup *memcg) > { > @@ -5197,7 +5193,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > * Flush percpu lruvec stats to guarantee the value > * correctness on parent's and all ancestor levels. > */ > - memcg_flush_lruvec_page_state(memcg); > + memcg_flush_lruvec_page_state(memcg, -1); I wonder if adding "cpu" or "percpu" into the function name will make clearer what -1 means? E.g. memcg_flush_(per)cpu_lruvec_stats(memcg, -1). Reviewed-by: Roman Gushchin <guro@fb.com> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing @ 2021-02-03 2:25 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-03 2:25 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Feb 02, 2021 at 01:47:46PM -0500, Johannes Weiner wrote: > There are two functions to flush the per-cpu data of an lruvec into > the rest of the cgroup tree: when the cgroup is being freed, and when > a CPU disappears during hotplug. The difference is whether all CPUs or > just one is being collected, but the rest of the flushing code is the > same. Merge them into one function and share the common code. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > --- > mm/memcontrol.c | 88 +++++++++++++++++++++++-------------------------- > 1 file changed, 42 insertions(+), 46 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b205b2413186..88e8afc49a46 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2410,39 +2410,56 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > mutex_unlock(&percpu_charge_mutex); > } > > -static int memcg_hotplug_cpu_dead(unsigned int cpu) > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) > { > - struct memcg_stock_pcp *stock; > - struct mem_cgroup *memcg; > - > - stock = &per_cpu(memcg_stock, cpu); > - drain_stock(stock); > + int nid; > > - for_each_mem_cgroup(memcg) { > + for_each_node(nid) { > + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0, }; ^^^^ Same here. > + struct batched_lruvec_stat *lstatc; > int i; > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > - int nid; > - > - for_each_node(nid) { > - struct batched_lruvec_stat *lstatc; > - struct mem_cgroup_per_node *pn; > - long x; > - > - pn = memcg->nodeinfo[nid]; > + if (cpu == -1) { > + int cpui; > + /* > + * The memcg is about to be freed, collect all > + * CPUs, no need to zero anything out. > + */ > + for_each_online_cpu(cpui) { > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpui); > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > + stat[i] += lstatc->count[i]; > + } > + } else { > + /* > + * The CPU has gone away, collect and zero out > + * its stats, it may come back later. > + */ > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > - > - x = lstatc->count[i]; > + stat[i] = lstatc->count[i]; > lstatc->count[i] = 0; > - > - if (x) { > - do { > - atomic_long_add(x, &pn->lruvec_stat[i]); > - } while ((pn = parent_nodeinfo(pn, nid))); > - } > } > } > + > + do { > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > + atomic_long_add(stat[i], &pn->lruvec_stat[i]); > + } while ((pn = parent_nodeinfo(pn, nid))); > } > +} > + > +static int memcg_hotplug_cpu_dead(unsigned int cpu) > +{ > + struct memcg_stock_pcp *stock; > + struct mem_cgroup *memcg; > + > + stock = &per_cpu(memcg_stock, cpu); > + drain_stock(stock); > + > + for_each_mem_cgroup(memcg) > + memcg_flush_lruvec_page_state(memcg, cpu); > > return 0; > } > @@ -3636,27 +3653,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > } > } > > -static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > -{ > - int node; > - > - for_each_node(node) { > - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > - unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > - struct mem_cgroup_per_node *pi; > - int cpu, i; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - stat[i] += per_cpu( > - pn->lruvec_stat_cpu->count[i], cpu); > - > - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - atomic_long_add(stat[i], &pi->lruvec_stat[i]); > - } > -} > - > #ifdef CONFIG_MEMCG_KMEM > static int memcg_online_kmem(struct mem_cgroup *memcg) > { > @@ -5197,7 +5193,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > * Flush percpu lruvec stats to guarantee the value > * correctness on parent's and all ancestor levels. > */ > - memcg_flush_lruvec_page_state(memcg); > + memcg_flush_lruvec_page_state(memcg, -1); I wonder if adding "cpu" or "percpu" into the function name will make clearer what -1 means? E.g. memcg_flush_(per)cpu_lruvec_stats(memcg, -1). Reviewed-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing @ 2021-02-04 21:44 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 21:44 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Tue, Feb 02, 2021 at 06:25:30PM -0800, Roman Gushchin wrote: > On Tue, Feb 02, 2021 at 01:47:46PM -0500, Johannes Weiner wrote: > > There are two functions to flush the per-cpu data of an lruvec into > > the rest of the cgroup tree: when the cgroup is being freed, and when > > a CPU disappears during hotplug. The difference is whether all CPUs or > > just one is being collected, but the rest of the flushing code is the > > same. Merge them into one function and share the common code. > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > --- > > mm/memcontrol.c | 88 +++++++++++++++++++++++-------------------------- > > 1 file changed, 42 insertions(+), 46 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index b205b2413186..88e8afc49a46 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2410,39 +2410,56 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > > mutex_unlock(&percpu_charge_mutex); > > } > > > > -static int memcg_hotplug_cpu_dead(unsigned int cpu) > > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) > > { > > - struct memcg_stock_pcp *stock; > > - struct mem_cgroup *memcg; > > - > > - stock = &per_cpu(memcg_stock, cpu); > > - drain_stock(stock); > > + int nid; > > > > - for_each_mem_cgroup(memcg) { > > + for_each_node(nid) { > > + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; > > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0, }; > ^^^^ > Same here. > > > + struct batched_lruvec_stat *lstatc; > > int i; > > > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > > - int nid; > > - > > - for_each_node(nid) { > > - struct batched_lruvec_stat *lstatc; > > - struct mem_cgroup_per_node *pn; > > - long x; > > - > > - pn = memcg->nodeinfo[nid]; > > + if (cpu == -1) { > > + int cpui; > > + /* > > + * The memcg is about to be freed, collect all > > + * CPUs, no need to zero anything out. > > + */ > > + for_each_online_cpu(cpui) { > > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpui); > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > + stat[i] += lstatc->count[i]; > > + } > > + } else { > > + /* > > + * The CPU has gone away, collect and zero out > > + * its stats, it may come back later. > > + */ > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > > lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > > - > > - x = lstatc->count[i]; > > + stat[i] = lstatc->count[i]; > > lstatc->count[i] = 0; > > - > > - if (x) { > > - do { > > - atomic_long_add(x, &pn->lruvec_stat[i]); > > - } while ((pn = parent_nodeinfo(pn, nid))); > > - } > > } > > } > > + > > + do { > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > + atomic_long_add(stat[i], &pn->lruvec_stat[i]); > > + } while ((pn = parent_nodeinfo(pn, nid))); > > } > > +} > > + > > +static int memcg_hotplug_cpu_dead(unsigned int cpu) > > +{ > > + struct memcg_stock_pcp *stock; > > + struct mem_cgroup *memcg; > > + > > + stock = &per_cpu(memcg_stock, cpu); > > + drain_stock(stock); > > + > > + for_each_mem_cgroup(memcg) > > + memcg_flush_lruvec_page_state(memcg, cpu); > > > > return 0; > > } > > @@ -3636,27 +3653,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > > } > > } > > > > -static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > > -{ > > - int node; > > - > > - for_each_node(node) { > > - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > > - unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > > - struct mem_cgroup_per_node *pi; > > - int cpu, i; > > - > > - for_each_online_cpu(cpu) > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > - stat[i] += per_cpu( > > - pn->lruvec_stat_cpu->count[i], cpu); > > - > > - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > - atomic_long_add(stat[i], &pi->lruvec_stat[i]); > > - } > > -} > > - > > #ifdef CONFIG_MEMCG_KMEM > > static int memcg_online_kmem(struct mem_cgroup *memcg) > > { > > @@ -5197,7 +5193,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > > * Flush percpu lruvec stats to guarantee the value > > * correctness on parent's and all ancestor levels. > > */ > > - memcg_flush_lruvec_page_state(memcg); > > + memcg_flush_lruvec_page_state(memcg, -1); > > I wonder if adding "cpu" or "percpu" into the function name will make clearer what -1 means? > E.g. memcg_flush_(per)cpu_lruvec_stats(memcg, -1). Yes, it's a bit ominous. I changed it to memcg_flush_lruvec_page_state_cpu(memcg, -1); percpu would have pushed the function signature over 80 characters. > Reviewed-by: Roman Gushchin <guro@fb.com> Thanks ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing @ 2021-02-04 21:44 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-04 21:44 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Feb 02, 2021 at 06:25:30PM -0800, Roman Gushchin wrote: > On Tue, Feb 02, 2021 at 01:47:46PM -0500, Johannes Weiner wrote: > > There are two functions to flush the per-cpu data of an lruvec into > > the rest of the cgroup tree: when the cgroup is being freed, and when > > a CPU disappears during hotplug. The difference is whether all CPUs or > > just one is being collected, but the rest of the flushing code is the > > same. Merge them into one function and share the common code. > > > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > > --- > > mm/memcontrol.c | 88 +++++++++++++++++++++++-------------------------- > > 1 file changed, 42 insertions(+), 46 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index b205b2413186..88e8afc49a46 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2410,39 +2410,56 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > > mutex_unlock(&percpu_charge_mutex); > > } > > > > -static int memcg_hotplug_cpu_dead(unsigned int cpu) > > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) > > { > > - struct memcg_stock_pcp *stock; > > - struct mem_cgroup *memcg; > > - > > - stock = &per_cpu(memcg_stock, cpu); > > - drain_stock(stock); > > + int nid; > > > > - for_each_mem_cgroup(memcg) { > > + for_each_node(nid) { > > + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; > > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0, }; > ^^^^ > Same here. > > > + struct batched_lruvec_stat *lstatc; > > int i; > > > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > > - int nid; > > - > > - for_each_node(nid) { > > - struct batched_lruvec_stat *lstatc; > > - struct mem_cgroup_per_node *pn; > > - long x; > > - > > - pn = memcg->nodeinfo[nid]; > > + if (cpu == -1) { > > + int cpui; > > + /* > > + * The memcg is about to be freed, collect all > > + * CPUs, no need to zero anything out. > > + */ > > + for_each_online_cpu(cpui) { > > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpui); > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > + stat[i] += lstatc->count[i]; > > + } > > + } else { > > + /* > > + * The CPU has gone away, collect and zero out > > + * its stats, it may come back later. > > + */ > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > > lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > > - > > - x = lstatc->count[i]; > > + stat[i] = lstatc->count[i]; > > lstatc->count[i] = 0; > > - > > - if (x) { > > - do { > > - atomic_long_add(x, &pn->lruvec_stat[i]); > > - } while ((pn = parent_nodeinfo(pn, nid))); > > - } > > } > > } > > + > > + do { > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > + atomic_long_add(stat[i], &pn->lruvec_stat[i]); > > + } while ((pn = parent_nodeinfo(pn, nid))); > > } > > +} > > + > > +static int memcg_hotplug_cpu_dead(unsigned int cpu) > > +{ > > + struct memcg_stock_pcp *stock; > > + struct mem_cgroup *memcg; > > + > > + stock = &per_cpu(memcg_stock, cpu); > > + drain_stock(stock); > > + > > + for_each_mem_cgroup(memcg) > > + memcg_flush_lruvec_page_state(memcg, cpu); > > > > return 0; > > } > > @@ -3636,27 +3653,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > > } > > } > > > > -static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > > -{ > > - int node; > > - > > - for_each_node(node) { > > - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > > - unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > > - struct mem_cgroup_per_node *pi; > > - int cpu, i; > > - > > - for_each_online_cpu(cpu) > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > - stat[i] += per_cpu( > > - pn->lruvec_stat_cpu->count[i], cpu); > > - > > - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > - atomic_long_add(stat[i], &pi->lruvec_stat[i]); > > - } > > -} > > - > > #ifdef CONFIG_MEMCG_KMEM > > static int memcg_online_kmem(struct mem_cgroup *memcg) > > { > > @@ -5197,7 +5193,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > > * Flush percpu lruvec stats to guarantee the value > > * correctness on parent's and all ancestor levels. > > */ > > - memcg_flush_lruvec_page_state(memcg); > > + memcg_flush_lruvec_page_state(memcg, -1); > > I wonder if adding "cpu" or "percpu" into the function name will make clearer what -1 means? > E.g. memcg_flush_(per)cpu_lruvec_stats(memcg, -1). Yes, it's a bit ominous. I changed it to memcg_flush_lruvec_page_state_cpu(memcg, -1); percpu would have pushed the function signature over 80 characters. > Reviewed-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> Thanks ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing @ 2021-02-04 21:47 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-04 21:47 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team On Thu, Feb 04, 2021 at 04:44:27PM -0500, Johannes Weiner wrote: > On Tue, Feb 02, 2021 at 06:25:30PM -0800, Roman Gushchin wrote: > > On Tue, Feb 02, 2021 at 01:47:46PM -0500, Johannes Weiner wrote: > > > There are two functions to flush the per-cpu data of an lruvec into > > > the rest of the cgroup tree: when the cgroup is being freed, and when > > > a CPU disappears during hotplug. The difference is whether all CPUs or > > > just one is being collected, but the rest of the flushing code is the > > > same. Merge them into one function and share the common code. > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > > --- > > > mm/memcontrol.c | 88 +++++++++++++++++++++++-------------------------- > > > 1 file changed, 42 insertions(+), 46 deletions(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index b205b2413186..88e8afc49a46 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -2410,39 +2410,56 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > > > mutex_unlock(&percpu_charge_mutex); > > > } > > > > > > -static int memcg_hotplug_cpu_dead(unsigned int cpu) > > > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) > > > { > > > - struct memcg_stock_pcp *stock; > > > - struct mem_cgroup *memcg; > > > - > > > - stock = &per_cpu(memcg_stock, cpu); > > > - drain_stock(stock); > > > + int nid; > > > > > > - for_each_mem_cgroup(memcg) { > > > + for_each_node(nid) { > > > + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; > > > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0, }; > > ^^^^ > > Same here. > > > > > + struct batched_lruvec_stat *lstatc; > > > int i; > > > > > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > > > - int nid; > > > - > > > - for_each_node(nid) { > > > - struct batched_lruvec_stat *lstatc; > > > - struct mem_cgroup_per_node *pn; > > > - long x; > > > - > > > - pn = memcg->nodeinfo[nid]; > > > + if (cpu == -1) { > > > + int cpui; > > > + /* > > > + * The memcg is about to be freed, collect all > > > + * CPUs, no need to zero anything out. > > > + */ > > > + for_each_online_cpu(cpui) { > > > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpui); > > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > > + stat[i] += lstatc->count[i]; > > > + } > > > + } else { > > > + /* > > > + * The CPU has gone away, collect and zero out > > > + * its stats, it may come back later. > > > + */ > > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > > > lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > > > - > > > - x = lstatc->count[i]; > > > + stat[i] = lstatc->count[i]; > > > lstatc->count[i] = 0; > > > - > > > - if (x) { > > > - do { > > > - atomic_long_add(x, &pn->lruvec_stat[i]); > > > - } while ((pn = parent_nodeinfo(pn, nid))); > > > - } > > > } > > > } > > > + > > > + do { > > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > > + atomic_long_add(stat[i], &pn->lruvec_stat[i]); > > > + } while ((pn = parent_nodeinfo(pn, nid))); > > > } > > > +} > > > + > > > +static int memcg_hotplug_cpu_dead(unsigned int cpu) > > > +{ > > > + struct memcg_stock_pcp *stock; > > > + struct mem_cgroup *memcg; > > > + > > > + stock = &per_cpu(memcg_stock, cpu); > > > + drain_stock(stock); > > > + > > > + for_each_mem_cgroup(memcg) > > > + memcg_flush_lruvec_page_state(memcg, cpu); > > > > > > return 0; > > > } > > > @@ -3636,27 +3653,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > > > } > > > } > > > > > > -static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > > > -{ > > > - int node; > > > - > > > - for_each_node(node) { > > > - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > > > - unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > > > - struct mem_cgroup_per_node *pi; > > > - int cpu, i; > > > - > > > - for_each_online_cpu(cpu) > > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > > - stat[i] += per_cpu( > > > - pn->lruvec_stat_cpu->count[i], cpu); > > > - > > > - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) > > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > > - atomic_long_add(stat[i], &pi->lruvec_stat[i]); > > > - } > > > -} > > > - > > > #ifdef CONFIG_MEMCG_KMEM > > > static int memcg_online_kmem(struct mem_cgroup *memcg) > > > { > > > @@ -5197,7 +5193,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > > > * Flush percpu lruvec stats to guarantee the value > > > * correctness on parent's and all ancestor levels. > > > */ > > > - memcg_flush_lruvec_page_state(memcg); > > > + memcg_flush_lruvec_page_state(memcg, -1); > > > > I wonder if adding "cpu" or "percpu" into the function name will make clearer what -1 means? > > E.g. memcg_flush_(per)cpu_lruvec_stats(memcg, -1). > > Yes, it's a bit ominous. I changed it to > > memcg_flush_lruvec_page_state_cpu(memcg, -1); Works for me! But honestly I don't understand what does "page_state" mean in this context. Thanks! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing @ 2021-02-04 21:47 ` Roman Gushchin 0 siblings, 0 replies; 82+ messages in thread From: Roman Gushchin @ 2021-02-04 21:47 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Feb 04, 2021 at 04:44:27PM -0500, Johannes Weiner wrote: > On Tue, Feb 02, 2021 at 06:25:30PM -0800, Roman Gushchin wrote: > > On Tue, Feb 02, 2021 at 01:47:46PM -0500, Johannes Weiner wrote: > > > There are two functions to flush the per-cpu data of an lruvec into > > > the rest of the cgroup tree: when the cgroup is being freed, and when > > > a CPU disappears during hotplug. The difference is whether all CPUs or > > > just one is being collected, but the rest of the flushing code is the > > > same. Merge them into one function and share the common code. > > > > > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > > > --- > > > mm/memcontrol.c | 88 +++++++++++++++++++++++-------------------------- > > > 1 file changed, 42 insertions(+), 46 deletions(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index b205b2413186..88e8afc49a46 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -2410,39 +2410,56 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > > > mutex_unlock(&percpu_charge_mutex); > > > } > > > > > > -static int memcg_hotplug_cpu_dead(unsigned int cpu) > > > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) > > > { > > > - struct memcg_stock_pcp *stock; > > > - struct mem_cgroup *memcg; > > > - > > > - stock = &per_cpu(memcg_stock, cpu); > > > - drain_stock(stock); > > > + int nid; > > > > > > - for_each_mem_cgroup(memcg) { > > > + for_each_node(nid) { > > > + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; > > > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0, }; > > ^^^^ > > Same here. > > > > > + struct batched_lruvec_stat *lstatc; > > > int i; > > > > > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > > > - int nid; > > > - > > > - for_each_node(nid) { > > > - struct batched_lruvec_stat *lstatc; > > > - struct mem_cgroup_per_node *pn; > > > - long x; > > > - > > > - pn = memcg->nodeinfo[nid]; > > > + if (cpu == -1) { > > > + int cpui; > > > + /* > > > + * The memcg is about to be freed, collect all > > > + * CPUs, no need to zero anything out. > > > + */ > > > + for_each_online_cpu(cpui) { > > > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpui); > > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > > + stat[i] += lstatc->count[i]; > > > + } > > > + } else { > > > + /* > > > + * The CPU has gone away, collect and zero out > > > + * its stats, it may come back later. > > > + */ > > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > > > lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > > > - > > > - x = lstatc->count[i]; > > > + stat[i] = lstatc->count[i]; > > > lstatc->count[i] = 0; > > > - > > > - if (x) { > > > - do { > > > - atomic_long_add(x, &pn->lruvec_stat[i]); > > > - } while ((pn = parent_nodeinfo(pn, nid))); > > > - } > > > } > > > } > > > + > > > + do { > > > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > > + atomic_long_add(stat[i], &pn->lruvec_stat[i]); > > > + } while ((pn = parent_nodeinfo(pn, nid))); > > > } > > > +} > > > + > > > +static int memcg_hotplug_cpu_dead(unsigned int cpu) > > > +{ > > > + struct memcg_stock_pcp *stock; > > > + struct mem_cgroup *memcg; > > > + > > > + stock = &per_cpu(memcg_stock, cpu); > > > + drain_stock(stock); > > > + > > > + for_each_mem_cgroup(memcg) > > > + memcg_flush_lruvec_page_state(memcg, cpu); > > > > > > return 0; > > > } > > > @@ -3636,27 +3653,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > > > } > > > } > > > > > > -static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > > > -{ > > > - int node; > > > - > > > - for_each_node(node) { > > > - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > > > - unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > > > - struct mem_cgroup_per_node *pi; > > > - int cpu, i; > > > - > > > - for_each_online_cpu(cpu) > > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > > - stat[i] += per_cpu( > > > - pn->lruvec_stat_cpu->count[i], cpu); > > > - > > > - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) > > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > > > - atomic_long_add(stat[i], &pi->lruvec_stat[i]); > > > - } > > > -} > > > - > > > #ifdef CONFIG_MEMCG_KMEM > > > static int memcg_online_kmem(struct mem_cgroup *memcg) > > > { > > > @@ -5197,7 +5193,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > > > * Flush percpu lruvec stats to guarantee the value > > > * correctness on parent's and all ancestor levels. > > > */ > > > - memcg_flush_lruvec_page_state(memcg); > > > + memcg_flush_lruvec_page_state(memcg, -1); > > > > I wonder if adding "cpu" or "percpu" into the function name will make clearer what -1 means? > > E.g. memcg_flush_(per)cpu_lruvec_stats(memcg, -1). > > Yes, it's a bit ominous. I changed it to > > memcg_flush_lruvec_page_state_cpu(memcg, -1); Works for me! But honestly I don't understand what does "page_state" mean in this context. Thanks! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing @ 2021-02-05 15:17 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-05 15:17 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Tue 02-02-21 13:47:46, Johannes Weiner wrote: > There are two functions to flush the per-cpu data of an lruvec into > the rest of the cgroup tree: when the cgroup is being freed, and when > a CPU disappears during hotplug. The difference is whether all CPUs or > just one is being collected, but the rest of the flushing code is the > same. Merge them into one function and share the common code. IIUC the only reason for the cpu == -1 special case is to avoid zeroying, right? Is this optimization worth the special case? The code would be slightly easier to follow without this. > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Anyway the above is not really a fundamental objection. It is more important to unify the flushing. Acked-by: Michal Hocko <mhocko@suse.com> > --- > mm/memcontrol.c | 88 +++++++++++++++++++++++-------------------------- > 1 file changed, 42 insertions(+), 46 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b205b2413186..88e8afc49a46 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2410,39 +2410,56 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > mutex_unlock(&percpu_charge_mutex); > } > > -static int memcg_hotplug_cpu_dead(unsigned int cpu) > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) > { > - struct memcg_stock_pcp *stock; > - struct mem_cgroup *memcg; > - > - stock = &per_cpu(memcg_stock, cpu); > - drain_stock(stock); > + int nid; > > - for_each_mem_cgroup(memcg) { > + for_each_node(nid) { > + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0, }; > + struct batched_lruvec_stat *lstatc; > int i; > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > - int nid; > - > - for_each_node(nid) { > - struct batched_lruvec_stat *lstatc; > - struct mem_cgroup_per_node *pn; > - long x; > - > - pn = memcg->nodeinfo[nid]; > + if (cpu == -1) { > + int cpui; > + /* > + * The memcg is about to be freed, collect all > + * CPUs, no need to zero anything out. > + */ > + for_each_online_cpu(cpui) { > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpui); > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > + stat[i] += lstatc->count[i]; > + } > + } else { > + /* > + * The CPU has gone away, collect and zero out > + * its stats, it may come back later. > + */ > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > - > - x = lstatc->count[i]; > + stat[i] = lstatc->count[i]; > lstatc->count[i] = 0; > - > - if (x) { > - do { > - atomic_long_add(x, &pn->lruvec_stat[i]); > - } while ((pn = parent_nodeinfo(pn, nid))); > - } > } > } > + > + do { > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > + atomic_long_add(stat[i], &pn->lruvec_stat[i]); > + } while ((pn = parent_nodeinfo(pn, nid))); > } > +} > + > +static int memcg_hotplug_cpu_dead(unsigned int cpu) > +{ > + struct memcg_stock_pcp *stock; > + struct mem_cgroup *memcg; > + > + stock = &per_cpu(memcg_stock, cpu); > + drain_stock(stock); > + > + for_each_mem_cgroup(memcg) > + memcg_flush_lruvec_page_state(memcg, cpu); > > return 0; > } > @@ -3636,27 +3653,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > } > } > > -static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > -{ > - int node; > - > - for_each_node(node) { > - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > - unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > - struct mem_cgroup_per_node *pi; > - int cpu, i; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - stat[i] += per_cpu( > - pn->lruvec_stat_cpu->count[i], cpu); > - > - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - atomic_long_add(stat[i], &pi->lruvec_stat[i]); > - } > -} > - > #ifdef CONFIG_MEMCG_KMEM > static int memcg_online_kmem(struct mem_cgroup *memcg) > { > @@ -5197,7 +5193,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > * Flush percpu lruvec stats to guarantee the value > * correctness on parent's and all ancestor levels. > */ > - memcg_flush_lruvec_page_state(memcg); > + memcg_flush_lruvec_page_state(memcg, -1); > __mem_cgroup_free(memcg); > } > > -- > 2.30.0 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing @ 2021-02-05 15:17 ` Michal Hocko 0 siblings, 0 replies; 82+ messages in thread From: Michal Hocko @ 2021-02-05 15:17 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue 02-02-21 13:47:46, Johannes Weiner wrote: > There are two functions to flush the per-cpu data of an lruvec into > the rest of the cgroup tree: when the cgroup is being freed, and when > a CPU disappears during hotplug. The difference is whether all CPUs or > just one is being collected, but the rest of the flushing code is the > same. Merge them into one function and share the common code. IIUC the only reason for the cpu == -1 special case is to avoid zeroying, right? Is this optimization worth the special case? The code would be slightly easier to follow without this. > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Anyway the above is not really a fundamental objection. It is more important to unify the flushing. Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > --- > mm/memcontrol.c | 88 +++++++++++++++++++++++-------------------------- > 1 file changed, 42 insertions(+), 46 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b205b2413186..88e8afc49a46 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2410,39 +2410,56 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > mutex_unlock(&percpu_charge_mutex); > } > > -static int memcg_hotplug_cpu_dead(unsigned int cpu) > +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) > { > - struct memcg_stock_pcp *stock; > - struct mem_cgroup *memcg; > - > - stock = &per_cpu(memcg_stock, cpu); > - drain_stock(stock); > + int nid; > > - for_each_mem_cgroup(memcg) { > + for_each_node(nid) { > + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; > + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0, }; > + struct batched_lruvec_stat *lstatc; > int i; > > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > - int nid; > - > - for_each_node(nid) { > - struct batched_lruvec_stat *lstatc; > - struct mem_cgroup_per_node *pn; > - long x; > - > - pn = memcg->nodeinfo[nid]; > + if (cpu == -1) { > + int cpui; > + /* > + * The memcg is about to be freed, collect all > + * CPUs, no need to zero anything out. > + */ > + for_each_online_cpu(cpui) { > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpui); > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > + stat[i] += lstatc->count[i]; > + } > + } else { > + /* > + * The CPU has gone away, collect and zero out > + * its stats, it may come back later. > + */ > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > - > - x = lstatc->count[i]; > + stat[i] = lstatc->count[i]; > lstatc->count[i] = 0; > - > - if (x) { > - do { > - atomic_long_add(x, &pn->lruvec_stat[i]); > - } while ((pn = parent_nodeinfo(pn, nid))); > - } > } > } > + > + do { > + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > + atomic_long_add(stat[i], &pn->lruvec_stat[i]); > + } while ((pn = parent_nodeinfo(pn, nid))); > } > +} > + > +static int memcg_hotplug_cpu_dead(unsigned int cpu) > +{ > + struct memcg_stock_pcp *stock; > + struct mem_cgroup *memcg; > + > + stock = &per_cpu(memcg_stock, cpu); > + drain_stock(stock); > + > + for_each_mem_cgroup(memcg) > + memcg_flush_lruvec_page_state(memcg, cpu); > > return 0; > } > @@ -3636,27 +3653,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > } > } > > -static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) > -{ > - int node; > - > - for_each_node(node) { > - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; > - unsigned long stat[NR_VM_NODE_STAT_ITEMS] = {0, }; > - struct mem_cgroup_per_node *pi; > - int cpu, i; > - > - for_each_online_cpu(cpu) > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - stat[i] += per_cpu( > - pn->lruvec_stat_cpu->count[i], cpu); > - > - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - atomic_long_add(stat[i], &pi->lruvec_stat[i]); > - } > -} > - > #ifdef CONFIG_MEMCG_KMEM > static int memcg_online_kmem(struct mem_cgroup *memcg) > { > @@ -5197,7 +5193,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) > * Flush percpu lruvec stats to guarantee the value > * correctness on parent's and all ancestor levels. > */ > - memcg_flush_lruvec_page_state(memcg); > + memcg_flush_lruvec_page_state(memcg, -1); > __mem_cgroup_free(memcg); > } > > -- > 2.30.0 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing @ 2021-02-05 17:10 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-05 17:10 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team On Fri, Feb 05, 2021 at 04:17:27PM +0100, Michal Hocko wrote: > On Tue 02-02-21 13:47:46, Johannes Weiner wrote: > > There are two functions to flush the per-cpu data of an lruvec into > > the rest of the cgroup tree: when the cgroup is being freed, and when > > a CPU disappears during hotplug. The difference is whether all CPUs or > > just one is being collected, but the rest of the flushing code is the > > same. Merge them into one function and share the common code. > > IIUC the only reason for the cpu == -1 special case is to avoid > zeroying, right? Is this optimization worth the special case? The code > would be slightly easier to follow without this. Hm, it was less about the optimization and more about which CPU(s) need(s) to be handled. But it's pretty silly the way it's written, indeed. I'll move the for_each_online_cpu() to the caller and drop the cpu==-1 special casing, it makes things much simpler and more obvious. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > Anyway the above is not really a fundamental objection. It is more important > to unify the flushing. > > Acked-by: Michal Hocko <mhocko@suse.com> Thanks. v2 is different, so I'll wait with taking the ack. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing @ 2021-02-05 17:10 ` Johannes Weiner 0 siblings, 0 replies; 82+ messages in thread From: Johannes Weiner @ 2021-02-05 17:10 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tejun Heo, Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri, Feb 05, 2021 at 04:17:27PM +0100, Michal Hocko wrote: > On Tue 02-02-21 13:47:46, Johannes Weiner wrote: > > There are two functions to flush the per-cpu data of an lruvec into > > the rest of the cgroup tree: when the cgroup is being freed, and when > > a CPU disappears during hotplug. The difference is whether all CPUs or > > just one is being collected, but the rest of the flushing code is the > > same. Merge them into one function and share the common code. > > IIUC the only reason for the cpu == -1 special case is to avoid > zeroying, right? Is this optimization worth the special case? The code > would be slightly easier to follow without this. Hm, it was less about the optimization and more about which CPU(s) need(s) to be handled. But it's pretty silly the way it's written, indeed. I'll move the for_each_online_cpu() to the caller and drop the cpu==-1 special casing, it makes things much simpler and more obvious. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > > Anyway the above is not really a fundamental objection. It is more important > to unify the flushing. > > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Thanks. v2 is different, so I'll wait with taking the ack. ^ permalink raw reply [flat|nested] 82+ messages in thread
end of thread, other threads:[~2021-02-08 14:32 UTC | newest] Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-02-02 18:47 [PATCH 0/7]: mm: memcontrol: switch to rstat Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 18:47 ` [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 22:23 ` Shakeel Butt 2021-02-02 22:23 ` Shakeel Butt 2021-02-02 22:23 ` Shakeel Butt 2021-02-02 23:07 ` Roman Gushchin 2021-02-02 23:07 ` Roman Gushchin 2021-02-03 2:28 ` Roman Gushchin 2021-02-03 2:28 ` Roman Gushchin 2021-02-04 19:29 ` Johannes Weiner 2021-02-04 19:29 ` Johannes Weiner 2021-02-04 19:34 ` Roman Gushchin 2021-02-04 19:34 ` Roman Gushchin 2021-02-05 17:50 ` Johannes Weiner 2021-02-05 17:50 ` Johannes Weiner 2021-02-04 13:28 ` Michal Hocko 2021-02-04 13:28 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 22:24 ` Shakeel Butt 2021-02-02 22:24 ` Shakeel Butt 2021-02-02 22:24 ` Shakeel Butt 2021-02-02 23:13 ` Roman Gushchin 2021-02-02 23:13 ` Roman Gushchin 2021-02-04 13:29 ` Michal Hocko 2021-02-04 13:29 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 22:26 ` Shakeel Butt 2021-02-02 22:26 ` Shakeel Butt 2021-02-02 23:17 ` Roman Gushchin 2021-02-02 23:17 ` Roman Gushchin 2021-02-04 13:30 ` Michal Hocko 2021-02-04 13:30 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 4/7] cgroup: rstat: support cgroup1 Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-03 1:16 ` Roman Gushchin 2021-02-03 1:16 ` Roman Gushchin 2021-02-04 13:39 ` Michal Hocko 2021-02-04 13:39 ` Michal Hocko 2021-02-04 16:01 ` Johannes Weiner 2021-02-04 16:01 ` Johannes Weiner 2021-02-04 16:42 ` Michal Hocko 2021-02-04 16:42 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 5/7] cgroup: rstat: punt root-level optimization to individual controllers Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 18:47 ` [PATCH 6/7] mm: memcontrol: switch to rstat Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-03 1:47 ` Roman Gushchin 2021-02-03 1:47 ` Roman Gushchin 2021-02-04 16:26 ` Johannes Weiner 2021-02-04 16:26 ` Johannes Weiner 2021-02-04 18:45 ` Roman Gushchin 2021-02-04 18:45 ` Roman Gushchin 2021-02-04 20:05 ` Johannes Weiner 2021-02-04 20:05 ` Johannes Weiner 2021-02-04 14:19 ` Michal Hocko 2021-02-04 14:19 ` Michal Hocko 2021-02-04 16:15 ` Johannes Weiner 2021-02-04 16:44 ` Michal Hocko 2021-02-04 16:44 ` Michal Hocko 2021-02-04 20:28 ` Johannes Weiner 2021-02-04 20:28 ` Johannes Weiner 2021-02-05 15:05 ` Michal Hocko 2021-02-05 15:05 ` Michal Hocko 2021-02-05 16:34 ` Johannes Weiner 2021-02-05 16:34 ` Johannes Weiner 2021-02-08 14:07 ` Michal Hocko 2021-02-08 14:07 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing Johannes Weiner 2021-02-03 2:25 ` Roman Gushchin 2021-02-03 2:25 ` Roman Gushchin 2021-02-04 21:44 ` Johannes Weiner 2021-02-04 21:44 ` Johannes Weiner 2021-02-04 21:47 ` Roman Gushchin 2021-02-04 21:47 ` Roman Gushchin 2021-02-05 15:17 ` Michal Hocko 2021-02-05 15:17 ` Michal Hocko 2021-02-05 17:10 ` Johannes Weiner 2021-02-05 17:10 ` Johannes Weiner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.