From: Michal Hocko <mhocko@suse.com> To: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org>, Tejun Heo <tj@kernel.org>, Roman Gushchin <guro@fb.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH 6/7] mm: memcontrol: switch to rstat Date: Mon, 8 Feb 2021 15:07:57 +0100 [thread overview] Message-ID: <YCFFvdd7CgE+GWEQ@dhcp22.suse.cz> (raw) In-Reply-To: <YB1zi/bZdeL2j59I@cmpxchg.org> On Fri 05-02-21 11:34:19, Johannes Weiner wrote: > On Fri, Feb 05, 2021 at 04:05:20PM +0100, Michal Hocko wrote: > > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > > Replace the memory controller's custom hierarchical stats code with > > > the generic rstat infrastructure provided by the cgroup core. > > > > > > The current implementation does batched upward propagation from the > > > write side (i.e. as stats change). The per-cpu batches introduce an > > > error, which is multiplied by the number of subgroups in a tree. In > > > systems with many CPUs and sizable cgroup trees, the error can be > > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > > subgroups results in an error of up to 128M per stat item). This can > > > entirely swallow allocation bursts inside a workload that the user is > > > expecting to see reflected in the statistics. > > > > > > In the past, we've done read-side aggregation, where a memory.stat > > > read would have to walk the entire subtree and add up per-cpu > > > counts. This became problematic with lazily-freed cgroups: we could > > > have large subtrees where most cgroups were entirely idle. Hence the > > > switch to change-driven upward propagation. Unfortunately, it needed > > > to trade accuracy for speed due to the write side being so hot. > > > > > > Rstat combines the best of both worlds: from the write side, it > > > cheaply maintains a queue of cgroups that have pending changes, so > > > that the read side can do selective tree aggregation. This way the > > > reported stats will always be precise and recent as can be, while the > > > aggregation can skip over potentially large numbers of idle cgroups. > > > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > > > The above confused me a bit. I can see the pcp data size increased by > > adding _prev. The resulting memory footprint should be increased by > > sizeof(long) * (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) * (CPUS + 1) > > which is roughly 1kB per CPU per memcg unless I have made any > > mistake. This is a quite a lot and it should be mentioned in the > > changelog. > > Not quite, you missed a hunk further below in the patch. You are right. > Yes, the _prev arrays are added to the percpu struct. HOWEVER, we used > to have TWO percpu structs in a memcg: one for local data, one for > hierarchical data. In the rstat format, one is enough to capture both: > > - /* Legacy local VM stats and events */ > - struct memcg_vmstats_percpu __percpu *vmstats_local; > - > - /* Subtree VM stats and events (batched updates) */ > struct memcg_vmstats_percpu __percpu *vmstats_percpu; > > This eliminates dead duplicates of the nr_page_events and > targets[MEM_CGROUP_NTARGETS(2)] we used to carry, which means we have > a net reduction of 3 longs in the percpu data with this series. In the old code we used to have 2*(MEMCG_NR_STAT + NR_VM_EVENT_ITEMS + MEM_CGROUP_NTARGETS) (2 struct memcg_vmstats_percpu) pcp data plus MEMCG_NR_STAT + NR_VM_EVENT_ITEMS atomics. New code has 2*MEMCG_NR_STAT + 2*NR_VM_EVENT_ITEMS + MEM_CGROUP_NTARGETS in pcp plus 2*MEMCG_NR_STAT + 2*NR_VM_EVENT_ITEMS aggregated counters. So the resulting diff is MEMCG_NR_STAT + NR_VM_EVENT_ITEMS - MEM_CGROUP_NTARGETS * nr_cpus which would be 1024 - 2 * nr_cpus. Which looks better. Thanks and sorry for misreading the patch. -- Michal Hocko SUSE Labs
WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> To: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg@public.gmane.org Subject: Re: [PATCH 6/7] mm: memcontrol: switch to rstat Date: Mon, 8 Feb 2021 15:07:57 +0100 [thread overview] Message-ID: <YCFFvdd7CgE+GWEQ@dhcp22.suse.cz> (raw) In-Reply-To: <YB1zi/bZdeL2j59I-druUgvl0LCNAfugRpC6u6w@public.gmane.org> On Fri 05-02-21 11:34:19, Johannes Weiner wrote: > On Fri, Feb 05, 2021 at 04:05:20PM +0100, Michal Hocko wrote: > > On Tue 02-02-21 13:47:45, Johannes Weiner wrote: > > > Replace the memory controller's custom hierarchical stats code with > > > the generic rstat infrastructure provided by the cgroup core. > > > > > > The current implementation does batched upward propagation from the > > > write side (i.e. as stats change). The per-cpu batches introduce an > > > error, which is multiplied by the number of subgroups in a tree. In > > > systems with many CPUs and sizable cgroup trees, the error can be > > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 > > > subgroups results in an error of up to 128M per stat item). This can > > > entirely swallow allocation bursts inside a workload that the user is > > > expecting to see reflected in the statistics. > > > > > > In the past, we've done read-side aggregation, where a memory.stat > > > read would have to walk the entire subtree and add up per-cpu > > > counts. This became problematic with lazily-freed cgroups: we could > > > have large subtrees where most cgroups were entirely idle. Hence the > > > switch to change-driven upward propagation. Unfortunately, it needed > > > to trade accuracy for speed due to the write side being so hot. > > > > > > Rstat combines the best of both worlds: from the write side, it > > > cheaply maintains a queue of cgroups that have pending changes, so > > > that the read side can do selective tree aggregation. This way the > > > reported stats will always be precise and recent as can be, while the > > > aggregation can skip over potentially large numbers of idle cgroups. > > > > > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + > > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward > > > aggregation. It removes 3 words from the per-cpu data. It eliminates > > > memcg_exact_page_state(), since memcg_page_state() is now exact. > > > > The above confused me a bit. I can see the pcp data size increased by > > adding _prev. The resulting memory footprint should be increased by > > sizeof(long) * (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) * (CPUS + 1) > > which is roughly 1kB per CPU per memcg unless I have made any > > mistake. This is a quite a lot and it should be mentioned in the > > changelog. > > Not quite, you missed a hunk further below in the patch. You are right. > Yes, the _prev arrays are added to the percpu struct. HOWEVER, we used > to have TWO percpu structs in a memcg: one for local data, one for > hierarchical data. In the rstat format, one is enough to capture both: > > - /* Legacy local VM stats and events */ > - struct memcg_vmstats_percpu __percpu *vmstats_local; > - > - /* Subtree VM stats and events (batched updates) */ > struct memcg_vmstats_percpu __percpu *vmstats_percpu; > > This eliminates dead duplicates of the nr_page_events and > targets[MEM_CGROUP_NTARGETS(2)] we used to carry, which means we have > a net reduction of 3 longs in the percpu data with this series. In the old code we used to have 2*(MEMCG_NR_STAT + NR_VM_EVENT_ITEMS + MEM_CGROUP_NTARGETS) (2 struct memcg_vmstats_percpu) pcp data plus MEMCG_NR_STAT + NR_VM_EVENT_ITEMS atomics. New code has 2*MEMCG_NR_STAT + 2*NR_VM_EVENT_ITEMS + MEM_CGROUP_NTARGETS in pcp plus 2*MEMCG_NR_STAT + 2*NR_VM_EVENT_ITEMS aggregated counters. So the resulting diff is MEMCG_NR_STAT + NR_VM_EVENT_ITEMS - MEM_CGROUP_NTARGETS * nr_cpus which would be 1024 - 2 * nr_cpus. Which looks better. Thanks and sorry for misreading the patch. -- Michal Hocko SUSE Labs
next prev parent reply other threads:[~2021-02-08 14:32 UTC|newest] Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-02-02 18:47 [PATCH 0/7]: mm: memcontrol: switch to rstat Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 18:47 ` [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 22:23 ` Shakeel Butt 2021-02-02 22:23 ` Shakeel Butt 2021-02-02 22:23 ` Shakeel Butt 2021-02-02 23:07 ` Roman Gushchin 2021-02-02 23:07 ` Roman Gushchin 2021-02-03 2:28 ` Roman Gushchin 2021-02-03 2:28 ` Roman Gushchin 2021-02-04 19:29 ` Johannes Weiner 2021-02-04 19:29 ` Johannes Weiner 2021-02-04 19:34 ` Roman Gushchin 2021-02-04 19:34 ` Roman Gushchin 2021-02-05 17:50 ` Johannes Weiner 2021-02-05 17:50 ` Johannes Weiner 2021-02-04 13:28 ` Michal Hocko 2021-02-04 13:28 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 22:24 ` Shakeel Butt 2021-02-02 22:24 ` Shakeel Butt 2021-02-02 22:24 ` Shakeel Butt 2021-02-02 23:13 ` Roman Gushchin 2021-02-02 23:13 ` Roman Gushchin 2021-02-04 13:29 ` Michal Hocko 2021-02-04 13:29 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 22:26 ` Shakeel Butt 2021-02-02 22:26 ` Shakeel Butt 2021-02-02 23:17 ` Roman Gushchin 2021-02-02 23:17 ` Roman Gushchin 2021-02-04 13:30 ` Michal Hocko 2021-02-04 13:30 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 4/7] cgroup: rstat: support cgroup1 Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-03 1:16 ` Roman Gushchin 2021-02-03 1:16 ` Roman Gushchin 2021-02-04 13:39 ` Michal Hocko 2021-02-04 13:39 ` Michal Hocko 2021-02-04 16:01 ` Johannes Weiner 2021-02-04 16:01 ` Johannes Weiner 2021-02-04 16:42 ` Michal Hocko 2021-02-04 16:42 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 5/7] cgroup: rstat: punt root-level optimization to individual controllers Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-02 18:47 ` [PATCH 6/7] mm: memcontrol: switch to rstat Johannes Weiner 2021-02-02 18:47 ` Johannes Weiner 2021-02-03 1:47 ` Roman Gushchin 2021-02-03 1:47 ` Roman Gushchin 2021-02-04 16:26 ` Johannes Weiner 2021-02-04 16:26 ` Johannes Weiner 2021-02-04 18:45 ` Roman Gushchin 2021-02-04 18:45 ` Roman Gushchin 2021-02-04 20:05 ` Johannes Weiner 2021-02-04 20:05 ` Johannes Weiner 2021-02-04 14:19 ` Michal Hocko 2021-02-04 14:19 ` Michal Hocko 2021-02-04 16:15 ` Johannes Weiner 2021-02-04 16:44 ` Michal Hocko 2021-02-04 16:44 ` Michal Hocko 2021-02-04 20:28 ` Johannes Weiner 2021-02-04 20:28 ` Johannes Weiner 2021-02-05 15:05 ` Michal Hocko 2021-02-05 15:05 ` Michal Hocko 2021-02-05 16:34 ` Johannes Weiner 2021-02-05 16:34 ` Johannes Weiner 2021-02-08 14:07 ` Michal Hocko [this message] 2021-02-08 14:07 ` Michal Hocko 2021-02-02 18:47 ` [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing Johannes Weiner 2021-02-03 2:25 ` Roman Gushchin 2021-02-03 2:25 ` Roman Gushchin 2021-02-04 21:44 ` Johannes Weiner 2021-02-04 21:44 ` Johannes Weiner 2021-02-04 21:47 ` Roman Gushchin 2021-02-04 21:47 ` Roman Gushchin 2021-02-05 15:17 ` Michal Hocko 2021-02-05 15:17 ` Michal Hocko 2021-02-05 17:10 ` Johannes Weiner 2021-02-05 17:10 ` Johannes Weiner
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=YCFFvdd7CgE+GWEQ@dhcp22.suse.cz \ --to=mhocko@suse.com \ --cc=akpm@linux-foundation.org \ --cc=cgroups@vger.kernel.org \ --cc=guro@fb.com \ --cc=hannes@cmpxchg.org \ --cc=kernel-team@fb.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=tj@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.