All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>, Roman Gushchin <guro@fb.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 6/7] mm: memcontrol: switch to rstat
Date: Thu, 4 Feb 2021 17:44:06 +0100	[thread overview]
Message-ID: <YBwkVoNa77Nn5TE9@dhcp22.suse.cz> (raw)
In-Reply-To: <YBwdiu2Fj4JHgqhQ@cmpxchg.org>

On Thu 04-02-21 11:15:06, Johannes Weiner wrote:
> Hello Michal,
> 
> On Thu, Feb 04, 2021 at 03:19:17PM +0100, Michal Hocko wrote:
> > On Tue 02-02-21 13:47:45, Johannes Weiner wrote:
> > > Replace the memory controller's custom hierarchical stats code with
> > > the generic rstat infrastructure provided by the cgroup core.
> > > 
> > > The current implementation does batched upward propagation from the
> > > write side (i.e. as stats change). The per-cpu batches introduce an
> > > error, which is multiplied by the number of subgroups in a tree. In
> > > systems with many CPUs and sizable cgroup trees, the error can be
> > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32
> > > subgroups results in an error of up to 128M per stat item). This can
> > > entirely swallow allocation bursts inside a workload that the user is
> > > expecting to see reflected in the statistics.
> > > 
> > > In the past, we've done read-side aggregation, where a memory.stat
> > > read would have to walk the entire subtree and add up per-cpu
> > > counts. This became problematic with lazily-freed cgroups: we could
> > > have large subtrees where most cgroups were entirely idle. Hence the
> > > switch to change-driven upward propagation. Unfortunately, it needed
> > > to trade accuracy for speed due to the write side being so hot.
> > > 
> > > Rstat combines the best of both worlds: from the write side, it
> > > cheaply maintains a queue of cgroups that have pending changes, so
> > > that the read side can do selective tree aggregation. This way the
> > > reported stats will always be precise and recent as can be, while the
> > > aggregation can skip over potentially large numbers of idle cgroups.
> > > 
> > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
> > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
> > > aggregation. It removes 3 words from the per-cpu data. It eliminates
> > > memcg_exact_page_state(), since memcg_page_state() is now exact.
> > 
> > I am still digesting details and need to look deeper into how rstat
> > works but removing our own stats is definitely a good plan. Especially
> > when there are existing limitations and problems that would need fixing.
> > 
> > Just to check that my high level understanding is correct. The
> > transition is effectivelly removing a need to manually sync counters up
> > the hierarchy and partially outsources that decision to rstat core. The
> > controller is responsible just to tell the core how that syncing is done
> > (e.g. which specific counters etc).
> 
> Yes, exactly.
> 
> rstat implements a tree of cgroups that have local changes pending,
> and a flush walk on that tree. But it's all driven by the controller.
> 
> memcg needs to tell rstat 1) when stats in a local cgroup change
> e.g. when we do mod_memcg_state() (cgroup_rstat_updated), 2) when to
> flush, e.g. before a memory.stat read (cgroup_rstat_flush), and 3) how
> to flush one cgroup's per-cpu state and propagate it upward to the
> parent during rstat's flush walk (.css_rstat_flush).

Can we have this short summary in a changelog please?

> > Excplicit flushes are needed when you want an exact value (e.g. when
> > values are presented to the userspace). I do not see any flushes to
> > be done by the core pro-actively except for clean up on a release.
> > 
> > Is the above correct understanding?
> 
> Yes, that's correct.

OK, thanks for the confirmation. I will have a closer look tomorrow but
I do not see any problems now.

-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
To: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kernel-team-b10kYP2dOMg@public.gmane.org
Subject: Re: [PATCH 6/7] mm: memcontrol: switch to rstat
Date: Thu, 4 Feb 2021 17:44:06 +0100	[thread overview]
Message-ID: <YBwkVoNa77Nn5TE9@dhcp22.suse.cz> (raw)
In-Reply-To: <YBwdiu2Fj4JHgqhQ-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

On Thu 04-02-21 11:15:06, Johannes Weiner wrote:
> Hello Michal,
> 
> On Thu, Feb 04, 2021 at 03:19:17PM +0100, Michal Hocko wrote:
> > On Tue 02-02-21 13:47:45, Johannes Weiner wrote:
> > > Replace the memory controller's custom hierarchical stats code with
> > > the generic rstat infrastructure provided by the cgroup core.
> > > 
> > > The current implementation does batched upward propagation from the
> > > write side (i.e. as stats change). The per-cpu batches introduce an
> > > error, which is multiplied by the number of subgroups in a tree. In
> > > systems with many CPUs and sizable cgroup trees, the error can be
> > > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32
> > > subgroups results in an error of up to 128M per stat item). This can
> > > entirely swallow allocation bursts inside a workload that the user is
> > > expecting to see reflected in the statistics.
> > > 
> > > In the past, we've done read-side aggregation, where a memory.stat
> > > read would have to walk the entire subtree and add up per-cpu
> > > counts. This became problematic with lazily-freed cgroups: we could
> > > have large subtrees where most cgroups were entirely idle. Hence the
> > > switch to change-driven upward propagation. Unfortunately, it needed
> > > to trade accuracy for speed due to the write side being so hot.
> > > 
> > > Rstat combines the best of both worlds: from the write side, it
> > > cheaply maintains a queue of cgroups that have pending changes, so
> > > that the read side can do selective tree aggregation. This way the
> > > reported stats will always be precise and recent as can be, while the
> > > aggregation can skip over potentially large numbers of idle cgroups.
> > > 
> > > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
> > > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
> > > aggregation. It removes 3 words from the per-cpu data. It eliminates
> > > memcg_exact_page_state(), since memcg_page_state() is now exact.
> > 
> > I am still digesting details and need to look deeper into how rstat
> > works but removing our own stats is definitely a good plan. Especially
> > when there are existing limitations and problems that would need fixing.
> > 
> > Just to check that my high level understanding is correct. The
> > transition is effectivelly removing a need to manually sync counters up
> > the hierarchy and partially outsources that decision to rstat core. The
> > controller is responsible just to tell the core how that syncing is done
> > (e.g. which specific counters etc).
> 
> Yes, exactly.
> 
> rstat implements a tree of cgroups that have local changes pending,
> and a flush walk on that tree. But it's all driven by the controller.
> 
> memcg needs to tell rstat 1) when stats in a local cgroup change
> e.g. when we do mod_memcg_state() (cgroup_rstat_updated), 2) when to
> flush, e.g. before a memory.stat read (cgroup_rstat_flush), and 3) how
> to flush one cgroup's per-cpu state and propagate it upward to the
> parent during rstat's flush walk (.css_rstat_flush).

Can we have this short summary in a changelog please?

> > Excplicit flushes are needed when you want an exact value (e.g. when
> > values are presented to the userspace). I do not see any flushes to
> > be done by the core pro-actively except for clean up on a release.
> > 
> > Is the above correct understanding?
> 
> Yes, that's correct.

OK, thanks for the confirmation. I will have a closer look tomorrow but
I do not see any problems now.

-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2021-02-04 16:46 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-02 18:47 [PATCH 0/7]: mm: memcontrol: switch to rstat Johannes Weiner
2021-02-02 18:47 ` Johannes Weiner
2021-02-02 18:47 ` [PATCH 1/7] mm: memcontrol: fix cpuhotplug statistics flushing Johannes Weiner
2021-02-02 18:47   ` Johannes Weiner
2021-02-02 22:23   ` Shakeel Butt
2021-02-02 22:23     ` Shakeel Butt
2021-02-02 22:23     ` Shakeel Butt
2021-02-02 23:07   ` Roman Gushchin
2021-02-02 23:07     ` Roman Gushchin
2021-02-03  2:28     ` Roman Gushchin
2021-02-03  2:28       ` Roman Gushchin
2021-02-04 19:29       ` Johannes Weiner
2021-02-04 19:29         ` Johannes Weiner
2021-02-04 19:34         ` Roman Gushchin
2021-02-04 19:34           ` Roman Gushchin
2021-02-05 17:50           ` Johannes Weiner
2021-02-05 17:50             ` Johannes Weiner
2021-02-04 13:28   ` Michal Hocko
2021-02-04 13:28     ` Michal Hocko
2021-02-02 18:47 ` [PATCH 2/7] mm: memcontrol: kill mem_cgroup_nodeinfo() Johannes Weiner
2021-02-02 18:47   ` Johannes Weiner
2021-02-02 22:24   ` Shakeel Butt
2021-02-02 22:24     ` Shakeel Butt
2021-02-02 22:24     ` Shakeel Butt
2021-02-02 23:13   ` Roman Gushchin
2021-02-02 23:13     ` Roman Gushchin
2021-02-04 13:29   ` Michal Hocko
2021-02-04 13:29     ` Michal Hocko
2021-02-02 18:47 ` [PATCH 3/7] mm: memcontrol: privatize memcg_page_state query functions Johannes Weiner
2021-02-02 18:47   ` Johannes Weiner
2021-02-02 22:26   ` Shakeel Butt
2021-02-02 22:26     ` Shakeel Butt
2021-02-02 23:17   ` Roman Gushchin
2021-02-02 23:17     ` Roman Gushchin
2021-02-04 13:30   ` Michal Hocko
2021-02-04 13:30     ` Michal Hocko
2021-02-02 18:47 ` [PATCH 4/7] cgroup: rstat: support cgroup1 Johannes Weiner
2021-02-02 18:47   ` Johannes Weiner
2021-02-03  1:16   ` Roman Gushchin
2021-02-03  1:16     ` Roman Gushchin
2021-02-04 13:39   ` Michal Hocko
2021-02-04 13:39     ` Michal Hocko
2021-02-04 16:01     ` Johannes Weiner
2021-02-04 16:01       ` Johannes Weiner
2021-02-04 16:42       ` Michal Hocko
2021-02-04 16:42         ` Michal Hocko
2021-02-02 18:47 ` [PATCH 5/7] cgroup: rstat: punt root-level optimization to individual controllers Johannes Weiner
2021-02-02 18:47   ` Johannes Weiner
2021-02-02 18:47 ` [PATCH 6/7] mm: memcontrol: switch to rstat Johannes Weiner
2021-02-02 18:47   ` Johannes Weiner
2021-02-03  1:47   ` Roman Gushchin
2021-02-03  1:47     ` Roman Gushchin
2021-02-04 16:26     ` Johannes Weiner
2021-02-04 16:26       ` Johannes Weiner
2021-02-04 18:45       ` Roman Gushchin
2021-02-04 18:45         ` Roman Gushchin
2021-02-04 20:05         ` Johannes Weiner
2021-02-04 20:05           ` Johannes Weiner
2021-02-04 14:19   ` Michal Hocko
2021-02-04 14:19     ` Michal Hocko
2021-02-04 16:15     ` Johannes Weiner
2021-02-04 16:44       ` Michal Hocko [this message]
2021-02-04 16:44         ` Michal Hocko
2021-02-04 20:28         ` Johannes Weiner
2021-02-04 20:28           ` Johannes Weiner
2021-02-05 15:05   ` Michal Hocko
2021-02-05 15:05     ` Michal Hocko
2021-02-05 16:34     ` Johannes Weiner
2021-02-05 16:34       ` Johannes Weiner
2021-02-08 14:07       ` Michal Hocko
2021-02-08 14:07         ` Michal Hocko
2021-02-02 18:47 ` [PATCH 7/7] mm: memcontrol: consolidate lruvec stat flushing Johannes Weiner
2021-02-03  2:25   ` Roman Gushchin
2021-02-03  2:25     ` Roman Gushchin
2021-02-04 21:44     ` Johannes Weiner
2021-02-04 21:44       ` Johannes Weiner
2021-02-04 21:47       ` Roman Gushchin
2021-02-04 21:47         ` Roman Gushchin
2021-02-05 15:17   ` Michal Hocko
2021-02-05 15:17     ` Michal Hocko
2021-02-05 17:10     ` Johannes Weiner
2021-02-05 17:10       ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YBwkVoNa77Nn5TE9@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.