Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Shakeel Butt <shakeelb@google.com>
To: Roman Gushchin <guro@fb.com>
Cc: "Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Richard Palethorpe" <rpalethorpe@suse.com>,
	"LTP List" <ltp@lists.linux.it>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Christoph Lameter" <cl@linux.com>,
	"Michal Hocko" <mhocko@kernel.org>, "Tejun Heo" <tj@kernel.org>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Linux MM" <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"Michal Hocko" <mhocko@suse.com>
Subject: Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
Date: Tue, 10 Nov 2020 07:11:28 -0800
Message-ID: <CALvZod7GrYayHjYsqtF2AfcvkbTHCyWQJW4oXoO3fSGJeotDpQ@mail.gmail.com> (raw)
In-Reply-To: <20201110012758.GA2612097@carbon.dhcp.thefacebook.com>

On Mon, Nov 9, 2020 at 5:28 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Oct 23, 2020 at 12:30:53PM -0400, Johannes Weiner wrote:
> > On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> > > On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > > > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > > > If we want these counter to function properly, then we should go into the opposite
> > > > > direction and remove the special handling of the root memory cgroup in many places.
> > > >
> > > > I suspect this is also by far the most robust solution from a code and
> > > > maintenance POV.
> > > >
> > > > I don't recall the page counter at the root level having been a
> > > > concern in recent years, even though it's widely used in production
> > > > environments. It's lockless and cache compact. It's also per-cpu
> > > > batched, which means it isn't actually part of the memcg hotpath.
> > >
> > >
> > > I agree.
> > >
> > > Here is my first attempt. Comments are welcome!
> > >
> > > It doesn't solve the original problem though (use_hierarchy == false and
> > > objcg reparenting), I'll send a separate patch for that.
> > >
> > > Thanks!
> > >
> > > --
> > >
> > > From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> > > From: Roman Gushchin <guro@fb.com>
> > > Date: Tue, 20 Oct 2020 18:05:43 -0700
> > > Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
> > >  specially
> > >
> > > Currently the root memory cgroup is treated in a special way:
> > > it's not charged and uncharged directly (only indirectly with their
> > > descendants), processes belonging to the root memory cgroup are exempt
> > > from the kernel- and the socket memory accounting.
> > >
> > > At the same time some of root level statistics and data are available
> > > to a user:
> > >   - cgroup v2: memory.stat
> > >   - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
> > >                memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> > >
> > > Historically the reason for a special treatment was an avoidance
> > > of extra performance cost, however now it's unlikely a good reason:
> > > over years there was a significant improvement in the performance
> > > of the memory cgroup code. Also on a modern system actively using
> > > cgroups (e.g. managed by systemd) there are usually no (significant)
> > > processes in the root memory cgroup.
> > >
> > > The special treatment of the root memory cgroups creates a number of
> > > issues visible to a user:
> > > 1) slab stats on the root level do not include the slab memory
> > >    consumed by processes in the root memory cgroup
> > > 2) non-slab kernel memory consumed by processes in the root memory cgroup
> > >    is not included into memory.kmem.usage_in_bytes
> > > 3) socket memory consumed by processes in the root memory cgroup
> > >    is not included into memory.kmem.tcp.usage_in_bytes
> > >
> > > It complicates the code and increases a risk of new bugs.
> > >
> > > This patch removes a number of exceptions related to the handling of
> > > the root memory cgroup. With this patch applied the root memory cgroup
> > > is treated uniformly to other cgroups in the following cases:
> > > 1) root memory cgroup is charged and uncharged directly, try_charge()
> > >    and cancel_charge() do not return immediately if the root memory
> > >    cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
> > >    do not handle the root memory cgroup specially.
> > > 2) per-memcg slab statistics is gathered for the root memory cgroup
> > > 3) shrinkers infra treats the root memory cgroup as any other memory
> > >    cgroup
> > > 4) non-slab kernel memory accounting doesn't exclude pages allocated
> > >    by processes belonging to the root memory cgroup
> > > 5) if a socket is opened by a process in the root memory cgroup,
> > >    the socket memory is accounted
> > > 6) root cgroup is charged for the used swap memory.
> > >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > This looks great.
> >
> > The try_charge(), cancel_charge() etc. paths are relatively
> > straight-forward and look correct to me.
> >
> > The swap counters too.
> >
> > Slab is a bit trickier, but it also looks correct to me.
> >
> > I'm having some trouble with the shrinkers. Currently, tracked objects
> > allocated in non-root cgroups live in that cgroup. Tracked objects in
> > the root cgroup, as well as untracked objects, live in a global pool.
> > When reclaim iterates all memcgs and calls shrink_slab(), we special
> > case the root_mem_cgroup and redirect to the global pool.
> >
> > After your patch we have tracked objects allocated in the root cgroup
> > actually live in the root cgroup. Removing the shrinker special case
> > is correct in order to shrink those - but it removes the call to
> > shrink the global pool of untracked allocation classes.
> >
> > I think we need to restore the double call to shrink_slab() we had
> > prior to this:
> >
> > commit aeed1d325d429ac9699c4bf62d17156d60905519
> > Author: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Date:   Fri Aug 17 15:48:17 2018 -0700
> >
> >     mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
> >
> >     The patch makes shrink_slab() be called for root_mem_cgroup in the same
> >     way as it's called for the rest of cgroups.  This simplifies the logic
> >     and improves the readability.
> >
> > where we iterate through all cgroups, including the root, to reclaim
> > objects accounted to those respective groups; and then a call to scan
> > the global pool of untracked objects in that numa node.
>
> I agree, thank you for pointing at this commit.
>
> >
> > For ease of review/verification, it could be helpful to split the
> > patch and remove the root exception case-by-case (not callsite by
> > callsite, but e.g. the swap counter, the memory counter etc.).
>
> Sorry for a long pause, here's an update. I've split the patch,
> fixed a couple of issues and was almost ready to send it upstream,
> but then I've noticed that on cgroup v1 kmem and memsw counters
> are sometimes heading into a negative territory and generating a warning
> in dmesg. It happens for a short amount of time at early stages
> of the system uptime. I haven't seen it happening with the memory counter.
>
> My investigation showed that the reason is that the result of a
> cgroup_subsys_on_dfl(memory_cgrp_subsys) call can be misleading at
> early stages. Depending on the return value we charge or skip the kmem
> counter and also handle the swap/memsw counter differently.
>
> The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
> can change at any particular moment. So I don't see how to make all root's
> counters consistent without tracking them all no matter which cgroup version
> is used. Which is obviously an overkill and will lead to an overhead, which
> unlikely can be justified.
>
> I'll appreciate any ideas, but I don't see a good path forward here
> (except fixing a particular issue with root's slab stats with the
> Muchun's patch).
>

Since the commit 0158115f702b0 ("memcg, kmem: deprecate
kmem.limit_in_bytes"), we are in the process of deprecating the limit
on kmem. If we decide that now is the time to deprecate it, we can
convert the kmem page counter to a memcg stat, update it for both v1
and v2 and serve v1's kmem.usage_in_bytes from that memcg stat. The
memcg stat is more efficient than the page counter, so I don't think
overhead should be an issue. This new memcg stat represents all types
of kmem memory for a memcg like slab, stack and no-type. What do you
think?


  reply index

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-14 19:07 Richard Palethorpe
2020-10-14 20:08 ` Roman Gushchin
2020-10-16  5:40   ` Richard Palethorpe
2020-10-16  9:47 ` Michal Koutný
2020-10-16 10:41   ` Richard Palethorpe
2020-10-16 15:05     ` Richard Palethorpe
2020-10-16 17:26       ` Michal Koutný
2020-10-16 14:53   ` Johannes Weiner
2020-10-16 17:02     ` Roman Gushchin
2020-10-16 17:15     ` Michal Koutný
2020-10-19  8:45       ` Richard Palethorpe
2020-10-19  9:58         ` [PATCH v3] " Richard Palethorpe
2020-10-19 16:58           ` Shakeel Butt
2020-10-20  5:52             ` Richard Palethorpe
2020-10-20 13:49               ` Richard Palethorpe
2020-10-20 16:56                 ` Shakeel Butt
2020-10-21 20:32                   ` Roman Gushchin
2020-10-20 17:24               ` Michal Koutný
2020-10-22  7:04                 ` Richard Palethorpe
2020-10-22 12:28                   ` [PATCH v4] " Richard Palethorpe
2020-10-22 16:37                     ` Shakeel Butt
2020-10-22 17:25                       ` Roman Gushchin
2020-10-22 23:59                         ` Shakeel Butt
2020-10-23  0:40                           ` Roman Gushchin
2020-10-23 15:44                             ` Johannes Weiner
2020-10-23 16:41                             ` Shakeel Butt
2020-10-26  7:32                             ` Richard Palethorpe
2020-10-26 23:14                               ` Roman Gushchin
2020-10-19 22:28       ` [RFC PATCH] " Roman Gushchin
2020-10-20  6:04         ` Richard Palethorpe
2020-10-20 12:02           ` Richard Palethorpe
2020-10-20 14:48         ` Richard Palethorpe
2020-10-20 16:27         ` Michal Koutný
2020-10-20 17:07           ` Roman Gushchin
2020-10-20 18:18             ` Johannes Weiner
2020-10-21 19:33               ` Roman Gushchin
2020-10-23 16:30                 ` Johannes Weiner
2020-11-10  1:27                   ` Roman Gushchin
2020-11-10 15:11                     ` Shakeel Butt [this message]
2020-11-10 19:13                       ` Roman Gushchin
2020-11-20 17:46                       ` Michal Koutný
2020-11-03 13:22                 ` Michal Hocko
2020-11-03 21:30                   ` Roman Gushchin
2020-10-20 16:55         ` Shakeel Butt
2020-10-20 17:17           ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALvZod7GrYayHjYsqtF2AfcvkbTHCyWQJW4oXoO3fSGJeotDpQ@mail.gmail.com \
    --to=shakeelb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ltp@lists.linux.it \
    --cc=mhocko@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=rpalethorpe@suse.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git