All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Shakeel Butt <shakeelb@google.com>, <linux-mm@kvack.org>,
	<kernel-team@fb.com>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
Date: Tue, 11 Aug 2020 09:55:27 -0700	[thread overview]
Message-ID: <20200811165527.GA1507044@carbon.DHCP.thefacebook.com> (raw)
In-Reply-To: <20200811144754.GA45201@blackbook>

On Tue, Aug 11, 2020 at 04:47:54PM +0200, Michal Koutny wrote:
> On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin <guro@fb.com> wrote:
> > In general, yes. But in this case I think it wouldn't be a good idea:
> > most often cgroups are created by a centralized daemon (systemd),
> > which is usually located in the root cgroup. Even if it's located not in
> > the root cgroup, limiting it's memory will likely affect the whole system,
> > even if only one specific limit was reached.
> The generic scheme would be (assuming the no internal process
> constraint, in the root too)
> 
> ` root or delegated root
>   ` manager-cgroup (systemd, docker, ...)
>   ` [aggregation group(s)]
>     ` job-group-1
>     ` ...
>     ` job-group-n
> 
> > If there is a containerized workload, which creates sub-cgroups,
> > charging it's parent cgroup is perfectly effective.
> No dispute about this in either approaches.
> 
> > And the opposite, if we'll charge the cgroup of a process, who created
> > a cgroup, we'll not cover the most common case: systemd creating
> > cgroups for all services in the system.
> What I mean is that systemd should be charged for the cgroup creation.
> Or more generally, any container/cgroup manager should be charged.
> Consider a leak when it wouldn't remove spent cgroups, IMO the effect
> (throttling, reclaim) should be exercised on such a culprit.

As I said, there are 2 problems with charging systemd (or a similar daemon):
1) It often belongs to the root cgroup.
2) OOMing or failing some random memory allocations is a bad way
   to "communicate" a memory shortage to the daemon.
   What we really want is to prevent creating a huge number of cgroups
   (including dying cgroups) in some specific sub-tree(s).
   OOMing the daemon or returning -ENOMEM to some random syscalls
   will not help us to reach the goal and likely will bring a bad
   experience to a user.

In a generic case I don't see how we can charge the cgroup which
creates cgroups without solving these problems first.

And if there is a very special case where we have to limit it,
we can just add an additional layer:

` root or delegated root
   ` manager-parent-cgroup-with-a-limit
     ` manager-cgroup (systemd, docker, ...)
   ` [aggregation group(s)]
     ` job-group-1
     ` ...
     ` job-group-n

> 
> I don't think the existing workload (job-group-i above) should
> unnecessarily suffer when only manager is acting up. Is that different
> from your idea?
> 
> > Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> > in completely different cgroup. In this particular case there are tons of other
> > ways how a task from C00 can hurt C1.
> I agree with that.
> 
> 
> If I haven't overlooked anything, this should be first case when
> cgroup-related structures are accounted (please correct me).
> So this is setting a precendent, if others show useful to be accounted
> in the future too.

Right.

> I'm thinking about cpu_cgroup_css_alloc() that can
> also allocate a lot (with big CPU count). The current approach would lead
> situations where matching cpu and memory csses needn't to exist and that
> would need special handling.

I'd definitely charge the parent cgroup in all similar cases.

> 
> 
> > On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > > These week-old issues appear to be significant.  Roman?  Or someone
> > > else?
> Despite my concerns, I don't think this is fundamental and can't be
> changed later so it doesn't prevent the inclusion in 5.9 RC1.

Thank you!

  reply	other threads:[~2020-08-11 16:56 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-23 18:45 [PATCH v3 0/5] mm: memcg accounting of percpu memory Roman Gushchin
2020-06-23 18:45 ` [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
2020-06-24  0:58   ` Shakeel Butt
2020-06-24  0:58     ` Shakeel Butt
2020-06-23 18:45 ` [PATCH v3 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
2020-06-24  1:25   ` Shakeel Butt
2020-06-24  1:25     ` Shakeel Butt
2020-06-23 18:45 ` [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
2020-06-24  1:35   ` Shakeel Butt
2020-06-24  1:35     ` Shakeel Butt
2020-08-11 15:05   ` Johannes Weiner
2020-06-23 18:45 ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
2020-06-24  1:40   ` Shakeel Butt
2020-06-24  1:40     ` Shakeel Butt
2020-06-24  1:49     ` Roman Gushchin
2020-07-29 17:10   ` Michal Koutný
2020-08-07  4:16     ` Andrew Morton
2020-08-07  4:37       ` Roman Gushchin
2020-08-10 19:33         ` Roman Gushchin
2020-08-11 14:47         ` Michal Koutný
2020-08-11 16:55           ` Roman Gushchin [this message]
2020-08-11 18:32             ` Michal Koutný
2020-08-11 19:32               ` Roman Gushchin
2020-08-12 16:28                 ` Michal Koutný
2020-08-11 15:27   ` Johannes Weiner
2020-08-11 17:06     ` Roman Gushchin
2020-08-13  9:16       ` Naresh Kamboju
2020-08-13  9:16         ` Naresh Kamboju
2020-08-13  9:16         ` Naresh Kamboju
2020-08-13 23:27         ` Stephen Rothwell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200811165527.GA1507044@carbon.DHCP.thefacebook.com \
    --to=guro@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=dennis@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.