From: Roman Gushchin <firstname.lastname@example.org> To: "Michal Koutný" <email@example.com> Cc: Andrew Morton <firstname.lastname@example.org>, Dennis Zhou <email@example.com>, Tejun Heo <firstname.lastname@example.org>, Christoph Lameter <email@example.com>, Johannes Weiner <firstname.lastname@example.org>, Michal Hocko <email@example.com>, Shakeel Butt <firstname.lastname@example.org>, <email@example.com>, <firstname.lastname@example.org>, <email@example.com> Subject: Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Date: Tue, 11 Aug 2020 12:32:28 -0700 Message-ID: <20200811193228.GC1507044@carbon.DHCP.thefacebook.com> (raw) In-Reply-To: <20200811183225.GA62582@blackbook> On Tue, Aug 11, 2020 at 08:32:25PM +0200, Michal Koutny wrote: > On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <firstname.lastname@example.org> wrote: > > As I said, there are 2 problems with charging systemd (or a similar daemon): > > 1) It often belongs to the root cgroup. > This doesn't hold for systemd (if we agree that systemd is the most > common case). Ok, it's better. > > > 2) OOMing or failing some random memory allocations is a bad way > > to "communicate" a memory shortage to the daemon. > > What we really want is to prevent creating a huge number of cgroups > There's cgroup.max.descendants for that... cgroup.max.descendants limits the number of live cgroups, it can't limit the number of dying cgroups. > > > (including dying cgroups) in some specific sub-tree(s). > ...oh, so is this limiting the number of cgroups or limiting resources > used? My scenario is simple: there is a large machine, which has no memory pressure for some time (e.g. is idle or running a workload with small working set). Periodically running services creating a lot of cgroups, usually in system.slice. After some time a significant part of the whole memory is getting consumed by dying cgroups and their percpu data. Getting rid of it and reclaiming all memory is not always possible (percpu is getting fragmented relatively easy) and is time consuming. If we'll set memory.high on system.slice, it will create an artificial memory pressure once we're getting close to the limit. It will trigger the reclaim of user pages and slab objects, so eventually we'll be able to release dying cgroups as well. You might say that it would work even without charging memcg internal structures. The problem is that a small slab object can indirectly pin a lot of (percpu) memory. If don't take the indirectly pinned memory into account, likely we won't apply enough memory pressure. If we'll limit init.slice (where systemd seems to reside), as you suggest, we'll eventually create trashing in init.slice, followed by OOM. I struggle to see how it makes the life of a user better? > > > OOMing the daemon or returning -ENOMEM to some random syscalls > > will not help us to reach the goal and likely will bring a bad > > experience to a user. > If we reach the situation when memory for cgroup operations is tight, > it'll disappoint the user either way. > My premise is that a running workload is more valuable than the > accompanying manager. The problem is that OOM-killing the accompanying manager won't release resources and help to get rid of accumulated cgroups. So in the very best case it will prevent new cgroups from being created (as well as some other random operations from being performed). Most likely the only way to "fix" this for a user will be to reboot the machine. > > > In a generic case I don't see how we can charge the cgroup which > > creates cgroups without solving these problems first. > In my understanding, "onbehalveness" is a concept useful for various > kernel threads doing deferred work. Here it's promoted to user processes > managing cgroups. > > > And if there is a very special case where we have to limit it, > > we can just add an additional layer: > > > > ` root or delegated root > > ` manager-parent-cgroup-with-a-limit > > ` manager-cgroup (systemd, docker, ...) > > ` [aggregation group(s)] > > ` job-group-1 > > ` ... > > ` job-group-n > If the charge goes to the parent of created cgroup (job-cgroup-i here), > then the layer adds nothing. Am I missing something? Sorry, I was wrong here, please ignore this part. > > > I'd definitely charge the parent cgroup in all similar cases. > (This would mandate the controllers on the unified hierarchy, which is > fine IMO.) Then the order of enabling controllers on a subtree (e.g. > cpu,memory vs memory,cpu) by the manager would yield different charging. > This seems wrong^W confusing to me. I agree it's confusing. Thanks!
next prev parent reply index Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-06-23 18:45 [PATCH v3 0/5] mm: memcg accounting of percpu memory Roman Gushchin 2020-06-23 18:45 ` [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin 2020-06-24 0:58 ` Shakeel Butt 2020-06-23 18:45 ` [PATCH v3 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin 2020-06-24 1:25 ` Shakeel Butt 2020-06-23 18:45 ` [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin 2020-06-24 1:35 ` Shakeel Butt 2020-08-11 15:05 ` Johannes Weiner 2020-06-23 18:45 ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin 2020-06-24 1:40 ` Shakeel Butt 2020-06-24 1:49 ` Roman Gushchin 2020-07-29 17:10 ` Michal Koutný 2020-08-07 4:16 ` Andrew Morton 2020-08-07 4:37 ` Roman Gushchin 2020-08-10 19:33 ` Roman Gushchin 2020-08-11 14:47 ` Michal Koutný 2020-08-11 16:55 ` Roman Gushchin 2020-08-11 18:32 ` Michal Koutný 2020-08-11 19:32 ` Roman Gushchin [this message] 2020-08-12 16:28 ` Michal Koutný 2020-08-11 15:27 ` Johannes Weiner 2020-08-11 17:06 ` Roman Gushchin 2020-08-13 9:16 ` Naresh Kamboju 2020-08-13 23:27 ` Stephen Rothwell
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20200811193228.GC1507044@carbon.DHCP.thefacebook.com \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-mm Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \ email@example.com public-inbox-index linux-mm Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kvack.linux-mm AGPL code for this site: git clone https://public-inbox.org/public-inbox.git