From: Tejun Heo <tj@kernel.org> To: Yafang Shao <laoar.shao@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, john fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>, jolsa@kernel.org, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <songmuchun@bytedance.com>, Andrew Morton <akpm@linux-foundation.org>, lizefan.x@bytedance.com, Cgroups <cgroups@vger.kernel.org>, netdev <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>, Linux MM <linux-mm@kvack.org> Subject: Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Date: Fri, 19 Aug 2022 07:06:51 -1000 [thread overview] Message-ID: <Yv/DK+AGlMeBGkF1@slm.duckdns.org> (raw) In-Reply-To: <CALOAHbDcrj1ifFsNMHBEih5-SXY2rWViig4rQHi9N07JY6CjXA@mail.gmail.com> Hello, On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote: > On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote: > > > > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote: > > > We have the exact same problem for any resources which span multiple > > > instances of a service including page cache, tmpfs instances and any other > > > thing which can persist longer than procss life time. My current opinion is > > > > To expand a bit more on this point, once we start including page cache and > > tmpfs, we now get entangled with memory reclaim which then brings in IO and > > not-yet-but-eventually CPU usage. > > Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead? Introducing a new layer in cgroup2 doesn't mean that any specific resource controller is enabled, so there is no runtime overhead difference. In terms of logical complexity, introducing a localized layer seems a lot more straightforward than building a whole separate tree. Note that the same applies to cgroup1 where collapsed controller tree is represented by simply not creating those layers in that particular controller tree. No matter how we cut the problem here, if we want to track these persistent resources, we have to create a cgroup to host them somewhere. The discussion we're having is mostly around where to put them. With your proposal, it can be anywhere and you draw out an example where the persistent cgroups form their own separate tree. What I'm saying is that the logical place to put it is where the current resource consumption is and we just need to put the persistent entity as the parent of the instances. Flexibility, just like anything else, isn't free. Here, if we extrapolate this approach, the cost is evidently hefty in that it doesn't generically work with the basic resource control structure. > > Once you start splitting the tree like > > you're suggesting here, all those will break down and now we have to worry > > about how to split resource accounting and control for the same entities > > across two split branches of the tree, which doesn't really make any sense. > > The k8s has already been broken thanks to the memcg accounting on bpf memory. > If you ignored it, I paste it below. > [0]"1. The memory usage is not consistent between the first generation and > new generations." > > This issue will persist even if you introduce a new layer. Please watch your tone. Again, this isn't a problem specific to k8s. We have the same problem with e.g. persistent tmpfs. One idea which I'm not against is allowing specific resources to be charged to an ancestor. We gotta think carefully about how such charges should be granted / denied but an approach like that jives well with the existing hierarchical control structure and because introducing a persistent layer does too, the combination of the two works well. > > So, we *really* don't wanna paint ourselves into that kind of a corner. This > > is a dead-end. Please ditch it. > > It makes non-sensen to ditch it. > Because, the hierarchy I described in the commit log is *one* use case > of the selectable memcg, but not *the only one* use case of it. If you > dislike that hierarchy, I will remove it to avoid misleading you. But if you drop that, what'd be the rationale for adding what you're proposing? Why would we want bpf memory charges to be attached any part of the hierarchy? > Even if you introduce a new layer, you still need the selectable memcg. > For example, to avoid the issue I described in [0], you still need to > charge to the parent cgroup instead of the current cgroup. As I wrote above, we've been discussing the above. Again, I'd be a lot more amenable to such approach because it fits with how everything is structured. > That's why I described in the commit log that the selectable memcg is flexible. Hopefully, my point on this is clear by now. Thanks. -- tejun
WARNING: multiple messages have this Message-ID (diff)
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> To: Yafang Shao <laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Cc: Alexei Starovoitov <ast-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, Daniel Borkmann <daniel-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>, Andrii Nakryiko <andrii-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, Martin Lau <kafai-b10kYP2dOMg@public.gmane.org>, Song Liu <songliubraving-b10kYP2dOMg@public.gmane.org>, Yonghong Song <yhs-b10kYP2dOMg@public.gmane.org>, john fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, KP Singh <kpsingh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, Stanislav Fomichev <sdf-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, Hao Luo <haoluo-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, jolsa-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>, Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org, Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, netdev <netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, bpf <bpf-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Linux MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org> Subject: Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Date: Fri, 19 Aug 2022 07:06:51 -1000 [thread overview] Message-ID: <Yv/DK+AGlMeBGkF1@slm.duckdns.org> (raw) In-Reply-To: <CALOAHbDcrj1ifFsNMHBEih5-SXY2rWViig4rQHi9N07JY6CjXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> Hello, On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote: > On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > > > > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote: > > > We have the exact same problem for any resources which span multiple > > > instances of a service including page cache, tmpfs instances and any other > > > thing which can persist longer than procss life time. My current opinion is > > > > To expand a bit more on this point, once we start including page cache and > > tmpfs, we now get entangled with memory reclaim which then brings in IO and > > not-yet-but-eventually CPU usage. > > Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead? Introducing a new layer in cgroup2 doesn't mean that any specific resource controller is enabled, so there is no runtime overhead difference. In terms of logical complexity, introducing a localized layer seems a lot more straightforward than building a whole separate tree. Note that the same applies to cgroup1 where collapsed controller tree is represented by simply not creating those layers in that particular controller tree. No matter how we cut the problem here, if we want to track these persistent resources, we have to create a cgroup to host them somewhere. The discussion we're having is mostly around where to put them. With your proposal, it can be anywhere and you draw out an example where the persistent cgroups form their own separate tree. What I'm saying is that the logical place to put it is where the current resource consumption is and we just need to put the persistent entity as the parent of the instances. Flexibility, just like anything else, isn't free. Here, if we extrapolate this approach, the cost is evidently hefty in that it doesn't generically work with the basic resource control structure. > > Once you start splitting the tree like > > you're suggesting here, all those will break down and now we have to worry > > about how to split resource accounting and control for the same entities > > across two split branches of the tree, which doesn't really make any sense. > > The k8s has already been broken thanks to the memcg accounting on bpf memory. > If you ignored it, I paste it below. > [0]"1. The memory usage is not consistent between the first generation and > new generations." > > This issue will persist even if you introduce a new layer. Please watch your tone. Again, this isn't a problem specific to k8s. We have the same problem with e.g. persistent tmpfs. One idea which I'm not against is allowing specific resources to be charged to an ancestor. We gotta think carefully about how such charges should be granted / denied but an approach like that jives well with the existing hierarchical control structure and because introducing a persistent layer does too, the combination of the two works well. > > So, we *really* don't wanna paint ourselves into that kind of a corner. This > > is a dead-end. Please ditch it. > > It makes non-sensen to ditch it. > Because, the hierarchy I described in the commit log is *one* use case > of the selectable memcg, but not *the only one* use case of it. If you > dislike that hierarchy, I will remove it to avoid misleading you. But if you drop that, what'd be the rationale for adding what you're proposing? Why would we want bpf memory charges to be attached any part of the hierarchy? > Even if you introduce a new layer, you still need the selectable memcg. > For example, to avoid the issue I described in [0], you still need to > charge to the parent cgroup instead of the current cgroup. As I wrote above, we've been discussing the above. Again, I'd be a lot more amenable to such approach because it fits with how everything is structured. > That's why I described in the commit log that the selectable memcg is flexible. Hopefully, my point on this is clear by now. Thanks. -- tejun
next prev parent reply other threads:[~2022-08-19 17:45 UTC|newest] Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-08-18 14:31 [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao 2022-08-18 14:31 ` [PATCH bpf-next v2 01/12] cgroup: Update the comment on cgroup_get_from_fd Yafang Shao 2022-08-18 14:31 ` Yafang Shao 2022-08-18 19:11 ` Yosry Ahmed 2022-08-18 19:11 ` Yosry Ahmed 2022-08-18 14:31 ` [PATCH bpf-next v2 02/12] bpf: Introduce new helper bpf_map_put_memcg() Yafang Shao 2022-08-18 14:31 ` Yafang Shao 2022-08-18 14:31 ` [PATCH bpf-next v2 03/12] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM Yafang Shao 2022-08-18 14:31 ` [PATCH bpf-next v2 04/12] bpf: Call bpf_map_init_from_attr() immediately after map creation Yafang Shao 2022-08-18 14:31 ` [PATCH bpf-next v2 05/12] bpf: Save memcg in bpf_map_init_from_attr() Yafang Shao 2022-08-18 14:31 ` Yafang Shao 2022-08-18 14:31 ` [PATCH bpf-next v2 06/12] bpf: Use scoped-based charge in bpf_map_area_alloc Yafang Shao 2022-08-18 14:31 ` Yafang Shao 2022-08-18 14:31 ` [PATCH bpf-next v2 07/12] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao 2022-08-18 14:31 ` Yafang Shao 2022-08-18 17:30 ` Andrii Nakryiko 2022-08-18 14:31 ` [PATCH bpf-next v2 08/12] bpf: Use bpf_map_kzalloc in arraymap Yafang Shao 2022-08-18 14:31 ` Yafang Shao 2022-08-18 14:31 ` [PATCH bpf-next v2 09/12] bpf: Use bpf_map_kvcalloc in bpf_local_storage Yafang Shao 2022-08-18 14:31 ` Yafang Shao 2022-08-18 14:31 ` [PATCH bpf-next v2 10/12] mm, memcg: Add new helper get_obj_cgroup_from_cgroup Yafang Shao 2022-08-18 20:38 ` Shakeel Butt 2022-08-19 1:21 ` Yafang Shao 2022-08-18 14:31 ` [PATCH bpf-next v2 11/12] bpf: Add return value for bpf_map_init_from_attr Yafang Shao 2022-08-18 14:31 ` [PATCH bpf-next v2 12/12] bpf: Introduce selectable memcg for bpf map Yafang Shao 2022-08-18 22:20 ` [PATCH bpf-next v2 00/12] " Tejun Heo 2022-08-18 22:20 ` Tejun Heo 2022-08-18 22:33 ` Tejun Heo 2022-08-19 1:09 ` Yafang Shao 2022-08-19 1:09 ` Yafang Shao 2022-08-19 17:06 ` Tejun Heo [this message] 2022-08-19 17:06 ` Tejun Heo 2022-08-20 2:25 ` Yafang Shao 2022-08-20 2:25 ` Yafang Shao 2022-08-22 11:29 ` [RFD RESEND] cgroup: Persistent memory usage tracking Tejun Heo 2022-08-22 11:29 ` Tejun Heo 2022-08-22 16:12 ` Shakeel Butt 2022-08-22 16:12 ` Shakeel Butt 2022-08-22 19:02 ` Mina Almasry 2022-08-22 19:02 ` Mina Almasry 2022-08-22 21:19 ` Johannes Weiner 2022-08-22 21:19 ` Johannes Weiner 2022-08-22 21:52 ` Mina Almasry 2022-08-22 21:52 ` Mina Almasry 2022-08-23 3:01 ` Roman Gushchin 2022-08-23 3:01 ` Roman Gushchin 2022-08-23 3:14 ` Tejun Heo 2022-08-23 3:14 ` Tejun Heo 2022-08-24 19:02 ` Mina Almasry 2022-08-24 19:02 ` Mina Almasry 2022-08-25 17:59 ` Tejun Heo 2022-08-25 17:59 ` Tejun Heo 2022-08-23 11:08 ` Yafang Shao 2022-08-23 11:08 ` Yafang Shao 2022-08-23 17:12 ` Tejun Heo 2022-08-23 17:12 ` Tejun Heo 2022-08-24 11:57 ` Yafang Shao 2022-08-24 11:57 ` Yafang Shao 2022-08-19 0:59 ` [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao 2022-08-19 16:45 ` Tejun Heo 2022-08-19 16:45 ` Tejun Heo
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=Yv/DK+AGlMeBGkF1@slm.duckdns.org \ --to=tj@kernel.org \ --cc=akpm@linux-foundation.org \ --cc=andrii@kernel.org \ --cc=ast@kernel.org \ --cc=bpf@vger.kernel.org \ --cc=cgroups@vger.kernel.org \ --cc=daniel@iogearbox.net \ --cc=hannes@cmpxchg.org \ --cc=haoluo@google.com \ --cc=john.fastabend@gmail.com \ --cc=jolsa@kernel.org \ --cc=kafai@fb.com \ --cc=kpsingh@kernel.org \ --cc=laoar.shao@gmail.com \ --cc=linux-mm@kvack.org \ --cc=lizefan.x@bytedance.com \ --cc=mhocko@kernel.org \ --cc=netdev@vger.kernel.org \ --cc=roman.gushchin@linux.dev \ --cc=sdf@google.com \ --cc=shakeelb@google.com \ --cc=songliubraving@fb.com \ --cc=songmuchun@bytedance.com \ --cc=yhs@fb.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.