All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>, Martin Lau <kafai@fb.com>,
	Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
	john fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
	jolsa@kernel.org, Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	lizefan.x@bytedance.com, Cgroups <cgroups@vger.kernel.org>,
	netdev <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
Date: Fri, 19 Aug 2022 07:06:51 -1000	[thread overview]
Message-ID: <Yv/DK+AGlMeBGkF1@slm.duckdns.org> (raw)
In-Reply-To: <CALOAHbDcrj1ifFsNMHBEih5-SXY2rWViig4rQHi9N07JY6CjXA@mail.gmail.com>

Hello,

On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > We have the exact same problem for any resources which span multiple
> > > instances of a service including page cache, tmpfs instances and any other
> > > thing which can persist longer than procss life time. My current opinion is
> >
> > To expand a bit more on this point, once we start including page cache and
> > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > not-yet-but-eventually CPU usage.
> 
> Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?

Introducing a new layer in cgroup2 doesn't mean that any specific resource
controller is enabled, so there is no runtime overhead difference. In terms
of logical complexity, introducing a localized layer seems a lot more
straightforward than building a whole separate tree.

Note that the same applies to cgroup1 where collapsed controller tree is
represented by simply not creating those layers in that particular
controller tree.

No matter how we cut the problem here, if we want to track these persistent
resources, we have to create a cgroup to host them somewhere. The discussion
we're having is mostly around where to put them. With your proposal, it can
be anywhere and you draw out an example where the persistent cgroups form
their own separate tree. What I'm saying is that the logical place to put it
is where the current resource consumption is and we just need to put the
persistent entity as the parent of the instances.

Flexibility, just like anything else, isn't free. Here, if we extrapolate
this approach, the cost is evidently hefty in that it doesn't generically
work with the basic resource control structure.

> > Once you start splitting the tree like
> > you're suggesting here, all those will break down and now we have to worry
> > about how to split resource accounting and control for the same entities
> > across two split branches of the tree, which doesn't really make any sense.
> 
> The k8s has already been broken thanks to the memcg accounting on  bpf memory.
> If you ignored it, I paste it below.
> [0]"1. The memory usage is not consistent between the first generation and
> new generations."
> 
> This issue will persist even if you introduce a new layer.

Please watch your tone.

Again, this isn't a problem specific to k8s. We have the same problem with
e.g. persistent tmpfs. One idea which I'm not against is allowing specific
resources to be charged to an ancestor. We gotta think carefully about how
such charges should be granted / denied but an approach like that jives well
with the existing hierarchical control structure and because introducing a
persistent layer does too, the combination of the two works well.

> > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > is a dead-end. Please ditch it.
> 
> It makes non-sensen to ditch it.
> Because, the hierarchy I described in the commit log is *one* use case
> of the selectable memcg, but not *the only one* use case of it. If you
> dislike that hierarchy, I will remove it to avoid misleading you.

But if you drop that, what'd be the rationale for adding what you're
proposing? Why would we want bpf memory charges to be attached any part of
the hierarchy?

> Even if you introduce a new layer, you still need the selectable memcg.
> For example, to avoid the issue I described in [0],  you still need to
> charge to the parent cgroup instead of the current cgroup.

As I wrote above, we've been discussing the above. Again, I'd be a lot more
amenable to such approach because it fits with how everything is structured.

> That's why I described in the commit log that the selectable memcg is flexible.

Hopefully, my point on this is clear by now.

Thanks.

-- 
tejun

WARNING: multiple messages have this Message-ID (diff)
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: Yafang Shao <laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Alexei Starovoitov <ast-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Daniel Borkmann <daniel-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>,
	Andrii Nakryiko <andrii-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Martin Lau <kafai-b10kYP2dOMg@public.gmane.org>,
	Song Liu <songliubraving-b10kYP2dOMg@public.gmane.org>,
	Yonghong Song <yhs-b10kYP2dOMg@public.gmane.org>,
	john fastabend
	<john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	KP Singh <kpsingh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Stanislav Fomichev <sdf-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Hao Luo <haoluo-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	jolsa-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Roman Gushchin
	<roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>,
	Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org,
	Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	netdev <netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	bpf <bpf-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Linux MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>
Subject: Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
Date: Fri, 19 Aug 2022 07:06:51 -1000	[thread overview]
Message-ID: <Yv/DK+AGlMeBGkF1@slm.duckdns.org> (raw)
In-Reply-To: <CALOAHbDcrj1ifFsNMHBEih5-SXY2rWViig4rQHi9N07JY6CjXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hello,

On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >
> > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > We have the exact same problem for any resources which span multiple
> > > instances of a service including page cache, tmpfs instances and any other
> > > thing which can persist longer than procss life time. My current opinion is
> >
> > To expand a bit more on this point, once we start including page cache and
> > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > not-yet-but-eventually CPU usage.
> 
> Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?

Introducing a new layer in cgroup2 doesn't mean that any specific resource
controller is enabled, so there is no runtime overhead difference. In terms
of logical complexity, introducing a localized layer seems a lot more
straightforward than building a whole separate tree.

Note that the same applies to cgroup1 where collapsed controller tree is
represented by simply not creating those layers in that particular
controller tree.

No matter how we cut the problem here, if we want to track these persistent
resources, we have to create a cgroup to host them somewhere. The discussion
we're having is mostly around where to put them. With your proposal, it can
be anywhere and you draw out an example where the persistent cgroups form
their own separate tree. What I'm saying is that the logical place to put it
is where the current resource consumption is and we just need to put the
persistent entity as the parent of the instances.

Flexibility, just like anything else, isn't free. Here, if we extrapolate
this approach, the cost is evidently hefty in that it doesn't generically
work with the basic resource control structure.

> > Once you start splitting the tree like
> > you're suggesting here, all those will break down and now we have to worry
> > about how to split resource accounting and control for the same entities
> > across two split branches of the tree, which doesn't really make any sense.
> 
> The k8s has already been broken thanks to the memcg accounting on  bpf memory.
> If you ignored it, I paste it below.
> [0]"1. The memory usage is not consistent between the first generation and
> new generations."
> 
> This issue will persist even if you introduce a new layer.

Please watch your tone.

Again, this isn't a problem specific to k8s. We have the same problem with
e.g. persistent tmpfs. One idea which I'm not against is allowing specific
resources to be charged to an ancestor. We gotta think carefully about how
such charges should be granted / denied but an approach like that jives well
with the existing hierarchical control structure and because introducing a
persistent layer does too, the combination of the two works well.

> > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > is a dead-end. Please ditch it.
> 
> It makes non-sensen to ditch it.
> Because, the hierarchy I described in the commit log is *one* use case
> of the selectable memcg, but not *the only one* use case of it. If you
> dislike that hierarchy, I will remove it to avoid misleading you.

But if you drop that, what'd be the rationale for adding what you're
proposing? Why would we want bpf memory charges to be attached any part of
the hierarchy?

> Even if you introduce a new layer, you still need the selectable memcg.
> For example, to avoid the issue I described in [0],  you still need to
> charge to the parent cgroup instead of the current cgroup.

As I wrote above, we've been discussing the above. Again, I'd be a lot more
amenable to such approach because it fits with how everything is structured.

> That's why I described in the commit log that the selectable memcg is flexible.

Hopefully, my point on this is clear by now.

Thanks.

-- 
tejun

  reply	other threads:[~2022-08-19 17:45 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-18 14:31 [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 01/12] cgroup: Update the comment on cgroup_get_from_fd Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 19:11   ` Yosry Ahmed
2022-08-18 19:11     ` Yosry Ahmed
2022-08-18 14:31 ` [PATCH bpf-next v2 02/12] bpf: Introduce new helper bpf_map_put_memcg() Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 03/12] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 04/12] bpf: Call bpf_map_init_from_attr() immediately after map creation Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 05/12] bpf: Save memcg in bpf_map_init_from_attr() Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 06/12] bpf: Use scoped-based charge in bpf_map_area_alloc Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 07/12] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 17:30   ` Andrii Nakryiko
2022-08-18 14:31 ` [PATCH bpf-next v2 08/12] bpf: Use bpf_map_kzalloc in arraymap Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 09/12] bpf: Use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 10/12] mm, memcg: Add new helper get_obj_cgroup_from_cgroup Yafang Shao
2022-08-18 20:38   ` Shakeel Butt
2022-08-19  1:21     ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 11/12] bpf: Add return value for bpf_map_init_from_attr Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 12/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-08-18 22:20 ` [PATCH bpf-next v2 00/12] " Tejun Heo
2022-08-18 22:20   ` Tejun Heo
2022-08-18 22:33   ` Tejun Heo
2022-08-19  1:09     ` Yafang Shao
2022-08-19  1:09       ` Yafang Shao
2022-08-19 17:06       ` Tejun Heo [this message]
2022-08-19 17:06         ` Tejun Heo
2022-08-20  2:25         ` Yafang Shao
2022-08-20  2:25           ` Yafang Shao
2022-08-22 11:29           ` [RFD RESEND] cgroup: Persistent memory usage tracking Tejun Heo
2022-08-22 11:29             ` Tejun Heo
2022-08-22 16:12             ` Shakeel Butt
2022-08-22 16:12               ` Shakeel Butt
2022-08-22 19:02             ` Mina Almasry
2022-08-22 19:02               ` Mina Almasry
2022-08-22 21:19               ` Johannes Weiner
2022-08-22 21:19                 ` Johannes Weiner
2022-08-22 21:52                 ` Mina Almasry
2022-08-22 21:52                   ` Mina Almasry
2022-08-23  3:01                 ` Roman Gushchin
2022-08-23  3:01                   ` Roman Gushchin
2022-08-23  3:14                   ` Tejun Heo
2022-08-23  3:14                     ` Tejun Heo
2022-08-24 19:02                     ` Mina Almasry
2022-08-24 19:02                       ` Mina Almasry
2022-08-25 17:59                       ` Tejun Heo
2022-08-25 17:59                         ` Tejun Heo
2022-08-23 11:08             ` Yafang Shao
2022-08-23 11:08               ` Yafang Shao
2022-08-23 17:12               ` Tejun Heo
2022-08-23 17:12                 ` Tejun Heo
2022-08-24 11:57                 ` Yafang Shao
2022-08-24 11:57                   ` Yafang Shao
2022-08-19  0:59   ` [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-08-19 16:45     ` Tejun Heo
2022-08-19 16:45       ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yv/DK+AGlMeBGkF1@slm.duckdns.org \
    --to=tj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=hannes@cmpxchg.org \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mhocko@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=roman.gushchin@linux.dev \
    --cc=sdf@google.com \
    --cc=shakeelb@google.com \
    --cc=songliubraving@fb.com \
    --cc=songmuchun@bytedance.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.