bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bui Quang Minh <minhquangbui99@gmail.com>
To: "Michal Koutný" <mkoutny@suse.com>, "Tejun Heo" <tj@kernel.org>
Cc: cgroups@vger.kernel.org, kernel test robot <lkp@intel.com>,
	Zefan Li <lizefan.x@bytedance.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	bpf@vger.kernel.org
Subject: Re: [PATCH v2] cgroup: Kill the parent controller when its last child is killed
Date: Tue, 5 Apr 2022 21:58:01 +0700	[thread overview]
Message-ID: <bdd4104d-390e-74c7-0de1-a275044831a5@gmail.com> (raw)
In-Reply-To: <20220405091158.GA13806@blackbody.suse.cz>

On 4/5/22 16:11, Michal Koutný wrote:
> On Mon, Apr 04, 2022 at 07:37:24AM -1000, Tejun Heo <tj@kernel.org> wrote:
>> And the suggested behavior doesn't make much sense to me. It doesn't
>> actually solve the underlying problem but instead always make css
>> destructions recursive which can lead to surprises for normal use cases.
> 
> I also don't like the nested special-case use percpu_ref_kill().

After thinking more carefully, I agree with your points. The recursive 
css destruction only does not fixup the previous parents' metadata 
correctly and it is not a desirable behavior too.

> I looked at this and my supposed solution turned out to be a revert of
> commit 3c606d35fe97 ("cgroup: prevent mount hang due to memory
> controller lifetime"). So at the unmount time it's necessary to distinguish
> children that are in the process of removal from children than are online or
> pinned indefinitely.
> 
> What about:
> 
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2205,11 +2205,14 @@ static void cgroup_kill_sb(struct super_block *sb)
>          struct cgroup_root *root = cgroup_root_from_kf(kf_root);
> 
>          /*
> -        * If @root doesn't have any children, start killing it.
> +        * If @root doesn't have any children held by residual state (e.g.
> +        * memory controller), start killing it, flush workqueue to filter out
> +        * transiently offlined children.
>           * This prevents new mounts by disabling percpu_ref_tryget_live().
>           *
>           * And don't kill the default root.
>           */
> +       flush_workqueue(cgroup_destroy_wq);
>          if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
>              !percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
>                  cgroup_bpf_offline(&root->cgrp);
> 
> (I suspect there's technically still possible a race between concurrent unmount
> and the last rmdir but the flush on kill_sb path should be affordable and it
> prevents unnecessarily conserved cgroup roots.)

Your proposed solution looks good to me. As with my example the flush 
will guarantee the rmdir and its deferred work has been executed before 
cleaning up in umount path.

But what do you think about

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f01ff231a484..5578ee76e789 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2215,6 +2215,7 @@ static void cgroup_kill_sb(struct super_block *sb)
                 cgroup_bpf_offline(&root->cgrp);
                 percpu_ref_kill(&root->cgrp.self.refcnt);
         }
+       root->cgrp.flags |= CGRP_UMOUNT;
         cgroup_put(&root->cgrp);
         kernfs_kill_sb(sb);
  }
@@ -5152,12 +5153,28 @@ static void css_release_work_fn(struct 
work_struct *work)
                 container_of(work, struct cgroup_subsys_state, 
destroy_work);
         struct cgroup_subsys *ss = css->ss;
         struct cgroup *cgrp = css->cgroup;
+       struct cgroup *parent = cgroup_parent(cgrp);

         mutex_lock(&cgroup_mutex);

         css->flags |= CSS_RELEASED;
         list_del_rcu(&css->sibling);

+       /*
+        * If parent doesn't have any children, start killing it.
+        * And don't kill the default root.
+        */
+       if (parent && list_empty(&parent->self.children) &&
+           parent->flags & CGRP_UMOUNT &&
+           parent != &cgrp_dfl_root.cgrp &&
+           !percpu_ref_is_dying(&parent->self.refcnt)) {
+#ifdef CONFIG_CGROUP_BPF
+               if (!percpu_ref_is_dying(&cgrp->bpf.refcnt))
+                       cgroup_bpf_offline(parent);
+#endif
+               percpu_ref_kill(&parent->self.refcnt);
+       }
+
         if (ss) {
                 /* css release path */
                 if (!list_empty(&css->rstat_css_node)) {

The idea is to set a flag in the umount path, in the rmdir it will 
destroy the css in case its direct parent is umounted, no recursive 
here. This is just an incomplete example, we may need to reset that flag 
when remounting.

Thanks,
Quang Minh.

  reply	other threads:[~2022-04-06  0:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-04 14:25 [PATCH v2] cgroup: Kill the parent controller when its last child is killed Bui Quang Minh
2022-04-04 17:37 ` Tejun Heo
2022-04-05  9:11   ` Michal Koutný
2022-04-05 14:58     ` Bui Quang Minh [this message]
2022-04-21 22:37       ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bdd4104d-390e-74c7-0de1-a275044831a5@gmail.com \
    --to=minhquangbui99@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=hannes@cmpxchg.org \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan.x@bytedance.com \
    --cc=lkp@intel.com \
    --cc=mkoutny@suse.com \
    --cc=netdev@vger.kernel.org \
    --cc=songliubraving@fb.com \
    --cc=tj@kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).