Re: [RFC PATCH bpf-next v2 1/9] cgroup: Make operations on the cgroup root_list RCU safe

From: Yafang Shao <laoar.shao@gmail.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: ast@kernel.org, daniel@iogearbox.net, john.fastabend@gmail.com,
	 andrii@kernel.org, martin.lau@linux.dev, song@kernel.org,
	 yonghong.song@linux.dev, kpsingh@kernel.org, sdf@google.com,
	 haoluo@google.com, jolsa@kernel.org, tj@kernel.org,
	lizefan.x@bytedance.com,  hannes@cmpxchg.org,
	yosryahmed@google.com, sinquersw@gmail.com,
	 cgroups@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [RFC PATCH bpf-next v2 1/9] cgroup: Make operations on the cgroup root_list RCU safe
Date: Wed, 18 Oct 2023 10:51:47 +0800	[thread overview]
Message-ID: <CALOAHbCP8HVs2UjuegLHvZSRHbfJ2ONMNC58AQxwVaNGSDYzOg@mail.gmail.com> (raw)
In-Reply-To: <q7yaokzrcg5effyr2j7n6f6ljlez755uunlzfzpjgktfmrvhnb@t44uxkbj7k5k>

On Tue, Oct 17, 2023 at 9:20 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Tue, Oct 17, 2023 at 12:45:38PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote:
> > Therefore, making it RCU-safe would be beneficial.
>
> Notice that whole cgroup_destroy_root() is actually an RCU callback (via
> css_free_rwork_fn()). So the list traversal under RCU should alreay be
> OK wrt its stability. Do you see a loophole in this argument?

css_free_rwork_fn() is designed for handling the cgroup's CSS. When
the RCU callback is executed, this specific CSS is not accessible to
other tasks. However, the cgroup root remains accessible to other
tasks. For instance, other tasks can still traverse the cgroup
root_list and access the cgroup root that is currently being
destroyed. Hence, we must take explicit measures to make access to the
cgroup root RCU-safe.

>
>
> >  /* iterate across the hierarchies */
> >  #define for_each_root(root)                                          \
> > -     list_for_each_entry((root), &cgroup_roots, root_list)
> > +     list_for_each_entry_rcu((root), &cgroup_roots, root_list,       \
> > +                             !lockdep_is_held(&cgroup_mutex))
>
> The extra condition should be constant false (regardless of
> cgroup_mutex). IOW, RCU read lock is always required.

IIUC, the RCU read lock is not required, while the cgroup_mutex is
required when we want to traverse the cgroup root_list, such as in the
case of cgroup_attach_task_all.

>
> > @@ -1386,13 +1386,15 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
> >               }
> >       }
> >
> > -     BUG_ON(!res_cgroup);
> > +     WARN_ON_ONCE(!res_cgroup && lockdep_is_held(&cgroup_mutex));
> >       return res_cgroup;
>
> Hm, this would benefit from a comment why !res_cgroup is conditionally
> acceptable.

The cgrp_cset_link is freed before we remove the cgroup root from the
root_list. Consequently, when accessing a cgroup root, the cset_link
may have already been freed, resulting in a NULL res_cgroup. However,
by holding the cgroup_mutex, we ensure that res_cgroup cannot be NULL.

Will add the comments in the next version.

>
> >  }
> >
> >  /*
> >   * look up cgroup associated with current task's cgroup namespace on the
> > - * specified hierarchy
> > + * specified hierarchy. Umount synchronization is ensured via VFS layer,
> > + * so we don't have to hold cgroup_mutex to prevent the root from being
> > + * destroyed.
>
> I tried the similar via explicit lockdep invocation (in a thread I
> linked to you previously) and VFS folks weren't ethusiastic about it.

Thanks for your information. will take a look at it.

>
> You cannot hide this synchronization assumption in a mere comment.

will think about how to address it.

--
Regards
Yafang