Re: [PATCH bpf-next v7 4/8] bpf: Introduce cgroup iter

From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	cgroups@vger.kernel.org, netdev@vger.kernel.org,
	Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
	Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
	KP Singh <kpsingh@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Benjamin Tissoires <benjamin.tissoires@redhat.com>,
	John Fastabend <john.fastabend@gmail.com>,
	Michal Koutny <mkoutny@suse.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	David Rientjes <rientjes@google.com>,
	Stanislav Fomichev <sdf@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	Yosry Ahmed <yosryahmed@google.com>
Subject: Re: [PATCH bpf-next v7 4/8] bpf: Introduce cgroup iter
Date: Tue, 9 Aug 2022 11:38:32 -0700	[thread overview]
Message-ID: <CA+khW7j0kzP+W_Qgsim52J+HeR27XJcyMk73Hq93tsmNzT7q6w@mail.gmail.com> (raw)
In-Reply-To: <20220809162325.hwgvys5n3rivuz7a@MacBook-Pro-3.local.dhcp.thefacebook.com>

On Tue, Aug 9, 2022 at 9:23 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 08, 2022 at 05:56:57PM -0700, Hao Luo wrote:
> > On Mon, Aug 8, 2022 at 5:19 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Fri, Aug 5, 2022 at 2:49 PM Hao Luo <haoluo@google.com> wrote:
> > > >
> > > > Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
> > > >
> > > >  - walking a cgroup's descendants in pre-order.
> > > >  - walking a cgroup's descendants in post-order.
> > > >  - walking a cgroup's ancestors.
> > > >  - process only the given cgroup.
> > > >
[...]
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 59a217ca2dfd..4d758b2e70d6 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -87,10 +87,37 @@ struct bpf_cgroup_storage_key {
> > > >         __u32   attach_type;            /* program attach type (enum bpf_attach_type) */
> > > >  };
> > > >
> > > > +enum bpf_iter_order {
> > > > +       BPF_ITER_ORDER_DEFAULT = 0,     /* default order. */
> > >
> > > why is this default order necessary? It just adds confusion (I had to
> > > look up source code to know what is default order). I might have
> > > missed some discussion, so if there is some very good reason, then
> > > please document this in commit message. But I'd rather not do some
> > > magical default order instead. We can set 0 to mean invalid and error
> > > out, or just do SELF as the very first value (and if user forgot to
> > > specify more fancy mode, they hopefully will quickly discover this in
> > > their testing).
> > >
> >
> > PRE/POST/UP are tree-specific orders. SELF applies on all iters and
> > yields only a single object. How does task_iter express a non-self
> > order? By non-self, I mean something like "I don't care about the
> > order, just scan _all_ the objects". And this "don't care" order, IMO,
> > may be the common case. I don't think everyone cares about walking
> > order for tasks. The DEFAULT is intentionally put at the first value,
> > so that if users don't care about order, they don't have to specify
> > this field.
> >
> > If that sounds valid, maybe using "UNSPEC" instead of "DEFAULT" is better?
>
> I agree with Andrii.
> This:
> +       if (order == BPF_ITER_ORDER_DEFAULT)
> +               order = BPF_ITER_DESCENDANTS_PRE;
>
> looks like an arbitrary choice.
> imo
> BPF_ITER_DESCENDANTS_PRE = 0,
> would have been more obvious. No need to dig into definition of "default".
>
> UNSPEC = 0
> is fine too if we want user to always be conscious about the order
> and the kernel will error if that field is not initialized.
> That would be my preference, since it will match the rest of uapi/bpf.h
>

Sounds good. In the next version, will use

enum bpf_iter_order {
        BPF_ITER_ORDER_UNSPEC = 0,
        BPF_ITER_SELF_ONLY,             /* process only a single object. */
        BPF_ITER_DESCENDANTS_PRE,       /* walk descendants in pre-order. */
        BPF_ITER_DESCENDANTS_POST,      /* walk descendants in post-order. */
        BPF_ITER_ANCESTORS_UP,          /* walk ancestors upward. */
};

and explicitly list the values acceptable by cgroup_iter, error out if
UNSPEC is detected.

Also, following Andrii's comments, will change BPF_ITER_SELF to
BPF_ITER_SELF_ONLY, which does seem a little bit explicit in
comparison.

> I applied the first 3 patches to ease respin.

Thanks! This helps!

> Thanks!