Re: [PATCH bpf-next v6 1/9] bpf: implement getsockopt and setsockopt hooks

From: Stanislav Fomichev <sdf@fomichev.me>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Stanislav Fomichev <sdf@google.com>,
	netdev@vger.kernel.org, bpf@vger.kernel.org, davem@davemloft.net,
	ast@kernel.org, daniel@iogearbox.net, Martin Lau <kafai@fb.com>
Subject: Re: [PATCH bpf-next v6 1/9] bpf: implement getsockopt and setsockopt hooks
Date: Tue, 18 Jun 2019 11:09:44 -0700	[thread overview]
Message-ID: <20190618180944.GJ9636@mini-arch> (raw)
In-Reply-To: <20190618164913.GI9636@mini-arch>

On 06/18, Stanislav Fomichev wrote:
> On 06/18, Alexei Starovoitov wrote:
> > On Mon, Jun 17, 2019 at 11:01:01AM -0700, Stanislav Fomichev wrote:
> > > Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> > > BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> > > 
> > > BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> > > BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> > > Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> > > 
> > > The buffer memory is pre-allocated (because I don't think there is
> > > a precedent for working with __user memory from bpf). This might be
> > > slow to do for each {s,g}etsockopt call, that's why I've added
> > > __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
> > > attached to a cgroup. Note, however, that there is a race between
> > > __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
> > > program layout might have changed; this should not be a problem
> > > because in general there is a race between multiple calls to
> > > {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> > > 
> > > The return code of the BPF program is handled as follows:
> > > * 0: EPERM
> > > * 1: success, execute kernel {s,g}etsockopt path after BPF prog exits
> > > * 2: success, do _not_ execute kernel {s,g}etsockopt path after BPF
> > >      prog exits
> > > 
> > > Note that if 0 or 2 is returned from BPF program, no further BPF program
> > > in the cgroup hierarchy is executed. This is in contrast with any existing
> > > per-cgroup BPF attach_type.
> > 
> > This is drastically different from all other cgroup-bpf progs.
> > I think all programs should be executed regardless of return code.
> > It seems to me that 1 vs 2 difference can be expressed via bpf program logic
> > instead of return code.
> > 
> > How about we do what all other cgroup-bpf progs do:
> > "any no is no. all yes is yes"
> > Meaning any ret=0 - EPERM back to user.
> > If all are ret=1 - kernel handles get/set.
> > 
> > I think the desire to differentiate 1 vs 2 came from ordering issue
> > on getsockopt.
> > How about for setsockopt all progs run first and then kernel.
> > For getsockopt kernel runs first and then all progs.
> > Then progs will have an ability to overwrite anything the kernel returns.
> Good idea, makes sense. For getsockopt we'd also need to pass the return
> value of the kernel getsockopt to let bpf programs override it, but seems
> doable. Let me play with it a bit; I'll send another version if nothing
> major comes up.
> 
> Thanks for another round of review!
One clarification: we'd still probably need to have 3 return codes for
setsockopt:
* any 0 - EPERM
* all 1 - continue with the kernel path (i.e. apply this sockopt as is)
* any 2 - return after all BPF hooks are executed (bypass kernel)
          (any 0 trumps any 2 -> EPERM)

The context is readonly for setsockopt, so it shouldn't be an issue.
Let me know if you have better idea how to handle that.