Re: [PATCH bpf-next v6 1/9] bpf: implement getsockopt and setsockopt hooks

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Stanislav Fomichev <sdf@fomichev.me>
Cc: Stanislav Fomichev <sdf@google.com>,
	Network Development <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>, Martin Lau <kafai@fb.com>
Subject: Re: [PATCH bpf-next v6 1/9] bpf: implement getsockopt and setsockopt hooks
Date: Tue, 18 Jun 2019 11:41:11 -0700	[thread overview]
Message-ID: <CAADnVQ+kymi+zJww+PfPd4WWhvNA67ynGVTd7oj6jiU+XFeguQ@mail.gmail.com> (raw)
In-Reply-To: <20190618180944.GJ9636@mini-arch>

On Tue, Jun 18, 2019 at 11:09 AM Stanislav Fomichev <sdf@fomichev.me> wrote:
>
> On 06/18, Stanislav Fomichev wrote:
> > On 06/18, Alexei Starovoitov wrote:
> > > On Mon, Jun 17, 2019 at 11:01:01AM -0700, Stanislav Fomichev wrote:
> > > > Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> > > > BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> > > >
> > > > BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> > > > BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> > > > Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> > > >
> > > > The buffer memory is pre-allocated (because I don't think there is
> > > > a precedent for working with __user memory from bpf). This might be
> > > > slow to do for each {s,g}etsockopt call, that's why I've added
> > > > __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
> > > > attached to a cgroup. Note, however, that there is a race between
> > > > __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
> > > > program layout might have changed; this should not be a problem
> > > > because in general there is a race between multiple calls to
> > > > {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> > > >
> > > > The return code of the BPF program is handled as follows:
> > > > * 0: EPERM
> > > > * 1: success, execute kernel {s,g}etsockopt path after BPF prog exits
> > > > * 2: success, do _not_ execute kernel {s,g}etsockopt path after BPF
> > > >      prog exits
> > > >
> > > > Note that if 0 or 2 is returned from BPF program, no further BPF program
> > > > in the cgroup hierarchy is executed. This is in contrast with any existing
> > > > per-cgroup BPF attach_type.
> > >
> > > This is drastically different from all other cgroup-bpf progs.
> > > I think all programs should be executed regardless of return code.
> > > It seems to me that 1 vs 2 difference can be expressed via bpf program logic
> > > instead of return code.
> > >
> > > How about we do what all other cgroup-bpf progs do:
> > > "any no is no. all yes is yes"
> > > Meaning any ret=0 - EPERM back to user.
> > > If all are ret=1 - kernel handles get/set.
> > >
> > > I think the desire to differentiate 1 vs 2 came from ordering issue
> > > on getsockopt.
> > > How about for setsockopt all progs run first and then kernel.
> > > For getsockopt kernel runs first and then all progs.
> > > Then progs will have an ability to overwrite anything the kernel returns.
> > Good idea, makes sense. For getsockopt we'd also need to pass the return
> > value of the kernel getsockopt to let bpf programs override it, but seems
> > doable. Let me play with it a bit; I'll send another version if nothing
> > major comes up.
> >
> > Thanks for another round of review!
> One clarification: we'd still probably need to have 3 return codes for
> setsockopt:
> * any 0 - EPERM
> * all 1 - continue with the kernel path (i.e. apply this sockopt as is)
> * any 2 - return after all BPF hooks are executed (bypass kernel)
>           (any 0 trumps any 2 -> EPERM)
>
> The context is readonly for setsockopt, so it shouldn't be an issue.
> Let me know if you have better idea how to handle that.

I think we don't really need 2.
The progs can reduce optlen to zero (or optname to BPF_EMPTY_SOCKOPT)
and do ret=1.
Then the kernel can see that nothing to be be done and return 0 to user space.
Since parent prog in the chain will be able to see that child prog
set optlen to zero, it will be able to overwrite if necessary.

getsockopt wil be clean as well.
all 1s return whatever was produced by progs to user space.
and progs will be able to see what kernel wanted to return because
the kernel's getsockopt logic ran first.
ret=2 doesn't have any meaning for getsockopt, so nice to keep
setsockopt symmetrical and don't do it there either.