From mboxrd@z Thu Jan  1 00:00:00 1970
From: Joe Stringer <joe@wand.net.nz>
Subject: Re: [RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
Date: Wed, 16 May 2018 11:55:37 -0700
Message-ID: <CAOftzPgWjrVfKVdkdWpReEp8d=btVZLxGNd3xAA-Xr_75YBNyg@mail.gmail.com>
References: <20180509210709.7201-1-joe@wand.net.nz> <20180509210709.7201-8-joe@wand.net.nz>
 <20180511045722.p7r4tbog66omohs6@kafai-mbp.dhcp.thefacebook.com>
 <CAOftzPg-2JdMOgvwTtubKijaF8mMO+s5w7CdYmFDuBDK3gAiog@mail.gmail.com>
 <20180511214111.hi6ax6qoe37al37q@kafai-mbp.dhcp.thefacebook.com>
 <CAOftzPinEyoaaokgsssA809LQ571x_u8hhA1rXw0HjN8a5O-7w@mail.gmail.com>
 <20180515031657.4wza3jsatn2ggsj5@ast-mbp> <20180515164801.lp5oyabhnukagczc@kafai-mbp.dhcp.thefacebook.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
        Joe Stringer <joe@wand.net.nz>, daniel@iogearbox.net,
        netdev <netdev@vger.kernel.org>, ast@kernel.org,
        john fastabend <john.fastabend@gmail.com>
To: Martin KaFai Lau <kafai@fb.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-qk0-f194.google.com ([209.85.220.194]:45579 "EHLO
        mail-qk0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750778AbeEPSz7 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 16 May 2018 14:55:59 -0400
Received: by mail-qk0-f194.google.com with SMTP id a8-v6so1543386qkj.12
        for <netdev@vger.kernel.org>; Wed, 16 May 2018 11:55:59 -0700 (PDT)
In-Reply-To: <20180515164801.lp5oyabhnukagczc@kafai-mbp.dhcp.thefacebook.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 15 May 2018 at 09:48, Martin KaFai Lau <kafai@fb.com> wrote:
> On Mon, May 14, 2018 at 08:16:59PM -0700, Alexei Starovoitov wrote:
>> On Fri, May 11, 2018 at 05:54:33PM -0700, Joe Stringer wrote:
>> > On 11 May 2018 at 14:41, Martin KaFai Lau <kafai@fb.com> wrote:
>> > > On Fri, May 11, 2018 at 02:08:01PM -0700, Joe Stringer wrote:
>> > >> On 10 May 2018 at 22:00, Martin KaFai Lau <kafai@fb.com> wrote:
>> > >> > On Wed, May 09, 2018 at 02:07:05PM -0700, Joe Stringer wrote:
>> > >> >> This patch adds a new BPF helper function, sk_lookup() which allows BPF
>> > >> >> programs to find out if there is a socket listening on this host, and
>> > >> >> returns a socket pointer which the BPF program can then access to
>> > >> >> determine, for instance, whether to forward or drop traffic. sk_lookup()
>> > >> >> takes a reference on the socket, so when a BPF program makes use of this
>> > >> >> function, it must subsequently pass the returned pointer into the newly
>> > >> >> added sk_release() to return the reference.
>> > >> >>
>> > >> >> By way of example, the following pseudocode would filter inbound
>> > >> >> connections at XDP if there is no corresponding service listening for
>> > >> >> the traffic:
>> > >> >>
>> > >> >>   struct bpf_sock_tuple tuple;
>> > >> >>   struct bpf_sock_ops *sk;
>> > >> >>
>> > >> >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
>> > >> >>   sk = bpf_sk_lookup(ctx, &tuple, sizeof tuple, netns, 0);
>> > >> >>   if (!sk) {
>> > >> >>     // Couldn't find a socket listening for this traffic. Drop.
>> > >> >>     return TC_ACT_SHOT;
>> > >> >>   }
>> > >> >>   bpf_sk_release(sk, 0);
>> > >> >>   return TC_ACT_OK;
>> > >> >>
>> > >> >> Signed-off-by: Joe Stringer <joe@wand.net.nz>
>> > >> >> ---
>> > >>
>> > >> ...
>> > >>
>> > >> >> @@ -4032,6 +4036,96 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
>> > >> >>  };
>> > >> >>  #endif
>> > >> >>
>> > >> >> +struct sock *
>> > >> >> +sk_lookup(struct net *net, struct bpf_sock_tuple *tuple) {
>> > >> > Would it be possible to have another version that
>> > >> > returns a sk without taking its refcnt?
>> > >> > It may have performance benefit.
>> > >>
>> > >> Not really. The sockets are not RCU-protected, and established sockets
>> > >> may be torn down without notice. If we don't take a reference, there's
>> > >> no guarantee that the socket will continue to exist for the duration
>> > >> of running the BPF program.
>> > >>
>> > >> From what I follow, the comment below has a hidden implication which
>> > >> is that sockets without SOCK_RCU_FREE, eg established sockets, may be
>> > >> directly freed regardless of RCU.
>> > > Right, SOCK_RCU_FREE sk is the one I am concern about.
>> > > For example, TCP_LISTEN socket does not require taking a refcnt
>> > > now.  Doing a bpf_sk_lookup() may have a rather big
>> > > impact on handling TCP syn flood.  or the usual intention
>> > > is to redirect instead of passing it up to the stack?
>> >
>> > I see, if you're only interested in listen sockets then probably this
>> > series could be extended with a new flag, eg something like
>> > BPF_F_SK_FIND_LISTENERS which restricts the set of possible sockets
>> > found to only listen sockets, then the implementation would call into
>> > __inet_lookup_listener() instead of inet_lookup(). The presence of
>> > that flag in the relevant register during CALL instruction would show
>> > that the verifier should not reference-track the result, then there'd
>> > need to be a check on the release to ensure that this unreferenced
>> > socket is never released. Just a thought, completely untested and I
>> > could still be missing some detail..
>> >
>> > That said, I don't completely follow how you would expect to handle
>> > the traffic for sockets that are already established - the helper
>> > would no longer find those sockets, so you wouldn't know whether to
>> > pass the traffic up the stack for established traffic or not.
>>
>> I think Martin has a valid concern here that if somebody starts using
>> this helper on the rx traffic the bpf program (via these two new
>> helpers) will be doing refcnt++ and refcnt-- even for listener
>> sockets which will cause synflood to suffer.
>> One can argue that this is bpf author mistake, but without fixes
>> (and api changes) to the helper the programmer doesn't really have a way
>> of avoiding this situation.
>> Also udp sockets don't need refcnt at all.
>> How about we split this single helper into three:
>> - bpf_sk_lookup_tcp_established() that will return refcnt-ed socket
>> and has to be bpf_sk_release()d by the program.
>> - bpf_sk_lookup_tcp_listener() that doesn't refcnt, since progs
>> run in rcu.
>> - bpf_sk_lookup_udp() that also doesn't refcnt.
>> The logic you want to put into this helper can be easily
>> replicated with these three helpers and the whole thing will
>> be much faster.
>> Thoughts?
> Just came to my mind.
>
> or can we explore something like:
>
> On the bpf_sk_lookup() side, use __inet[6]_lookup()
> and __udp[46]_lib_lookup() instead.  That should
> only take refcnt if it has to.
>
> On the bpf_sk_release() side, it skips refcnt--
> if sk is SOCK_RCU_FREE.

Reflecting the discussion from IOVisor call:

I voiced a concern with the above proposal by Alexei that it leaks
kernel implementation detail (established sockets are refcnted) into
the BPF API.

Martin's proposal here addresses this concern. We can force all
sk_lookup()s to match with a bpf_sk_release(), then inside the
bpf_sk_release() we can deal with the details of whether any freeing
is actually required.

It's still useful to split the helpers out into bpf_sk_lookup_tcp()
and bpf_sk_lookup_udp() because then we don't need to deal with the
forward-compatibility concern of adding support for different socket
types (eg SCTP). That said, the TCP established/listener split does
not have an immediate user, so we don't need to split these at this
time. If there is a use case for only finding the listener sockets, we
can always add a flag to the bpf_sk_lookup_tcp() helper to only find
the listener sockets.

I'll respin, thanks for the feedback all.