Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper

From: Martin Lau <kafai@fb.com>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Alexei Starovoitov <ast@fb.com>, Kernel Team <Kernel-team@fb.com>,
	Lawrence Brakmo <brakmo@fb.com>
Subject: Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
Date: Thu, 7 Feb 2019 07:27:24 +0000	[thread overview]
Message-ID: <20190207072716.qat44fx2t6wqvu4l@kafai-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20190205005029.yxowrbz4aiht7jhm@kafai-mbp>

On Mon, Feb 04, 2019 at 04:50:32PM -0800, Martin Lau wrote:
> On Mon, Feb 04, 2019 at 11:33:28PM +0100, Daniel Borkmann wrote:
> > Hi Martin,
> > 
> > On 02/01/2019 08:03 AM, Martin KaFai Lau wrote:
> > > In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)"
> > > before accessing the fields in sock.  For example, in __netdev_pick_tx:
> > > 
> > > static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
> > > 			    struct net_device *sb_dev)
> > > {
> > > 	/* ... */
> > > 
> > > 	struct sock *sk = skb->sk;
> > > 
> > > 		if (queue_index != new_index && sk &&
> > > 		    sk_fullsock(sk) &&
> > > 		    rcu_access_pointer(sk->sk_dst_cache))
> > > 			sk_tx_queue_set(sk, new_index);
> > > 
> > > 	/* ... */
> > > 
> > > 	return queue_index;
> > > }
> > > 
> > > This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
> > > where a few of the convert_ctx_access() in filter.c has already been
> > > accessing the skb->sk sock_common's fields,
> > > e.g. sock_ops_convert_ctx_access().
> > > 
> > > "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
> > > Some of the fileds in "bpf_sock" will not be directly
> > > accessible through the "__sk_buff->sk" pointer.  It is limited
> > > by the new "bpf_sock_common_is_valid_access()".
> > > e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
> > >      are not allowed.
> > > 
> > > The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
> > > can be used to get a sk with all accessible fields in "bpf_sock".
> > > This helper is added to both cg_skb and sched_(cls|act).
> > > 
> > > int cg_skb_foo(struct __sk_buff *skb) {
> > > 	struct bpf_sock *sk;
> > > 	__u32 family;
> > > 
> > > 	sk = skb->sk;
> > > 	if (!sk)
> > > 		return 1;
> > > 
> > > 	sk = bpf_sk_fullsock(sk);
> > > 	if (!sk)
> > > 		return 1;
> > > 
> > > 	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
> > > 		return 1;
> > > 
> > > 	/* some_traffic_shaping(); */
> > > 
> > > 	return 1;
> > > }
> > > 
> > > (1) The sk is read only
> > > 
> > > (2) There is no new "struct bpf_sock_common" introduced.
> > > 
> > > (3) Future kernel sock's members could be added to bpf_sock only
> > >     instead of repeatedly adding at multiple places like currently
> > >     in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
> > > 
> > > (4) After "sk = skb->sk", the reg holding sk is in type
> > >     PTR_TO_SOCK_COMMON_OR_NULL.
> > > 
> > > (5) After bpf_sk_fullsock(), the return type will be in type
> > >     PTR_TO_SOCKET_OR_NULL which is the same as the return type of
> > >     bpf_sk_lookup_xxx().
> > > 
> > >     However, bpf_sk_fullsock() does not take refcnt.  The
> > >     acquire_reference_state() is only depending on the return type now.
> > >     To avoid it, a new is_acquire_function() is checked before calling
> > >     acquire_reference_state().
> > 
> > Bit unfortunate that a helper like bpf_sk_fullsock() would be needed, after
> > all this is more of an implementation detail which we would expose here to
> > the developer.
> > 
> > Is there a specific reason why fetching skb->sk couldn't already be of the
> > type PTR_TO_SOCKET_OR_NULL such that the bpf_sk_fullsock() step wouldn't be
> > needed and most logic we have today could already be reused (modulo refcnt
> > avoidance)?
> Not all running context has a fullsock (PTR_TO_SOCKET_OR_NULL).
> 
> Based on how sk_to_full_sk() is used (e.g. in bpf_get_socket_uid()),
> not sure a sk (e.g. tw sock) can always be traced back to a full sk.
> 
> In term of the patch implementation, it is not much difference.  It is a bit
> simplier without bpf_sk_fullsock() and PTR_TO_SOCK_COMMON(_OR_NULL) but
> not a lot.  I have tried both.
> 
> The "fullsock" has already been exposed in another form.
> e.g. In sock_ops, the tcp_sock fields is not read if it is not a fullsock
> while other sock_common fields will still be available.  The bpf_prog
> can test the sock_ops->is_fullsock for what to do.
> 
> > 
> > In particular, do you need the skb->sk without the full-sk part somewhere
> > (e.g. in tw socks)? Why not doing something like sk_to_full_sk() inside the
> > helper or even better as BPF ctx rewrite upon skb->sk to fetch the full sk
> > parent where you could also access remaining bpf_sock fields?
> I am thinking more on what if the bpf_prog only needs the fields from
> sock_common (e.g. the src/dst ip/port) and skb already has
> other needed info (e.g. protocol/mark/priority).
> Enforing skb->sk must be a fullsock will unnecessarily limit those
> bpf_prog from seeing all skb.
> 
> A "struct bpf_common_sock" could be added instead vs a bpf_sk_fullsock()
> tester.  I think having one "struct bpf_sock" is better and less confusing.
> 
> Later, for the running context that is sure to have a fullsock,
> skb->sk can directly have PTR_TO_SOCKET_OR_NULL instead of
> PTR_TO_SOCK_COMMON_OR_NULL.
> 
Following up the discussion in the iovisor conf call.

One of discussion was about:
other than tw, can __sk_buff->sk always return a
fullsock (PTR_TO_SOCKET_OR_NULL).  In request_sock case,
it is doable because it can trace back to the listener sock.

However, that will go back to the sock_common accessing question.
In particular, how to access the sock_common's fields of the
request_sock itself?  Those fields in the request_sock are different
from its listener sock.  e.g. the skc_daddr and skc_dport.

Also, if the sock_common fields of tw is needed, it will become weird
because likely a new "struct bpf_tw_sock" is needed which is OK
but all sock_common fields need to be copied from bpf_sock
to bpf_tw_sock.

I think reading a sk from a ctx should return the
most basic type PTR_TO_SOCK_COMMON_OR_NULL (unless the running
ctx can guarantee that it always has a fullsock).
Currently, it is __sk_buff->sk.  Later, sock_ops->sk...etc.
One single 'struct bpf_sock' and limit fullsock field access
at verification time.  The bpf_prog then moves down the chain
based on what it needs.  It could be fullsock, tcp_sock...etc.

I think that will be the most flexible way to write bpf_prog
while also avoid having duplicate fields in different
bpf struct in uapi.