bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Martin Lau <kafai@fb.com>
To: unlisted-recipients:; (no To-header on input)
Cc: John Fastabend <john.fastabend@gmail.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"kernel-team@cloudflare.com" <kernel-team@cloudflare.com>
Subject: Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
Date: Thu, 12 Dec 2019 19:23:58 +0000	[thread overview]
Message-ID: <20191212192354.umerwea5z4fpwbkq@kafai-mbp> (raw)
In-Reply-To: <87pngtg4x4.fsf@cloudflare.com>

On Thu, Dec 12, 2019 at 12:27:19PM +0100, Jakub Sitnicki wrote:
> On Wed, Dec 11, 2019 at 06:20 PM CET, Martin Lau wrote:
> > On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote:
> >> John, Martin,
> >>
> >> On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote:
> >> > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote:
> >> >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
> >> >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
> >> >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
> >> >>> > [ ... ]
> >> >>> >
> >> >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
> >> >>> >>  			sk->sk_prot = psock->sk_proto;
> >> >>> >>  		psock->sk_proto = NULL;
> >> >>> >>  	}
> >> >>> >> +
> >> >>> >> +	if (psock->icsk_af_ops) {
> >> >>> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
> >> >>> >> +		psock->icsk_af_ops = NULL;
> >> >>> >> +	}
> >> >>> >>  }
> >> >>> >
> >> >>> > [ ... ]
> >> >>> >
> >> >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
> >> >>> >> +					  struct sk_buff *skb,
> >> >>> >> +					  struct request_sock *req,
> >> >>> >> +					  struct dst_entry *dst,
> >> >>> >> +					  struct request_sock *req_unhash,
> >> >>> >> +					  bool *own_req)
> >> >>> >> +{
> >> >>> >> +	const struct inet_connection_sock_af_ops *ops;
> >> >>> >> +	void (*write_space)(struct sock *sk);
> >> >>> >> +	struct sk_psock *psock;
> >> >>> >> +	struct proto *proto;
> >> >>> >> +	struct sock *child;
> >> >>> >> +
> >> >>> >> +	rcu_read_lock();
> >> >>> >> +	psock = sk_psock(sk);
> >> >>> >> +	if (likely(psock)) {
> >> >>> >> +		proto = psock->sk_proto;
> >> >>> >> +		write_space = psock->saved_write_space;
> >> >>> >> +		ops = psock->icsk_af_ops;
> >> >>> > It is not immediately clear to me what ensure
> >> >>> > ops is not NULL here.
> >> >>> >
> >> >>> > It is likely I missed something.  A short comment would
> >> >>> > be very useful here.
> >> >>>
> >> >>> I can see the readability problem. Looking at it now, perhaps it should
> >> >>> be rewritten, to the same effect, as:
> >> >>>
> >> >>> static struct sock *tcp_bpf_syn_recv_sock(...)
> >> >>> {
> >> >>> 	const struct inet_connection_sock_af_ops *ops = NULL;
> >> >>>         ...
> >> >>>
> >> >>>         rcu_read_lock();
> >> >>> 	psock = sk_psock(sk);
> >> >>> 	if (likely(psock)) {
> >> >>> 		proto = psock->sk_proto;
> >> >>> 		write_space = psock->saved_write_space;
> >> >>> 		ops = psock->icsk_af_ops;
> >> >>> 	}
> >> >>> 	rcu_read_unlock();
> >> >>>
> >> >>>         if (!ops)
> >> >>> 		ops = inet_csk(sk)->icsk_af_ops;
> >> >>>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
> >> >>>
> >> >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
> >> >>> properly. To double check what happens here:
> >> >> I did not mean the init path.  The init path is fine since it init
> >> >> eveything on psock before publishing the sk to the sock_map.
> >> >>
> >> >> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
> >> >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
> >> >> psock->icsk_af_ops to NULL from running in parallel with
> >> >> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
> >> >
> >> > Ah, I misunderstood. Nothing prevents the race, AFAIK.
> >> >
> >> > Setting psock->icsk_af_ops to null on restore and not checking for it
> >> > here was a bad move on my side.  Also I need to revisit what to do about
> >> > psock->sk_proto so the child socket doesn't end up with null sk_proto.
> >> >
> >> > This race should be easy enough to trigger. Will give it a shot.
> >>
> >> I've convinced myself that this approach is racy beyond repair.
> >>
> >> Once syn_recv_sock() has returned it is too late to reset the child
> >> sk_user_data and restore its callbacks. It has been already inserted
> >> into ehash and ingress path can invoke its callbacks.
> >>
> >> The race can be triggered with with a reproducer where:
> >>
> >> thread-1:
> >>
> >>         p = accept(s, ...);
> >>         close(p);
> >>
> >> thread-2:
> >>
> >> 	bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST);
> >> 	bpf_map_delete_elem(mapfd, &key);
> >>
> >> This a dead-end because we can't have the parent and the child share the
> >> psock state. Even though psock itself is refcounted, and potentially we
> >> could grab a reference before cloning the parent, link into the map that
> >> psock holds is not.
> >>
> >> Two ways out come to mind. Both involve touching TCP code, which I was
> >> hoping to avoid:
> >>
> >> 1) reset sk_user_data when initializing the child
> >>
> >>    This is problematic because tcp_bpf callbacks are not designed to
> >>    handle sockets with no psock _and_ with overridden sk_prot
> >>    callbacks. (Although, I think they could if the fallback was directly
> >>    on {tcp,tcpv6}_prot based on socket domain.)
> >>
> >>    Also, there are other sk_user_data users like DRBD which rely on
> >>    sharing the sk_user_data pointer between parent and child, if I read
> >>    the code correctly [0]. If anything, clearing the sk_user_data on
> >>    clone would have to be guarded by a flag.
> > Can the copy/not-to-copy sk_user_data decision be made in
> > sk_clone_lock()?
> 
> Yes, this could be pushed down to sk_clone_lock(), where we do similar
> work (reset sk_reuseport_cb and clone bpf_sk_storage):
aha.  I missed your eariler "clearing the sk_user_data on clone would have
to be guarded by a flag..." part.  It turns out we were talking the same
thing on (1).  sock_flag works better if there is still bit left (and it
seems there is one),  although I was thinking more like adding
something (e.g. a func ptr) to 'struct proto' to mangle sk_user_data
before returning newsk....but not sure this kind of logic
belongs to 'struct proto'

> 
> 	/* User data can hold reference. Child must not
> 	 * inherit the pointer without acquiring a reference.
> 	 */
> 	if (sock_flag(sk, SOCK_OWNS_USER_DATA)) {
> 		sock_reset_flag(newsk, SOCK_OWNS_USER_DATA);
> 		RCU_INIT_POINTER(newsk->sk_user_data, NULL);
> 	}
> 
> I belive this would still need to be guarded by a flag.  Do you see
> value in clearing child sk_user_data on clone as opposed to dealying
> that work until accept() time?
It seems to me clearing things up front at the very beginning is more
straight forward, such that it does not have to worry about the
sk_user_data may be used in a wrong way before it gets a chance
to be cleared in accept().

Just something to consider, if it is obvious that there is no hole in
clearing it in accept(), it is fine too.

> >>
> >> 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot
> >>
> >>    The simpler way out. tcp_bpf callbacks never get invoked on the child
> >>    socket so the copied psock reference is no longer a problem. We can
> >>    clear the pointer on accept().
> >>
> >>    So far I wasn't able poke any holes in it and it comes down to
> >>    patching tcp_create_openreq_child() with:
> >>
> >> 	/* sk_msg and ULP frameworks can override the callbacks into
> >> 	 * protocol. We don't assume they are intended to be inherited
> >> 	 * by the child. Frameworks can re-install the callbacks on
> >> 	 * accept() if needed.
> >> 	 */
> >> 	WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator);
> >>
> >>    That's what I'm going with for v2.
> >>
> >> Open to suggestions.
> >>
> >> Thanks,
> >> Jakub
> >>
> >> BTW. Reading into kTLS code, I noticed it has been limited down to just
> >> established sockets due to the same problem I'm struggling with here:
> >>
> >> static int tls_init(struct sock *sk)
> >> {
> >> ...
> >> 	/* The TLS ulp is currently supported only for TCP sockets
> >> 	 * in ESTABLISHED state.
> >> 	 * Supporting sockets in LISTEN state will require us
> >> 	 * to modify the accept implementation to clone rather then
> >> 	 * share the ulp context.
> >> 	 */
> >> 	if (sk->sk_state != TCP_ESTABLISHED)
> >> 		return -ENOTCONN;
> >>
> >> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e=

  reply	other threads:[~2019-12-12 19:24 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
2019-11-24  5:32   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
2019-11-24  5:35   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP Jakub Sitnicki
2019-11-24  5:38   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy Jakub Sitnicki
2019-11-24  5:56   ` John Fastabend
2019-11-25 22:38   ` Martin Lau
2019-11-26 15:54     ` Jakub Sitnicki
2019-11-26 17:16       ` Martin Lau
2019-11-26 18:36         ` Jakub Sitnicki
     [not found]           ` <87sglsfdda.fsf@cloudflare.com>
2019-12-11 17:20             ` Martin Lau
2019-12-12 11:27               ` Jakub Sitnicki
2019-12-12 19:23                 ` Martin Lau [this message]
2019-12-17 15:06                   ` Jakub Sitnicki
2019-11-26 18:43         ` John Fastabend
2019-11-27 22:18           ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
2019-11-24  5:57   ` John Fastabend
2019-11-25  1:24   ` Alexei Starovoitov
2019-11-25  4:17     ` John Fastabend
2019-11-25 10:40       ` Jakub Sitnicki
2019-11-25 22:07         ` Martin Lau
2019-11-26 14:30           ` Jakub Sitnicki
2019-11-26 19:03             ` Martin Lau
2019-11-27 21:34               ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name Jakub Sitnicki
2019-11-24  5:57   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
2019-11-24  6:00   ` John Fastabend
2019-11-25 22:30   ` Martin Lau
2019-11-26 14:32     ` Jakub Sitnicki
2019-12-12 10:30     ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
2019-11-24  6:04   ` John Fastabend
2019-11-24  6:10 ` [PATCH bpf-next 0/8] Extend SOCKMAP to store " John Fastabend
2019-11-25  9:22   ` Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191212192354.umerwea5z4fpwbkq@kafai-mbp \
    --to=kafai@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --subject='Re: [PATCH bpf-next 4/8] bpf, sockmap: Don'\''t let child socket inherit psock or its ops on copy' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).