bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Martin Lau <kafai@fb.com>
To: Jakub Sitnicki <jakub@cloudflare.com>
Cc: John Fastabend <john.fastabend@gmail.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"kernel-team@cloudflare.com" <kernel-team@cloudflare.com>
Subject: Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
Date: Wed, 11 Dec 2019 17:20:54 +0000	[thread overview]
Message-ID: <20191211172051.clnwh5n5vdeovayy@kafai-mbp> (raw)
In-Reply-To: <87sglsfdda.fsf@cloudflare.com>

On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote:
> John, Martin,
> 
> On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote:
> > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote:
> >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
> >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
> >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
> >>> > [ ... ]
> >>> >
> >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
> >>> >>  			sk->sk_prot = psock->sk_proto;
> >>> >>  		psock->sk_proto = NULL;
> >>> >>  	}
> >>> >> +
> >>> >> +	if (psock->icsk_af_ops) {
> >>> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
> >>> >> +		psock->icsk_af_ops = NULL;
> >>> >> +	}
> >>> >>  }
> >>> >
> >>> > [ ... ]
> >>> >
> >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
> >>> >> +					  struct sk_buff *skb,
> >>> >> +					  struct request_sock *req,
> >>> >> +					  struct dst_entry *dst,
> >>> >> +					  struct request_sock *req_unhash,
> >>> >> +					  bool *own_req)
> >>> >> +{
> >>> >> +	const struct inet_connection_sock_af_ops *ops;
> >>> >> +	void (*write_space)(struct sock *sk);
> >>> >> +	struct sk_psock *psock;
> >>> >> +	struct proto *proto;
> >>> >> +	struct sock *child;
> >>> >> +
> >>> >> +	rcu_read_lock();
> >>> >> +	psock = sk_psock(sk);
> >>> >> +	if (likely(psock)) {
> >>> >> +		proto = psock->sk_proto;
> >>> >> +		write_space = psock->saved_write_space;
> >>> >> +		ops = psock->icsk_af_ops;
> >>> > It is not immediately clear to me what ensure
> >>> > ops is not NULL here.
> >>> >
> >>> > It is likely I missed something.  A short comment would
> >>> > be very useful here.
> >>>
> >>> I can see the readability problem. Looking at it now, perhaps it should
> >>> be rewritten, to the same effect, as:
> >>>
> >>> static struct sock *tcp_bpf_syn_recv_sock(...)
> >>> {
> >>> 	const struct inet_connection_sock_af_ops *ops = NULL;
> >>>         ...
> >>>
> >>>         rcu_read_lock();
> >>> 	psock = sk_psock(sk);
> >>> 	if (likely(psock)) {
> >>> 		proto = psock->sk_proto;
> >>> 		write_space = psock->saved_write_space;
> >>> 		ops = psock->icsk_af_ops;
> >>> 	}
> >>> 	rcu_read_unlock();
> >>>
> >>>         if (!ops)
> >>> 		ops = inet_csk(sk)->icsk_af_ops;
> >>>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
> >>>
> >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
> >>> properly. To double check what happens here:
> >> I did not mean the init path.  The init path is fine since it init
> >> eveything on psock before publishing the sk to the sock_map.
> >>
> >> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
> >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
> >> psock->icsk_af_ops to NULL from running in parallel with
> >> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
> >
> > Ah, I misunderstood. Nothing prevents the race, AFAIK.
> >
> > Setting psock->icsk_af_ops to null on restore and not checking for it
> > here was a bad move on my side.  Also I need to revisit what to do about
> > psock->sk_proto so the child socket doesn't end up with null sk_proto.
> >
> > This race should be easy enough to trigger. Will give it a shot.
> 
> I've convinced myself that this approach is racy beyond repair.
> 
> Once syn_recv_sock() has returned it is too late to reset the child
> sk_user_data and restore its callbacks. It has been already inserted
> into ehash and ingress path can invoke its callbacks.
> 
> The race can be triggered with with a reproducer where:
> 
> thread-1:
> 
>         p = accept(s, ...);
>         close(p);
> 
> thread-2:
> 
> 	bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST);
> 	bpf_map_delete_elem(mapfd, &key);
> 
> This a dead-end because we can't have the parent and the child share the
> psock state. Even though psock itself is refcounted, and potentially we
> could grab a reference before cloning the parent, link into the map that
> psock holds is not.
> 
> Two ways out come to mind. Both involve touching TCP code, which I was
> hoping to avoid:
> 
> 1) reset sk_user_data when initializing the child
> 
>    This is problematic because tcp_bpf callbacks are not designed to
>    handle sockets with no psock _and_ with overridden sk_prot
>    callbacks. (Although, I think they could if the fallback was directly
>    on {tcp,tcpv6}_prot based on socket domain.)
> 
>    Also, there are other sk_user_data users like DRBD which rely on
>    sharing the sk_user_data pointer between parent and child, if I read
>    the code correctly [0]. If anything, clearing the sk_user_data on
>    clone would have to be guarded by a flag.
Can the copy/not-to-copy sk_user_data decision be made in
sk_clone_lock()?

> 
> 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot
> 
>    The simpler way out. tcp_bpf callbacks never get invoked on the child
>    socket so the copied psock reference is no longer a problem. We can
>    clear the pointer on accept().
> 
>    So far I wasn't able poke any holes in it and it comes down to
>    patching tcp_create_openreq_child() with:
> 
> 	/* sk_msg and ULP frameworks can override the callbacks into
> 	 * protocol. We don't assume they are intended to be inherited
> 	 * by the child. Frameworks can re-install the callbacks on
> 	 * accept() if needed.
> 	 */
> 	WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator);
> 
>    That's what I'm going with for v2.
> 
> Open to suggestions.
> 
> Thanks,
> Jakub
> 
> BTW. Reading into kTLS code, I noticed it has been limited down to just
> established sockets due to the same problem I'm struggling with here:
> 
> static int tls_init(struct sock *sk)
> {
> ...
> 	/* The TLS ulp is currently supported only for TCP sockets
> 	 * in ESTABLISHED state.
> 	 * Supporting sockets in LISTEN state will require us
> 	 * to modify the accept implementation to clone rather then
> 	 * share the ulp context.
> 	 */
> 	if (sk->sk_state != TCP_ESTABLISHED)
> 		return -ENOTCONN;
> 
> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e= 

  parent reply	other threads:[~2019-12-11 17:21 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
2019-11-24  5:32   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
2019-11-24  5:35   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP Jakub Sitnicki
2019-11-24  5:38   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy Jakub Sitnicki
2019-11-24  5:56   ` John Fastabend
2019-11-25 22:38   ` Martin Lau
2019-11-26 15:54     ` Jakub Sitnicki
2019-11-26 17:16       ` Martin Lau
2019-11-26 18:36         ` Jakub Sitnicki
     [not found]           ` <87sglsfdda.fsf@cloudflare.com>
2019-12-11 17:20             ` Martin Lau [this message]
2019-12-12 11:27               ` Jakub Sitnicki
2019-12-12 19:23                 ` Martin Lau
2019-12-17 15:06                   ` Jakub Sitnicki
2019-11-26 18:43         ` John Fastabend
2019-11-27 22:18           ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
2019-11-24  5:57   ` John Fastabend
2019-11-25  1:24   ` Alexei Starovoitov
2019-11-25  4:17     ` John Fastabend
2019-11-25 10:40       ` Jakub Sitnicki
2019-11-25 22:07         ` Martin Lau
2019-11-26 14:30           ` Jakub Sitnicki
2019-11-26 19:03             ` Martin Lau
2019-11-27 21:34               ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name Jakub Sitnicki
2019-11-24  5:57   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
2019-11-24  6:00   ` John Fastabend
2019-11-25 22:30   ` Martin Lau
2019-11-26 14:32     ` Jakub Sitnicki
2019-12-12 10:30     ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
2019-11-24  6:04   ` John Fastabend
2019-11-24  6:10 ` [PATCH bpf-next 0/8] Extend SOCKMAP to store " John Fastabend
2019-11-25  9:22   ` Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191211172051.clnwh5n5vdeovayy@kafai-mbp \
    --to=kafai@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=jakub@cloudflare.com \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).