Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy

From: Jakub Sitnicki <jakub@cloudflare.com>
To: Martin Lau <kafai@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>,
	"bpf\@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev\@vger.kernel.org" <netdev@vger.kernel.org>,
	"kernel-team\@cloudflare.com" <kernel-team@cloudflare.com>
Subject: Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
Date: Thu, 12 Dec 2019 12:27:19 +0100	[thread overview]
Message-ID: <87pngtg4x4.fsf@cloudflare.com> (raw)
In-Reply-To: <20191211172051.clnwh5n5vdeovayy@kafai-mbp>

On Wed, Dec 11, 2019 at 06:20 PM CET, Martin Lau wrote:
> On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote:
>> John, Martin,
>>
>> On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote:
>> > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote:
>> >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
>> >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
>> >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
>> >>> > [ ... ]
>> >>> >
>> >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>> >>> >>  			sk->sk_prot = psock->sk_proto;
>> >>> >>  		psock->sk_proto = NULL;
>> >>> >>  	}
>> >>> >> +
>> >>> >> +	if (psock->icsk_af_ops) {
>> >>> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
>> >>> >> +		psock->icsk_af_ops = NULL;
>> >>> >> +	}
>> >>> >>  }
>> >>> >
>> >>> > [ ... ]
>> >>> >
>> >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
>> >>> >> +					  struct sk_buff *skb,
>> >>> >> +					  struct request_sock *req,
>> >>> >> +					  struct dst_entry *dst,
>> >>> >> +					  struct request_sock *req_unhash,
>> >>> >> +					  bool *own_req)
>> >>> >> +{
>> >>> >> +	const struct inet_connection_sock_af_ops *ops;
>> >>> >> +	void (*write_space)(struct sock *sk);
>> >>> >> +	struct sk_psock *psock;
>> >>> >> +	struct proto *proto;
>> >>> >> +	struct sock *child;
>> >>> >> +
>> >>> >> +	rcu_read_lock();
>> >>> >> +	psock = sk_psock(sk);
>> >>> >> +	if (likely(psock)) {
>> >>> >> +		proto = psock->sk_proto;
>> >>> >> +		write_space = psock->saved_write_space;
>> >>> >> +		ops = psock->icsk_af_ops;
>> >>> > It is not immediately clear to me what ensure
>> >>> > ops is not NULL here.
>> >>> >
>> >>> > It is likely I missed something.  A short comment would
>> >>> > be very useful here.
>> >>>
>> >>> I can see the readability problem. Looking at it now, perhaps it should
>> >>> be rewritten, to the same effect, as:
>> >>>
>> >>> static struct sock *tcp_bpf_syn_recv_sock(...)
>> >>> {
>> >>> 	const struct inet_connection_sock_af_ops *ops = NULL;
>> >>>         ...
>> >>>
>> >>>         rcu_read_lock();
>> >>> 	psock = sk_psock(sk);
>> >>> 	if (likely(psock)) {
>> >>> 		proto = psock->sk_proto;
>> >>> 		write_space = psock->saved_write_space;
>> >>> 		ops = psock->icsk_af_ops;
>> >>> 	}
>> >>> 	rcu_read_unlock();
>> >>>
>> >>>         if (!ops)
>> >>> 		ops = inet_csk(sk)->icsk_af_ops;
>> >>>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
>> >>>
>> >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
>> >>> properly. To double check what happens here:
>> >> I did not mean the init path.  The init path is fine since it init
>> >> eveything on psock before publishing the sk to the sock_map.
>> >>
>> >> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
>> >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
>> >> psock->icsk_af_ops to NULL from running in parallel with
>> >> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
>> >
>> > Ah, I misunderstood. Nothing prevents the race, AFAIK.
>> >
>> > Setting psock->icsk_af_ops to null on restore and not checking for it
>> > here was a bad move on my side.  Also I need to revisit what to do about
>> > psock->sk_proto so the child socket doesn't end up with null sk_proto.
>> >
>> > This race should be easy enough to trigger. Will give it a shot.
>>
>> I've convinced myself that this approach is racy beyond repair.
>>
>> Once syn_recv_sock() has returned it is too late to reset the child
>> sk_user_data and restore its callbacks. It has been already inserted
>> into ehash and ingress path can invoke its callbacks.
>>
>> The race can be triggered with with a reproducer where:
>>
>> thread-1:
>>
>>         p = accept(s, ...);
>>         close(p);
>>
>> thread-2:
>>
>> 	bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST);
>> 	bpf_map_delete_elem(mapfd, &key);
>>
>> This a dead-end because we can't have the parent and the child share the
>> psock state. Even though psock itself is refcounted, and potentially we
>> could grab a reference before cloning the parent, link into the map that
>> psock holds is not.
>>
>> Two ways out come to mind. Both involve touching TCP code, which I was
>> hoping to avoid:
>>
>> 1) reset sk_user_data when initializing the child
>>
>>    This is problematic because tcp_bpf callbacks are not designed to
>>    handle sockets with no psock _and_ with overridden sk_prot
>>    callbacks. (Although, I think they could if the fallback was directly
>>    on {tcp,tcpv6}_prot based on socket domain.)
>>
>>    Also, there are other sk_user_data users like DRBD which rely on
>>    sharing the sk_user_data pointer between parent and child, if I read
>>    the code correctly [0]. If anything, clearing the sk_user_data on
>>    clone would have to be guarded by a flag.
> Can the copy/not-to-copy sk_user_data decision be made in
> sk_clone_lock()?

Yes, this could be pushed down to sk_clone_lock(), where we do similar
work (reset sk_reuseport_cb and clone bpf_sk_storage):

	/* User data can hold reference. Child must not
	 * inherit the pointer without acquiring a reference.
	 */
	if (sock_flag(sk, SOCK_OWNS_USER_DATA)) {
		sock_reset_flag(newsk, SOCK_OWNS_USER_DATA);
		RCU_INIT_POINTER(newsk->sk_user_data, NULL);
	}

I belive this would still need to be guarded by a flag.  Do you see
value in clearing child sk_user_data on clone as opposed to dealying
that work until accept() time?

-Jakub

>
>>
>> 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot
>>
>>    The simpler way out. tcp_bpf callbacks never get invoked on the child
>>    socket so the copied psock reference is no longer a problem. We can
>>    clear the pointer on accept().
>>
>>    So far I wasn't able poke any holes in it and it comes down to
>>    patching tcp_create_openreq_child() with:
>>
>> 	/* sk_msg and ULP frameworks can override the callbacks into
>> 	 * protocol. We don't assume they are intended to be inherited
>> 	 * by the child. Frameworks can re-install the callbacks on
>> 	 * accept() if needed.
>> 	 */
>> 	WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator);
>>
>>    That's what I'm going with for v2.
>>
>> Open to suggestions.
>>
>> Thanks,
>> Jakub
>>
>> BTW. Reading into kTLS code, I noticed it has been limited down to just
>> established sockets due to the same problem I'm struggling with here:
>>
>> static int tls_init(struct sock *sk)
>> {
>> ...
>> 	/* The TLS ulp is currently supported only for TCP sockets
>> 	 * in ESTABLISHED state.
>> 	 * Supporting sockets in LISTEN state will require us
>> 	 * to modify the accept implementation to clone rather then
>> 	 * share the ulp context.
>> 	 */
>> 	if (sk->sk_state != TCP_ESTABLISHED)
>> 		return -ENOTCONN;
>>
>> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e=