bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: Martin Lau <kafai@fb.com>, Jakub Sitnicki <jakub@cloudflare.com>
Cc: "bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"kernel-team@cloudflare.com" <kernel-team@cloudflare.com>,
	John Fastabend <john.fastabend@gmail.com>
Subject: Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
Date: Tue, 26 Nov 2019 10:43:50 -0800	[thread overview]
Message-ID: <5ddd7266c36aa_671a2b0b882605c04a@john-XPS-13-9370.notmuch> (raw)
In-Reply-To: <20191126171607.pzrg5qhbavh7enwh@kafai-mbp.dhcp.thefacebook.com>

Martin Lau wrote:
> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
> > On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
> > > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
> > > [ ... ]
> > >
> > >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
> > >>  			sk->sk_prot = psock->sk_proto;
> > >>  		psock->sk_proto = NULL;
> > >>  	}
> > >> +
> > >> +	if (psock->icsk_af_ops) {
> > >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
> > >> +		psock->icsk_af_ops = NULL;
> > >> +	}
> > >>  }
> > >
> > > [ ... ]
> > >
> > >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
> > >> +					  struct sk_buff *skb,
> > >> +					  struct request_sock *req,
> > >> +					  struct dst_entry *dst,
> > >> +					  struct request_sock *req_unhash,
> > >> +					  bool *own_req)
> > >> +{
> > >> +	const struct inet_connection_sock_af_ops *ops;
> > >> +	void (*write_space)(struct sock *sk);
> > >> +	struct sk_psock *psock;
> > >> +	struct proto *proto;
> > >> +	struct sock *child;
> > >> +
> > >> +	rcu_read_lock();
> > >> +	psock = sk_psock(sk);
> > >> +	if (likely(psock)) {
> > >> +		proto = psock->sk_proto;
> > >> +		write_space = psock->saved_write_space;
> > >> +		ops = psock->icsk_af_ops;
> > > It is not immediately clear to me what ensure
> > > ops is not NULL here.
> > >
> > > It is likely I missed something.  A short comment would
> > > be very useful here.
> > 
> > I can see the readability problem. Looking at it now, perhaps it should
> > be rewritten, to the same effect, as:
> > 
> > static struct sock *tcp_bpf_syn_recv_sock(...)
> > {
> > 	const struct inet_connection_sock_af_ops *ops = NULL;
> >         ...
> > 
> >     rcu_read_lock();
> > 	psock = sk_psock(sk);
> > 	if (likely(psock)) {
> > 		proto = psock->sk_proto;
> > 		write_space = psock->saved_write_space;
> > 		ops = psock->icsk_af_ops;
> > 	}
> > 	rcu_read_unlock();
> > 
> >         if (!ops)
> > 		ops = inet_csk(sk)->icsk_af_ops;
> >         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
> > 
> > If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
> > properly. To double check what happens here:
> I did not mean the init path.  The init path is fine since it init
> eveything on psock before publishing the sk to the sock_map.
> 
> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
> psock->icsk_af_ops to NULL from running in parallel with
> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
> 

I'll answer. Updates are protected via sk_callback_lock so we don't have
parrallel updates in-flight causing write_space and sk_proto to be out
of sync. However access should be OK because its a pointer write we
never update the pointer in place, e.g.

static inline void sk_psock_restore_proto(struct sock *sk,
					  struct sk_psock *psock)
{
+       struct inet_connection_sock *icsk = inet_csk(sk);
+
	sk->sk_write_space = psock->saved_write_space;

	if (psock->sk_proto) {
		struct inet_connection_sock *icsk = inet_csk(sk);
		bool has_ulp = !!icsk->icsk_ulp_data;

		if (has_ulp)
			tcp_update_ulp(sk, psock->sk_proto);
		else
			sk->sk_prot = psock->sk_proto;
		psock->sk_proto = NULL;
	}

+
+       if (psock->icsk_af_ops) {
+               icsk->icsk_af_ops = psock->icsk_af_ops;
+               psock->icsk_af_ops = NULL;
+       }
}

In restore case either psock->icsk_af_ops is null or not. If its
null below code catches it. If its not null (from init path) then
we have a valid pointer.

        rcu_read_lock();
	psock = sk_psock(sk);
 	if (likely(psock)) {
 		proto = psock->sk_proto;
 		write_space = psock->saved_write_space;
 		ops = psock->icsk_af_ops;
 	}
 	rcu_read_unlock();
 
        if (!ops)
		ops = inet_csk(sk)->icsk_af_ops;
        child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);


We should do this with proper READ_ONCE/WRITE_ONCE to make it clear
what is going on and to stop compiler from breaking these assumptions. I
was going to generate that patch after this series but can do it before
as well. I didn't mention it here because it seems a bit out of scope
for this series because its mostly a fix to older code.

Also I started to think that write_space might be out of sync with ops but
it seems we never actually remove psock_write_space until after
rcu grace period so that should be OK as well and always point to the
previous write_space.

Finally I wondered if we could remove the ops and then add it back
quickly which seems at least in theory possible, but that would get
hit with a grace period because we can't have conflicting psock
definitions on the same sock. So expanding the rcu block to include
the ops = inet_csk(sk)->icsk_af_ops would fix that case.

So in summary I think we should expand the rcu lock here to include the
ops = inet_csk(sk)->icsk_af_ops to ensure we dont race with tear
down and create. I'll push the necessary update with WRITE_ONCE and
READ_ONCE to fix that up. Seeing we have to wait until the merge
window opens most likely anyways I'll send those out sooner rather
then later and this series can add the proper annotations as well.

  parent reply	other threads:[~2019-11-26 18:44 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
2019-11-24  5:32   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
2019-11-24  5:35   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP Jakub Sitnicki
2019-11-24  5:38   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy Jakub Sitnicki
2019-11-24  5:56   ` John Fastabend
2019-11-25 22:38   ` Martin Lau
2019-11-26 15:54     ` Jakub Sitnicki
2019-11-26 17:16       ` Martin Lau
2019-11-26 18:36         ` Jakub Sitnicki
     [not found]           ` <87sglsfdda.fsf@cloudflare.com>
2019-12-11 17:20             ` Martin Lau
2019-12-12 11:27               ` Jakub Sitnicki
2019-12-12 19:23                 ` Martin Lau
2019-12-17 15:06                   ` Jakub Sitnicki
2019-11-26 18:43         ` John Fastabend [this message]
2019-11-27 22:18           ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
2019-11-24  5:57   ` John Fastabend
2019-11-25  1:24   ` Alexei Starovoitov
2019-11-25  4:17     ` John Fastabend
2019-11-25 10:40       ` Jakub Sitnicki
2019-11-25 22:07         ` Martin Lau
2019-11-26 14:30           ` Jakub Sitnicki
2019-11-26 19:03             ` Martin Lau
2019-11-27 21:34               ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name Jakub Sitnicki
2019-11-24  5:57   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
2019-11-24  6:00   ` John Fastabend
2019-11-25 22:30   ` Martin Lau
2019-11-26 14:32     ` Jakub Sitnicki
2019-12-12 10:30     ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
2019-11-24  6:04   ` John Fastabend
2019-11-24  6:10 ` [PATCH bpf-next 0/8] Extend SOCKMAP to store " John Fastabend
2019-11-25  9:22   ` Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ddd7266c36aa_671a2b0b882605c04a@john-XPS-13-9370.notmuch \
    --to=john.fastabend@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=jakub@cloudflare.com \
    --cc=kafai@fb.com \
    --cc=kernel-team@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --subject='Re: [PATCH bpf-next 4/8] bpf, sockmap: Don'\''t let child socket inherit psock or its ops on copy' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).