netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: Martin Lau <kafai@fb.com>
Cc: "bpf\@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev\@vger.kernel.org" <netdev@vger.kernel.org>,
	"kernel-team\@cloudflare.com" <kernel-team@cloudflare.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"John Fastabend" <john.fastabend@gmail.com>,
	Lorenz Bauer <lmb@cloudflare.com>
Subject: Re: [PATCH bpf-next v3 10/12] net: Generate reuseport group ID on group creation
Date: Thu, 23 Jan 2020 11:59:46 +0100	[thread overview]
Message-ID: <875zh230bh.fsf@cloudflare.com> (raw)
In-Reply-To: <20200122225351.hajnt4u7au24mj5g@kafai-mbp.dhcp.thefacebook.com>

On Wed, Jan 22, 2020 at 11:53 PM CET, Martin Lau wrote:
> On Wed, Jan 22, 2020 at 02:05:47PM +0100, Jakub Sitnicki wrote:
>> Commit 736b46027eb4 ("net: Add ID (if needed) to sock_reuseport and expose
>> reuseport_lock") has introduced lazy generation of reuseport group IDs that
>> survive group resize.
>> 
>> By comparing the identifier we check if BPF reuseport program is not trying
>> to select a socket from a BPF map that belongs to a different reuseport
>> group than the one the packet is for.
>> 
>> Because SOCKARRAY used to be the only BPF map type that can be used with
>> reuseport BPF, it was possible to delay the generation of reuseport group
>> ID until a socket from the group was inserted into BPF map for the first
>> time.
>> 
>> Now that SOCKMAP can be used with reuseport BPF we have two options, either
>> generate the reuseport ID on map update, like SOCKARRAY does, or allocate
>> an ID from the start when reuseport group gets created.
>> 
>> This patch goes the latter approach to keep SOCKMAP free of calls into
>> reuseport code. This streamlines the reuseport_id access as its lifetime
>> now matches the longevity of reuseport object.
>> 
>> The cost of this simplification, however, is that we allocate reuseport IDs
>> for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their
>> setups. With the way identifiers are currently generated, we can have at
>> most S32_MAX reuseport groups, which hopefully is sufficient.
> Not sure if it would be a concern.  I think it is good as is.
> For TCP, that would mean billion different ip:port listening socks
> in inet_hashinfo.
>
> If it came to that, another idea is to use a 64bit reuseport_id which
> practically won't wrap around.  It could use the very first sk->sk_cookie
> as the reuseport_id.  All the ida logic will go away also in the expense
> of +4 bytes.

Thanks for the idea. I'll add it to the patch description.

>
>> 
>> Another change is that we now always call into SOCKARRAY logic to unlink
>> the socket from the map when unhashing or closing the socket. Previously we
>> did it only when at least one socket from the group was in a BPF map.
>> 
>> It is worth noting that this doesn't conflict with SOCKMAP tear-down in
>> case a socket is in a SOCKMAP and belongs to a reuseport group. SOCKMAP
>> tear-down happens first:
>> 
>>   prot->unhash
>>   `- tcp_bpf_unhash
>>      |- tcp_bpf_remove
>>      |  `- while (sk_psock_link_pop(psock))
>>      |     `- sk_psock_unlink
>>      |        `- sock_map_delete_from_link
>>      |           `- __sock_map_delete
>>      |              `- sock_map_unref
>>      |                 `- sk_psock_put
>>      |                    `- sk_psock_drop
>>      |                       `- rcu_assign_sk_user_data(sk, NULL)
>>      `- inet_unhash
>>         `- reuseport_detach_sock
>>            `- bpf_sk_reuseport_detach
>>               `- WRITE_ONCE(sk->sk_user_data, NULL)
> Thanks for the details.
>
> [ ... ]
>
>> @@ -200,12 +189,10 @@ void reuseport_detach_sock(struct sock *sk)
>>  	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
>>  					  lockdep_is_held(&reuseport_lock));
>>  
>> -	/* At least one of the sk in this reuseport group is added to
>> -	 * a bpf map.  Notify the bpf side.  The bpf map logic will
>> -	 * remove the sk if it is indeed added to a bpf map.
>> +	/* Notify the bpf side. The sk may be added to bpf map. The
>> +	 * bpf map logic will remove the sk from the map if indeed.
> s/indeed/needed/ ?
>
> I think it will be good to have a few words here like, that is needed
> by sockarray but not necessary for sockmap which has its own ->unhash
> to remove itself from the map.


Yeah, I didn't do it justice. Will expand the comment.

[...]

  reply	other threads:[~2020-01-23 10:59 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-22 13:05 [PATCH bpf-next v3 00/12] Extend SOCKMAP to store listening sockets Jakub Sitnicki
2020-01-22 13:05 ` [PATCH bpf-next v3 01/12] bpf, sk_msg: Don't clear saved sock proto on restore Jakub Sitnicki
2020-01-22 13:05 ` [PATCH bpf-next v3 02/12] net, sk_msg: Annotate lockless access to sk_prot on clone Jakub Sitnicki
2020-01-22 22:57   ` Martin Lau
2020-01-22 13:05 ` [PATCH bpf-next v3 03/12] net, sk_msg: Clear sk_user_data pointer on clone if tagged Jakub Sitnicki
2020-01-22 13:05 ` [PATCH bpf-next v3 04/12] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
2020-01-22 20:35   ` Martin Lau
2020-01-23 10:34     ` Jakub Sitnicki
2020-01-22 13:05 ` [PATCH bpf-next v3 05/12] bpf, sockmap: Allow inserting listening TCP sockets into sockmap Jakub Sitnicki
2020-01-22 20:52   ` Martin Lau
2020-01-23 10:41     ` Jakub Sitnicki
2020-01-22 13:05 ` [PATCH bpf-next v3 06/12] bpf, sockmap: Don't set up sockmap progs for listening sockets Jakub Sitnicki
2020-01-22 16:24   ` Lorenz Bauer
2020-01-22 18:07     ` Jakub Sitnicki
2020-01-22 23:11   ` Martin Lau
2020-01-22 13:05 ` [PATCH bpf-next v3 07/12] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
2020-01-22 13:05 ` [PATCH bpf-next v3 08/12] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
2020-01-22 23:02   ` Martin Lau
2020-01-22 13:05 ` [PATCH bpf-next v3 09/12] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
2020-01-22 23:08   ` Martin Lau
2020-01-22 13:05 ` [PATCH bpf-next v3 10/12] net: Generate reuseport group ID on group creation Jakub Sitnicki
2020-01-22 22:53   ` Martin Lau
2020-01-23 10:59     ` Jakub Sitnicki [this message]
2020-01-22 13:05 ` [PATCH bpf-next v3 11/12] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
2020-01-22 13:05 ` [PATCH bpf-next v3 12/12] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875zh230bh.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kernel-team@cloudflare.com \
    --cc=lmb@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).