All of lore.kernel.org
 help / color / mirror / Atom feed
* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-27 16:47 Christoph Paasch
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Paasch @ 2020-05-27 16:47 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 7280 bytes --]

On 05/27/20 - 13:00, Paolo Abeni wrote:
> On Tue, 2020-05-26 at 09:23 -0700, Christoph Paasch wrote:
> > On 05/25/20 - 12:42, Paolo Abeni wrote:
> > > On Fri, 2020-05-22 at 18:10 -0700, Mat Martineau wrote:
> > > > On Fri, 22 May 2020, Paolo Abeni wrote:
> > > > 
> > > > > On Fri, 2020-05-22 at 09:06 -0700, Christoph Paasch wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > On 22/05/20 - 12:10:11, Paolo Abeni wrote:
> > > > > > > Replace the radix tree with an hash table allocated
> > > > > > > at boot time. The radix tree has some short coming:
> > > > > > > a single lock is contented by all the mptcp operation,
> > > > > > > the lookup currently use such lock, and traversing
> > > > > > > all the items would require lock, too.
> > > > > > > 
> > > > > > > With hash table instead we trade a little memory to
> > > > > > > address all the above - a per bucket lock is used.
> > > > > > > 
> > > > > > > Additionally refactor the token creation to code to:
> > > > > > > 
> > > > > > > - limit the number of consecutive attempts to a fixed
> > > > > > > maximum. Hitting an hash bucket with long chain is
> > > > > > > considered a failed attempt
> > > > > > > 
> > > > > > > - accept() no longer can fail to to token management.
> > > > > > > 
> > > > > > > - if token creation fails at connect() time, we do
> > > > > > > fallback to TCP (before the connection was closed)
> > > > > > > 
> > > > > > > Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
> > > > > > > ---
> > > > > > >  net/mptcp/protocol.c |  16 +--
> > > > > > >  net/mptcp/protocol.h |   5 +-
> > > > > > >  net/mptcp/subflow.c  |  10 +-
> > > > > > >  net/mptcp/token.c    | 246 ++++++++++++++++++++++++++++++-------------
> > > > > > >  4 files changed, 184 insertions(+), 93 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > > > > > > index 16ca39ae314a..09152cb80e05 100644
> > > > > > > --- a/net/mptcp/protocol.c
> > > > > > > +++ b/net/mptcp/protocol.c
> > > > > > > @@ -1424,19 +1424,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
> > > > > > >  	msk->token = subflow_req->token;
> > > > > > >  	msk->subflow = NULL;
> > > > > > > 
> > > > > > > -	if (unlikely(mptcp_token_new_accept(subflow_req->token, nsk))) {
> > > > > > > -		nsk->sk_state = TCP_CLOSE;
> > > > > > > -		bh_unlock_sock(nsk);
> > > > > > > -
> > > > > > > -		/* we can't call into mptcp_close() here - possible BH context
> > > > > > > -		 * free the sock directly.
> > > > > > > -		 * sk_clone_lock() sets nsk refcnt to two, hence call sk_free()
> > > > > > > -		 * too.
> > > > > > > -		 */
> > > > > > > -		sk_common_release(nsk);
> > > > > > > -		sk_free(nsk);
> > > > > > > -		return NULL;
> > > > > > > -	}
> > > > > > > +	mptcp_token_accept(subflow_req, msk);
> > > > > > > 
> > > > > > >  	msk->write_seq = subflow_req->idsn + 1;
> > > > > > >  	atomic64_set(&msk->snd_una, msk->write_seq);
> > > > > > > @@ -1654,7 +1642,6 @@ void mptcp_finish_connect(struct sock *ssk)
> > > > > > >  	 */
> > > > > > >  	WRITE_ONCE(msk->remote_key, subflow->remote_key);
> > > > > > >  	WRITE_ONCE(msk->local_key, subflow->local_key);
> > > > > > > -	WRITE_ONCE(msk->token, subflow->token);
> > > > > > >  	WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
> > > > > > >  	WRITE_ONCE(msk->ack_seq, ack_seq);
> > > > > > >  	WRITE_ONCE(msk->can_ack, 1);
> > > > > > > @@ -1738,6 +1725,7 @@ static struct proto mptcp_prot = {
> > > > > > >  	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
> > > > > > >  	.sysctl_mem	= sysctl_tcp_mem,
> > > > > > >  	.obj_size	= sizeof(struct mptcp_sock),
> > > > > > > +	.slab_flags	= SLAB_TYPESAFE_BY_RCU,
> > > > > > 
> > > > > > I wonder if you now need to be careful when allocating and zero'ing the socket,
> > > > > > same way it is happening in sock_copy ?
> > > > > > 
> > > > > > In out-of-tree I had to take care of that by bringing back the clear_sk
> > > > > > callback in struct proto which will then be called in sk_prot_alloc:
> > > > > > 
> > > > > > https://github.com/multipath-tcp/mptcp/blob/b86461666ede4e6da195431dcf26cd454bc547fe/net/mptcp/mptcp_ctrl.c#L2867
> > > > > > 
> > > > > > https://github.com/multipath-tcp/mptcp/blob/f04a56b142b1cb209338392d563102837db4e2d4/net/core/sock.c#L1486
> > > > > 
> > > > > Indeed! I felt like I was missing some relevant point!
> > > > > 
> > > > > re-adding the clear_sk callback for mptcp's sake does not look 110%
> > > > > nice. There are a few alternatives, which sounds equally suboptimal too
> > > > > me:
> > > > > 
> > > > > 1) use plain RCU (with kfree_rcu() and all the relevant memory
> > > > > overhead) for the whole msk sockets (~1600 bytes for IPv4). We already
> > > > > to that for the subflow contexts (~200 bytes). This is simple, but the
> > > > > memory overhead could be relevant ?!?
> > > > > 
> > > > > 	1.a) additionally srink the mptcp_sock structure, e.g. I'm wild
> > > > > guessing we can use 'struct sock' as the base type and add mptcp custom
> > > > > fields there ?!?
> > > > > 
> > > > > 2) use ULP for msk, too. Move the token there (and possibly all mptcp-
> > > > > specific data), and use plain RCU to handle the context. As a downside
> > > > > we will need 2 allocations per accept() (the msk socket and the msk ulp
> > > > > context)
> > > > > 
> > > > > Any other better option?!?
> > > > > 
> > > > 
> > > > I lean toward option 1 so we don't have to change around ULP to not expose 
> > > > MPTCP stuff (we rely on a kernel sock check to keep the subflow ULP from 
> > > > being exposed to userspace). I'd also like to keep ULP available for 
> > > > possible TLS support someday.
> > > 
> > > I just occurred to me a likely crazy 3rd alternative. What if we define
> > > a new list_nulls variant with 'nulls' values using the least
> > > significative bit zeroed? (the current implementation requires the
> > > opposite). 
> > > 
> > > Something alike:
> > > 
> > > struct list_0_node *list_0_next(struct list_0_node *node)
> > > {
> > > 	return (struct list_0_node *)(list->next & ~1);
> > > }
> > > 
> > > bool list_0_is_null(struct list_0_node *node)
> > > {
> > > 	return !((unsigned long __force)node & 1);
> > > }
> > > 
> > > [...]
> > > 
> > > This way, memsetting a struct to 0 will preserve the NULL value and we
> > > should not need any additional care. Do I miss something relevant ?!?
> > 
> > Meaning, when bytes get zero'd the nulls value becomes 0 and we always "goto again"?
> > 
> > That's a neat idea IMO. I think it could work.
> 
> That was the idea ;)
> 
> > But it means that a 0 nulls-value is invalid, right? (which is easy to take
> > care of)
> 
> Why? (I like the idea that NULL is snull ;) I don't see why it can't
> !??)

I was wondering how in the lookup you handle the end-case when you hit a
NULL. Would you be able to know whether either the token is not present or
the socket has been zero'ed?

> Anyhow I opted for a different, hopefully less invasive solution -
> using the currently unused msk->sk_node to insert the msk into the
> token hash - please see v2 :)

Nice, that solution is even cleaner!


Christoph

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-27 11:00 Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2020-05-27 11:00 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 6722 bytes --]

On Tue, 2020-05-26 at 09:23 -0700, Christoph Paasch wrote:
> On 05/25/20 - 12:42, Paolo Abeni wrote:
> > On Fri, 2020-05-22 at 18:10 -0700, Mat Martineau wrote:
> > > On Fri, 22 May 2020, Paolo Abeni wrote:
> > > 
> > > > On Fri, 2020-05-22 at 09:06 -0700, Christoph Paasch wrote:
> > > > > Hello,
> > > > > 
> > > > > On 22/05/20 - 12:10:11, Paolo Abeni wrote:
> > > > > > Replace the radix tree with an hash table allocated
> > > > > > at boot time. The radix tree has some short coming:
> > > > > > a single lock is contented by all the mptcp operation,
> > > > > > the lookup currently use such lock, and traversing
> > > > > > all the items would require lock, too.
> > > > > > 
> > > > > > With hash table instead we trade a little memory to
> > > > > > address all the above - a per bucket lock is used.
> > > > > > 
> > > > > > Additionally refactor the token creation to code to:
> > > > > > 
> > > > > > - limit the number of consecutive attempts to a fixed
> > > > > > maximum. Hitting an hash bucket with long chain is
> > > > > > considered a failed attempt
> > > > > > 
> > > > > > - accept() no longer can fail to to token management.
> > > > > > 
> > > > > > - if token creation fails at connect() time, we do
> > > > > > fallback to TCP (before the connection was closed)
> > > > > > 
> > > > > > Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
> > > > > > ---
> > > > > >  net/mptcp/protocol.c |  16 +--
> > > > > >  net/mptcp/protocol.h |   5 +-
> > > > > >  net/mptcp/subflow.c  |  10 +-
> > > > > >  net/mptcp/token.c    | 246 ++++++++++++++++++++++++++++++-------------
> > > > > >  4 files changed, 184 insertions(+), 93 deletions(-)
> > > > > > 
> > > > > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > > > > > index 16ca39ae314a..09152cb80e05 100644
> > > > > > --- a/net/mptcp/protocol.c
> > > > > > +++ b/net/mptcp/protocol.c
> > > > > > @@ -1424,19 +1424,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
> > > > > >  	msk->token = subflow_req->token;
> > > > > >  	msk->subflow = NULL;
> > > > > > 
> > > > > > -	if (unlikely(mptcp_token_new_accept(subflow_req->token, nsk))) {
> > > > > > -		nsk->sk_state = TCP_CLOSE;
> > > > > > -		bh_unlock_sock(nsk);
> > > > > > -
> > > > > > -		/* we can't call into mptcp_close() here - possible BH context
> > > > > > -		 * free the sock directly.
> > > > > > -		 * sk_clone_lock() sets nsk refcnt to two, hence call sk_free()
> > > > > > -		 * too.
> > > > > > -		 */
> > > > > > -		sk_common_release(nsk);
> > > > > > -		sk_free(nsk);
> > > > > > -		return NULL;
> > > > > > -	}
> > > > > > +	mptcp_token_accept(subflow_req, msk);
> > > > > > 
> > > > > >  	msk->write_seq = subflow_req->idsn + 1;
> > > > > >  	atomic64_set(&msk->snd_una, msk->write_seq);
> > > > > > @@ -1654,7 +1642,6 @@ void mptcp_finish_connect(struct sock *ssk)
> > > > > >  	 */
> > > > > >  	WRITE_ONCE(msk->remote_key, subflow->remote_key);
> > > > > >  	WRITE_ONCE(msk->local_key, subflow->local_key);
> > > > > > -	WRITE_ONCE(msk->token, subflow->token);
> > > > > >  	WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
> > > > > >  	WRITE_ONCE(msk->ack_seq, ack_seq);
> > > > > >  	WRITE_ONCE(msk->can_ack, 1);
> > > > > > @@ -1738,6 +1725,7 @@ static struct proto mptcp_prot = {
> > > > > >  	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
> > > > > >  	.sysctl_mem	= sysctl_tcp_mem,
> > > > > >  	.obj_size	= sizeof(struct mptcp_sock),
> > > > > > +	.slab_flags	= SLAB_TYPESAFE_BY_RCU,
> > > > > 
> > > > > I wonder if you now need to be careful when allocating and zero'ing the socket,
> > > > > same way it is happening in sock_copy ?
> > > > > 
> > > > > In out-of-tree I had to take care of that by bringing back the clear_sk
> > > > > callback in struct proto which will then be called in sk_prot_alloc:
> > > > > 
> > > > > https://github.com/multipath-tcp/mptcp/blob/b86461666ede4e6da195431dcf26cd454bc547fe/net/mptcp/mptcp_ctrl.c#L2867
> > > > > 
> > > > > https://github.com/multipath-tcp/mptcp/blob/f04a56b142b1cb209338392d563102837db4e2d4/net/core/sock.c#L1486
> > > > 
> > > > Indeed! I felt like I was missing some relevant point!
> > > > 
> > > > re-adding the clear_sk callback for mptcp's sake does not look 110%
> > > > nice. There are a few alternatives, which sounds equally suboptimal too
> > > > me:
> > > > 
> > > > 1) use plain RCU (with kfree_rcu() and all the relevant memory
> > > > overhead) for the whole msk sockets (~1600 bytes for IPv4). We already
> > > > to that for the subflow contexts (~200 bytes). This is simple, but the
> > > > memory overhead could be relevant ?!?
> > > > 
> > > > 	1.a) additionally srink the mptcp_sock structure, e.g. I'm wild
> > > > guessing we can use 'struct sock' as the base type and add mptcp custom
> > > > fields there ?!?
> > > > 
> > > > 2) use ULP for msk, too. Move the token there (and possibly all mptcp-
> > > > specific data), and use plain RCU to handle the context. As a downside
> > > > we will need 2 allocations per accept() (the msk socket and the msk ulp
> > > > context)
> > > > 
> > > > Any other better option?!?
> > > > 
> > > 
> > > I lean toward option 1 so we don't have to change around ULP to not expose 
> > > MPTCP stuff (we rely on a kernel sock check to keep the subflow ULP from 
> > > being exposed to userspace). I'd also like to keep ULP available for 
> > > possible TLS support someday.
> > 
> > I just occurred to me a likely crazy 3rd alternative. What if we define
> > a new list_nulls variant with 'nulls' values using the least
> > significative bit zeroed? (the current implementation requires the
> > opposite). 
> > 
> > Something alike:
> > 
> > struct list_0_node *list_0_next(struct list_0_node *node)
> > {
> > 	return (struct list_0_node *)(list->next & ~1);
> > }
> > 
> > bool list_0_is_null(struct list_0_node *node)
> > {
> > 	return !((unsigned long __force)node & 1);
> > }
> > 
> > [...]
> > 
> > This way, memsetting a struct to 0 will preserve the NULL value and we
> > should not need any additional care. Do I miss something relevant ?!?
> 
> Meaning, when bytes get zero'd the nulls value becomes 0 and we always "goto again"?
> 
> That's a neat idea IMO. I think it could work.

That was the idea ;)

> But it means that a 0 nulls-value is invalid, right? (which is easy to take
> care of)

Why? (I like the idea that NULL is snull ;) I don't see why it can't
!??)

Anyhow I opted for a different, hopefully less invasive solution -
using the currently unused msk->sk_node to insert the msk into the
token hash - please see v2 :)

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-26 16:28 Mat Martineau
  0 siblings, 0 replies; 12+ messages in thread
From: Mat Martineau @ 2020-05-26 16:28 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 2109 bytes --]

On Mon, 25 May 2020, Paolo Abeni wrote:

> On Fri, 2020-05-22 at 17:34 -0700, Mat Martineau wrote:
>> On Fri, 22 May 2020, Paolo Abeni wrote:
>>> -static RADIX_TREE(token_tree, GFP_ATOMIC);
>>> -static RADIX_TREE(token_req_tree, GFP_ATOMIC);
>>> -static DEFINE_SPINLOCK(token_tree_lock);
>>> -static int token_used __read_mostly;
>>> +#define TOKEN_MAX_RETRIES	4
>>> +#define TOKEN_MAX_CHAIN_LEN	4
>>> +
>>> +struct token_bucket {
>>> +	spinlock_t		lock;
>>> +	int			chain_len;
>>> +	struct hlist_nulls_head	req_chain;
>>> +	struct hlist_nulls_head	msk_chain;
>>> +};
>>
>> I remember a discussion a while ago about how we didn't really need two
>> radix trees for req & msk, since there's only one "token space".
>> Similarly, seems like we only need one linked list per bucket if we can
>> define a common structure to use in both sock types, like:
>>
>> struct mptcp_token_node {
>>  	struct hlist_nulls_node node;
>>  	u32 value;
>>  	bool is_req;
>> };
>>
>> It would use a little more space per mptcp_sock/req_sock and would require
>> extra code in the lookup to return the appropriate container pointer (req
>> or mptcp_sock), but would shrink the hash table or give us more buckets in
>> the same memory, which seems like a win.
>
> In v2 I had to drop the above idea because it would not fit with the
> current solution for the list corruption issue.
>

Yes, the list corruption issue is more important than the above 
optimization!

> Anyhow I have an addional doubt, how can the above fit the list
> traversal helper ? Don't we need to allocate 'mptcp_token_node'
> separatelly and keep a back ptr to the main socket/req???
>
> Otherwise 'mptcp_token_node' should have the same offset inside 'struct
> mptcp_sock' and inside 'struct mptcp_subflow_request_sock' to fit
> hlist_nulls_entry(), right?
>

The idea was to wrap up the information the list traversal helper needed 
in mptcp_token_node. Then use is_req to determine which structure type 
(mptcp_sock or mptcp_subflow_request_sock) to use with container_of().

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-26 16:23 Christoph Paasch
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Paasch @ 2020-05-26 16:23 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 6108 bytes --]

On 05/25/20 - 12:42, Paolo Abeni wrote:
> On Fri, 2020-05-22 at 18:10 -0700, Mat Martineau wrote:
> > On Fri, 22 May 2020, Paolo Abeni wrote:
> > 
> > > On Fri, 2020-05-22 at 09:06 -0700, Christoph Paasch wrote:
> > > > Hello,
> > > > 
> > > > On 22/05/20 - 12:10:11, Paolo Abeni wrote:
> > > > > Replace the radix tree with an hash table allocated
> > > > > at boot time. The radix tree has some short coming:
> > > > > a single lock is contented by all the mptcp operation,
> > > > > the lookup currently use such lock, and traversing
> > > > > all the items would require lock, too.
> > > > > 
> > > > > With hash table instead we trade a little memory to
> > > > > address all the above - a per bucket lock is used.
> > > > > 
> > > > > Additionally refactor the token creation to code to:
> > > > > 
> > > > > - limit the number of consecutive attempts to a fixed
> > > > > maximum. Hitting an hash bucket with long chain is
> > > > > considered a failed attempt
> > > > > 
> > > > > - accept() no longer can fail to to token management.
> > > > > 
> > > > > - if token creation fails at connect() time, we do
> > > > > fallback to TCP (before the connection was closed)
> > > > > 
> > > > > Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
> > > > > ---
> > > > >  net/mptcp/protocol.c |  16 +--
> > > > >  net/mptcp/protocol.h |   5 +-
> > > > >  net/mptcp/subflow.c  |  10 +-
> > > > >  net/mptcp/token.c    | 246 ++++++++++++++++++++++++++++++-------------
> > > > >  4 files changed, 184 insertions(+), 93 deletions(-)
> > > > > 
> > > > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > > > > index 16ca39ae314a..09152cb80e05 100644
> > > > > --- a/net/mptcp/protocol.c
> > > > > +++ b/net/mptcp/protocol.c
> > > > > @@ -1424,19 +1424,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
> > > > >  	msk->token = subflow_req->token;
> > > > >  	msk->subflow = NULL;
> > > > > 
> > > > > -	if (unlikely(mptcp_token_new_accept(subflow_req->token, nsk))) {
> > > > > -		nsk->sk_state = TCP_CLOSE;
> > > > > -		bh_unlock_sock(nsk);
> > > > > -
> > > > > -		/* we can't call into mptcp_close() here - possible BH context
> > > > > -		 * free the sock directly.
> > > > > -		 * sk_clone_lock() sets nsk refcnt to two, hence call sk_free()
> > > > > -		 * too.
> > > > > -		 */
> > > > > -		sk_common_release(nsk);
> > > > > -		sk_free(nsk);
> > > > > -		return NULL;
> > > > > -	}
> > > > > +	mptcp_token_accept(subflow_req, msk);
> > > > > 
> > > > >  	msk->write_seq = subflow_req->idsn + 1;
> > > > >  	atomic64_set(&msk->snd_una, msk->write_seq);
> > > > > @@ -1654,7 +1642,6 @@ void mptcp_finish_connect(struct sock *ssk)
> > > > >  	 */
> > > > >  	WRITE_ONCE(msk->remote_key, subflow->remote_key);
> > > > >  	WRITE_ONCE(msk->local_key, subflow->local_key);
> > > > > -	WRITE_ONCE(msk->token, subflow->token);
> > > > >  	WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
> > > > >  	WRITE_ONCE(msk->ack_seq, ack_seq);
> > > > >  	WRITE_ONCE(msk->can_ack, 1);
> > > > > @@ -1738,6 +1725,7 @@ static struct proto mptcp_prot = {
> > > > >  	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
> > > > >  	.sysctl_mem	= sysctl_tcp_mem,
> > > > >  	.obj_size	= sizeof(struct mptcp_sock),
> > > > > +	.slab_flags	= SLAB_TYPESAFE_BY_RCU,
> > > > 
> > > > I wonder if you now need to be careful when allocating and zero'ing the socket,
> > > > same way it is happening in sock_copy ?
> > > > 
> > > > In out-of-tree I had to take care of that by bringing back the clear_sk
> > > > callback in struct proto which will then be called in sk_prot_alloc:
> > > > 
> > > > https://github.com/multipath-tcp/mptcp/blob/b86461666ede4e6da195431dcf26cd454bc547fe/net/mptcp/mptcp_ctrl.c#L2867
> > > > 
> > > > https://github.com/multipath-tcp/mptcp/blob/f04a56b142b1cb209338392d563102837db4e2d4/net/core/sock.c#L1486
> > > 
> > > Indeed! I felt like I was missing some relevant point!
> > > 
> > > re-adding the clear_sk callback for mptcp's sake does not look 110%
> > > nice. There are a few alternatives, which sounds equally suboptimal too
> > > me:
> > > 
> > > 1) use plain RCU (with kfree_rcu() and all the relevant memory
> > > overhead) for the whole msk sockets (~1600 bytes for IPv4). We already
> > > to that for the subflow contexts (~200 bytes). This is simple, but the
> > > memory overhead could be relevant ?!?
> > > 
> > > 	1.a) additionally srink the mptcp_sock structure, e.g. I'm wild
> > > guessing we can use 'struct sock' as the base type and add mptcp custom
> > > fields there ?!?
> > > 
> > > 2) use ULP for msk, too. Move the token there (and possibly all mptcp-
> > > specific data), and use plain RCU to handle the context. As a downside
> > > we will need 2 allocations per accept() (the msk socket and the msk ulp
> > > context)
> > > 
> > > Any other better option?!?
> > > 
> > 
> > I lean toward option 1 so we don't have to change around ULP to not expose 
> > MPTCP stuff (we rely on a kernel sock check to keep the subflow ULP from 
> > being exposed to userspace). I'd also like to keep ULP available for 
> > possible TLS support someday.
> 
> I just occurred to me a likely crazy 3rd alternative. What if we define
> a new list_nulls variant with 'nulls' values using the least
> significative bit zeroed? (the current implementation requires the
> opposite). 
> 
> Something alike:
> 
> struct list_0_node *list_0_next(struct list_0_node *node)
> {
> 	return (struct list_0_node *)(list->next & ~1);
> }
> 
> bool list_0_is_null(struct list_0_node *node)
> {
> 	return !((unsigned long __force)node & 1);
> }
> 
> [...]
> 
> This way, memsetting a struct to 0 will preserve the NULL value and we
> should not need any additional care. Do I miss something relevant ?!?

Meaning, when bytes get zero'd the nulls value becomes 0 and we always "goto again"?

That's a neat idea IMO. I think it could work.

But it means that a 0 nulls-value is invalid, right? (which is easy to take
care of)



Christoph

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-25 17:41 Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2020-05-25 17:41 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 1722 bytes --]

On Fri, 2020-05-22 at 17:34 -0700, Mat Martineau wrote:
> On Fri, 22 May 2020, Paolo Abeni wrote:
> > -static RADIX_TREE(token_tree, GFP_ATOMIC);
> > -static RADIX_TREE(token_req_tree, GFP_ATOMIC);
> > -static DEFINE_SPINLOCK(token_tree_lock);
> > -static int token_used __read_mostly;
> > +#define TOKEN_MAX_RETRIES	4
> > +#define TOKEN_MAX_CHAIN_LEN	4
> > +
> > +struct token_bucket {
> > +	spinlock_t		lock;
> > +	int			chain_len;
> > +	struct hlist_nulls_head	req_chain;
> > +	struct hlist_nulls_head	msk_chain;
> > +};
> 
> I remember a discussion a while ago about how we didn't really need two 
> radix trees for req & msk, since there's only one "token space". 
> Similarly, seems like we only need one linked list per bucket if we can 
> define a common structure to use in both sock types, like:
> 
> struct mptcp_token_node {
>  	struct hlist_nulls_node node;
>  	u32 value;
>  	bool is_req;
> };
> 
> It would use a little more space per mptcp_sock/req_sock and would require 
> extra code in the lookup to return the appropriate container pointer (req 
> or mptcp_sock), but would shrink the hash table or give us more buckets in 
> the same memory, which seems like a win.

In v2 I had to drop the above idea because it would not fit with the
current solution for the list corruption issue.

Anyhow I have an addional doubt, how can the above fit the list
traversal helper ? Don't we need to allocate 'mptcp_token_node'
separatelly and keep a back ptr to the main socket/req???

Otherwise 'mptcp_token_node' should have the same offset inside 'struct
mptcp_sock' and inside 'struct mptcp_subflow_request_sock' to fit
hlist_nulls_entry(), right? 

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-25 10:42 Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2020-05-25 10:42 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 5581 bytes --]

On Fri, 2020-05-22 at 18:10 -0700, Mat Martineau wrote:
> On Fri, 22 May 2020, Paolo Abeni wrote:
> 
> > On Fri, 2020-05-22 at 09:06 -0700, Christoph Paasch wrote:
> > > Hello,
> > > 
> > > On 22/05/20 - 12:10:11, Paolo Abeni wrote:
> > > > Replace the radix tree with an hash table allocated
> > > > at boot time. The radix tree has some short coming:
> > > > a single lock is contented by all the mptcp operation,
> > > > the lookup currently use such lock, and traversing
> > > > all the items would require lock, too.
> > > > 
> > > > With hash table instead we trade a little memory to
> > > > address all the above - a per bucket lock is used.
> > > > 
> > > > Additionally refactor the token creation to code to:
> > > > 
> > > > - limit the number of consecutive attempts to a fixed
> > > > maximum. Hitting an hash bucket with long chain is
> > > > considered a failed attempt
> > > > 
> > > > - accept() no longer can fail to to token management.
> > > > 
> > > > - if token creation fails at connect() time, we do
> > > > fallback to TCP (before the connection was closed)
> > > > 
> > > > Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
> > > > ---
> > > >  net/mptcp/protocol.c |  16 +--
> > > >  net/mptcp/protocol.h |   5 +-
> > > >  net/mptcp/subflow.c  |  10 +-
> > > >  net/mptcp/token.c    | 246 ++++++++++++++++++++++++++++++-------------
> > > >  4 files changed, 184 insertions(+), 93 deletions(-)
> > > > 
> > > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > > > index 16ca39ae314a..09152cb80e05 100644
> > > > --- a/net/mptcp/protocol.c
> > > > +++ b/net/mptcp/protocol.c
> > > > @@ -1424,19 +1424,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
> > > >  	msk->token = subflow_req->token;
> > > >  	msk->subflow = NULL;
> > > > 
> > > > -	if (unlikely(mptcp_token_new_accept(subflow_req->token, nsk))) {
> > > > -		nsk->sk_state = TCP_CLOSE;
> > > > -		bh_unlock_sock(nsk);
> > > > -
> > > > -		/* we can't call into mptcp_close() here - possible BH context
> > > > -		 * free the sock directly.
> > > > -		 * sk_clone_lock() sets nsk refcnt to two, hence call sk_free()
> > > > -		 * too.
> > > > -		 */
> > > > -		sk_common_release(nsk);
> > > > -		sk_free(nsk);
> > > > -		return NULL;
> > > > -	}
> > > > +	mptcp_token_accept(subflow_req, msk);
> > > > 
> > > >  	msk->write_seq = subflow_req->idsn + 1;
> > > >  	atomic64_set(&msk->snd_una, msk->write_seq);
> > > > @@ -1654,7 +1642,6 @@ void mptcp_finish_connect(struct sock *ssk)
> > > >  	 */
> > > >  	WRITE_ONCE(msk->remote_key, subflow->remote_key);
> > > >  	WRITE_ONCE(msk->local_key, subflow->local_key);
> > > > -	WRITE_ONCE(msk->token, subflow->token);
> > > >  	WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
> > > >  	WRITE_ONCE(msk->ack_seq, ack_seq);
> > > >  	WRITE_ONCE(msk->can_ack, 1);
> > > > @@ -1738,6 +1725,7 @@ static struct proto mptcp_prot = {
> > > >  	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
> > > >  	.sysctl_mem	= sysctl_tcp_mem,
> > > >  	.obj_size	= sizeof(struct mptcp_sock),
> > > > +	.slab_flags	= SLAB_TYPESAFE_BY_RCU,
> > > 
> > > I wonder if you now need to be careful when allocating and zero'ing the socket,
> > > same way it is happening in sock_copy ?
> > > 
> > > In out-of-tree I had to take care of that by bringing back the clear_sk
> > > callback in struct proto which will then be called in sk_prot_alloc:
> > > 
> > > https://github.com/multipath-tcp/mptcp/blob/b86461666ede4e6da195431dcf26cd454bc547fe/net/mptcp/mptcp_ctrl.c#L2867
> > > 
> > > https://github.com/multipath-tcp/mptcp/blob/f04a56b142b1cb209338392d563102837db4e2d4/net/core/sock.c#L1486
> > 
> > Indeed! I felt like I was missing some relevant point!
> > 
> > re-adding the clear_sk callback for mptcp's sake does not look 110%
> > nice. There are a few alternatives, which sounds equally suboptimal too
> > me:
> > 
> > 1) use plain RCU (with kfree_rcu() and all the relevant memory
> > overhead) for the whole msk sockets (~1600 bytes for IPv4). We already
> > to that for the subflow contexts (~200 bytes). This is simple, but the
> > memory overhead could be relevant ?!?
> > 
> > 	1.a) additionally srink the mptcp_sock structure, e.g. I'm wild
> > guessing we can use 'struct sock' as the base type and add mptcp custom
> > fields there ?!?
> > 
> > 2) use ULP for msk, too. Move the token there (and possibly all mptcp-
> > specific data), and use plain RCU to handle the context. As a downside
> > we will need 2 allocations per accept() (the msk socket and the msk ulp
> > context)
> > 
> > Any other better option?!?
> > 
> 
> I lean toward option 1 so we don't have to change around ULP to not expose 
> MPTCP stuff (we rely on a kernel sock check to keep the subflow ULP from 
> being exposed to userspace). I'd also like to keep ULP available for 
> possible TLS support someday.

I just occurred to me a likely crazy 3rd alternative. What if we define
a new list_nulls variant with 'nulls' values using the least
significative bit zeroed? (the current implementation requires the
opposite). 

Something alike:

struct list_0_node *list_0_next(struct list_0_node *node)
{
	return (struct list_0_node *)(list->next & ~1);
}

bool list_0_is_null(struct list_0_node *node)
{
	return !((unsigned long __force)node & 1);
}

[...]

This way, memsetting a struct to 0 will preserve the NULL value and we
should not need any additional care. Do I miss something relevant ?!?

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-25  8:10 Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2020-05-25  8:10 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 2235 bytes --]

Hi,

On Fri, 2020-05-22 at 17:34 -0700, Mat Martineau wrote:
> > +struct token_bucket {
> > +	spinlock_t		lock;
> > +	int			chain_len;
> > +	struct hlist_nulls_head	req_chain;
> > +	struct hlist_nulls_head	msk_chain;
> > +};
> 
> I remember a discussion a while ago about how we didn't really need two 
> radix trees for req & msk, since there's only one "token space". 
> Similarly, seems like we only need one linked list per bucket if we can 
> define a common structure to use in both sock types, like:
> 
> struct mptcp_token_node {
>  	struct hlist_nulls_node node;
>  	u32 value;
>  	bool is_req;
> };
> 
> It would use a little more space per mptcp_sock/req_sock and would require 
> extra code in the lookup to return the appropriate container pointer (req 
> or mptcp_sock), but would shrink the hash table or give us more buckets in 
> the same memory, which seems like a win.

LGTM, I'll try that.

[...]
> > struct mptcp_sock *mptcp_token_get_sock(u32 token)
> > {
> > -	struct sock *conn;
> > -
> > -	spin_lock_bh(&token_tree_lock);
> > -	conn = radix_tree_lookup(&token_tree, token);
> > -	if (conn) {
> > -		/* token still reserved? */
> > -		if (conn == (struct sock *)&token_used)
> > -			conn = NULL;
> > -		else
> > -			sock_hold(conn);
> > +	struct hlist_nulls_node *pos;
> > +	struct token_bucket *bucket;
> > +	struct mptcp_sock *msk;
> > +
> > +	rcu_read_lock();
> > +	bucket = token_bucket(token);
> > +
> > +again:
> > +	hlist_nulls_for_each_entry_rcu(msk, pos, &bucket->msk_chain,
> > +				       token_node) {
> > +		if (READ_ONCE(msk->token) != token)
> > +			continue;
> > +		if (!refcount_inc_not_zero(&((struct sock *)msk)->sk_refcnt))
> > +			goto not_found;
> > +		if (READ_ONCE(msk->token) != token)
> > +			goto again;
> 
> Checking msk->token twice here, but it doesn't look like it ever gets 
> changed after it's set?

AFAICT, this is the common list_nulls usage pattern. The problem is
that a msk can be freed and re-allocated concurrently by another CPU.
Since this lookup is lockless, the token value can possibly change
'under-the-wood' till we get a reference to the token - and forbit msk
de-allocation.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-23  1:10 Mat Martineau
  0 siblings, 0 replies; 12+ messages in thread
From: Mat Martineau @ 2020-05-23  1:10 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 4531 bytes --]

On Fri, 22 May 2020, Paolo Abeni wrote:

> On Fri, 2020-05-22 at 09:06 -0700, Christoph Paasch wrote:
>> Hello,
>>
>> On 22/05/20 - 12:10:11, Paolo Abeni wrote:
>>> Replace the radix tree with an hash table allocated
>>> at boot time. The radix tree has some short coming:
>>> a single lock is contented by all the mptcp operation,
>>> the lookup currently use such lock, and traversing
>>> all the items would require lock, too.
>>>
>>> With hash table instead we trade a little memory to
>>> address all the above - a per bucket lock is used.
>>>
>>> Additionally refactor the token creation to code to:
>>>
>>> - limit the number of consecutive attempts to a fixed
>>> maximum. Hitting an hash bucket with long chain is
>>> considered a failed attempt
>>>
>>> - accept() no longer can fail to to token management.
>>>
>>> - if token creation fails at connect() time, we do
>>> fallback to TCP (before the connection was closed)
>>>
>>> Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
>>> ---
>>>  net/mptcp/protocol.c |  16 +--
>>>  net/mptcp/protocol.h |   5 +-
>>>  net/mptcp/subflow.c  |  10 +-
>>>  net/mptcp/token.c    | 246 ++++++++++++++++++++++++++++++-------------
>>>  4 files changed, 184 insertions(+), 93 deletions(-)
>>>
>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>> index 16ca39ae314a..09152cb80e05 100644
>>> --- a/net/mptcp/protocol.c
>>> +++ b/net/mptcp/protocol.c
>>> @@ -1424,19 +1424,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
>>>  	msk->token = subflow_req->token;
>>>  	msk->subflow = NULL;
>>>
>>> -	if (unlikely(mptcp_token_new_accept(subflow_req->token, nsk))) {
>>> -		nsk->sk_state = TCP_CLOSE;
>>> -		bh_unlock_sock(nsk);
>>> -
>>> -		/* we can't call into mptcp_close() here - possible BH context
>>> -		 * free the sock directly.
>>> -		 * sk_clone_lock() sets nsk refcnt to two, hence call sk_free()
>>> -		 * too.
>>> -		 */
>>> -		sk_common_release(nsk);
>>> -		sk_free(nsk);
>>> -		return NULL;
>>> -	}
>>> +	mptcp_token_accept(subflow_req, msk);
>>>
>>>  	msk->write_seq = subflow_req->idsn + 1;
>>>  	atomic64_set(&msk->snd_una, msk->write_seq);
>>> @@ -1654,7 +1642,6 @@ void mptcp_finish_connect(struct sock *ssk)
>>>  	 */
>>>  	WRITE_ONCE(msk->remote_key, subflow->remote_key);
>>>  	WRITE_ONCE(msk->local_key, subflow->local_key);
>>> -	WRITE_ONCE(msk->token, subflow->token);
>>>  	WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
>>>  	WRITE_ONCE(msk->ack_seq, ack_seq);
>>>  	WRITE_ONCE(msk->can_ack, 1);
>>> @@ -1738,6 +1725,7 @@ static struct proto mptcp_prot = {
>>>  	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
>>>  	.sysctl_mem	= sysctl_tcp_mem,
>>>  	.obj_size	= sizeof(struct mptcp_sock),
>>> +	.slab_flags	= SLAB_TYPESAFE_BY_RCU,
>>
>> I wonder if you now need to be careful when allocating and zero'ing the socket,
>> same way it is happening in sock_copy ?
>>
>> In out-of-tree I had to take care of that by bringing back the clear_sk
>> callback in struct proto which will then be called in sk_prot_alloc:
>>
>> https://github.com/multipath-tcp/mptcp/blob/b86461666ede4e6da195431dcf26cd454bc547fe/net/mptcp/mptcp_ctrl.c#L2867
>>
>> https://github.com/multipath-tcp/mptcp/blob/f04a56b142b1cb209338392d563102837db4e2d4/net/core/sock.c#L1486
>
> Indeed! I felt like I was missing some relevant point!
>
> re-adding the clear_sk callback for mptcp's sake does not look 110%
> nice. There are a few alternatives, which sounds equally suboptimal too
> me:
>
> 1) use plain RCU (with kfree_rcu() and all the relevant memory
> overhead) for the whole msk sockets (~1600 bytes for IPv4). We already
> to that for the subflow contexts (~200 bytes). This is simple, but the
> memory overhead could be relevant ?!?
>
> 	1.a) additionally srink the mptcp_sock structure, e.g. I'm wild
> guessing we can use 'struct sock' as the base type and add mptcp custom
> fields there ?!?
>
> 2) use ULP for msk, too. Move the token there (and possibly all mptcp-
> specific data), and use plain RCU to handle the context. As a downside
> we will need 2 allocations per accept() (the msk socket and the msk ulp
> context)
>
> Any other better option?!?
>

I lean toward option 1 so we don't have to change around ULP to not expose 
MPTCP stuff (we rely on a kernel sock check to keep the subflow ULP from 
being exposed to userspace). I'd also like to keep ULP available for 
possible TLS support someday.

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-23  0:34 Mat Martineau
  0 siblings, 0 replies; 12+ messages in thread
From: Mat Martineau @ 2020-05-23  0:34 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 15054 bytes --]


On Fri, 22 May 2020, Paolo Abeni wrote:

> Replace the radix tree with an hash table allocated
> at boot time. The radix tree has some short coming:
> a single lock is contented by all the mptcp operation,
> the lookup currently use such lock, and traversing
> all the items would require lock, too.
>
> With hash table instead we trade a little memory to
> address all the above - a per bucket lock is used.
>
> Additionally refactor the token creation to code to:
>
> - limit the number of consecutive attempts to a fixed
> maximum. Hitting an hash bucket with long chain is
> considered a failed attempt
>
> - accept() no longer can fail to to token management.
>
> - if token creation fails at connect() time, we do
> fallback to TCP (before the connection was closed)
>
> Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
> ---
> net/mptcp/protocol.c |  16 +--
> net/mptcp/protocol.h |   5 +-
> net/mptcp/subflow.c  |  10 +-
> net/mptcp/token.c    | 246 ++++++++++++++++++++++++++++++-------------
> 4 files changed, 184 insertions(+), 93 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 16ca39ae314a..09152cb80e05 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1424,19 +1424,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
> 	msk->token = subflow_req->token;
> 	msk->subflow = NULL;
>
> -	if (unlikely(mptcp_token_new_accept(subflow_req->token, nsk))) {
> -		nsk->sk_state = TCP_CLOSE;
> -		bh_unlock_sock(nsk);
> -
> -		/* we can't call into mptcp_close() here - possible BH context
> -		 * free the sock directly.
> -		 * sk_clone_lock() sets nsk refcnt to two, hence call sk_free()
> -		 * too.
> -		 */
> -		sk_common_release(nsk);
> -		sk_free(nsk);
> -		return NULL;
> -	}
> +	mptcp_token_accept(subflow_req, msk);
>
> 	msk->write_seq = subflow_req->idsn + 1;
> 	atomic64_set(&msk->snd_una, msk->write_seq);
> @@ -1654,7 +1642,6 @@ void mptcp_finish_connect(struct sock *ssk)
> 	 */
> 	WRITE_ONCE(msk->remote_key, subflow->remote_key);
> 	WRITE_ONCE(msk->local_key, subflow->local_key);
> -	WRITE_ONCE(msk->token, subflow->token);
> 	WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
> 	WRITE_ONCE(msk->ack_seq, ack_seq);
> 	WRITE_ONCE(msk->can_ack, 1);
> @@ -1738,6 +1725,7 @@ static struct proto mptcp_prot = {
> 	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
> 	.sysctl_mem	= sysctl_tcp_mem,
> 	.obj_size	= sizeof(struct mptcp_sock),
> +	.slab_flags	= SLAB_TYPESAFE_BY_RCU,
> 	.no_autobind	= true,
> };
>
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index 038c0237cca0..6f18ad2db7a1 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -211,6 +211,7 @@ struct mptcp_sock {
> 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
> 	struct sock	*first;
> 	struct mptcp_pm_data	pm;
> +	struct hlist_nulls_node token_node;
> };
>
> #define mptcp_for_each_subflow(__msk, __subflow)			\
> @@ -255,6 +256,7 @@ struct mptcp_subflow_request_sock {
> 	u64	thmac;
> 	u32	local_nonce;
> 	u32	remote_nonce;
> +	struct hlist_nulls_node token_node;
> };
>
> static inline struct mptcp_subflow_request_sock *
> @@ -386,7 +388,8 @@ void __init mptcp_token_init(void);
> int mptcp_token_new_request(struct request_sock *req);
> void mptcp_token_destroy_request(u32 token);
> int mptcp_token_new_connect(struct sock *sk);
> -int mptcp_token_new_accept(u32 token, struct sock *conn);
> +void mptcp_token_accept(struct mptcp_subflow_request_sock *r,
> +			struct mptcp_sock *msk);
> struct mptcp_sock *mptcp_token_get_sock(u32 token);
> void mptcp_token_destroy(u32 token);
>
> diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
> index 6f5b43afd5fd..88195510acf9 100644
> --- a/net/mptcp/subflow.c
> +++ b/net/mptcp/subflow.c
> @@ -31,11 +31,14 @@ static void SUBFLOW_REQ_INC_STATS(struct request_sock *req,
> static int subflow_rebuild_header(struct sock *sk)
> {
> 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
> -	int local_id, err = 0;
> +	int local_id;
>
> 	if (subflow->request_mptcp && !subflow->token) {
> 		pr_debug("subflow=%p", sk);
> -		err = mptcp_token_new_connect(sk);
> +		if (mptcp_token_new_connect(sk)) {
> +			subflow->mp_capable = 0;
> +			goto out;
> +		}
> 	} else if (subflow->request_join && !subflow->local_nonce) {
> 		struct mptcp_sock *msk = (struct mptcp_sock *)subflow->conn;
>
> @@ -56,9 +59,6 @@ static int subflow_rebuild_header(struct sock *sk)
> 	}
>
> out:
> -	if (err)
> -		return err;
> -
> 	return subflow->icsk_af_ops->rebuild_header(sk);
> }
>
> diff --git a/net/mptcp/token.c b/net/mptcp/token.c
> index 33352dd99d4d..d4a4df3c7c42 100644
> --- a/net/mptcp/token.c
> +++ b/net/mptcp/token.c
> @@ -24,7 +24,7 @@
>
> #include <linux/kernel.h>
> #include <linux/module.h>
> -#include <linux/radix-tree.h>
> +#include <linux/memblock.h>
> #include <linux/ip.h>
> #include <linux/tcp.h>
> #include <net/sock.h>
> @@ -33,10 +33,53 @@
> #include <net/mptcp.h>
> #include "protocol.h"
>
> -static RADIX_TREE(token_tree, GFP_ATOMIC);
> -static RADIX_TREE(token_req_tree, GFP_ATOMIC);
> -static DEFINE_SPINLOCK(token_tree_lock);
> -static int token_used __read_mostly;
> +#define TOKEN_MAX_RETRIES	4
> +#define TOKEN_MAX_CHAIN_LEN	4
> +
> +struct token_bucket {
> +	spinlock_t		lock;
> +	int			chain_len;
> +	struct hlist_nulls_head	req_chain;
> +	struct hlist_nulls_head	msk_chain;
> +};

I remember a discussion a while ago about how we didn't really need two 
radix trees for req & msk, since there's only one "token space". 
Similarly, seems like we only need one linked list per bucket if we can 
define a common structure to use in both sock types, like:

struct mptcp_token_node {
 	struct hlist_nulls_node node;
 	u32 value;
 	bool is_req;
};

It would use a little more space per mptcp_sock/req_sock and would require 
extra code in the lookup to return the appropriate container pointer (req 
or mptcp_sock), but would shrink the hash table or give us more buckets in 
the same memory, which seems like a win.

> +
> +static struct token_bucket *token_hash __read_mostly;
> +static unsigned int token_mask __read_mostly;
> +
> +static struct token_bucket *token_bucket(u32 token)
> +{
> +	return &token_hash[token & token_mask];
> +}
> +
> +static struct mptcp_subflow_request_sock *
> +__token_lookup_req(struct token_bucket *t, u32 token)
> +{
> +	struct mptcp_subflow_request_sock *req;
> +	struct hlist_nulls_node *pos;
> +
> +	hlist_nulls_for_each_entry_rcu(req, pos, &t->req_chain, token_node)
> +		if (req->token == token)
> +			return req;
> +	return NULL;
> +}
> +
> +static struct mptcp_sock *
> +__token_lookup_msk(struct token_bucket *t, u32 token)
> +{
> +	struct hlist_nulls_node *pos;
> +	struct mptcp_sock *msk;
> +
> +	hlist_nulls_for_each_entry_rcu(msk, pos, &t->msk_chain, token_node)
> +		if (msk->token == token)
> +			return msk;
> +	return NULL;
> +}
> +
> +bool __token_bucket_busy(struct token_bucket *t, u32 token)

static?

> +{
> +	return !token || t->chain_len >= TOKEN_MAX_CHAIN_LEN ||
> +	       __token_lookup_req(t, token) || __token_lookup_msk(t, token);
> +}
>
> /**
>  * mptcp_token_new_request - create new key/idsn/token for subflow_request
> @@ -52,30 +95,32 @@ static int token_used __read_mostly;
> int mptcp_token_new_request(struct request_sock *req)
> {
> 	struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
> -	int err;
> -
> -	while (1) {
> -		u32 token;
> -
> -		mptcp_crypto_key_gen_sha(&subflow_req->local_key,
> -					 &subflow_req->token,
> -					 &subflow_req->idsn);
> -		pr_debug("req=%p local_key=%llu, token=%u, idsn=%llu\n",
> -			 req, subflow_req->local_key, subflow_req->token,
> -			 subflow_req->idsn);
> -
> -		token = subflow_req->token;
> -		spin_lock_bh(&token_tree_lock);
> -		if (!radix_tree_lookup(&token_req_tree, token) &&
> -		    !radix_tree_lookup(&token_tree, token))
> -			break;
> -		spin_unlock_bh(&token_tree_lock);
> +	int retries = TOKEN_MAX_RETRIES;
> +	struct token_bucket *bucket;
> +	u32 token;
> +
> +again:
> +	mptcp_crypto_key_gen_sha(&subflow_req->local_key,
> +				 &subflow_req->token,
> +				 &subflow_req->idsn);
> +	pr_debug("req=%p local_key=%llu, token=%u, idsn=%llu\n",
> +		 req, subflow_req->local_key, subflow_req->token,
> +		 subflow_req->idsn);
> +
> +	token = subflow_req->token;
> +	bucket = token_bucket(token);
> +	spin_lock_bh(&bucket->lock);
> +	if (__token_bucket_busy(bucket, token)) {
> +		spin_unlock_bh(&bucket->lock);
> +		if (!--retries)
> +			return -EBUSY;
> +		goto again;
> 	}
>
> -	err = radix_tree_insert(&token_req_tree,
> -				subflow_req->token, &token_used);
> -	spin_unlock_bh(&token_tree_lock);
> -	return err;
> +	hlist_nulls_add_tail_rcu(&subflow_req->token_node, &bucket->req_chain);
> +	bucket->chain_len++;
> +	spin_unlock_bh(&bucket->lock);
> +	return 0;
> }
>
> /**
> @@ -97,48 +142,54 @@ int mptcp_token_new_request(struct request_sock *req)
> int mptcp_token_new_connect(struct sock *sk)
> {
> 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
> -	struct sock *mptcp_sock = subflow->conn;
> -	int err;
> +	struct mptcp_sock *msk = mptcp_sk(subflow->conn);
> +	int retries = TOKEN_MAX_RETRIES;
> +	struct token_bucket *bucket;
>
> -	while (1) {
> -		u32 token;
> +	pr_debug("ssk=%p, local_key=%llu, token=%u, idsn=%llu\n",
> +		 sk, subflow->local_key, subflow->token, subflow->idsn);
>
> -		mptcp_crypto_key_gen_sha(&subflow->local_key, &subflow->token,
> -					 &subflow->idsn);
> +again:
> +	mptcp_crypto_key_gen_sha(&subflow->local_key, &subflow->token,
> +				 &subflow->idsn);
>
> -		pr_debug("ssk=%p, local_key=%llu, token=%u, idsn=%llu\n",
> -			 sk, subflow->local_key, subflow->token, subflow->idsn);
> -
> -		token = subflow->token;
> -		spin_lock_bh(&token_tree_lock);
> -		if (!radix_tree_lookup(&token_req_tree, token) &&
> -		    !radix_tree_lookup(&token_tree, token))
> -			break;
> -		spin_unlock_bh(&token_tree_lock);
> +	bucket = token_bucket(subflow->token);
> +	spin_lock_bh(&bucket->lock);
> +	if (__token_bucket_busy(bucket, subflow->token)) {
> +		spin_unlock_bh(&bucket->lock);
> +		if (!--retries)
> +			return -EBUSY;
> +		goto again;
> 	}
> -	err = radix_tree_insert(&token_tree, subflow->token, mptcp_sock);
> -	spin_unlock_bh(&token_tree_lock);
>
> -	return err;
> +	WRITE_ONCE(msk->token, subflow->token);
> +	hlist_nulls_add_tail_rcu(&msk->token_node, &bucket->msk_chain);
> +	bucket->chain_len++;
> +	spin_unlock_bh(&bucket->lock);
> +	return 0;
> }
>
> /**
> - * mptcp_token_new_accept - insert token for later processing
> - * @token: the token to insert to the tree
> - * @conn: the just cloned socket linked to the new connection
> + * mptcp_token_accept - replace a req sk with full sock in token hash
> + * @req: the request socket to be removed
> + * @msk: the just cloned socket linked to the new connection
>  *
>  * Called when a SYN packet creates a new logical connection, i.e.
>  * is not a join request.
>  */
> -int mptcp_token_new_accept(u32 token, struct sock *conn)
> +void mptcp_token_accept(struct mptcp_subflow_request_sock *req,
> +			struct mptcp_sock *msk)
> {
> -	int err;
> -
> -	spin_lock_bh(&token_tree_lock);
> -	err = radix_tree_insert(&token_tree, token, conn);
> -	spin_unlock_bh(&token_tree_lock);
> +	struct mptcp_subflow_request_sock *pos;
> +	struct token_bucket *bucket;
>
> -	return err;
> +	bucket = token_bucket(req->token);
> +	spin_lock_bh(&bucket->lock);
> +	pos = __token_lookup_req(bucket, req->token);
> +	if (!WARN_ON_ONCE(pos != req))
> +		hlist_nulls_del_rcu(&req->token_node);
> +	hlist_nulls_add_tail_rcu(&msk->token_node, &bucket->msk_chain);
> +	spin_unlock_bh(&bucket->lock);
> }
>
> /**
> @@ -152,20 +203,33 @@ int mptcp_token_new_accept(u32 token, struct sock *conn)
>  */
> struct mptcp_sock *mptcp_token_get_sock(u32 token)
> {
> -	struct sock *conn;
> -
> -	spin_lock_bh(&token_tree_lock);
> -	conn = radix_tree_lookup(&token_tree, token);
> -	if (conn) {
> -		/* token still reserved? */
> -		if (conn == (struct sock *)&token_used)
> -			conn = NULL;
> -		else
> -			sock_hold(conn);
> +	struct hlist_nulls_node *pos;
> +	struct token_bucket *bucket;
> +	struct mptcp_sock *msk;
> +
> +	rcu_read_lock();
> +	bucket = token_bucket(token);
> +
> +again:
> +	hlist_nulls_for_each_entry_rcu(msk, pos, &bucket->msk_chain,
> +				       token_node) {
> +		if (READ_ONCE(msk->token) != token)
> +			continue;
> +		if (!refcount_inc_not_zero(&((struct sock *)msk)->sk_refcnt))
> +			goto not_found;
> +		if (READ_ONCE(msk->token) != token)
> +			goto again;

Checking msk->token twice here, but it doesn't look like it ever gets 
changed after it's set?


Thanks,

Mat


> +		goto found;
> 	}
> -	spin_unlock_bh(&token_tree_lock);
> +	if (get_nulls_value(pos) != (token & token_mask))
> +		goto again;
>
> -	return mptcp_sk(conn);
> +not_found:
> +	msk = NULL;
> +
> +found:
> +	rcu_read_unlock();
> +	return msk;
> }
>
> /**
> @@ -177,9 +241,17 @@ struct mptcp_sock *mptcp_token_get_sock(u32 token)
>  */
> void mptcp_token_destroy_request(u32 token)
> {
> -	spin_lock_bh(&token_tree_lock);
> -	radix_tree_delete(&token_req_tree, token);
> -	spin_unlock_bh(&token_tree_lock);
> +	struct mptcp_subflow_request_sock *pos;
> +	struct token_bucket *bucket;
> +
> +	bucket = token_bucket(token);
> +	spin_lock_bh(&bucket->lock);
> +	pos = __token_lookup_req(bucket, token);
> +	if (pos) {
> +		hlist_nulls_del_rcu(&pos->token_node);
> +		bucket->chain_len--;
> +	}
> +	spin_unlock_bh(&bucket->lock);
> }
>
> /**
> @@ -190,7 +262,35 @@ void mptcp_token_destroy_request(u32 token)
>  */
> void mptcp_token_destroy(u32 token)
> {
> -	spin_lock_bh(&token_tree_lock);
> -	radix_tree_delete(&token_tree, token);
> -	spin_unlock_bh(&token_tree_lock);
> +	struct token_bucket *bucket;
> +	struct mptcp_sock *pos;
> +
> +	bucket = token_bucket(token);
> +	spin_lock_bh(&bucket->lock);
> +	pos = __token_lookup_msk(bucket, token);
> +	if (pos) {
> +		hlist_nulls_del_rcu(&pos->token_node);
> +		bucket->chain_len--;
> +	}
> +	spin_unlock_bh(&bucket->lock);
> +}
> +
> +void __init mptcp_token_init(void)
> +{
> +	int i;
> +
> +	token_hash = alloc_large_system_hash("MPTCP token",
> +					     sizeof(struct token_bucket),
> +					     0,
> +					     20,/* one slot per 1MB of memory */
> +					     0,
> +					     NULL,
> +					     &token_mask,
> +					     0,
> +					     64 * 1024);
> +	for (i = 0; i < token_mask + 1; ++i) {
> +		INIT_HLIST_NULLS_HEAD(&token_hash[i].req_chain, i);
> +		INIT_HLIST_NULLS_HEAD(&token_hash[i].msk_chain, i);
> +		spin_lock_init(&token_hash[i].lock);
> +	}
> }
> -- 
> 2.21.3

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-22 19:06 Christoph Paasch
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Paasch @ 2020-05-22 19:06 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 4586 bytes --]

On 22/05/20 - 19:37:05, Paolo Abeni wrote:
> On Fri, 2020-05-22 at 09:06 -0700, Christoph Paasch wrote:
> > Hello,
> > 
> > On 22/05/20 - 12:10:11, Paolo Abeni wrote:
> > > Replace the radix tree with an hash table allocated
> > > at boot time. The radix tree has some short coming:
> > > a single lock is contented by all the mptcp operation,
> > > the lookup currently use such lock, and traversing
> > > all the items would require lock, too.
> > > 
> > > With hash table instead we trade a little memory to
> > > address all the above - a per bucket lock is used.
> > > 
> > > Additionally refactor the token creation to code to:
> > > 
> > > - limit the number of consecutive attempts to a fixed
> > > maximum. Hitting an hash bucket with long chain is
> > > considered a failed attempt
> > > 
> > > - accept() no longer can fail to to token management.
> > > 
> > > - if token creation fails at connect() time, we do
> > > fallback to TCP (before the connection was closed)
> > > 
> > > Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
> > > ---
> > >  net/mptcp/protocol.c |  16 +--
> > >  net/mptcp/protocol.h |   5 +-
> > >  net/mptcp/subflow.c  |  10 +-
> > >  net/mptcp/token.c    | 246 ++++++++++++++++++++++++++++++-------------
> > >  4 files changed, 184 insertions(+), 93 deletions(-)
> > > 
> > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > > index 16ca39ae314a..09152cb80e05 100644
> > > --- a/net/mptcp/protocol.c
> > > +++ b/net/mptcp/protocol.c
> > > @@ -1424,19 +1424,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
> > >  	msk->token = subflow_req->token;
> > >  	msk->subflow = NULL;
> > >  
> > > -	if (unlikely(mptcp_token_new_accept(subflow_req->token, nsk))) {
> > > -		nsk->sk_state = TCP_CLOSE;
> > > -		bh_unlock_sock(nsk);
> > > -
> > > -		/* we can't call into mptcp_close() here - possible BH context
> > > -		 * free the sock directly.
> > > -		 * sk_clone_lock() sets nsk refcnt to two, hence call sk_free()
> > > -		 * too.
> > > -		 */
> > > -		sk_common_release(nsk);
> > > -		sk_free(nsk);
> > > -		return NULL;
> > > -	}
> > > +	mptcp_token_accept(subflow_req, msk);
> > >  
> > >  	msk->write_seq = subflow_req->idsn + 1;
> > >  	atomic64_set(&msk->snd_una, msk->write_seq);
> > > @@ -1654,7 +1642,6 @@ void mptcp_finish_connect(struct sock *ssk)
> > >  	 */
> > >  	WRITE_ONCE(msk->remote_key, subflow->remote_key);
> > >  	WRITE_ONCE(msk->local_key, subflow->local_key);
> > > -	WRITE_ONCE(msk->token, subflow->token);
> > >  	WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
> > >  	WRITE_ONCE(msk->ack_seq, ack_seq);
> > >  	WRITE_ONCE(msk->can_ack, 1);
> > > @@ -1738,6 +1725,7 @@ static struct proto mptcp_prot = {
> > >  	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
> > >  	.sysctl_mem	= sysctl_tcp_mem,
> > >  	.obj_size	= sizeof(struct mptcp_sock),
> > > +	.slab_flags	= SLAB_TYPESAFE_BY_RCU,
> > 
> > I wonder if you now need to be careful when allocating and zero'ing the socket,
> > same way it is happening in sock_copy ?
> > 
> > In out-of-tree I had to take care of that by bringing back the clear_sk
> > callback in struct proto which will then be called in sk_prot_alloc:
> > 
> > https://github.com/multipath-tcp/mptcp/blob/b86461666ede4e6da195431dcf26cd454bc547fe/net/mptcp/mptcp_ctrl.c#L2867
> > 
> > https://github.com/multipath-tcp/mptcp/blob/f04a56b142b1cb209338392d563102837db4e2d4/net/core/sock.c#L1486
> 
> Indeed! I felt like I was missing some relevant point!
> 
> re-adding the clear_sk callback for mptcp's sake does not look 110%
> nice. There are a few alternatives, which sounds equally suboptimal too
> me:
> 
> 1) use plain RCU (with kfree_rcu() and all the relevant memory
> overhead) for the whole msk sockets (~1600 bytes for IPv4). We already
> to that for the subflow contexts (~200 bytes). This is simple, but the
> memory overhead could be relevant ?!?
> 
> 	1.a) additionally srink the mptcp_sock structure, e.g. I'm wild
> guessing we can use 'struct sock' as the base type and add mptcp custom
> fields there ?!?
> 
> 2) use ULP for msk, too. Move the token there (and possibly all mptcp-
> specific data), and use plain RCU to handle the context. As a downside
> we will need 2 allocations per accept() (the msk socket and the msk ulp
> context)
> 
> Any other better option?!?

If we exclude clear_sk, I can't see another option that the ones you
described. Both have the problem that on a busy node we have the
memory-overheads.


Christoph

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-22 17:37 Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2020-05-22 17:37 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 4186 bytes --]

On Fri, 2020-05-22 at 09:06 -0700, Christoph Paasch wrote:
> Hello,
> 
> On 22/05/20 - 12:10:11, Paolo Abeni wrote:
> > Replace the radix tree with an hash table allocated
> > at boot time. The radix tree has some short coming:
> > a single lock is contented by all the mptcp operation,
> > the lookup currently use such lock, and traversing
> > all the items would require lock, too.
> > 
> > With hash table instead we trade a little memory to
> > address all the above - a per bucket lock is used.
> > 
> > Additionally refactor the token creation to code to:
> > 
> > - limit the number of consecutive attempts to a fixed
> > maximum. Hitting an hash bucket with long chain is
> > considered a failed attempt
> > 
> > - accept() no longer can fail to to token management.
> > 
> > - if token creation fails at connect() time, we do
> > fallback to TCP (before the connection was closed)
> > 
> > Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
> > ---
> >  net/mptcp/protocol.c |  16 +--
> >  net/mptcp/protocol.h |   5 +-
> >  net/mptcp/subflow.c  |  10 +-
> >  net/mptcp/token.c    | 246 ++++++++++++++++++++++++++++++-------------
> >  4 files changed, 184 insertions(+), 93 deletions(-)
> > 
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index 16ca39ae314a..09152cb80e05 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -1424,19 +1424,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
> >  	msk->token = subflow_req->token;
> >  	msk->subflow = NULL;
> >  
> > -	if (unlikely(mptcp_token_new_accept(subflow_req->token, nsk))) {
> > -		nsk->sk_state = TCP_CLOSE;
> > -		bh_unlock_sock(nsk);
> > -
> > -		/* we can't call into mptcp_close() here - possible BH context
> > -		 * free the sock directly.
> > -		 * sk_clone_lock() sets nsk refcnt to two, hence call sk_free()
> > -		 * too.
> > -		 */
> > -		sk_common_release(nsk);
> > -		sk_free(nsk);
> > -		return NULL;
> > -	}
> > +	mptcp_token_accept(subflow_req, msk);
> >  
> >  	msk->write_seq = subflow_req->idsn + 1;
> >  	atomic64_set(&msk->snd_una, msk->write_seq);
> > @@ -1654,7 +1642,6 @@ void mptcp_finish_connect(struct sock *ssk)
> >  	 */
> >  	WRITE_ONCE(msk->remote_key, subflow->remote_key);
> >  	WRITE_ONCE(msk->local_key, subflow->local_key);
> > -	WRITE_ONCE(msk->token, subflow->token);
> >  	WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
> >  	WRITE_ONCE(msk->ack_seq, ack_seq);
> >  	WRITE_ONCE(msk->can_ack, 1);
> > @@ -1738,6 +1725,7 @@ static struct proto mptcp_prot = {
> >  	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
> >  	.sysctl_mem	= sysctl_tcp_mem,
> >  	.obj_size	= sizeof(struct mptcp_sock),
> > +	.slab_flags	= SLAB_TYPESAFE_BY_RCU,
> 
> I wonder if you now need to be careful when allocating and zero'ing the socket,
> same way it is happening in sock_copy ?
> 
> In out-of-tree I had to take care of that by bringing back the clear_sk
> callback in struct proto which will then be called in sk_prot_alloc:
> 
> https://github.com/multipath-tcp/mptcp/blob/b86461666ede4e6da195431dcf26cd454bc547fe/net/mptcp/mptcp_ctrl.c#L2867
> 
> https://github.com/multipath-tcp/mptcp/blob/f04a56b142b1cb209338392d563102837db4e2d4/net/core/sock.c#L1486

Indeed! I felt like I was missing some relevant point!

re-adding the clear_sk callback for mptcp's sake does not look 110%
nice. There are a few alternatives, which sounds equally suboptimal too
me:

1) use plain RCU (with kfree_rcu() and all the relevant memory
overhead) for the whole msk sockets (~1600 bytes for IPv4). We already
to that for the subflow contexts (~200 bytes). This is simple, but the
memory overhead could be relevant ?!?

	1.a) additionally srink the mptcp_sock structure, e.g. I'm wild
guessing we can use 'struct sock' as the base type and add mptcp custom
fields there ?!?

2) use ULP for msk, too. Move the token there (and possibly all mptcp-
specific data), and use plain RCU to handle the context. As a downside
we will need 2 allocations per accept() (the msk socket and the msk ulp
context)

Any other better option?!?

Thanks!

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container.
@ 2020-05-22 16:06 Christoph Paasch
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Paasch @ 2020-05-22 16:06 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 12956 bytes --]

Hello,

On 22/05/20 - 12:10:11, Paolo Abeni wrote:
> Replace the radix tree with an hash table allocated
> at boot time. The radix tree has some short coming:
> a single lock is contented by all the mptcp operation,
> the lookup currently use such lock, and traversing
> all the items would require lock, too.
> 
> With hash table instead we trade a little memory to
> address all the above - a per bucket lock is used.
> 
> Additionally refactor the token creation to code to:
> 
> - limit the number of consecutive attempts to a fixed
> maximum. Hitting an hash bucket with long chain is
> considered a failed attempt
> 
> - accept() no longer can fail to to token management.
> 
> - if token creation fails at connect() time, we do
> fallback to TCP (before the connection was closed)
> 
> Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
> ---
>  net/mptcp/protocol.c |  16 +--
>  net/mptcp/protocol.h |   5 +-
>  net/mptcp/subflow.c  |  10 +-
>  net/mptcp/token.c    | 246 ++++++++++++++++++++++++++++++-------------
>  4 files changed, 184 insertions(+), 93 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 16ca39ae314a..09152cb80e05 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1424,19 +1424,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
>  	msk->token = subflow_req->token;
>  	msk->subflow = NULL;
>  
> -	if (unlikely(mptcp_token_new_accept(subflow_req->token, nsk))) {
> -		nsk->sk_state = TCP_CLOSE;
> -		bh_unlock_sock(nsk);
> -
> -		/* we can't call into mptcp_close() here - possible BH context
> -		 * free the sock directly.
> -		 * sk_clone_lock() sets nsk refcnt to two, hence call sk_free()
> -		 * too.
> -		 */
> -		sk_common_release(nsk);
> -		sk_free(nsk);
> -		return NULL;
> -	}
> +	mptcp_token_accept(subflow_req, msk);
>  
>  	msk->write_seq = subflow_req->idsn + 1;
>  	atomic64_set(&msk->snd_una, msk->write_seq);
> @@ -1654,7 +1642,6 @@ void mptcp_finish_connect(struct sock *ssk)
>  	 */
>  	WRITE_ONCE(msk->remote_key, subflow->remote_key);
>  	WRITE_ONCE(msk->local_key, subflow->local_key);
> -	WRITE_ONCE(msk->token, subflow->token);
>  	WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
>  	WRITE_ONCE(msk->ack_seq, ack_seq);
>  	WRITE_ONCE(msk->can_ack, 1);
> @@ -1738,6 +1725,7 @@ static struct proto mptcp_prot = {
>  	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
>  	.sysctl_mem	= sysctl_tcp_mem,
>  	.obj_size	= sizeof(struct mptcp_sock),
> +	.slab_flags	= SLAB_TYPESAFE_BY_RCU,

I wonder if you now need to be careful when allocating and zero'ing the socket,
same way it is happening in sock_copy ?

In out-of-tree I had to take care of that by bringing back the clear_sk
callback in struct proto which will then be called in sk_prot_alloc:

https://github.com/multipath-tcp/mptcp/blob/b86461666ede4e6da195431dcf26cd454bc547fe/net/mptcp/mptcp_ctrl.c#L2867

https://github.com/multipath-tcp/mptcp/blob/f04a56b142b1cb209338392d563102837db4e2d4/net/core/sock.c#L1486

>  	.no_autobind	= true,
>  };
>  
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index 038c0237cca0..6f18ad2db7a1 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -211,6 +211,7 @@ struct mptcp_sock {
>  	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
>  	struct sock	*first;
>  	struct mptcp_pm_data	pm;
> +	struct hlist_nulls_node token_node;
>  };
>  
>  #define mptcp_for_each_subflow(__msk, __subflow)			\
> @@ -255,6 +256,7 @@ struct mptcp_subflow_request_sock {
>  	u64	thmac;
>  	u32	local_nonce;
>  	u32	remote_nonce;
> +	struct hlist_nulls_node token_node;
>  };
>  
>  static inline struct mptcp_subflow_request_sock *
> @@ -386,7 +388,8 @@ void __init mptcp_token_init(void);
>  int mptcp_token_new_request(struct request_sock *req);
>  void mptcp_token_destroy_request(u32 token);
>  int mptcp_token_new_connect(struct sock *sk);
> -int mptcp_token_new_accept(u32 token, struct sock *conn);
> +void mptcp_token_accept(struct mptcp_subflow_request_sock *r,
> +			struct mptcp_sock *msk);
>  struct mptcp_sock *mptcp_token_get_sock(u32 token);
>  void mptcp_token_destroy(u32 token);
>  
> diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
> index 6f5b43afd5fd..88195510acf9 100644
> --- a/net/mptcp/subflow.c
> +++ b/net/mptcp/subflow.c
> @@ -31,11 +31,14 @@ static void SUBFLOW_REQ_INC_STATS(struct request_sock *req,
>  static int subflow_rebuild_header(struct sock *sk)
>  {
>  	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
> -	int local_id, err = 0;
> +	int local_id;
>  
>  	if (subflow->request_mptcp && !subflow->token) {
>  		pr_debug("subflow=%p", sk);
> -		err = mptcp_token_new_connect(sk);
> +		if (mptcp_token_new_connect(sk)) {
> +			subflow->mp_capable = 0;
> +			goto out;
> +		}
>  	} else if (subflow->request_join && !subflow->local_nonce) {
>  		struct mptcp_sock *msk = (struct mptcp_sock *)subflow->conn;
>  
> @@ -56,9 +59,6 @@ static int subflow_rebuild_header(struct sock *sk)
>  	}
>  
>  out:
> -	if (err)
> -		return err;
> -
>  	return subflow->icsk_af_ops->rebuild_header(sk);
>  }
>  
> diff --git a/net/mptcp/token.c b/net/mptcp/token.c
> index 33352dd99d4d..d4a4df3c7c42 100644
> --- a/net/mptcp/token.c
> +++ b/net/mptcp/token.c
> @@ -24,7 +24,7 @@
>  
>  #include <linux/kernel.h>
>  #include <linux/module.h>
> -#include <linux/radix-tree.h>
> +#include <linux/memblock.h>
>  #include <linux/ip.h>
>  #include <linux/tcp.h>
>  #include <net/sock.h>
> @@ -33,10 +33,53 @@
>  #include <net/mptcp.h>
>  #include "protocol.h"
>  
> -static RADIX_TREE(token_tree, GFP_ATOMIC);
> -static RADIX_TREE(token_req_tree, GFP_ATOMIC);
> -static DEFINE_SPINLOCK(token_tree_lock);
> -static int token_used __read_mostly;
> +#define TOKEN_MAX_RETRIES	4
> +#define TOKEN_MAX_CHAIN_LEN	4
> +
> +struct token_bucket {
> +	spinlock_t		lock;
> +	int			chain_len;
> +	struct hlist_nulls_head	req_chain;
> +	struct hlist_nulls_head	msk_chain;
> +};
> +
> +static struct token_bucket *token_hash __read_mostly;
> +static unsigned int token_mask __read_mostly;
> +
> +static struct token_bucket *token_bucket(u32 token)
> +{
> +	return &token_hash[token & token_mask];
> +}
> +
> +static struct mptcp_subflow_request_sock *
> +__token_lookup_req(struct token_bucket *t, u32 token)
> +{
> +	struct mptcp_subflow_request_sock *req;
> +	struct hlist_nulls_node *pos;
> +
> +	hlist_nulls_for_each_entry_rcu(req, pos, &t->req_chain, token_node)
> +		if (req->token == token)
> +			return req;
> +	return NULL;
> +}
> +
> +static struct mptcp_sock *
> +__token_lookup_msk(struct token_bucket *t, u32 token)
> +{
> +	struct hlist_nulls_node *pos;
> +	struct mptcp_sock *msk;
> +
> +	hlist_nulls_for_each_entry_rcu(msk, pos, &t->msk_chain, token_node)
> +		if (msk->token == token)
> +			return msk;
> +	return NULL;
> +}
> +
> +bool __token_bucket_busy(struct token_bucket *t, u32 token)
> +{
> +	return !token || t->chain_len >= TOKEN_MAX_CHAIN_LEN ||
> +	       __token_lookup_req(t, token) || __token_lookup_msk(t, token);
> +}
>  
>  /**
>   * mptcp_token_new_request - create new key/idsn/token for subflow_request
> @@ -52,30 +95,32 @@ static int token_used __read_mostly;
>  int mptcp_token_new_request(struct request_sock *req)
>  {
>  	struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
> -	int err;
> -
> -	while (1) {
> -		u32 token;
> -
> -		mptcp_crypto_key_gen_sha(&subflow_req->local_key,
> -					 &subflow_req->token,
> -					 &subflow_req->idsn);
> -		pr_debug("req=%p local_key=%llu, token=%u, idsn=%llu\n",
> -			 req, subflow_req->local_key, subflow_req->token,
> -			 subflow_req->idsn);
> -
> -		token = subflow_req->token;
> -		spin_lock_bh(&token_tree_lock);
> -		if (!radix_tree_lookup(&token_req_tree, token) &&
> -		    !radix_tree_lookup(&token_tree, token))
> -			break;
> -		spin_unlock_bh(&token_tree_lock);
> +	int retries = TOKEN_MAX_RETRIES;
> +	struct token_bucket *bucket;
> +	u32 token;
> +
> +again:
> +	mptcp_crypto_key_gen_sha(&subflow_req->local_key,
> +				 &subflow_req->token,
> +				 &subflow_req->idsn);
> +	pr_debug("req=%p local_key=%llu, token=%u, idsn=%llu\n",
> +		 req, subflow_req->local_key, subflow_req->token,
> +		 subflow_req->idsn);
> +
> +	token = subflow_req->token;
> +	bucket = token_bucket(token);
> +	spin_lock_bh(&bucket->lock);
> +	if (__token_bucket_busy(bucket, token)) {
> +		spin_unlock_bh(&bucket->lock);
> +		if (!--retries)
> +			return -EBUSY;
> +		goto again;
>  	}
>  
> -	err = radix_tree_insert(&token_req_tree,
> -				subflow_req->token, &token_used);
> -	spin_unlock_bh(&token_tree_lock);
> -	return err;
> +	hlist_nulls_add_tail_rcu(&subflow_req->token_node, &bucket->req_chain);
> +	bucket->chain_len++;
> +	spin_unlock_bh(&bucket->lock);
> +	return 0;
>  }
>  
>  /**
> @@ -97,48 +142,54 @@ int mptcp_token_new_request(struct request_sock *req)
>  int mptcp_token_new_connect(struct sock *sk)
>  {
>  	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
> -	struct sock *mptcp_sock = subflow->conn;
> -	int err;
> +	struct mptcp_sock *msk = mptcp_sk(subflow->conn);
> +	int retries = TOKEN_MAX_RETRIES;
> +	struct token_bucket *bucket;
>  
> -	while (1) {
> -		u32 token;
> +	pr_debug("ssk=%p, local_key=%llu, token=%u, idsn=%llu\n",
> +		 sk, subflow->local_key, subflow->token, subflow->idsn);
>  
> -		mptcp_crypto_key_gen_sha(&subflow->local_key, &subflow->token,
> -					 &subflow->idsn);
> +again:
> +	mptcp_crypto_key_gen_sha(&subflow->local_key, &subflow->token,
> +				 &subflow->idsn);
>  
> -		pr_debug("ssk=%p, local_key=%llu, token=%u, idsn=%llu\n",
> -			 sk, subflow->local_key, subflow->token, subflow->idsn);
> -
> -		token = subflow->token;
> -		spin_lock_bh(&token_tree_lock);
> -		if (!radix_tree_lookup(&token_req_tree, token) &&
> -		    !radix_tree_lookup(&token_tree, token))
> -			break;
> -		spin_unlock_bh(&token_tree_lock);
> +	bucket = token_bucket(subflow->token);
> +	spin_lock_bh(&bucket->lock);
> +	if (__token_bucket_busy(bucket, subflow->token)) {
> +		spin_unlock_bh(&bucket->lock);
> +		if (!--retries)
> +			return -EBUSY;
> +		goto again;
>  	}
> -	err = radix_tree_insert(&token_tree, subflow->token, mptcp_sock);
> -	spin_unlock_bh(&token_tree_lock);
>  
> -	return err;
> +	WRITE_ONCE(msk->token, subflow->token);
> +	hlist_nulls_add_tail_rcu(&msk->token_node, &bucket->msk_chain);
> +	bucket->chain_len++;
> +	spin_unlock_bh(&bucket->lock);
> +	return 0;
>  }
>  
>  /**
> - * mptcp_token_new_accept - insert token for later processing
> - * @token: the token to insert to the tree
> - * @conn: the just cloned socket linked to the new connection
> + * mptcp_token_accept - replace a req sk with full sock in token hash
> + * @req: the request socket to be removed
> + * @msk: the just cloned socket linked to the new connection
>   *
>   * Called when a SYN packet creates a new logical connection, i.e.
>   * is not a join request.
>   */
> -int mptcp_token_new_accept(u32 token, struct sock *conn)
> +void mptcp_token_accept(struct mptcp_subflow_request_sock *req,
> +			struct mptcp_sock *msk)
>  {
> -	int err;
> -
> -	spin_lock_bh(&token_tree_lock);
> -	err = radix_tree_insert(&token_tree, token, conn);
> -	spin_unlock_bh(&token_tree_lock);
> +	struct mptcp_subflow_request_sock *pos;
> +	struct token_bucket *bucket;
>  
> -	return err;
> +	bucket = token_bucket(req->token);
> +	spin_lock_bh(&bucket->lock);
> +	pos = __token_lookup_req(bucket, req->token);
> +	if (!WARN_ON_ONCE(pos != req))
> +		hlist_nulls_del_rcu(&req->token_node);
> +	hlist_nulls_add_tail_rcu(&msk->token_node, &bucket->msk_chain);
> +	spin_unlock_bh(&bucket->lock);
>  }
>  
>  /**
> @@ -152,20 +203,33 @@ int mptcp_token_new_accept(u32 token, struct sock *conn)
>   */
>  struct mptcp_sock *mptcp_token_get_sock(u32 token)
>  {
> -	struct sock *conn;
> -
> -	spin_lock_bh(&token_tree_lock);
> -	conn = radix_tree_lookup(&token_tree, token);
> -	if (conn) {
> -		/* token still reserved? */
> -		if (conn == (struct sock *)&token_used)
> -			conn = NULL;
> -		else
> -			sock_hold(conn);
> +	struct hlist_nulls_node *pos;
> +	struct token_bucket *bucket;
> +	struct mptcp_sock *msk;
> +
> +	rcu_read_lock();
> +	bucket = token_bucket(token);
> +
> +again:
> +	hlist_nulls_for_each_entry_rcu(msk, pos, &bucket->msk_chain,
> +				       token_node) {
> +		if (READ_ONCE(msk->token) != token)
> +			continue;
> +		if (!refcount_inc_not_zero(&((struct sock *)msk)->sk_refcnt))
> +			goto not_found;
> +		if (READ_ONCE(msk->token) != token)
> +			goto again;

I think you need to drop the refcount when going to "again" again


Christoph

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-05-27 16:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-27 16:47 [MPTCP] Re: [PATCH 2/4] mptcp: refactor token container Christoph Paasch
  -- strict thread matches above, loose matches on Subject: below --
2020-05-27 11:00 Paolo Abeni
2020-05-26 16:28 Mat Martineau
2020-05-26 16:23 Christoph Paasch
2020-05-25 17:41 Paolo Abeni
2020-05-25 10:42 Paolo Abeni
2020-05-25  8:10 Paolo Abeni
2020-05-23  1:10 Mat Martineau
2020-05-23  0:34 Mat Martineau
2020-05-22 19:06 Christoph Paasch
2020-05-22 17:37 Paolo Abeni
2020-05-22 16:06 Christoph Paasch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.