All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
To: <eric.dumazet@gmail.com>
Cc: <andrii@kernel.org>, <ast@kernel.org>, <benh@amazon.com>,
	<bpf@vger.kernel.org>, <daniel@iogearbox.net>,
	<davem@davemloft.net>, <edumazet@google.com>, <kafai@fb.com>,
	<kuba@kernel.org>, <kuni1840@gmail.com>, <kuniyu@amazon.co.jp>,
	<linux-kernel@vger.kernel.org>, <netdev@vger.kernel.org>
Subject: Re: [PATCH v7 bpf-next 03/11] tcp: Keep TCP_CLOSE sockets in the reuseport group.
Date: Fri, 11 Jun 2021 07:37:30 +0900	[thread overview]
Message-ID: <20210610223730.97716-1-kuniyu@amazon.co.jp> (raw)
In-Reply-To: <c89d2cec-4cf2-1972-354b-5f5ca1330d82@gmail.com>

From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Thu, 10 Jun 2021 19:59:15 +0200
> On 5/21/21 8:20 PM, Kuniyuki Iwashima wrote:
> > When we close a listening socket, to migrate its connections to another
> > listener in the same reuseport group, we have to handle two kinds of child
> > sockets. One is that a listening socket has a reference to, and the other
> > is not.
> > 
> > The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
> > accept queue of their listening socket. So we can pop them out and push
> > them into another listener's queue at close() or shutdown() syscalls. On
> > the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
> > three-way handshake and not in the accept queue. Thus, we cannot access
> > such sockets at close() or shutdown() syscalls. Accordingly, we have to
> > migrate immature sockets after their listening socket has been closed.
> > 
> > Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
> > sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
> > that time, if we could select a new listener from the same reuseport group,
> > no connection would be aborted. However, we cannot do that because
> > reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
> > the reuseport group from closed sockets.
> > 
> > This patch allows TCP_CLOSE sockets to remain in the reuseport group and
> > access it while any child socket references them. The point is that
> > reuseport_detach_sock() was called twice from inet_unhash() and
> > sk_destruct(). This patch replaces the first reuseport_detach_sock() with
> > reuseport_stop_listen_sock(), which checks if the reuseport group is
> > capable of migration. If capable, it decrements num_socks, moves the socket
> > backwards in socks[] and increments num_closed_socks. When all connections
> > are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
> > socket from socks[], decrement num_closed_socks, and set NULL to
> > sk_reuseport_cb.
> > 
> > By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
> > Consequently, calling listen() after shutdown() can cause EADDRINUSE or
> > EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
> > such sockets not to have the reuseport group. Therefore, this patch also
> > loosens such validation rules so that a socket can listen again if it has a
> > reuseport group with num_closed_socks more than 0.
> > 
> > When such sockets listen again, we handle them in reuseport_resurrect(). If
> > there is an existing reuseport group (reuseport_add_sock() path), we move
> > the socket from the old group to the new one and free the old one if
> > necessary. If there is no existing group (reuseport_alloc() path), we
> > allocate a new reuseport group, detach sk from the old one, and free it if
> > necessary, not to break the current shutdown behaviour:
> > 
> >   - we cannot carry over the eBPF prog of shutdowned sockets
> >   - we cannot attach/detach an eBPF prog to/from listening sockets via
> >     shutdowned sockets
> > 
> > Note that when the number of sockets gets over U16_MAX, we try to detach a
> > closed socket randomly to make room for the new listening socket in
> > reuseport_grow().
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  include/net/sock_reuseport.h    |   1 +
> >  net/core/sock_reuseport.c       | 184 ++++++++++++++++++++++++++++++--
> >  net/ipv4/inet_connection_sock.c |  12 ++-
> >  net/ipv4/inet_hashtables.c      |   2 +-
> >  4 files changed, 188 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > index 0e558ca7afbf..1333d0cddfbc 100644
> > --- a/include/net/sock_reuseport.h
> > +++ b/include/net/sock_reuseport.h
> > @@ -32,6 +32,7 @@ extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> >  			      bool bind_inany);
> >  extern void reuseport_detach_sock(struct sock *sk);
> > +void reuseport_stop_listen_sock(struct sock *sk);
> >  extern struct sock *reuseport_select_sock(struct sock *sk,
> >  					  u32 hash,
> >  					  struct sk_buff *skb,
> > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > index 079bd1aca0e7..ea0e900d3e97 100644
> > --- a/net/core/sock_reuseport.c
> > +++ b/net/core/sock_reuseport.c
> > @@ -17,6 +17,8 @@
> >  DEFINE_SPINLOCK(reuseport_lock);
> >  
> >  static DEFINE_IDA(reuseport_ida);
> > +static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,
> > +			       struct sock_reuseport *reuse, bool bind_inany);
> >  
> >  static int reuseport_sock_index(struct sock *sk,
> >  				struct sock_reuseport *reuse,
> > @@ -61,6 +63,29 @@ static bool __reuseport_detach_sock(struct sock *sk,
> >  	return true;
> >  }
> >  
> > +static void __reuseport_add_closed_sock(struct sock *sk,
> > +					struct sock_reuseport *reuse)
> > +{
> > +	reuse->socks[reuse->max_socks - reuse->num_closed_socks - 1] = sk;
> > +	/* paired with READ_ONCE() in inet_csk_bind_conflict() */
> > +	WRITE_ONCE(reuse->num_closed_socks, reuse->num_closed_socks + 1);
> > +}
> > +
> > +static bool __reuseport_detach_closed_sock(struct sock *sk,
> > +					   struct sock_reuseport *reuse)
> > +{
> > +	int i = reuseport_sock_index(sk, reuse, true);
> > +
> > +	if (i == -1)
> > +		return false;
> > +
> > +	reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks];
> > +	/* paired with READ_ONCE() in inet_csk_bind_conflict() */
> > +	WRITE_ONCE(reuse->num_closed_socks, reuse->num_closed_socks - 1);
> > +
> > +	return true;
> > +}
> > +
> >  static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks)
> >  {
> >  	unsigned int size = sizeof(struct sock_reuseport) +
> > @@ -92,6 +117,14 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)
> >  	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
> >  					  lockdep_is_held(&reuseport_lock));
> >  	if (reuse) {
> > +		if (reuse->num_closed_socks) {
> > +			/* sk was shutdown()ed before */
> > +			int err = reuseport_resurrect(sk, reuse, NULL, bind_inany);
> > +
> > +			spin_unlock_bh(&reuseport_lock);
> > +			return err;
> 
> It seems coding style in this function would rather do
> 			ret = reuseport_resurrect(sk, reuse, NULL, bind_inany);
> 			goto out;
> 
> Overall, changes in this commit are a bit scarry.

I will change the style with ret and goto.

Thank you.

  reply	other threads:[~2021-06-10 22:37 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-21 18:20 [PATCH v7 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 01/11] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
2021-06-10 17:24   ` Eric Dumazet
2021-06-10 22:31     ` Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 02/11] tcp: Add num_closed_socks to struct sock_reuseport Kuniyuki Iwashima
2021-06-10 17:38   ` Eric Dumazet
2021-06-10 22:33     ` Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 03/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2021-06-10 17:59   ` Eric Dumazet
2021-06-10 22:37     ` Kuniyuki Iwashima [this message]
2021-05-21 18:20 ` [PATCH v7 bpf-next 04/11] tcp: Add reuseport_migrate_sock() to select a new listener Kuniyuki Iwashima
2021-06-10 18:09   ` Eric Dumazet
2021-06-10 22:39     ` Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 05/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2021-06-10 18:20   ` Eric Dumazet
2021-06-10 22:45     ` Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs Kuniyuki Iwashima
2021-06-10 20:21   ` Eric Dumazet
2021-06-10 22:52     ` Kuniyuki Iwashima
2021-05-21 18:21 ` [PATCH v7 bpf-next 07/11] tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK Kuniyuki Iwashima
2021-06-10 20:36   ` Eric Dumazet
2021-06-10 22:56     ` Kuniyuki Iwashima
2021-05-21 18:21 ` [PATCH v7 bpf-next 08/11] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2021-05-21 18:21 ` [PATCH v7 bpf-next 09/11] bpf: Support socket migration by eBPF Kuniyuki Iwashima
2021-05-21 18:21 ` [PATCH v7 bpf-next 10/11] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2021-05-21 18:21 ` [PATCH v7 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
2021-05-26  6:42 ` [PATCH v7 bpf-next 00/11] Socket migration for SO_REUSEPORT Daniel Borkmann
2021-06-08  3:13   ` Alexei Starovoitov
2021-06-08 17:48   ` Yuchung Cheng
2021-06-08 23:03     ` Kuniyuki Iwashima
2021-06-08 23:47       ` Yuchung Cheng
2021-06-09  0:34         ` Kuniyuki Iwashima
2021-06-09 17:04           ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210610223730.97716-1-kuniyu@amazon.co.jp \
    --to=kuniyu@amazon.co.jp \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=benh@amazon.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=kafai@fb.com \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.