linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
@ 2021-04-27  3:46 Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 01/11] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
                   ` (12 more replies)
  0 siblings, 13 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

The SO_REUSEPORT option allows sockets to listen on the same port and to
accept connections evenly. However, there is a defect in the current
implementation [1]. When a SYN packet is received, the connection is tied
to a listening socket. Accordingly, when the listener is closed, in-flight
requests during the three-way handshake and child sockets in the accept
queue are dropped even if other listeners on the same port could accept
such connections.

This situation can happen when various server management tools restart
server (such as nginx) processes. For instance, when we change nginx
configurations and restart it, it spins up new workers that respect the new
configuration and closes all listeners on the old workers, resulting in the
in-flight ACK of 3WHS is responded by RST.

To avoid such a situation, users have to know deeply how the kernel handles
SYN packets and implement connection draining by eBPF [2]:

  1. Stop routing SYN packets to the listener by eBPF.
  2. Wait for all timers to expire to complete requests
  3. Accept connections until EAGAIN, then close the listener.

  or

  1. Start counting SYN packets and accept syscalls using the eBPF map.
  2. Stop routing SYN packets.
  3. Accept connections up to the count, then close the listener.

In either way, we cannot close a listener immediately. However, ideally,
the application need not drain the not yet accepted sockets because 3WHS
and tying a connection to a listener are just the kernel behaviour. The
root cause is within the kernel, so the issue should be addressed in kernel
space and should not be visible to user space. This patchset fixes it so
that users need not take care of kernel implementation and connection
draining. With this patchset, the kernel redistributes requests and
connections from a listener to the others in the same reuseport group
at/after close or shutdown syscalls.

Although some software does connection draining, there are still merits in
migration. For some security reasons, such as replacing TLS certificates,
we may want to apply new settings as soon as possible and/or we may not be
able to wait for connection draining. The sockets in the accept queue have
not started application sessions yet. So, if we do not drain such sockets,
they can be handled by the newer listeners and could have a longer
lifetime. It is difficult to drain all connections in every case, but we
can decrease such aborted connections by migration. In that sense,
migration is always better than draining. 

Moreover, auto-migration simplifies user space logic and also works well in
a case where we cannot modify and build a server program to implement the
workaround.

Note that the source and destination listeners MUST have the same settings
at the socket API level; otherwise, applications may face inconsistency and
cause errors. In such a case, we have to use the eBPF program to select a
specific listener or to cancel migration.

Special thanks to Martin KaFai Lau for bouncing ideas and exchanging code
snippets along the way.


Link:
 [1] The SO_REUSEPORT socket option
 https://lwn.net/Articles/542629/

 [2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
 https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/


Changelog:
 v4:
  * Make some functions and variables 'static' in selftest
  * Remove 'scalability' from the cover letter because it is not
    primarily reason to use SO_REUSEPORT

 v3:
 https://lore.kernel.org/bpf/20210420154140.80034-1-kuniyu@amazon.co.jp/
  * Add sysctl back for reuseport_grow()
  * Add helper functions to manage socks[]
  * Separate migration related logic into functions: reuseport_resurrect(),
    reuseport_stop_listen_sock(), reuseport_migrate_sock()
  * Clone request_sock to be migrated
  * Migrate request one by one
  * Pass child socket to eBPF prog

 v2:
 https://lore.kernel.org/netdev/20201207132456.65472-1-kuniyu@amazon.co.jp/
  * Do not save closed sockets in socks[]
  * Revert 607904c357c61adf20b8fd18af765e501d61a385
  * Extract inet_csk_reqsk_queue_migrate() into a single patch
  * Change the spin_lock order to avoid lockdep warning
  * Add static to __reuseport_select_sock
  * Use refcount_inc_not_zero() in reuseport_select_migrated_sock()
  * Set the default attach type in bpf_prog_load_check_attach()
  * Define new proto of BPF_FUNC_get_socket_cookie
  * Fix test to be compiled successfully
  * Update commit messages

 v1:
 https://lore.kernel.org/netdev/20201201144418.35045-1-kuniyu@amazon.co.jp/
  * Remove the sysctl option
  * Enable migration if eBPF progam is not attached
  * Add expected_attach_type to check if eBPF program can migrate sockets
  * Add a field to tell migration type to eBPF program
  * Support BPF_FUNC_get_socket_cookie to get the cookie of sk
  * Allocate an empty skb if skb is NULL
  * Pass req_to_sk(req)->sk_hash because listener's hash is zero
  * Update commit messages and coverletter

 RFC:
 https://lore.kernel.org/netdev/20201117094023.3685-1-kuniyu@amazon.co.jp/


Kuniyuki Iwashima (11):
  net: Introduce net.ipv4.tcp_migrate_req.
  tcp: Add num_closed_socks to struct sock_reuseport.
  tcp: Keep TCP_CLOSE sockets in the reuseport group.
  tcp: Add reuseport_migrate_sock() to select a new listener.
  tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.
  tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
  bpf: Support BPF_FUNC_get_socket_cookie() for
    BPF_PROG_TYPE_SK_REUSEPORT.
  bpf: Support socket migration by eBPF.
  libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
  bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

 Documentation/networking/ip-sysctl.rst        |  20 +
 include/linux/bpf.h                           |   1 +
 include/linux/filter.h                        |   2 +
 include/net/netns/ipv4.h                      |   1 +
 include/net/request_sock.h                    |   2 +
 include/net/sock_reuseport.h                  |   9 +-
 include/uapi/linux/bpf.h                      |  16 +
 kernel/bpf/syscall.c                          |  13 +
 net/core/filter.c                             |  23 +-
 net/core/request_sock.c                       |  38 ++
 net/core/sock_reuseport.c                     | 337 ++++++++++--
 net/ipv4/inet_connection_sock.c               | 147 +++++-
 net/ipv4/inet_hashtables.c                    |   2 +-
 net/ipv4/sysctl_net_ipv4.c                    |   9 +
 net/ipv4/tcp_ipv4.c                           |  20 +-
 net/ipv6/tcp_ipv6.c                           |  14 +-
 tools/include/uapi/linux/bpf.h                |  16 +
 tools/lib/bpf/libbpf.c                        |   5 +-
 tools/testing/selftests/bpf/network_helpers.c |   2 +-
 tools/testing/selftests/bpf/network_helpers.h |   1 +
 .../bpf/prog_tests/migrate_reuseport.c        | 484 ++++++++++++++++++
 .../bpf/progs/test_migrate_reuseport.c        |  51 ++
 22 files changed, 1151 insertions(+), 62 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 01/11] net: Introduce net.ipv4.tcp_migrate_req.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 02/11] tcp: Add num_closed_socks to struct sock_reuseport Kuniyuki Iwashima
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
option is enabled or eBPF program is attached, we will be able to migrate
child sockets from a listener to another in the same reuseport group after
close() or shutdown() syscalls.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
---
 Documentation/networking/ip-sysctl.rst | 20 ++++++++++++++++++++
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 3 files changed, 30 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index c2ecc9894fd0..8e92f9b28aad 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -732,6 +732,26 @@ tcp_syncookies - INTEGER
 	network connections you can set this knob to 2 to enable
 	unconditionally generation of syncookies.
 
+tcp_migrate_req - INTEGER
+	The incoming connection is tied to a specific listening socket when
+	the initial SYN packet is received during the three-way handshake.
+	When a listener is closed, in-flight request sockets during the
+	handshake and established sockets in the accept queue are aborted.
+
+	If the listener has SO_REUSEPORT enabled, other listeners on the
+	same port should have been able to accept such connections. This
+	option makes it possible to migrate such child sockets to another
+	listener after close() or shutdown().
+
+	Default: 0
+
+	Note that the source and destination listeners MUST have the same
+	settings at the socket API level. If different applications listen
+	on the same port, disable this option or attach the
+	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE type of eBPF program to select
+	the correct socket by bpf_sk_select_reuseport() or to cancel
+	migration by returning SK_DROP.
+
 tcp_fastopen - INTEGER
 	Enable TCP Fast Open (RFC7413) to send and accept data in the opening
 	SYN packet.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index f6af8d96d3c6..fd63c91addcc 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -126,6 +126,7 @@ struct netns_ipv4 {
 	u8 sysctl_tcp_syn_retries;
 	u8 sysctl_tcp_synack_retries;
 	u8 sysctl_tcp_syncookies;
+	u8 sysctl_tcp_migrate_req;
 	int sysctl_tcp_reordering;
 	u8 sysctl_tcp_retries1;
 	u8 sysctl_tcp_retries2;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a62934b9f15a..7bb013fcbf5f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -940,6 +940,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.proc_handler	= proc_dou8vec_minmax,
 	},
 #endif
+	{
+		.procname	= "tcp_migrate_req",
+		.data		= &init_net.ipv4.sysctl_tcp_migrate_req,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE
+	},
 	{
 		.procname	= "tcp_reordering",
 		.data		= &init_net.ipv4.sysctl_tcp_reordering,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 02/11] tcp: Add num_closed_socks to struct sock_reuseport.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 01/11] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 03/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

As noted in the following commit, a closed listener has to hold the
reference to the reuseport group for socket migration. This patch adds a
field (num_closed_socks) to struct sock_reuseport to manage closed sockets
within the same reuseport group. Moreover, this and the following commits
introduce some helper functions to split socks[] into two sections and keep
TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended
queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE
sockets from the end.

  TCP_LISTEN---------->       <-------TCP_CLOSE
  +---+---+  ---  +---+  ---  +---+  ---  +---+
  | 0 | 1 |  ...  | i |  ...  | j |  ...  | k |
  +---+---+  ---  +---+  ---  +---+  ---  +---+

  i = num_socks - 1
  j = max_socks - num_closed_socks
  k = max_socks - 1

This patch also extends reuseport_add_sock() and reuseport_grow() to
support num_closed_socks.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/net/sock_reuseport.h |  5 ++-
 net/core/sock_reuseport.c    | 76 +++++++++++++++++++++++++++---------
 2 files changed, 60 insertions(+), 21 deletions(-)

diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 505f1e18e9bf..0e558ca7afbf 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock;
 struct sock_reuseport {
 	struct rcu_head		rcu;
 
-	u16			max_socks;	/* length of socks */
-	u16			num_socks;	/* elements in socks */
+	u16			max_socks;		/* length of socks */
+	u16			num_socks;		/* elements in socks */
+	u16			num_closed_socks;	/* closed elements in socks */
 	/* The last synq overflow event timestamp of this
 	 * reuse->socks[] group.
 	 */
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index b065f0a103ed..079bd1aca0e7 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -18,6 +18,49 @@ DEFINE_SPINLOCK(reuseport_lock);
 
 static DEFINE_IDA(reuseport_ida);
 
+static int reuseport_sock_index(struct sock *sk,
+				struct sock_reuseport *reuse,
+				bool closed)
+{
+	int left, right;
+
+	if (!closed) {
+		left = 0;
+		right = reuse->num_socks;
+	} else {
+		left = reuse->max_socks - reuse->num_closed_socks;
+		right = reuse->max_socks;
+	}
+
+	for (; left < right; left++)
+		if (reuse->socks[left] == sk)
+			return left;
+	return -1;
+}
+
+static void __reuseport_add_sock(struct sock *sk,
+				 struct sock_reuseport *reuse)
+{
+	reuse->socks[reuse->num_socks] = sk;
+	/* paired with smp_rmb() in reuseport_select_sock() */
+	smp_wmb();
+	reuse->num_socks++;
+}
+
+static bool __reuseport_detach_sock(struct sock *sk,
+				    struct sock_reuseport *reuse)
+{
+	int i = reuseport_sock_index(sk, reuse, false);
+
+	if (i == -1)
+		return false;
+
+	reuse->socks[i] = reuse->socks[reuse->num_socks - 1];
+	reuse->num_socks--;
+
+	return true;
+}
+
 static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks)
 {
 	unsigned int size = sizeof(struct sock_reuseport) +
@@ -72,9 +115,8 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)
 	}
 
 	reuse->reuseport_id = id;
-	reuse->socks[0] = sk;
-	reuse->num_socks = 1;
 	reuse->bind_inany = bind_inany;
+	__reuseport_add_sock(sk, reuse);
 	rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
 
 out:
@@ -98,6 +140,7 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse)
 		return NULL;
 
 	more_reuse->num_socks = reuse->num_socks;
+	more_reuse->num_closed_socks = reuse->num_closed_socks;
 	more_reuse->prog = reuse->prog;
 	more_reuse->reuseport_id = reuse->reuseport_id;
 	more_reuse->bind_inany = reuse->bind_inany;
@@ -105,9 +148,13 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse)
 
 	memcpy(more_reuse->socks, reuse->socks,
 	       reuse->num_socks * sizeof(struct sock *));
+	memcpy(more_reuse->socks +
+	       (more_reuse->max_socks - more_reuse->num_closed_socks),
+	       reuse->socks + reuse->num_socks,
+	       reuse->num_closed_socks * sizeof(struct sock *));
 	more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
 
-	for (i = 0; i < reuse->num_socks; ++i)
+	for (i = 0; i < reuse->max_socks; ++i)
 		rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
 				   more_reuse);
 
@@ -158,7 +205,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 		return -EBUSY;
 	}
 
-	if (reuse->num_socks == reuse->max_socks) {
+	if (reuse->num_socks + reuse->num_closed_socks == reuse->max_socks) {
 		reuse = reuseport_grow(reuse);
 		if (!reuse) {
 			spin_unlock_bh(&reuseport_lock);
@@ -166,10 +213,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 		}
 	}
 
-	reuse->socks[reuse->num_socks] = sk;
-	/* paired with smp_rmb() in reuseport_select_sock() */
-	smp_wmb();
-	reuse->num_socks++;
+	__reuseport_add_sock(sk, reuse);
 	rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
 
 	spin_unlock_bh(&reuseport_lock);
@@ -183,7 +227,6 @@ EXPORT_SYMBOL(reuseport_add_sock);
 void reuseport_detach_sock(struct sock *sk)
 {
 	struct sock_reuseport *reuse;
-	int i;
 
 	spin_lock_bh(&reuseport_lock);
 	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
@@ -200,16 +243,11 @@ void reuseport_detach_sock(struct sock *sk)
 	bpf_sk_reuseport_detach(sk);
 
 	rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
+	__reuseport_detach_sock(sk, reuse);
+
+	if (reuse->num_socks + reuse->num_closed_socks == 0)
+		call_rcu(&reuse->rcu, reuseport_free_rcu);
 
-	for (i = 0; i < reuse->num_socks; i++) {
-		if (reuse->socks[i] == sk) {
-			reuse->socks[i] = reuse->socks[reuse->num_socks - 1];
-			reuse->num_socks--;
-			if (reuse->num_socks == 0)
-				call_rcu(&reuse->rcu, reuseport_free_rcu);
-			break;
-		}
-	}
 	spin_unlock_bh(&reuseport_lock);
 }
 EXPORT_SYMBOL(reuseport_detach_sock);
@@ -274,7 +312,7 @@ struct sock *reuseport_select_sock(struct sock *sk,
 	prog = rcu_dereference(reuse->prog);
 	socks = READ_ONCE(reuse->num_socks);
 	if (likely(socks)) {
-		/* paired with smp_wmb() in reuseport_add_sock() */
+		/* paired with smp_wmb() in __reuseport_add_sock() */
 		smp_rmb();
 
 		if (!prog || !skb)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 03/11] tcp: Keep TCP_CLOSE sockets in the reuseport group.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 01/11] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 02/11] tcp: Add num_closed_socks to struct sock_reuseport Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 04/11] tcp: Add reuseport_migrate_sock() to select a new listener Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, we cannot do that because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to remain in the reuseport group and
access it while any child socket references them. The point is that
reuseport_detach_sock() was called twice from inet_unhash() and
sk_destruct(). This patch replaces the first reuseport_detach_sock() with
reuseport_stop_listen_sock(), which checks if the reuseport group is
capable of migration. If capable, it decrements num_socks, moves the socket
backwards in socks[] and increments num_closed_socks. When all connections
are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
socket from socks[], decrement num_closed_socks, and set NULL to
sk_reuseport_cb.

By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
Consequently, calling listen() after shutdown() can cause EADDRINUSE or
EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
such sockets not to have the reuseport group. Therefore, this patch also
loosens such validation rules so that a socket can listen again if it has a
reuseport group with num_closed_socks more than 0.

When such sockets listen again, we handle them in reuseport_resurrect(). If
there is an existing reuseport group (reuseport_add_sock() path), we move
the socket from the old group to the new one and free the old one if
necessary. If there is no existing group (reuseport_alloc() path), we
allocate a new reuseport group, detach sk from the old one, and free it if
necessary, not to break the current shutdown behaviour:

  - we cannot carry over the eBPF prog of shutdowned sockets
  - we cannot attach an eBPF prog to listening sockets via shutdowned
    sockets

Note that when the number of sockets gets over U16_MAX, we try to detach a
closed socket randomly to make room for the new listening socket in
reuseport_grow().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/sock_reuseport.h    |   1 +
 net/core/sock_reuseport.c       | 159 +++++++++++++++++++++++++++++++-
 net/ipv4/inet_connection_sock.c |  12 ++-
 net/ipv4/inet_hashtables.c      |   2 +-
 4 files changed, 168 insertions(+), 6 deletions(-)

diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 0e558ca7afbf..1333d0cddfbc 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -32,6 +32,7 @@ extern int reuseport_alloc(struct sock *sk, bool bind_inany);
 extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
 			      bool bind_inany);
 extern void reuseport_detach_sock(struct sock *sk);
+void reuseport_stop_listen_sock(struct sock *sk);
 extern struct sock *reuseport_select_sock(struct sock *sk,
 					  u32 hash,
 					  struct sk_buff *skb,
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 079bd1aca0e7..d5fb0ad12e87 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -17,6 +17,8 @@
 DEFINE_SPINLOCK(reuseport_lock);
 
 static DEFINE_IDA(reuseport_ida);
+static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,
+			       struct sock_reuseport *reuse, bool bind_inany);
 
 static int reuseport_sock_index(struct sock *sk,
 				struct sock_reuseport *reuse,
@@ -61,6 +63,29 @@ static bool __reuseport_detach_sock(struct sock *sk,
 	return true;
 }
 
+static void __reuseport_add_closed_sock(struct sock *sk,
+					struct sock_reuseport *reuse)
+{
+	reuse->socks[reuse->max_socks - reuse->num_closed_socks - 1] = sk;
+	/* paired with READ_ONCE() in inet_csk_bind_conflict() */
+	WRITE_ONCE(reuse->num_closed_socks, reuse->num_closed_socks + 1);
+}
+
+static bool __reuseport_detach_closed_sock(struct sock *sk,
+					   struct sock_reuseport *reuse)
+{
+	int i = reuseport_sock_index(sk, reuse, true);
+
+	if (i == -1)
+		return false;
+
+	reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks];
+	/* paired with READ_ONCE() in inet_csk_bind_conflict() */
+	WRITE_ONCE(reuse->num_closed_socks, reuse->num_closed_socks - 1);
+
+	return true;
+}
+
 static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks)
 {
 	unsigned int size = sizeof(struct sock_reuseport) +
@@ -92,6 +117,14 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)
 	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
 					  lockdep_is_held(&reuseport_lock));
 	if (reuse) {
+		if (reuse->num_closed_socks) {
+			/* sk was shutdown()ed before */
+			int err = reuseport_resurrect(sk, reuse, NULL, bind_inany);
+
+			spin_unlock_bh(&reuseport_lock);
+			return err;
+		}
+
 		/* Only set reuse->bind_inany if the bind_inany is true.
 		 * Otherwise, it will overwrite the reuse->bind_inany
 		 * which was set by the bind/hash path.
@@ -132,8 +165,23 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse)
 	u32 more_socks_size, i;
 
 	more_socks_size = reuse->max_socks * 2U;
-	if (more_socks_size > U16_MAX)
+	if (more_socks_size > U16_MAX) {
+		if (reuse->num_closed_socks) {
+			/* Make room by removing a closed sk.
+			 * The child has already been migrated.
+			 * Only reqsk left at this point.
+			 */
+			struct sock *sk;
+
+			sk = reuse->socks[reuse->max_socks - reuse->num_closed_socks];
+			RCU_INIT_POINTER(sk->sk_reuseport_cb, NULL);
+			__reuseport_detach_closed_sock(sk, reuse);
+
+			return reuse;
+		}
+
 		return NULL;
+	}
 
 	more_reuse = __reuseport_alloc(more_socks_size);
 	if (!more_reuse)
@@ -199,7 +247,15 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 	reuse = rcu_dereference_protected(sk2->sk_reuseport_cb,
 					  lockdep_is_held(&reuseport_lock));
 	old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
-					     lockdep_is_held(&reuseport_lock));
+					      lockdep_is_held(&reuseport_lock));
+	if (old_reuse && old_reuse->num_closed_socks) {
+		/* sk was shutdown()ed before */
+		int err = reuseport_resurrect(sk, old_reuse, reuse, reuse->bind_inany);
+
+		spin_unlock_bh(&reuseport_lock);
+		return err;
+	}
+
 	if (old_reuse && old_reuse->num_socks != 1) {
 		spin_unlock_bh(&reuseport_lock);
 		return -EBUSY;
@@ -224,6 +280,65 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 }
 EXPORT_SYMBOL(reuseport_add_sock);
 
+static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,
+			       struct sock_reuseport *reuse, bool bind_inany)
+{
+	if (old_reuse == reuse) {
+		/* If sk was in the same reuseport group, just pop sk out of
+		 * the closed section and push sk into the listening section.
+		 */
+		__reuseport_detach_closed_sock(sk, old_reuse);
+		__reuseport_add_sock(sk, old_reuse);
+		return 0;
+	}
+
+	if (!reuse) {
+		/* In bind()/listen() path, we cannot carry over the eBPF prog
+		 * for the shutdown()ed socket. In setsockopt() path, we should
+		 * not change the eBPF prog of listening sockets by attaching a
+		 * prog to the shutdown()ed socket. Thus, we will allocate a new
+		 * reuseport group and detach sk from the old group.
+		 */
+		int id;
+
+		reuse = __reuseport_alloc(INIT_SOCKS);
+		if (!reuse)
+			return -ENOMEM;
+
+		id = ida_alloc(&reuseport_ida, GFP_ATOMIC);
+		if (id < 0) {
+			kfree(reuse);
+			return id;
+		}
+
+		reuse->reuseport_id = id;
+		reuse->bind_inany = bind_inany;
+	} else {
+		/* Move sk from the old group to the new one if
+		 * - all the other listeners in the old group were close()d or
+		 *   shutdown()ed, and then sk2 has listen()ed on the same port
+		 * OR
+		 * - sk listen()ed without bind() (or with autobind), was
+		 *   shutdown()ed, and then listen()s on another port which
+		 *   sk2 listen()s on.
+		 */
+		if (reuse->num_socks + reuse->num_closed_socks == reuse->max_socks) {
+			reuse = reuseport_grow(reuse);
+			if (!reuse)
+				return -ENOMEM;
+		}
+	}
+
+	__reuseport_detach_closed_sock(sk, old_reuse);
+	__reuseport_add_sock(sk, reuse);
+	rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
+
+	if (old_reuse->num_socks + old_reuse->num_closed_socks == 0)
+		call_rcu(&old_reuse->rcu, reuseport_free_rcu);
+
+	return 0;
+}
+
 void reuseport_detach_sock(struct sock *sk)
 {
 	struct sock_reuseport *reuse;
@@ -232,6 +347,10 @@ void reuseport_detach_sock(struct sock *sk)
 	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
 					  lockdep_is_held(&reuseport_lock));
 
+	/* reuseport_grow() has detached a closed sk */
+	if (!reuse)
+		goto out;
+
 	/* Notify the bpf side. The sk may be added to a sockarray
 	 * map. If so, sockarray logic will remove it from the map.
 	 *
@@ -243,15 +362,49 @@ void reuseport_detach_sock(struct sock *sk)
 	bpf_sk_reuseport_detach(sk);
 
 	rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
-	__reuseport_detach_sock(sk, reuse);
+
+	if (!__reuseport_detach_closed_sock(sk, reuse))
+		__reuseport_detach_sock(sk, reuse);
 
 	if (reuse->num_socks + reuse->num_closed_socks == 0)
 		call_rcu(&reuse->rcu, reuseport_free_rcu);
 
+out:
 	spin_unlock_bh(&reuseport_lock);
 }
 EXPORT_SYMBOL(reuseport_detach_sock);
 
+void reuseport_stop_listen_sock(struct sock *sk)
+{
+	if (sk->sk_protocol == IPPROTO_TCP) {
+		struct sock_reuseport *reuse;
+
+		spin_lock_bh(&reuseport_lock);
+
+		reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
+						  lockdep_is_held(&reuseport_lock));
+
+		if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req) {
+			/* Migration capable, move sk from the listening section
+			 * to the closed section.
+			 */
+			bpf_sk_reuseport_detach(sk);
+
+			__reuseport_detach_sock(sk, reuse);
+			__reuseport_add_closed_sock(sk, reuse);
+
+			spin_unlock_bh(&reuseport_lock);
+			return;
+		}
+
+		spin_unlock_bh(&reuseport_lock);
+	}
+
+	/* Not capable to do migration, detach immediately */
+	reuseport_detach_sock(sk);
+}
+EXPORT_SYMBOL(reuseport_stop_listen_sock);
+
 static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks,
 				   struct bpf_prog *prog, struct sk_buff *skb,
 				   int hdr_len)
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fd472eae4f5c..fa806e9167ec 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -135,10 +135,18 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 				  bool relax, bool reuseport_ok)
 {
 	struct sock *sk2;
+	bool reuseport_cb_ok;
 	bool reuse = sk->sk_reuse;
 	bool reuseport = !!sk->sk_reuseport;
+	struct sock_reuseport *reuseport_cb;
 	kuid_t uid = sock_i_uid((struct sock *)sk);
 
+	rcu_read_lock();
+	reuseport_cb = rcu_dereference(sk->sk_reuseport_cb);
+	/* paired with WRITE_ONCE() in __reuseport_(add|detach)_closed_sock */
+	reuseport_cb_ok = !reuseport_cb || READ_ONCE(reuseport_cb->num_closed_socks);
+	rcu_read_unlock();
+
 	/*
 	 * Unlike other sk lookup places we do not check
 	 * for sk_net here, since _all_ the socks listed
@@ -156,14 +164,14 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 				if ((!relax ||
 				     (!reuseport_ok &&
 				      reuseport && sk2->sk_reuseport &&
-				      !rcu_access_pointer(sk->sk_reuseport_cb) &&
+				      reuseport_cb_ok &&
 				      (sk2->sk_state == TCP_TIME_WAIT ||
 				       uid_eq(uid, sock_i_uid(sk2))))) &&
 				    inet_rcv_saddr_equal(sk, sk2, true))
 					break;
 			} else if (!reuseport_ok ||
 				   !reuseport || !sk2->sk_reuseport ||
-				   rcu_access_pointer(sk->sk_reuseport_cb) ||
+				   !reuseport_cb_ok ||
 				   (sk2->sk_state != TCP_TIME_WAIT &&
 				    !uid_eq(uid, sock_i_uid(sk2)))) {
 				if (inet_rcv_saddr_equal(sk, sk2, true))
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index c96866a53a66..80aeaf9e6e16 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -697,7 +697,7 @@ void inet_unhash(struct sock *sk)
 		goto unlock;
 
 	if (rcu_access_pointer(sk->sk_reuseport_cb))
-		reuseport_detach_sock(sk);
+		reuseport_stop_listen_sock(sk);
 	if (ilb) {
 		inet_unhash2(hashinfo, sk);
 		ilb->count--;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 04/11] tcp: Add reuseport_migrate_sock() to select a new listener.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2021-04-27  3:46 ` [PATCH v4 bpf-next 03/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 05/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

reuseport_migrate_sock() does the same check done in
reuseport_listen_stop_sock(). If the reuseport group is capable of
migration, reuseport_migrate_sock() selects a new listener by the child
socket hash and increments the listener's sk_refcnt beforehand. Thus, if we
fail in the migration, we have to decrement it later.

We will support migration by eBPF in the later commits.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/sock_reuseport.h |  3 ++
 net/core/sock_reuseport.c    | 78 +++++++++++++++++++++++++++++-------
 2 files changed, 67 insertions(+), 14 deletions(-)

diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 1333d0cddfbc..473b0b0fa4ab 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -37,6 +37,9 @@ extern struct sock *reuseport_select_sock(struct sock *sk,
 					  u32 hash,
 					  struct sk_buff *skb,
 					  int hdr_len);
+struct sock *reuseport_migrate_sock(struct sock *sk,
+				    struct sock *migrating_sk,
+				    struct sk_buff *skb);
 extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog);
 extern int reuseport_detach_prog(struct sock *sk);
 
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index d5fb0ad12e87..a2bca39ec0e3 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -44,7 +44,7 @@ static void __reuseport_add_sock(struct sock *sk,
 				 struct sock_reuseport *reuse)
 {
 	reuse->socks[reuse->num_socks] = sk;
-	/* paired with smp_rmb() in reuseport_select_sock() */
+	/* paired with smp_rmb() in reuseport_(select|migrate)_sock() */
 	smp_wmb();
 	reuse->num_socks++;
 }
@@ -435,6 +435,23 @@ static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks,
 	return reuse->socks[index];
 }
 
+static struct sock *reuseport_select_sock_by_hash(struct sock_reuseport *reuse,
+						  u32 hash, u16 num_socks)
+{
+	int i, j;
+
+	i = j = reciprocal_scale(hash, num_socks);
+	while (reuse->socks[i]->sk_state == TCP_ESTABLISHED) {
+		i++;
+		if (i >= num_socks)
+			i = 0;
+		if (i == j)
+			return NULL;
+	}
+
+	return reuse->socks[i];
+}
+
 /**
  *  reuseport_select_sock - Select a socket from an SO_REUSEPORT group.
  *  @sk: First socket in the group.
@@ -478,19 +495,8 @@ struct sock *reuseport_select_sock(struct sock *sk,
 
 select_by_hash:
 		/* no bpf or invalid bpf result: fall back to hash usage */
-		if (!sk2) {
-			int i, j;
-
-			i = j = reciprocal_scale(hash, socks);
-			while (reuse->socks[i]->sk_state == TCP_ESTABLISHED) {
-				i++;
-				if (i >= socks)
-					i = 0;
-				if (i == j)
-					goto out;
-			}
-			sk2 = reuse->socks[i];
-		}
+		if (!sk2)
+			sk2 = reuseport_select_sock_by_hash(reuse, hash, socks);
 	}
 
 out:
@@ -499,6 +505,50 @@ struct sock *reuseport_select_sock(struct sock *sk,
 }
 EXPORT_SYMBOL(reuseport_select_sock);
 
+/**
+ *  reuseport_migrate_sock - Select a socket from an SO_REUSEPORT group.
+ *  @sk: close()ed or shutdown()ed socket in the group.
+ *  @migrating_sk: ESTABLISHED/SYN_RECV full socket in the accept queue or
+ *    NEW_SYN_RECV request socket during 3WHS.
+ *  @skb: skb to run through BPF filter.
+ *  Returns a socket (with sk_refcnt +1) that should accept the child socket
+ *  (or NULL on error).
+ */
+struct sock *reuseport_migrate_sock(struct sock *sk,
+				    struct sock *migrating_sk,
+				    struct sk_buff *skb)
+{
+	struct sock_reuseport *reuse;
+	struct sock *nsk = NULL;
+	u16 socks;
+	u32 hash;
+
+	rcu_read_lock();
+
+	reuse = rcu_dereference(sk->sk_reuseport_cb);
+	if (!reuse)
+		goto out;
+
+	socks = READ_ONCE(reuse->num_socks);
+	if (unlikely(!socks))
+		goto out;
+
+	/* paired with smp_wmb() in __reuseport_add_sock() */
+	smp_rmb();
+
+	hash = migrating_sk->sk_hash;
+	if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req)
+		nsk = reuseport_select_sock_by_hash(reuse, hash, socks);
+
+	if (nsk && unlikely(!refcount_inc_not_zero(&nsk->sk_refcnt)))
+		nsk = NULL;
+
+out:
+	rcu_read_unlock();
+	return nsk;
+}
+EXPORT_SYMBOL(reuseport_migrate_sock);
+
 int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog)
 {
 	struct sock_reuseport *reuse;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 05/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2021-04-27  3:46 ` [PATCH v4 bpf-next 04/11] tcp: Add reuseport_migrate_sock() to select a new listener Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

When we call close() or shutdown() for listening sockets, each child socket
in the accept queue are freed at inet_csk_listen_stop(). If we can get a
new listener by reuseport_migrate_sock() and clone the request by
reqsk_clone(), we try to add it into the new listener's accept queue by
inet_csk_reqsk_queue_add(). If it fails, we have to call __reqsk_free() to
call sock_put() for its listener and free the cloned request.

After putting the full socket into ehash, tcp_v[46]_syn_recv_sock() sets
NULL to ireq_opt/pktopts in struct inet_request_sock, but ipv6_opt can be
non-NULL. So, we have to set NULL to ipv6_opt of the old request to avoid
double free.

Note that we do not update req->rsk_listener and instead clone the req to
migrate because another path may reference the original request. If we
protected it by RCU, we would need to add rcu_read_lock() in many places.

Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/net/request_sock.h      |  2 ++
 net/core/request_sock.c         | 37 +++++++++++++++++++++++++++++++++
 net/ipv4/inet_connection_sock.c | 31 ++++++++++++++++++++++++++-
 3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 29e41ff3ec93..c6d6cfd3c93b 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -190,6 +190,8 @@ void reqsk_queue_alloc(struct request_sock_queue *queue);
 void reqsk_fastopen_remove(struct sock *sk, struct request_sock *req,
 			   bool reset);
 
+struct request_sock *reqsk_clone(struct request_sock *req, struct sock *sk);
+
 static inline bool reqsk_queue_empty(const struct request_sock_queue *queue)
 {
 	return READ_ONCE(queue->rskq_accept_head) == NULL;
diff --git a/net/core/request_sock.c b/net/core/request_sock.c
index f35c2e998406..82cf9fbe2668 100644
--- a/net/core/request_sock.c
+++ b/net/core/request_sock.c
@@ -130,3 +130,40 @@ void reqsk_fastopen_remove(struct sock *sk, struct request_sock *req,
 out:
 	spin_unlock_bh(&fastopenq->lock);
 }
+
+struct request_sock *reqsk_clone(struct request_sock *req, struct sock *sk)
+{
+	struct sock *req_sk, *nreq_sk;
+	struct request_sock *nreq;
+
+	nreq = kmem_cache_alloc(req->rsk_ops->slab, GFP_ATOMIC | __GFP_NOWARN);
+	if (!nreq) {
+		/* paired with refcount_inc_not_zero() in reuseport_migrate_sock() */
+		sock_put(sk);
+		return NULL;
+	}
+
+	req_sk = req_to_sk(req);
+	nreq_sk = req_to_sk(nreq);
+
+	memcpy(nreq_sk, req_sk,
+	       offsetof(struct sock, sk_dontcopy_begin));
+	memcpy(&nreq_sk->sk_dontcopy_end, &req_sk->sk_dontcopy_end,
+	       req->rsk_ops->obj_size - offsetof(struct sock, sk_dontcopy_end));
+
+	nreq_sk->sk_tx_queue_mapping = req_sk->sk_tx_queue_mapping;
+#ifdef CONFIG_XPS
+	nreq_sk->sk_rx_queue_mapping = req_sk->sk_rx_queue_mapping;
+#endif
+	nreq_sk->sk_incoming_cpu = req_sk->sk_incoming_cpu;
+
+	nreq->rsk_listener = sk;
+
+	/* We need not acquire fastopenq->lock
+	 * because the child socket is locked in inet_csk_listen_stop().
+	 */
+	if (tcp_rsk(nreq)->tfo_listener)
+		rcu_assign_pointer(tcp_sk(nreq->sk)->fastopen_rsk, nreq);
+
+	return nreq;
+}
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fa806e9167ec..851992405826 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -695,6 +695,13 @@ int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req)
 }
 EXPORT_SYMBOL(inet_rtx_syn_ack);
 
+static void reqsk_migrate_reset(struct request_sock *req)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	inet_rsk(req)->ipv6_opt = NULL;
+#endif
+}
+
 /* return true if req was found in the ehash table */
 static bool reqsk_queue_unlink(struct request_sock *req)
 {
@@ -1036,14 +1043,36 @@ void inet_csk_listen_stop(struct sock *sk)
 	 * of the variants now.			--ANK
 	 */
 	while ((req = reqsk_queue_remove(queue, sk)) != NULL) {
-		struct sock *child = req->sk;
+		struct sock *child = req->sk, *nsk;
+		struct request_sock *nreq;
 
 		local_bh_disable();
 		bh_lock_sock(child);
 		WARN_ON(sock_owned_by_user(child));
 		sock_hold(child);
 
+		nsk = reuseport_migrate_sock(sk, child, NULL);
+		if (nsk) {
+			nreq = reqsk_clone(req, nsk);
+			if (nreq) {
+				refcount_set(&nreq->rsk_refcnt, 1);
+
+				if (inet_csk_reqsk_queue_add(nsk, nreq, child)) {
+					reqsk_migrate_reset(req);
+				} else {
+					reqsk_migrate_reset(nreq);
+					__reqsk_free(nreq);
+				}
+
+				/* inet_csk_reqsk_queue_add() has already
+				 * called inet_child_forget() on failure case.
+				 */
+				goto skip_child_forget;
+			}
+		}
+
 		inet_child_forget(sk, req, child);
+skip_child_forget:
 		reqsk_put(req);
 		bh_unlock_sock(child);
 		local_bh_enable();
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2021-04-27  3:46 ` [PATCH v4 bpf-next 05/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-05-05  4:56   ` Martin KaFai Lau
  2021-04-27  3:46 ` [PATCH v4 bpf-next 07/11] tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

As with the preceding patch, this patch changes reqsk_timer_handler() to
call reuseport_migrate_sock() and reqsk_clone() to migrate in-flight
requests at retransmitting SYN+ACKs. If we can select a new listener and
clone the request, we resume setting the SYN+ACK timer for the new req. If
we can set the timer, we call inet_ehash_insert() to unhash the old req and
put the new req into ehash.

The noteworthy point here is that by unhashing the old req, another CPU
processing it may lose the "own_req" race in tcp_v[46]_syn_recv_sock() and
drop the final ACK packet. However, the new timer will recover this
situation.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 net/core/request_sock.c         |  1 +
 net/ipv4/inet_connection_sock.c | 76 +++++++++++++++++++++++++++++----
 2 files changed, 68 insertions(+), 9 deletions(-)

diff --git a/net/core/request_sock.c b/net/core/request_sock.c
index 82cf9fbe2668..08c37ecd923b 100644
--- a/net/core/request_sock.c
+++ b/net/core/request_sock.c
@@ -151,6 +151,7 @@ struct request_sock *reqsk_clone(struct request_sock *req, struct sock *sk)
 	memcpy(&nreq_sk->sk_dontcopy_end, &req_sk->sk_dontcopy_end,
 	       req->rsk_ops->obj_size - offsetof(struct sock, sk_dontcopy_end));
 
+	sk_node_init(&nreq_sk->sk_node);
 	nreq_sk->sk_tx_queue_mapping = req_sk->sk_tx_queue_mapping;
 #ifdef CONFIG_XPS
 	nreq_sk->sk_rx_queue_mapping = req_sk->sk_rx_queue_mapping;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 851992405826..dc984d1f352e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -695,10 +695,20 @@ int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req)
 }
 EXPORT_SYMBOL(inet_rtx_syn_ack);
 
+static void reqsk_queue_migrated(struct request_sock_queue *queue,
+				 const struct request_sock *req)
+{
+	if (req->num_timeout == 0)
+		atomic_inc(&queue->young);
+	atomic_inc(&queue->qlen);
+}
+
 static void reqsk_migrate_reset(struct request_sock *req)
 {
+	req->saved_syn = NULL;
+	inet_rsk(req)->ireq_opt = NULL;
 #if IS_ENABLED(CONFIG_IPV6)
-	inet_rsk(req)->ipv6_opt = NULL;
+	inet_rsk(req)->pktopts = NULL;
 #endif
 }
 
@@ -741,16 +751,37 @@ EXPORT_SYMBOL(inet_csk_reqsk_queue_drop_and_put);
 
 static void reqsk_timer_handler(struct timer_list *t)
 {
-	struct request_sock *req = from_timer(req, t, rsk_timer);
-	struct sock *sk_listener = req->rsk_listener;
-	struct net *net = sock_net(sk_listener);
-	struct inet_connection_sock *icsk = inet_csk(sk_listener);
-	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
+	struct request_sock *req = from_timer(req, t, rsk_timer), *nreq = NULL, *oreq = req;
+	struct sock *sk_listener = req->rsk_listener, *nsk = NULL;
+	struct inet_connection_sock *icsk;
+	struct request_sock_queue *queue;
+	struct net *net;
 	int max_syn_ack_retries, qlen, expire = 0, resend = 0;
 
-	if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
-		goto drop;
+	if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
+		nsk = reuseport_migrate_sock(sk_listener, req_to_sk(req), NULL);
+		if (!nsk)
+			goto drop;
+
+		nreq = reqsk_clone(req, nsk);
+		if (!nreq)
+			goto drop;
+
+		/* The new timer for the cloned req can decrease the 2
+		 * by calling inet_csk_reqsk_queue_drop_and_put(), so
+		 * hold another count to prevent use-after-free and
+		 * call reqsk_put() just before return.
+		 */
+		refcount_set(&nreq->rsk_refcnt, 2 + 1);
+		timer_setup(&nreq->rsk_timer, reqsk_timer_handler, TIMER_PINNED);
+		reqsk_queue_migrated(&inet_csk(nsk)->icsk_accept_queue, req);
+
+		req = nreq;
+		sk_listener = nsk;
+	}
 
+	icsk = inet_csk(sk_listener);
+	net = sock_net(sk_listener);
 	max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries;
 	/* Normally all the openreqs are young and become mature
 	 * (i.e. converted to established socket) for first timeout.
@@ -769,6 +800,7 @@ static void reqsk_timer_handler(struct timer_list *t)
 	 * embrions; and abort old ones without pity, if old
 	 * ones are about to clog our table.
 	 */
+	queue = &icsk->icsk_accept_queue;
 	qlen = reqsk_queue_len(queue);
 	if ((qlen << 1) > max(8U, READ_ONCE(sk_listener->sk_max_ack_backlog))) {
 		int young = reqsk_queue_len_young(queue) << 1;
@@ -793,10 +825,36 @@ static void reqsk_timer_handler(struct timer_list *t)
 			atomic_dec(&queue->young);
 		timeo = min(TCP_TIMEOUT_INIT << req->num_timeout, TCP_RTO_MAX);
 		mod_timer(&req->rsk_timer, jiffies + timeo);
+
+		if (!nreq)
+			return;
+
+		if (!inet_ehash_insert(req_to_sk(nreq), req_to_sk(oreq), NULL)) {
+			/* delete timer */
+			inet_csk_reqsk_queue_drop(sk_listener, nreq);
+			goto drop;
+		}
+
+		reqsk_migrate_reset(oreq);
+		reqsk_queue_removed(&inet_csk(oreq->rsk_listener)->icsk_accept_queue, oreq);
+		reqsk_put(oreq);
+
+		reqsk_put(nreq);
 		return;
 	}
+
 drop:
-	inet_csk_reqsk_queue_drop_and_put(sk_listener, req);
+	/* Even if we can clone the req, we may need not retransmit any more
+	 * SYN+ACKs (nreq->num_timeout > max_syn_ack_retries, etc), or another
+	 * CPU may win the "own_req" race so that inet_ehash_insert() fails.
+	 */
+	if (nreq) {
+		reqsk_migrate_reset(nreq);
+		reqsk_queue_removed(queue, nreq);
+		__reqsk_free(nreq);
+	}
+
+	inet_csk_reqsk_queue_drop_and_put(oreq->rsk_listener, oreq);
 }
 
 static void reqsk_queue_hash_req(struct request_sock *req,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 07/11] tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2021-04-27  3:46 ` [PATCH v4 bpf-next 06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 08/11] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

This patch also changes the code to call reuseport_migrate_sock() and
reqsk_clone(), but unlike the other cases, we do not call reqsk_clone()
right after reuseport_migrate_sock().

Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener
has three kinds of refcnt:

  (A) for listener itself
  (B) carried by reuqest_sock
  (C) sock_hold() in tcp_v[46]_rcv()

While processing the req, (A) may disappear by close(listener). Also, (B)
can disappear by accept(listener) once we put the req into the accept
queue. So, we have to hold another refcnt (C) for the listener to prevent
use-after-free.

For socket migration, we call reuseport_migrate_sock() to select a listener
with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv().
This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv().
Thus we have to take another refcnt (B) for the newly cloned request_sock.

In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and
try to put the new req into the accept queue. By migrating req after
winning the "own_req" race, we can avoid such a worst situation:

  CPU 1 looks up req1
  CPU 2 looks up req1, unhashes it, then CPU 1 loses the race
  CPU 3 looks up req2, unhashes it, then CPU 2 loses the race
  ...

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 net/ipv4/inet_connection_sock.c | 30 +++++++++++++++++++++++++++++-
 net/ipv4/tcp_ipv4.c             | 20 ++++++++++++++------
 net/ipv6/tcp_ipv6.c             | 14 +++++++++++---
 3 files changed, 54 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index dc984d1f352e..2f1e5897137b 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -1072,10 +1072,38 @@ struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
 	if (own_req) {
 		inet_csk_reqsk_queue_drop(sk, req);
 		reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req);
-		if (inet_csk_reqsk_queue_add(sk, req, child))
+
+		if (sk != req->rsk_listener) {
+			/* another listening sk has been selected,
+			 * migrate the req to it.
+			 */
+			struct request_sock *nreq;
+
+			/* hold a refcnt for the nreq->rsk_listener
+			 * which is assigned in reqsk_clone()
+			 */
+			sock_hold(sk);
+			nreq = reqsk_clone(req, sk);
+			if (!nreq) {
+				inet_child_forget(sk, req, child);
+				goto child_put;
+			}
+
+			refcount_set(&nreq->rsk_refcnt, 1);
+			if (inet_csk_reqsk_queue_add(sk, nreq, child)) {
+				reqsk_migrate_reset(req);
+				reqsk_put(req);
+				return child;
+			}
+
+			reqsk_migrate_reset(nreq);
+			__reqsk_free(nreq);
+		} else if (inet_csk_reqsk_queue_add(sk, req, child)) {
 			return child;
+		}
 	}
 	/* Too bad, another child took ownership of the request, undo. */
+child_put:
 	bh_unlock_sock(child);
 	sock_put(child);
 	return NULL;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 312184cead57..214495d02143 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2000,13 +2000,21 @@ int tcp_v4_rcv(struct sk_buff *skb)
 			goto csum_error;
 		}
 		if (unlikely(sk->sk_state != TCP_LISTEN)) {
-			inet_csk_reqsk_queue_drop_and_put(sk, req);
-			goto lookup;
+			nsk = reuseport_migrate_sock(sk, req_to_sk(req), skb);
+			if (!nsk) {
+				inet_csk_reqsk_queue_drop_and_put(sk, req);
+				goto lookup;
+			}
+			sk = nsk;
+			/* reuseport_migrate_sock() has already held one sk_refcnt
+			 * before returning.
+			 */
+		} else {
+			/* We own a reference on the listener, increase it again
+			 * as we might lose it too soon.
+			 */
+			sock_hold(sk);
 		}
-		/* We own a reference on the listener, increase it again
-		 * as we might lose it too soon.
-		 */
-		sock_hold(sk);
 		refcounted = true;
 		nsk = NULL;
 		if (!tcp_filter(sk, skb)) {
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5f47c0b6e3de..aea8e75d3fed 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1663,10 +1663,18 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 			goto csum_error;
 		}
 		if (unlikely(sk->sk_state != TCP_LISTEN)) {
-			inet_csk_reqsk_queue_drop_and_put(sk, req);
-			goto lookup;
+			nsk = reuseport_migrate_sock(sk, req_to_sk(req), skb);
+			if (!nsk) {
+				inet_csk_reqsk_queue_drop_and_put(sk, req);
+				goto lookup;
+			}
+			sk = nsk;
+			/* reuseport_migrate_sock() has already held one sk_refcnt
+			 * before returning.
+			 */
+		} else {
+			sock_hold(sk);
 		}
-		sock_hold(sk);
 		refcounted = true;
 		nsk = NULL;
 		if (!tcp_filter(sk, skb)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 08/11] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2021-04-27  3:46 ` [PATCH v4 bpf-next 07/11] tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 09/11] bpf: Support socket migration by eBPF Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

We will call sock_reuseport.prog for socket migration in the next commit,
so the eBPF program has to know which listener is closing to select a new
listener.

We can currently get a unique ID of each listener in the userspace by
calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.

This patch makes the pointer of sk available in sk_reuseport_md so that we
can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.

Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/uapi/linux/bpf.h       |  1 +
 net/core/filter.c              | 10 ++++++++++
 tools/include/uapi/linux/bpf.h |  1 +
 3 files changed, 12 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ec6d85a81744..d2e6bf2d464b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5364,6 +5364,7 @@ struct sk_reuseport_md {
 	__u32 ip_protocol;	/* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
 	__u32 bind_inany;	/* Is sock bound to an INANY address? */
 	__u32 hash;		/* A hash of the packet 4 tuples */
+	__bpf_md_ptr(struct bpf_sock *, sk);
 };
 
 #define BPF_TAG_SIZE	8
diff --git a/net/core/filter.c b/net/core/filter.c
index cae56d08a670..3d0f989f5d38 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10135,6 +10135,8 @@ sk_reuseport_func_proto(enum bpf_func_id func_id,
 		return &sk_reuseport_load_bytes_proto;
 	case BPF_FUNC_skb_load_bytes_relative:
 		return &sk_reuseport_load_bytes_relative_proto;
+	case BPF_FUNC_get_socket_cookie:
+		return &bpf_get_socket_ptr_cookie_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -10164,6 +10166,10 @@ sk_reuseport_is_valid_access(int off, int size,
 	case offsetof(struct sk_reuseport_md, hash):
 		return size == size_default;
 
+	case offsetof(struct sk_reuseport_md, sk):
+		info->reg_type = ARG_PTR_TO_SOCKET;
+		return size == sizeof(__u64);
+
 	/* Fields that allow narrowing */
 	case bpf_ctx_range(struct sk_reuseport_md, eth_protocol):
 		if (size < sizeof_field(struct sk_buff, protocol))
@@ -10236,6 +10242,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct sk_reuseport_md, bind_inany):
 		SK_REUSEPORT_LOAD_FIELD(bind_inany);
 		break;
+
+	case offsetof(struct sk_reuseport_md, sk):
+		SK_REUSEPORT_LOAD_FIELD(sk);
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index ec6d85a81744..d2e6bf2d464b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5364,6 +5364,7 @@ struct sk_reuseport_md {
 	__u32 ip_protocol;	/* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
 	__u32 bind_inany;	/* Is sock bound to an INANY address? */
 	__u32 hash;		/* A hash of the packet 4 tuples */
+	__bpf_md_ptr(struct bpf_sock *, sk);
 };
 
 #define BPF_TAG_SIZE	8
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 09/11] bpf: Support socket migration by eBPF.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2021-04-27  3:46 ` [PATCH v4 bpf-next 08/11] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 10/11] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT
to check if the attached eBPF program is capable of migrating sockets. When
the eBPF program is attached, we run it for socket migration if the
expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or
net.ipv4.tcp_migrate_req is enabled.

Ccurrently, the expected_attach_type is not enforced for the
BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the
earlier idea in the commit aac3fc320d94 ("bpf: Post-hooks for sys_bind") to
fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type().

Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to
select a new listener based on the child socket. migrating_sk varies
depending on if it is migrating a request in the accept queue or during
3WHS.

  - accept_queue : sock (ESTABLISHED/SYN_RECV)
  - 3WHS         : request_sock (NEW_SYN_RECV)

In the eBPF program, we can select a new listener by
BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
SK_DROP. This feature is useful when listeners have different settings at
the socket API level or when we want to free resources as soon as possible.

  - SK_PASS with selected_sk, select it as a new listener
  - SK_PASS with selected_sk NULL, fallbacks to the random selection
  - SK_DROP, cancel the migration.

There is a noteworthy point. We select a listening socket in three places,
but we do not have struct skb at closing a listener or retransmitting a
SYN+ACK. On the other hand, some helper functions do not expect skb is NULL
(e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer()
in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb
temporarily before running the eBPF program.

Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/linux/bpf.h            |  1 +
 include/linux/filter.h         |  2 ++
 include/uapi/linux/bpf.h       | 15 +++++++++++++++
 kernel/bpf/syscall.c           | 13 +++++++++++++
 net/core/filter.c              | 13 ++++++++++++-
 net/core/sock_reuseport.c      | 34 ++++++++++++++++++++++++++++++----
 tools/include/uapi/linux/bpf.h | 15 +++++++++++++++
 7 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ad4bcf1cadbb..e3b6d4532866 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2015,6 +2015,7 @@ struct sk_reuseport_kern {
 	struct sk_buff *skb;
 	struct sock *sk;
 	struct sock *selected_sk;
+	struct sock *migrating_sk;
 	void *data_end;
 	u32 hash;
 	u32 reuseport_id;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9a09547bc7ba..226f76c0b974 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -995,11 +995,13 @@ void bpf_warn_invalid_xdp_action(u32 act);
 #ifdef CONFIG_INET
 struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 				  struct bpf_prog *prog, struct sk_buff *skb,
+				  struct sock *migrating_sk,
 				  u32 hash);
 #else
 static inline struct sock *
 bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 		     struct bpf_prog *prog, struct sk_buff *skb,
+		     struct sock *migrating_sk,
 		     u32 hash)
 {
 	return NULL;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d2e6bf2d464b..ee8b5435ab1f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -979,6 +979,8 @@ enum bpf_attach_type {
 	BPF_SK_LOOKUP,
 	BPF_XDP,
 	BPF_SK_SKB_VERDICT,
+	BPF_SK_REUSEPORT_SELECT,
+	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -5364,7 +5366,20 @@ struct sk_reuseport_md {
 	__u32 ip_protocol;	/* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
 	__u32 bind_inany;	/* Is sock bound to an INANY address? */
 	__u32 hash;		/* A hash of the packet 4 tuples */
+	/* When reuse->migrating_sk is NULL, it is selecting a sk for the
+	 * new incoming connection request (e.g. selecting a listen sk for
+	 * the received SYN in the TCP case).  reuse->sk is one of the sk
+	 * in the reuseport group. The bpf prog can use reuse->sk to learn
+	 * the local listening ip/port without looking into the skb.
+	 *
+	 * When reuse->migrating_sk is not NULL, reuse->sk is closed and
+	 * reuse->migrating_sk is the socket that needs to be migrated
+	 * to another listening socket.  migrating_sk could be a fullsock
+	 * sk that is fully established or a reqsk that is in-the-middle
+	 * of 3-way handshake.
+	 */
 	__bpf_md_ptr(struct bpf_sock *, sk);
+	__bpf_md_ptr(struct bpf_sock *, migrating_sk);
 };
 
 #define BPF_TAG_SIZE	8
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index fd495190115e..217b5a7c01bd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1931,6 +1931,11 @@ static void bpf_prog_load_fixup_attach_type(union bpf_attr *attr)
 			attr->expected_attach_type =
 				BPF_CGROUP_INET_SOCK_CREATE;
 		break;
+	case BPF_PROG_TYPE_SK_REUSEPORT:
+		if (!attr->expected_attach_type)
+			attr->expected_attach_type =
+				BPF_SK_REUSEPORT_SELECT;
+		break;
 	}
 }
 
@@ -2014,6 +2019,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		if (expected_attach_type == BPF_SK_LOOKUP)
 			return 0;
 		return -EINVAL;
+	case BPF_PROG_TYPE_SK_REUSEPORT:
+		switch (expected_attach_type) {
+		case BPF_SK_REUSEPORT_SELECT:
+		case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
+			return 0;
+		default:
+			return -EINVAL;
+		}
 	case BPF_PROG_TYPE_EXT:
 		if (expected_attach_type)
 			return -EINVAL;
diff --git a/net/core/filter.c b/net/core/filter.c
index 3d0f989f5d38..1dfe2f820950 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10007,11 +10007,13 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
 static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
 				    struct sock_reuseport *reuse,
 				    struct sock *sk, struct sk_buff *skb,
+				    struct sock *migrating_sk,
 				    u32 hash)
 {
 	reuse_kern->skb = skb;
 	reuse_kern->sk = sk;
 	reuse_kern->selected_sk = NULL;
+	reuse_kern->migrating_sk = migrating_sk;
 	reuse_kern->data_end = skb->data + skb_headlen(skb);
 	reuse_kern->hash = hash;
 	reuse_kern->reuseport_id = reuse->reuseport_id;
@@ -10020,12 +10022,13 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
 
 struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 				  struct bpf_prog *prog, struct sk_buff *skb,
+				  struct sock *migrating_sk,
 				  u32 hash)
 {
 	struct sk_reuseport_kern reuse_kern;
 	enum sk_action action;
 
-	bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash);
+	bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, migrating_sk, hash);
 	action = BPF_PROG_RUN(prog, &reuse_kern);
 
 	if (action == SK_PASS)
@@ -10170,6 +10173,10 @@ sk_reuseport_is_valid_access(int off, int size,
 		info->reg_type = ARG_PTR_TO_SOCKET;
 		return size == sizeof(__u64);
 
+	case offsetof(struct sk_reuseport_md, migrating_sk):
+		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
+		return size == sizeof(__u64);
+
 	/* Fields that allow narrowing */
 	case bpf_ctx_range(struct sk_reuseport_md, eth_protocol):
 		if (size < sizeof_field(struct sk_buff, protocol))
@@ -10246,6 +10253,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct sk_reuseport_md, sk):
 		SK_REUSEPORT_LOAD_FIELD(sk);
 		break;
+
+	case offsetof(struct sk_reuseport_md, migrating_sk):
+		SK_REUSEPORT_LOAD_FIELD(migrating_sk);
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index a2bca39ec0e3..d35c8f51ab72 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -378,13 +378,17 @@ void reuseport_stop_listen_sock(struct sock *sk)
 {
 	if (sk->sk_protocol == IPPROTO_TCP) {
 		struct sock_reuseport *reuse;
+		struct bpf_prog *prog;
 
 		spin_lock_bh(&reuseport_lock);
 
 		reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
 						  lockdep_is_held(&reuseport_lock));
+		prog = rcu_dereference_protected(reuse->prog,
+						 lockdep_is_held(&reuseport_lock));
 
-		if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req) {
+		if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req ||
+		    (prog && prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)) {
 			/* Migration capable, move sk from the listening section
 			 * to the closed section.
 			 */
@@ -489,7 +493,7 @@ struct sock *reuseport_select_sock(struct sock *sk,
 			goto select_by_hash;
 
 		if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
-			sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash);
+			sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, NULL, hash);
 		else
 			sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len);
 
@@ -520,6 +524,8 @@ struct sock *reuseport_migrate_sock(struct sock *sk,
 {
 	struct sock_reuseport *reuse;
 	struct sock *nsk = NULL;
+	bool allocated = false;
+	struct bpf_prog *prog;
 	u16 socks;
 	u32 hash;
 
@@ -537,10 +543,30 @@ struct sock *reuseport_migrate_sock(struct sock *sk,
 	smp_rmb();
 
 	hash = migrating_sk->sk_hash;
-	if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req)
+	prog = rcu_dereference(reuse->prog);
+	if (!prog || prog->expected_attach_type != BPF_SK_REUSEPORT_SELECT_OR_MIGRATE) {
+		if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req)
+			goto select_by_hash;
+		goto out;
+	}
+
+	if (!skb) {
+		skb = alloc_skb(0, GFP_ATOMIC);
+		if (!skb)
+			goto out;
+		allocated = true;
+	}
+
+	nsk = bpf_run_sk_reuseport(reuse, sk, prog, skb, migrating_sk, hash);
+
+	if (allocated)
+		kfree_skb(skb);
+
+select_by_hash:
+	if (!nsk)
 		nsk = reuseport_select_sock_by_hash(reuse, hash, socks);
 
-	if (nsk && unlikely(!refcount_inc_not_zero(&nsk->sk_refcnt)))
+	if (IS_ERR_OR_NULL(nsk) || unlikely(!refcount_inc_not_zero(&nsk->sk_refcnt)))
 		nsk = NULL;
 
 out:
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d2e6bf2d464b..ee8b5435ab1f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -979,6 +979,8 @@ enum bpf_attach_type {
 	BPF_SK_LOOKUP,
 	BPF_XDP,
 	BPF_SK_SKB_VERDICT,
+	BPF_SK_REUSEPORT_SELECT,
+	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -5364,7 +5366,20 @@ struct sk_reuseport_md {
 	__u32 ip_protocol;	/* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
 	__u32 bind_inany;	/* Is sock bound to an INANY address? */
 	__u32 hash;		/* A hash of the packet 4 tuples */
+	/* When reuse->migrating_sk is NULL, it is selecting a sk for the
+	 * new incoming connection request (e.g. selecting a listen sk for
+	 * the received SYN in the TCP case).  reuse->sk is one of the sk
+	 * in the reuseport group. The bpf prog can use reuse->sk to learn
+	 * the local listening ip/port without looking into the skb.
+	 *
+	 * When reuse->migrating_sk is not NULL, reuse->sk is closed and
+	 * reuse->migrating_sk is the socket that needs to be migrated
+	 * to another listening socket.  migrating_sk could be a fullsock
+	 * sk that is fully established or a reqsk that is in-the-middle
+	 * of 3-way handshake.
+	 */
 	__bpf_md_ptr(struct bpf_sock *, sk);
+	__bpf_md_ptr(struct bpf_sock *, migrating_sk);
 };
 
 #define BPF_TAG_SIZE	8
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 10/11] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2021-04-27  3:46 ` [PATCH v4 bpf-next 09/11] bpf: Support socket migration by eBPF Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-04-27  3:46 ` [PATCH v4 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

This commit introduces a new section (sk_reuseport/migrate) and sets
expected_attach_type to two each section in BPF_PROG_TYPE_SK_REUSEPORT
program.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 tools/lib/bpf/libbpf.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index a1cddd17af7d..538849fb6bb7 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8818,7 +8818,10 @@ static struct bpf_link *attach_iter(const struct bpf_sec_def *sec,
 
 static const struct bpf_sec_def section_defs[] = {
 	BPF_PROG_SEC("socket",			BPF_PROG_TYPE_SOCKET_FILTER),
-	BPF_PROG_SEC("sk_reuseport",		BPF_PROG_TYPE_SK_REUSEPORT),
+	BPF_EAPROG_SEC("sk_reuseport/migrate",	BPF_PROG_TYPE_SK_REUSEPORT,
+						BPF_SK_REUSEPORT_SELECT_OR_MIGRATE),
+	BPF_EAPROG_SEC("sk_reuseport",		BPF_PROG_TYPE_SK_REUSEPORT,
+						BPF_SK_REUSEPORT_SELECT),
 	SEC_DEF("kprobe/", KPROBE,
 		.attach_fn = attach_kprobe),
 	BPF_PROG_SEC("uprobe/",			BPF_PROG_TYPE_KPROBE),
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2021-04-27  3:46 ` [PATCH v4 bpf-next 10/11] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
@ 2021-04-27  3:46 ` Kuniyuki Iwashima
  2021-05-05  5:14   ` Martin KaFai Lau
  2021-04-27 16:38 ` [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Jason Baron
  2021-04-27 21:55 ` Maciej Żenczykowski
  12 siblings, 1 reply; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-27  3:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE and
removes 'static' from settimeo() in network_helpers.c.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 tools/testing/selftests/bpf/network_helpers.c |   2 +-
 tools/testing/selftests/bpf/network_helpers.h |   1 +
 .../bpf/prog_tests/migrate_reuseport.c        | 484 ++++++++++++++++++
 .../bpf/progs/test_migrate_reuseport.c        |  51 ++
 4 files changed, 537 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 12ee40284da0..2060bc122c53 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -40,7 +40,7 @@ struct ipv6_packet pkt_v6 = {
 	.tcp.doff = 5,
 };
 
-static int settimeo(int fd, int timeout_ms)
+int settimeo(int fd, int timeout_ms)
 {
 	struct timeval timeout = { .tv_sec = 3 };
 
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 7205f8afdba1..5e0d51c07b63 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -33,6 +33,7 @@ struct ipv6_packet {
 } __packed;
 extern struct ipv6_packet pkt_v6;
 
+int settimeo(int fd, int timeout_ms);
 int start_server(int family, int type, const char *addr, __u16 port,
 		 int timeout_ms);
 int connect_to_fd(int server_fd, int timeout_ms);
diff --git a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
new file mode 100644
index 000000000000..1b33df1902fe
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
@@ -0,0 +1,484 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Check if we can migrate child sockets.
+ *
+ *   1. call listen() for 5 server sockets.
+ *   2. update a map to migrate all child sockets
+ *        to the last server socket (migrate_map[cookie] = 4)
+ *   3. call connect() for 25 client sockets.
+ *   4. call shutdown() for first 4 server sockets
+ *        and migrate the requests in the accept queue
+ *        to the last server socket.
+ *   5. call listen() for the second server socket.
+ *   6. call shutdown() for the last server
+ *        and migrate the requests in the accept queue
+ *        to the second server socket.
+ *   7. call listen() for the last server.
+ *   8. call shutdown() for the second server
+ *        and migrate the requests in the accept queue
+ *        to the last server socket.
+ *   9. call accept() for the last server socket.
+ *
+ * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
+ */
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "test_progs.h"
+#include "test_migrate_reuseport.skel.h"
+#include "network_helpers.h"
+
+#define NR_SERVERS 5
+#define NR_CLIENTS (NR_SERVERS * 5)
+#define MIGRATED_TO (NR_SERVERS - 1)
+
+/* fastopenq->max_qlen and sk->sk_max_ack_backlog */
+#define QLEN (NR_CLIENTS * 5)
+
+#define MSG "Hello World\0"
+#define MSGLEN 12
+
+static struct migrate_reuseport_test_case {
+	const char *name;
+	__s64 servers[NR_SERVERS];
+	__s64 clients[NR_CLIENTS];
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	int family;
+	bool drop_ack;
+	bool expire_synack_timer;
+	bool fastopen;
+} test_cases[] = {
+	{
+		.name = "IPv4 - TCP_ESTABLISHED - inet_csk_listen_stop",
+		.family = AF_INET,
+		.drop_ack = false,
+		.expire_synack_timer = false,
+		.fastopen = false,
+	},
+	{
+		.name = "IPv4 - TCP_SYN_RECV - inet_csk_listen_stop",
+		.family = AF_INET,
+		.drop_ack = true,
+		.expire_synack_timer = false,
+		.fastopen = true,
+	},
+	{
+		.name = "IPv4 - TCP_NEW_SYN_RECV - inet_csk_complete_hashdance",
+		.family = AF_INET,
+		.drop_ack = true,
+		.expire_synack_timer = false,
+		.fastopen = false,
+	},
+	{
+		.name = "IPv4 - TCP_NEW_SYN_RECV - reqsk_timer_handler",
+		.family = AF_INET,
+		.drop_ack = true,
+		.expire_synack_timer = true,
+		.fastopen = false,
+	},
+	{
+		.name = "IPv6 - TCP_ESTABLISHED - inet_csk_listen_stop",
+		.family = AF_INET6,
+		.drop_ack = false,
+		.expire_synack_timer = false,
+		.fastopen = false,
+	},
+	{
+		.name = "IPv6 - TCP_SYN_RECV - inet_csk_listen_stop",
+		.family = AF_INET6,
+		.drop_ack = true,
+		.expire_synack_timer = false,
+		.fastopen = true,
+	},
+	{
+		.name = "IPv6 - TCP_NEW_SYN_RECV - inet_csk_complete_hashdance",
+		.family = AF_INET6,
+		.drop_ack = true,
+		.expire_synack_timer = false,
+		.fastopen = false,
+	},
+	{
+		.name = "IPv6 - TCP_NEW_SYN_RECV - reqsk_timer_handler",
+		.family = AF_INET6,
+		.drop_ack = true,
+		.expire_synack_timer = true,
+		.fastopen = false,
+	}
+};
+
+static void init_fds(__s64 fds[], int len)
+{
+	int i;
+
+	for (i = 0; i < len; i++)
+		fds[i] = -1;
+}
+
+static void close_fds(__s64 fds[], int len)
+{
+	int i;
+
+	for (i = 0; i < len; i++) {
+		if (fds[i] != -1) {
+			close(fds[i]);
+			fds[i] = -1;
+		}
+	}
+}
+
+static int setup_fastopen(char *buf, int size, int *saved_len, bool restore)
+{
+	int err = 0, fd, len;
+
+	fd = open("/proc/sys/net/ipv4/tcp_fastopen", O_RDWR);
+	if (!ASSERT_NEQ(fd, -1, "open"))
+		return -1;
+
+	if (restore) {
+		len = write(fd, buf, *saved_len);
+		if (!ASSERT_EQ(len, *saved_len, "write - restore"))
+			err = -1;
+	} else {
+		*saved_len = read(fd, buf, size);
+		if (!ASSERT_LT(1, *saved_len, "read")) {
+			err = -1;
+			goto close;
+		}
+
+		err = lseek(fd, 0, SEEK_SET);
+		if (!ASSERT_OK(err, "lseek"))
+			goto close;
+
+		/* (TFO_CLIENT_ENABLE | TFO_SERVER_ENABLE) */
+		len = write(fd, "3", 1);
+		if (!ASSERT_EQ(len, 1, "write - setup"))
+			err = -1;
+	}
+
+close:
+	close(fd);
+
+	return err;
+}
+
+static int run_iptables(struct migrate_reuseport_test_case *test_case,
+			bool add_rule)
+{
+	char buf[128];
+	int err;
+
+	sprintf(buf, "%s -%c OUTPUT -o lo -p tcp --dport %d --tcp-flags SYN,ACK ACK -j DROP",
+		test_case->family == AF_INET ? "iptables" : "ip6tables",
+		add_rule ? 'A' : 'D',
+		ntohs(test_case->family == AF_INET ?
+		      ((struct sockaddr_in *)&test_case->addr)->sin_port :
+		      ((struct sockaddr_in6 *)&test_case->addr)->sin6_port));
+
+	err = system(buf);
+
+	return err == -1 ? err : WEXITSTATUS(err);
+}
+
+static int start_servers(struct migrate_reuseport_test_case *test_case,
+			 struct test_migrate_reuseport *skel)
+{
+	int reuseport = 1, qlen = QLEN, migrated_to = MIGRATED_TO;
+	int i, err, prog_fd, reuseport_map_fd, migrate_map_fd;
+	__u64 value;
+
+	prog_fd = bpf_program__fd(skel->progs.prog_migrate_reuseport);
+	reuseport_map_fd = bpf_map__fd(skel->maps.reuseport_map);
+	migrate_map_fd = bpf_map__fd(skel->maps.migrate_map);
+
+	make_sockaddr(test_case->family,
+		      test_case->family == AF_INET ? "127.0.0.1" : "::1", 0,
+		      &test_case->addr, &test_case->addrlen);
+
+	for (i = 0; i < NR_SERVERS; i++) {
+		test_case->servers[i] = socket(test_case->family, SOCK_STREAM,
+					       IPPROTO_TCP);
+		if (!ASSERT_NEQ(test_case->servers[i], -1, "socket"))
+			return -1;
+
+		err = setsockopt(test_case->servers[i], SOL_SOCKET,
+				 SO_REUSEPORT, &reuseport, sizeof(reuseport));
+		if (!ASSERT_OK(err, "setsockopt - SO_REUSEPORT"))
+			return -1;
+
+		err = bind(test_case->servers[i],
+			   (struct sockaddr *)&test_case->addr,
+			   test_case->addrlen);
+		if (!ASSERT_OK(err, "bind"))
+			return -1;
+
+		if (i == 0) {
+			err = setsockopt(test_case->servers[i], SOL_SOCKET,
+					 SO_ATTACH_REUSEPORT_EBPF,
+					 &prog_fd, sizeof(prog_fd));
+			if (!ASSERT_OK(err,
+				       "setsockopt - SO_ATTACH_REUSEPORT_EBPF"))
+				return -1;
+
+			err = getsockname(test_case->servers[i],
+					  (struct sockaddr *)&test_case->addr,
+					  &test_case->addrlen);
+			if (!ASSERT_OK(err, "getsockname"))
+				return -1;
+		}
+
+		if (test_case->fastopen) {
+			err = setsockopt(test_case->servers[i],
+					 SOL_TCP, TCP_FASTOPEN,
+					 &qlen, sizeof(qlen));
+			if (!ASSERT_OK(err, "setsockopt - TCP_FASTOPEN"))
+				return -1;
+		}
+
+		err = listen(test_case->servers[i], qlen);
+		if (!ASSERT_OK(err, "listen"))
+			return -1;
+
+		value = (__u64)test_case->servers[i];
+		err = bpf_map_update_elem(reuseport_map_fd, &i, &value,
+					  BPF_NOEXIST);
+		if (!ASSERT_OK(err, "bpf_map_update_elem - reuseport_map"))
+			return -1;
+
+		err = bpf_map_lookup_elem(reuseport_map_fd, &i, &value);
+		if (!ASSERT_OK(err, "bpf_map_lookup_elem - reuseport_map"))
+			return -1;
+
+		err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to,
+					  BPF_NOEXIST);
+		if (!ASSERT_OK(err, "bpf_map_update_elem - migrate_map"))
+			return -1;
+	}
+
+	return 0;
+}
+
+static int start_clients(struct migrate_reuseport_test_case *test_case)
+{
+	char buf[MSGLEN] = MSG;
+	int i, err;
+
+	for (i = 0; i < NR_CLIENTS; i++) {
+		test_case->clients[i] = socket(test_case->family, SOCK_STREAM,
+					       IPPROTO_TCP);
+		if (!ASSERT_NEQ(test_case->clients[i], -1, "socket"))
+			return -1;
+
+		/* iptables only drops the final ACK, so clients will
+		 * transition to TCP_ESTABLISHED immediately.
+		 */
+		err = settimeo(test_case->clients[i], 100);
+		if (!ASSERT_OK(err, "settimeo"))
+			return -1;
+
+		if (test_case->fastopen) {
+			int fastopen = 1;
+
+			err = setsockopt(test_case->clients[i], IPPROTO_TCP,
+					 TCP_FASTOPEN_CONNECT, &fastopen,
+					 sizeof(fastopen));
+			if (!ASSERT_OK(err,
+				       "setsockopt - TCP_FASTOPEN_CONNECT"))
+				return -1;
+		}
+
+		err = connect(test_case->clients[i],
+			      (struct sockaddr *)&test_case->addr,
+			      test_case->addrlen);
+		if (!ASSERT_OK(err, "connect"))
+			return -1;
+
+		err = write(test_case->clients[i], buf, MSGLEN);
+		if (!ASSERT_EQ(err, MSGLEN, "write"))
+			return -1;
+	}
+
+	return 0;
+}
+
+static int migrate_dance(struct migrate_reuseport_test_case *test_case)
+{
+	int i, err;
+
+	/* Migrate TCP_ESTABLISHED and TCP_SYN_RECV requests
+	 * to the last listener based on eBPF.
+	 */
+	for (i = 0; i < MIGRATED_TO; i++) {
+		err = shutdown(test_case->servers[i], SHUT_RDWR);
+		if (!ASSERT_OK(err, "shutdown"))
+			return -1;
+	}
+
+	/* No dance for TCP_NEW_SYN_RECV to migrate based on eBPF */
+	if (!test_case->fastopen && test_case->drop_ack)
+		return 0;
+
+	/* Note that we use the second listener instead of the
+	 * first one here.
+	 *
+	 * The fist listener is bind()ed with port 0 and,
+	 * SOCK_BINDPORT_LOCK is not set to sk_userlocks, so
+	 * calling listen() again will bind() the first listener
+	 * on a new ephemeral port and detach it from the existing
+	 * reuseport group.  (See: __inet_bind(), tcp_set_state())
+	 *
+	 * OTOH, the second one is bind()ed with a specific port,
+	 * and SOCK_BINDPORT_LOCK is set. Thus, re-listen() will
+	 * resurrect the listener on the existing reuseport group.
+	 */
+	err = listen(test_case->servers[1], QLEN);
+	if (!ASSERT_OK(err, "listen"))
+		return -1;
+
+	/* Migrate from the last listener to the second one.
+	 *
+	 * All listeners were detached out of the reuseport_map,
+	 * so migration will be done by kernel random pick from here.
+	 */
+	err = shutdown(test_case->servers[MIGRATED_TO], SHUT_RDWR);
+	if (!ASSERT_OK(err, "shutdown"))
+		return -1;
+
+	/* Back to the existing reuseport group */
+	err = listen(test_case->servers[MIGRATED_TO], QLEN);
+	if (!ASSERT_OK(err, "listen"))
+		return -1;
+
+	/* Migrate back to the last one from the second one */
+	err = shutdown(test_case->servers[1], SHUT_RDWR);
+	if (!ASSERT_OK(err, "shutdown"))
+		return -1;
+
+	return 0;
+}
+
+static int count_requests(struct migrate_reuseport_test_case *test_case)
+{
+	struct sockaddr_storage addr;
+	socklen_t len = sizeof(addr);
+	char buf[MSGLEN];
+	int cnt, client;
+
+	settimeo(test_case->servers[MIGRATED_TO], 2000);
+
+	for (cnt = 0; cnt < NR_CLIENTS; cnt++) {
+		client = accept(test_case->servers[MIGRATED_TO],
+				(struct sockaddr *)&addr, &len);
+		if (!ASSERT_NEQ(client, -1, "accept"))
+			goto out;
+
+		memset(buf, 0, MSGLEN);
+
+		read(client, &buf, MSGLEN);
+		if (!ASSERT_STREQ(buf, MSG, "read")) {
+			close(client);
+			goto out;
+		}
+
+		close(client);
+	}
+
+out:
+	return cnt;
+}
+
+static void run_test(struct migrate_reuseport_test_case *test_case,
+		     struct test_migrate_reuseport *skel)
+{
+	bool added_rule = false;
+	int err, saved_len;
+	char buf[16];
+
+	init_fds(test_case->servers, NR_SERVERS);
+	init_fds(test_case->clients, NR_CLIENTS);
+
+	if (test_case->fastopen) {
+		memset(buf, 0, sizeof(buf));
+
+		err = setup_fastopen(buf, sizeof(buf), &saved_len, false);
+		if (!ASSERT_OK(err, "setup_fastopen - setup"))
+			return;
+	}
+
+	err = start_servers(test_case, skel);
+	if (!ASSERT_OK(err, "start_servers"))
+		goto close_servers;
+
+	if (test_case->drop_ack) {
+		/* Drop the final ACK of the 3-way handshake and stick the
+		 * in-flight requests on TCP_SYN_RECV or TCP_NEW_SYN_RECV.
+		 */
+		err = run_iptables(test_case, true);
+		if (!ASSERT_OK(err, "run_iptables - add rule"))
+			goto close_servers;
+
+		added_rule = true;
+	}
+
+	err = start_clients(test_case);
+	if (!ASSERT_OK(err, "start_clients"))
+		goto close_clients;
+
+	/* Migrate the requests in the accept queue only.
+	 * TCP_NEW_SYN_RECV requests are not migrated at this point.
+	 */
+	err = migrate_dance(test_case);
+	if (!ASSERT_OK(err, "migrate_dance"))
+		goto close_clients;
+
+	if (test_case->expire_synack_timer) {
+		/* Wait for SYN+ACK timer to expire so that
+		 * reqsk_timer_handler() migrates TCP_NEW_SYN_RECV requests.
+		 */
+		sleep(1);
+	}
+
+	if (test_case->drop_ack) {
+		/* Resume 3WHS and migrate TCP_NEW_SYN_RECV requests */
+		err = run_iptables(test_case, false);
+		if (!ASSERT_OK(err, "run_iptables - delete rule"))
+			goto close_clients;
+
+		added_rule = false;
+	}
+
+	err = count_requests(test_case);
+	ASSERT_EQ(err, NR_CLIENTS, test_case->name);
+
+close_clients:
+	close_fds(test_case->clients, NR_CLIENTS);
+
+	if (added_rule) {
+		err = run_iptables(test_case, false);
+		ASSERT_OK(err, "run_iptables - clean up rule");
+	}
+
+close_servers:
+	close_fds(test_case->servers, NR_SERVERS);
+
+	if (test_case->fastopen) {
+		err = setup_fastopen(buf, sizeof(buf), &saved_len, true);
+		ASSERT_OK(err, "setup_fastopen - restore");
+	}
+}
+
+void test_migrate_reuseport(void)
+{
+	struct test_migrate_reuseport *skel;
+	int i;
+
+	skel = test_migrate_reuseport__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(test_cases); i++)
+		run_test(&test_cases[i], skel);
+
+	test_migrate_reuseport__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c b/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c
new file mode 100644
index 000000000000..d7136dc29fa2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Check if we can migrate child sockets.
+ *
+ *   1. If reuse_md->migrating_sk is NULL (SYN packet),
+ *        return SK_PASS without selecting a listener.
+ *   2. If reuse_md->migrating_sk is not NULL (socket migration),
+ *        select a listener (reuseport_map[migrate_map[cookie]])
+ *
+ * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
+ */
+
+#include <stddef.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_REUSEPORT_SOCKARRAY);
+	__uint(max_entries, 256);
+	__type(key, int);
+	__type(value, __u64);
+} reuseport_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 256);
+	__type(key, __u64);
+	__type(value, int);
+} migrate_map SEC(".maps");
+
+SEC("sk_reuseport/migrate")
+int prog_migrate_reuseport(struct sk_reuseport_md *reuse_md)
+{
+	int *key, flags = 0;
+	__u64 cookie;
+
+	if (!reuse_md->migrating_sk)
+		return SK_PASS;
+
+	cookie = bpf_get_socket_cookie(reuse_md->sk);
+
+	key = bpf_map_lookup_elem(&migrate_map, &cookie);
+	if (!key)
+		return SK_DROP;
+
+	bpf_sk_select_reuseport(reuse_md, &reuseport_map, key, flags);
+
+	return SK_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (10 preceding siblings ...)
  2021-04-27  3:46 ` [PATCH v4 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
@ 2021-04-27 16:38 ` Jason Baron
  2021-04-28  1:27   ` Martin KaFai Lau
  2021-04-28  8:13   ` Kuniyuki Iwashima
  2021-04-27 21:55 ` Maciej Żenczykowski
  12 siblings, 2 replies; 29+ messages in thread
From: Jason Baron @ 2021-04-27 16:38 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski,
	Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, bpf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 10156 bytes --]



On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation [1]. When a SYN packet is received, the connection is tied
> to a listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners on the same port could accept
> such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in the
> in-flight ACK of 3WHS is responded by RST.

Hi Kuniyuki,

I had implemented a different approach to this that I wanted to get your
thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
listen fd (or any other fd) around. Currently, if you have an 'old' webserver
that you want to replace with a 'new' webserver, you would need a separate
process to receive the listen fd and then have that process send the fd to
the new webserver, if they are not running con-currently. So instead what
I'm proposing is a 'delayed close' for a unix socket. That is, one could do:

1) bind unix socket with path '/sockets'
2) sendmsg() the listen fd via the unix socket
2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
3) exit/close the old webserver and the listen socket
4) start the new webserver
5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
6) recvmsg() the listen fd

So the idea is that we set a timeout on the unix socket. If the new process
does not start and bind to the unix socket, it simply closes, thus releasing
the listen socket. However, if it does bind it can now call recvmsg() and
use the listen fd as normal. It can then simply continue to use the old listen
fds and/or create new ones and drain the old ones.

Thus, the old and new webservers do not have to run concurrently. This doesn't
involve any changes to the tcp layer and can be used to pass any type of fd.
not sure if it's actually useful for anything else though.

I'm not sure if this solves your use-case or not but I thought I'd share it.
One can also inherit the fds like in systemd's socket activation model, but
that again requires another process to hold open the listen fd.

I have a very rough patch (emphasis on rough), that implements this idea that
I'm attaching below to explain it better. It would need a bunch of fixups and
it's against an older kernel, but hopefully gives this direction a better
explanation.

Thanks,

-Jason




> 
> To avoid such a situation, users have to know deeply how the kernel handles
> SYN packets and implement connection draining by eBPF [2]:
> 
>   1. Stop routing SYN packets to the listener by eBPF.
>   2. Wait for all timers to expire to complete requests
>   3. Accept connections until EAGAIN, then close the listener.
> 
>   or
> 
>   1. Start counting SYN packets and accept syscalls using the eBPF map.
>   2. Stop routing SYN packets.
>   3. Accept connections up to the count, then close the listener.
> 
> In either way, we cannot close a listener immediately. However, ideally,
> the application need not drain the not yet accepted sockets because 3WHS
> and tying a connection to a listener are just the kernel behaviour. The
> root cause is within the kernel, so the issue should be addressed in kernel
> space and should not be visible to user space. This patchset fixes it so
> that users need not take care of kernel implementation and connection
> draining. With this patchset, the kernel redistributes requests and
> connections from a listener to the others in the same reuseport group
> at/after close or shutdown syscalls.
> 
> Although some software does connection draining, there are still merits in
> migration. For some security reasons, such as replacing TLS certificates,
> we may want to apply new settings as soon as possible and/or we may not be
> able to wait for connection draining. The sockets in the accept queue have
> not started application sessions yet. So, if we do not drain such sockets,
> they can be handled by the newer listeners and could have a longer
> lifetime. It is difficult to drain all connections in every case, but we
> can decrease such aborted connections by migration. In that sense,
> migration is always better than draining. 
> 
> Moreover, auto-migration simplifies user space logic and also works well in
> a case where we cannot modify and build a server program to implement the
> workaround.
> 
> Note that the source and destination listeners MUST have the same settings
> at the socket API level; otherwise, applications may face inconsistency and
> cause errors. In such a case, we have to use the eBPF program to select a
> specific listener or to cancel migration.
> 
> Special thanks to Martin KaFai Lau for bouncing ideas and exchanging code
> snippets along the way.
> 
> 
> Link:
>  [1] The SO_REUSEPORT socket option
>  https://urldefense.com/v3/__https://lwn.net/Articles/542629/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHvYbG59iA$ 
> 
>  [2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
>  https://urldefense.com/v3/__https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHv5_PVAcw$ 
> 
> 
> Changelog:
>  v4:
>   * Make some functions and variables 'static' in selftest
>   * Remove 'scalability' from the cover letter because it is not
>     primarily reason to use SO_REUSEPORT
> 
>  v3:
>  https://urldefense.com/v3/__https://lore.kernel.org/bpf/20210420154140.80034-1-kuniyu@amazon.co.jp/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHtKFGgFOg$ 
>   * Add sysctl back for reuseport_grow()
>   * Add helper functions to manage socks[]
>   * Separate migration related logic into functions: reuseport_resurrect(),
>     reuseport_stop_listen_sock(), reuseport_migrate_sock()
>   * Clone request_sock to be migrated
>   * Migrate request one by one
>   * Pass child socket to eBPF prog
> 
>  v2:
>  https://urldefense.com/v3/__https://lore.kernel.org/netdev/20201207132456.65472-1-kuniyu@amazon.co.jp/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHtxujEgug$ 
>   * Do not save closed sockets in socks[]
>   * Revert 607904c357c61adf20b8fd18af765e501d61a385
>   * Extract inet_csk_reqsk_queue_migrate() into a single patch
>   * Change the spin_lock order to avoid lockdep warning
>   * Add static to __reuseport_select_sock
>   * Use refcount_inc_not_zero() in reuseport_select_migrated_sock()
>   * Set the default attach type in bpf_prog_load_check_attach()
>   * Define new proto of BPF_FUNC_get_socket_cookie
>   * Fix test to be compiled successfully
>   * Update commit messages
> 
>  v1:
>  https://urldefense.com/v3/__https://lore.kernel.org/netdev/20201201144418.35045-1-kuniyu@amazon.co.jp/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHsPqhRjHg$ 
>   * Remove the sysctl option
>   * Enable migration if eBPF progam is not attached
>   * Add expected_attach_type to check if eBPF program can migrate sockets
>   * Add a field to tell migration type to eBPF program
>   * Support BPF_FUNC_get_socket_cookie to get the cookie of sk
>   * Allocate an empty skb if skb is NULL
>   * Pass req_to_sk(req)->sk_hash because listener's hash is zero
>   * Update commit messages and coverletter
> 
>  RFC:
>  https://urldefense.com/v3/__https://lore.kernel.org/netdev/20201117094023.3685-1-kuniyu@amazon.co.jp/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHsn-5vckQ$ 
> 
> 
> Kuniyuki Iwashima (11):
>   net: Introduce net.ipv4.tcp_migrate_req.
>   tcp: Add num_closed_socks to struct sock_reuseport.
>   tcp: Keep TCP_CLOSE sockets in the reuseport group.
>   tcp: Add reuseport_migrate_sock() to select a new listener.
>   tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
>   tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.
>   tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
>   bpf: Support BPF_FUNC_get_socket_cookie() for
>     BPF_PROG_TYPE_SK_REUSEPORT.
>   bpf: Support socket migration by eBPF.
>   libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
>   bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> 
>  Documentation/networking/ip-sysctl.rst        |  20 +
>  include/linux/bpf.h                           |   1 +
>  include/linux/filter.h                        |   2 +
>  include/net/netns/ipv4.h                      |   1 +
>  include/net/request_sock.h                    |   2 +
>  include/net/sock_reuseport.h                  |   9 +-
>  include/uapi/linux/bpf.h                      |  16 +
>  kernel/bpf/syscall.c                          |  13 +
>  net/core/filter.c                             |  23 +-
>  net/core/request_sock.c                       |  38 ++
>  net/core/sock_reuseport.c                     | 337 ++++++++++--
>  net/ipv4/inet_connection_sock.c               | 147 +++++-
>  net/ipv4/inet_hashtables.c                    |   2 +-
>  net/ipv4/sysctl_net_ipv4.c                    |   9 +
>  net/ipv4/tcp_ipv4.c                           |  20 +-
>  net/ipv6/tcp_ipv6.c                           |  14 +-
>  tools/include/uapi/linux/bpf.h                |  16 +
>  tools/lib/bpf/libbpf.c                        |   5 +-
>  tools/testing/selftests/bpf/network_helpers.c |   2 +-
>  tools/testing/selftests/bpf/network_helpers.h |   1 +
>  .../bpf/prog_tests/migrate_reuseport.c        | 484 ++++++++++++++++++
>  .../bpf/progs/test_migrate_reuseport.c        |  51 ++
>  22 files changed, 1151 insertions(+), 62 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c
> 

[-- Attachment #2: unix-persist.patch --]
[-- Type: text/x-patch, Size: 9910 bytes --]

commit 834f8a8dca1f508a67dbb36422549901a6df62fc
Author: Jason Baron <jbaron@akamai.com>
Date:   Wed May 24 01:32:33 2017 +0000

    delay close

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index fd60ecc..739cddf 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -27,6 +27,15 @@ struct unix_address {
 	struct sockaddr_un name[0];
 };
 
+struct unix_persist {
+        struct path             path;
+        struct list_head        link;
+        /* queue all packets on here */
+        struct sk_buff_head     receive_queue;
+        struct delayed_work     dw;
+	int			delay;
+};
+
 struct unix_skb_parms {
 	struct pid		*pid;		/* Skb credentials	*/
 	kuid_t			uid;
@@ -63,6 +72,7 @@ struct unix_sock {
 #define UNIX_GC_MAYBE_CYCLE	1
 	struct socket_wq	peer_wq;
 	wait_queue_t		peer_wake;
+	struct unix_persist	*persist;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 6a7fe76..18a7924 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -124,7 +124,8 @@ EXPORT_SYMBOL_GPL(unix_socket_table);
 DEFINE_SPINLOCK(unix_table_lock);
 EXPORT_SYMBOL_GPL(unix_table_lock);
 static atomic_long_t unix_nr_socks;
-
+LIST_HEAD(unix_persist_head);
+DEFINE_SPINLOCK(unix_persist_lock);
 
 static struct hlist_head *unix_sockets_unbound(void *addr)
 {
@@ -508,6 +509,27 @@ static void unix_sock_destructor(struct sock *sk)
 #endif
 }
 
+static void unix_persist_delayed_work(struct work_struct *work)
+{
+	struct delayed_work *delay = to_delayed_work(work);
+	struct unix_persist *persist = container_of(delay, struct unix_persist, dw);
+	bool del = false;
+
+	spin_lock(&unix_persist_lock);
+	if (!list_empty(&persist->link)) {
+		del = true;
+		list_del_init(&persist->link);
+	}
+	spin_unlock(&unix_persist_lock);
+
+	if (!del)
+		return;
+	
+	skb_queue_purge(&persist->receive_queue);		
+	path_put(&persist->path);
+	kfree(persist);
+}
+
 static void unix_release_sock(struct sock *sk, int embrion)
 {
 	struct unix_sock *u = unix_sk(sk);
@@ -515,6 +537,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
 	struct sock *skpair;
 	struct sk_buff *skb;
 	int state;
+	struct unix_persist *persist;
+	bool do_persist = false;
 
 	unix_remove_socket(sk);
 
@@ -550,12 +574,29 @@ static void unix_release_sock(struct sock *sk, int embrion)
 		unix_peer(sk) = NULL;
 	}
 
-	/* Try to flush out this socket. Throw out buffers at least */
+	persist = u->persist;
+	if (persist) {
+		if (persist->delay && path.dentry) {
+			do_persist = true;
+			path_get(&path);
+			persist->path = path;
+			skb_queue_head_init(&persist->receive_queue);
+			INIT_DELAYED_WORK(&persist->dw, unix_persist_delayed_work);
+			schedule_delayed_work(&persist->dw, msecs_to_jiffies(persist->delay));
+		} else
+			kfree(persist);
+	} 
 
+	/* Try to flush out this socket. Throw out buffers at least */
 	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		/* FIXME: persists anything special for listen ??? */
 		if (state == TCP_LISTEN)
 			unix_release_sock(skb->sk, 1);
 		/* passed fds are erased in the kfree_skb hook	      */
+		if (do_persist) {
+			skb_queue_tail(&persist->receive_queue, skb);
+			continue;
+		}
 		UNIXCB(skb).consumed = skb->len;
 		kfree_skb(skb);
 	}
@@ -563,6 +604,12 @@ static void unix_release_sock(struct sock *sk, int embrion)
 	if (path.dentry)
 		path_put(&path);
 
+	if (do_persist) {
+		spin_lock(&unix_persist_lock);
+		list_add(&u->persist->link, &unix_persist_head);
+		spin_unlock(&unix_persist_lock);
+	}
+
 	sock_put(sk);
 
 	/* ---- Socket is dead now and most probably destroyed ---- */
@@ -671,6 +718,93 @@ static int unix_set_peek_off(struct sock *sk, int val)
 	return 0;
 }
 
+#define UNIX_DELAY_CLOSE 1
+#define UNIX_REBIND 2
+#define SOL_UNIX 5
+
+int unix_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, unsigned int optlen)
+{
+	struct sock *sk = sock->sk;
+	struct unix_sock *u = unix_sk(sk);
+	int val;
+	int err = 0;
+
+	/* FIXME lock sock for allocations */
+
+	if (level != SOL_UNIX) {
+		/* FIXME: check return */
+		return -ENOPROTOOPT;
+	}
+	
+	if (optlen < sizeof(int))
+		return -EINVAL;
+
+	if (get_user(val, (int __user *)optval))
+		return -EFAULT;
+
+	switch(optname) {
+	case UNIX_DELAY_CLOSE: {
+		struct unix_persist *tmp;
+		/* limit to 1 minute? */
+		if (val <= 0 || val > 60000)
+			err = -EINVAL;
+		printk("set delay: %d\n", val);
+		if (!u->persist) {
+			tmp = kmalloc(sizeof(struct unix_persist), GFP_KERNEL);
+			if (!tmp) {
+				err = -ENOMEM;
+				break;
+			}
+			if (cmpxchg(&u->persist, NULL, tmp))
+				kfree(tmp);
+		}
+		u->persist->delay = val;
+		break;
+	}
+	default:
+		err = -ENOPROTOOPT;
+		break;
+	}
+	return err;
+}
+
+int unix_getsockopt(struct socket *sock, int level, int optname,
+		    	  char __user *optval, int __user *optlen)
+
+{
+	struct sock *sk = sock->sk;
+	struct unix_sock *u = unix_sk(sk);
+	int val, len;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	
+	if (len < 0)
+		return -EINVAL;
+
+	switch(optname) {
+	case UNIX_DELAY_CLOSE:
+		if (u->persist)
+			val = u->persist->delay;
+		else
+			val = 0;
+		break;
+	default:
+		return -ENOPROTOOPT;
+	}
+
+	if (len > sizeof(int))
+		len = sizeof(int);
+
+	if (copy_to_user(optval, &val, len))
+		return -EFAULT;
+
+	if (put_user(len, optlen))
+		return -EFAULT;
+
+	return 0;
+}
 
 static const struct proto_ops unix_stream_ops = {
 	.family =	PF_UNIX,
@@ -685,8 +819,8 @@ static const struct proto_ops unix_stream_ops = {
 	.ioctl =	unix_ioctl,
 	.listen =	unix_listen,
 	.shutdown =	unix_shutdown,
-	.setsockopt =	sock_no_setsockopt,
-	.getsockopt =	sock_no_getsockopt,
+	.setsockopt =	unix_setsockopt,
+	.getsockopt =	unix_getsockopt,
 	.sendmsg =	unix_stream_sendmsg,
 	.recvmsg =	unix_stream_recvmsg,
 	.mmap =		sock_no_mmap,
@@ -708,8 +842,8 @@ static const struct proto_ops unix_dgram_ops = {
 	.ioctl =	unix_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	unix_shutdown,
-	.setsockopt =	sock_no_setsockopt,
-	.getsockopt =	sock_no_getsockopt,
+	.setsockopt =	unix_setsockopt,
+	.getsockopt =	unix_getsockopt,
 	.sendmsg =	unix_dgram_sendmsg,
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
@@ -730,8 +864,8 @@ static const struct proto_ops unix_seqpacket_ops = {
 	.ioctl =	unix_ioctl,
 	.listen =	unix_listen,
 	.shutdown =	unix_shutdown,
-	.setsockopt =	sock_no_setsockopt,
-	.getsockopt =	sock_no_getsockopt,
+	.setsockopt =	unix_setsockopt,
+	.getsockopt =	unix_getsockopt,
 	.sendmsg =	unix_seqpacket_sendmsg,
 	.recvmsg =	unix_seqpacket_recvmsg,
 	.mmap =		sock_no_mmap,
@@ -985,6 +1119,64 @@ static int unix_mknod(const char *sun_path, umode_t mode, struct path *res)
 	return err;
 }
 
+static struct unix_persist *unix_get_persist(struct net *net,
+					struct sockaddr_un *sunname, int len,
+					int type, unsigned int hash)
+{
+	struct inode *inode;
+	struct unix_persist *entry, *res = NULL;
+	struct path path;
+	int err = 0;
+		
+	printk("unix_get_persist: enter\n");
+
+	if (!sunname->sun_path[0]) {
+		printk("unix_get_persist 1\n");
+		return NULL;
+	}
+
+	err = kern_path(sunname->sun_path, LOOKUP_FOLLOW, &path);
+	if (err) {
+		printk("unix_get_persist 2: %s %d\n", sunname->sun_path, err);
+		return NULL;
+	}
+	inode = d_backing_inode(path.dentry);
+	err = inode_permission(inode, MAY_WRITE);
+	if (err) {
+		printk("unix_get_persist 3\n");
+		goto out;
+	}
+	
+	if (!S_ISSOCK(inode->i_mode)) {
+		printk("unix_get_persist 4\n");
+		spin_lock(&unix_persist_lock);
+		list_for_each_entry(entry, &unix_persist_head, link) {
+			struct dentry *dentry = entry->path.dentry;	
+
+			printk("unix_get_persist: %p %p\n", d_backing_inode(dentry), inode);
+		}
+		spin_unlock(&unix_persist_lock);
+		goto out;
+	}
+
+	spin_lock(&unix_persist_lock);
+	list_for_each_entry(entry, &unix_persist_head, link) {
+		struct dentry *dentry = entry->path.dentry;	
+
+		printk("unix_get_delayed_close: %p %p\n", d_backing_inode(dentry), inode);
+		if (dentry && d_backing_inode(dentry) == inode) {
+			list_del_init(&entry->link);
+			res = entry;
+			break;
+		}
+	}
+	spin_unlock(&unix_persist_lock);
+
+out:
+	path_put(&path);
+	return res;
+}
+
 static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
 	struct sock *sk = sock->sk;
@@ -997,6 +1189,7 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	struct unix_address *addr;
 	struct hlist_head *list;
 	struct path path = { };
+	
 
 	err = -EINVAL;
 	if (sunaddr->sun_family != AF_UNIX)
@@ -1013,13 +1206,38 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	addr_len = err;
 
 	if (sun_path[0]) {
-		umode_t mode = S_IFSOCK |
-		       (SOCK_INODE(sock)->i_mode & ~current_umask());
-		err = unix_mknod(sun_path, mode, &path);
-		if (err) {
-			if (err == -EEXIST)
-				err = -EADDRINUSE;
-			goto out;
+		/* first check if we can bind to existing socket */
+		struct unix_persist *persist = unix_get_persist(net,
+					sunaddr, addr_len,
+					sock->type, hash); 
+		if (persist) {
+			struct sk_buff *skb = NULL;
+
+			cancel_delayed_work(&persist->dw);
+			unix_state_lock(sk);
+			if (skb_queue_len(&sk->sk_receive_queue) + skb_queue_len(&persist->receive_queue) > sk->sk_max_ack_backlog) {
+				unix_state_unlock(sk);
+				skb_queue_purge(&persist->receive_queue);
+				kfree(persist);
+				goto out;
+			}
+			while ((skb = skb_dequeue(&persist->receive_queue)) != NULL) {
+				skb_queue_tail(&sk->sk_receive_queue, skb);
+			}
+			unix_state_unlock(sk);
+			path = persist->path;
+			kfree(persist);
+			/* FIXME: test if queue is not empty */
+			sk->sk_data_ready(sk);
+		} else {
+			umode_t mode = S_IFSOCK |
+		       		(SOCK_INODE(sock)->i_mode & ~current_umask());
+			err = unix_mknod(sun_path, mode, &path);
+			if (err) {
+				if (err == -EEXIST)
+					err = -EADDRINUSE;
+				goto out;
+			}
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (11 preceding siblings ...)
  2021-04-27 16:38 ` [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Jason Baron
@ 2021-04-27 21:55 ` Maciej Żenczykowski
  2021-04-27 22:00   ` Maciej Żenczykowski
  12 siblings, 1 reply; 29+ messages in thread
From: Maciej Żenczykowski @ 2021-04-27 21:55 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Benjamin Herrenschmidt, Kuniyuki Iwashima,
	BPF Mailing List, Linux NetDev, Linux Kernel Mailing List

On Mon, Apr 26, 2021 at 8:47 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation [1]. When a SYN packet is received, the connection is tied
> to a listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners on the same port could accept
> such connections.
>
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in the
> in-flight ACK of 3WHS is responded by RST.

This is IMHO a userspace bug.

You should never be closing or creating new SO_REUSEPORT sockets on a
running server (listening port).

There's at least 3 ways to accomplish this.

One involves a shim parent process that takes care of creating the
sockets (without close-on-exec),
then fork-exec's the actual server process[es] (which will use the
already opened listening fds),
and can thus re-fork-exec a new child while using the same set of sockets.
Here the old server can terminate before the new one starts.

(one could even envision systemd being modified to support this...)

The second involves the old running server fork-execing the new server
and handing off the non-CLOEXEC sockets that way.

The third approach involves unix fd passing of sockets to hand off the
listening sockets from the old process/thread(s) to the new
process/thread(s).  Once handed off the old server can stop accept'ing
on the listening sockets and close them (the real copies are in the
child), finish processing any still active connections (or time them
out) and terminate.

Either way you're never creating new SO_REUSEPORT sockets (dup doesn't
count), nor closing the final copy of a given socket.

This is basically the same thing that was needed not to lose incoming
connections in a pre-SO_REUSEPORT world.
(no SO_REUSEADDR by itself doesn't prevent an incoming SYN from
triggering a RST during the server restart, it just makes the window
when RSTs happen shorter)

This was from day one (I reported to Tom and worked with him on the
very initial distribution function) envisioned to work like this,
and we (Google) have always used it with unix fd handoff to support
transparent restart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-27 21:55 ` Maciej Żenczykowski
@ 2021-04-27 22:00   ` Maciej Żenczykowski
  2021-04-28  8:18     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 29+ messages in thread
From: Maciej Żenczykowski @ 2021-04-27 22:00 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Benjamin Herrenschmidt, Kuniyuki Iwashima,
	BPF Mailing List, Linux NetDev, Linux Kernel Mailing List

On Tue, Apr 27, 2021 at 2:55 PM Maciej Żenczykowski
<zenczykowski@gmail.com> wrote:
> On Mon, Apr 26, 2021 at 8:47 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation [1]. When a SYN packet is received, the connection is tied
> > to a listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners on the same port could accept
> > such connections.
> >
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in the
> > in-flight ACK of 3WHS is responded by RST.
>
> This is IMHO a userspace bug.
>
> You should never be closing or creating new SO_REUSEPORT sockets on a
> running server (listening port).
>
> There's at least 3 ways to accomplish this.
>
> One involves a shim parent process that takes care of creating the
> sockets (without close-on-exec),
> then fork-exec's the actual server process[es] (which will use the
> already opened listening fds),
> and can thus re-fork-exec a new child while using the same set of sockets.
> Here the old server can terminate before the new one starts.
>
> (one could even envision systemd being modified to support this...)
>
> The second involves the old running server fork-execing the new server
> and handing off the non-CLOEXEC sockets that way.

(this doesn't even need to be fork-exec -- can just be exec -- and is
potentially easier)

> The third approach involves unix fd passing of sockets to hand off the
> listening sockets from the old process/thread(s) to the new
> process/thread(s).  Once handed off the old server can stop accept'ing
> on the listening sockets and close them (the real copies are in the
> child), finish processing any still active connections (or time them

(this doesn't actually need to be a child, in can be an entirely new
parallel instance of the server,
potentially running in an entirely new container/cgroup setup, though
in the same network namespace)

> out) and terminate.
>
> Either way you're never creating new SO_REUSEPORT sockets (dup doesn't
> count), nor closing the final copy of a given socket.
>
> This is basically the same thing that was needed not to lose incoming
> connections in a pre-SO_REUSEPORT world.
> (no SO_REUSEADDR by itself doesn't prevent an incoming SYN from
> triggering a RST during the server restart, it just makes the window
> when RSTs happen shorter)
>
> This was from day one (I reported to Tom and worked with him on the
> very initial distribution function) envisioned to work like this,
> and we (Google) have always used it with unix fd handoff to support
> transparent restart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-27 16:38 ` [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Jason Baron
@ 2021-04-28  1:27   ` Martin KaFai Lau
  2021-04-28 14:18     ` Eric Dumazet
  2021-04-28  8:13   ` Kuniyuki Iwashima
  1 sibling, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2021-04-28  1:27 UTC (permalink / raw)
  To: Jason Baron
  Cc: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski,
	Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Benjamin Herrenschmidt, Kuniyuki Iwashima, bpf,
	netdev, linux-kernel

On Tue, Apr 27, 2021 at 12:38:58PM -0400, Jason Baron wrote:
> 
> 
> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation [1]. When a SYN packet is received, the connection is tied
> > to a listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners on the same port could accept
> > such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in the
> > in-flight ACK of 3WHS is responded by RST.
> 
> Hi Kuniyuki,
> 
> I had implemented a different approach to this that I wanted to get your
> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> that you want to replace with a 'new' webserver, you would need a separate
> process to receive the listen fd and then have that process send the fd to
> the new webserver, if they are not running con-currently. So instead what
> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> 
> 1) bind unix socket with path '/sockets'
> 2) sendmsg() the listen fd via the unix socket
> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> 3) exit/close the old webserver and the listen socket
> 4) start the new webserver
> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> 6) recvmsg() the listen fd
> 
> So the idea is that we set a timeout on the unix socket. If the new process
> does not start and bind to the unix socket, it simply closes, thus releasing
> the listen socket. However, if it does bind it can now call recvmsg() and
> use the listen fd as normal. It can then simply continue to use the old listen
> fds and/or create new ones and drain the old ones.
> 
> Thus, the old and new webservers do not have to run concurrently. This doesn't
> involve any changes to the tcp layer and can be used to pass any type of fd.
> not sure if it's actually useful for anything else though.
We also used to do tcp-listen(/udp) fd transfer because the new process can not
bind to the same IP:PORT in the old kernel without SO_REUSEPORT.  Some of the
services listen to many different IP:PORT(s).  Transferring all of them
was ok-ish but the old and new process do not necessary listen to the same set
of IP:PORT(s) (e.g. the config may have changed during restart) and it further
complicates the fd transfer logic in the userspace.

It was then moved to SO_REUSEPORT.  The new process can create its listen fds
without depending on the old process.  It pretty much starts as if there is
no old process.  There is no need to transfer the fds, simplified the userspace
logic.  The old and new process can work independently.  The old and new process
still run concurrently for a brief time period to avoid service disruption.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-27 16:38 ` [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Jason Baron
  2021-04-28  1:27   ` Martin KaFai Lau
@ 2021-04-28  8:13   ` Kuniyuki Iwashima
  2021-04-28 14:44     ` Jason Baron
  1 sibling, 1 reply; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-28  8:13 UTC (permalink / raw)
  To: jbaron
  Cc: andrii, ast, benh, bpf, daniel, davem, edumazet, kafai, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Jason Baron <jbaron@akamai.com>
Date:   Tue, 27 Apr 2021 12:38:58 -0400
> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation [1]. When a SYN packet is received, the connection is tied
> > to a listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners on the same port could accept
> > such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in the
> > in-flight ACK of 3WHS is responded by RST.
> 
> Hi Kuniyuki,
> 
> I had implemented a different approach to this that I wanted to get your
> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> that you want to replace with a 'new' webserver, you would need a separate
> process to receive the listen fd and then have that process send the fd to
> the new webserver, if they are not running con-currently. So instead what
> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> 
> 1) bind unix socket with path '/sockets'
> 2) sendmsg() the listen fd via the unix socket
> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> 3) exit/close the old webserver and the listen socket
> 4) start the new webserver
> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> 6) recvmsg() the listen fd
> 
> So the idea is that we set a timeout on the unix socket. If the new process
> does not start and bind to the unix socket, it simply closes, thus releasing
> the listen socket. However, if it does bind it can now call recvmsg() and
> use the listen fd as normal. It can then simply continue to use the old listen
> fds and/or create new ones and drain the old ones.
> 
> Thus, the old and new webservers do not have to run concurrently. This doesn't
> involve any changes to the tcp layer and can be used to pass any type of fd.
> not sure if it's actually useful for anything else though.
> 
> I'm not sure if this solves your use-case or not but I thought I'd share it.
> One can also inherit the fds like in systemd's socket activation model, but
> that again requires another process to hold open the listen fd.

Thank you for sharing code.

It seems bit more crash-tolerant than normal fd passing, but it can still
suffer if the process dies before passing fds. With this patch set, we can
migrate children sockets even if the process dies.

Also, as Martin said, fd passing tends to make application complicated.

If we do not mind these points, your approach could be an option.


> 
> I have a very rough patch (emphasis on rough), that implements this idea that
> I'm attaching below to explain it better. It would need a bunch of fixups and
> it's against an older kernel, but hopefully gives this direction a better
> explanation.
> 
> Thanks,
> 
> -Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-27 22:00   ` Maciej Żenczykowski
@ 2021-04-28  8:18     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-28  8:18 UTC (permalink / raw)
  To: zenczykowski
  Cc: andrii, ast, benh, bpf, daniel, davem, edumazet, kafai, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Maciej Żenczykowski <zenczykowski@gmail.com>
Date:   Tue, 27 Apr 2021 15:00:12 -0700
> On Tue, Apr 27, 2021 at 2:55 PM Maciej Żenczykowski
> <zenczykowski@gmail.com> wrote:
> > On Mon, Apr 26, 2021 at 8:47 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> > > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > accept connections evenly. However, there is a defect in the current
> > > implementation [1]. When a SYN packet is received, the connection is tied
> > > to a listening socket. Accordingly, when the listener is closed, in-flight
> > > requests during the three-way handshake and child sockets in the accept
> > > queue are dropped even if other listeners on the same port could accept
> > > such connections.
> > >
> > > This situation can happen when various server management tools restart
> > > server (such as nginx) processes. For instance, when we change nginx
> > > configurations and restart it, it spins up new workers that respect the new
> > > configuration and closes all listeners on the old workers, resulting in the
> > > in-flight ACK of 3WHS is responded by RST.
> >
> > This is IMHO a userspace bug.

I do not think so.

If the kernel selected another listener for incoming connections, they
could be accept()ed. There is no room for usersapce to change the behaviour
without an in-kernel tool, eBPF. A feature that can cause failure
stochastically due to kernel behaviour cannot be a userspace bug.


> >
> > You should never be closing or creating new SO_REUSEPORT sockets on a
> > running server (listening port).
> >
> > There's at least 3 ways to accomplish this.
> >
> > One involves a shim parent process that takes care of creating the
> > sockets (without close-on-exec),
> > then fork-exec's the actual server process[es] (which will use the
> > already opened listening fds),
> > and can thus re-fork-exec a new child while using the same set of sockets.
> > Here the old server can terminate before the new one starts.
> >
> > (one could even envision systemd being modified to support this...)
> >
> > The second involves the old running server fork-execing the new server
> > and handing off the non-CLOEXEC sockets that way.
> 
> (this doesn't even need to be fork-exec -- can just be exec -- and is
> potentially easier)
> 
> > The third approach involves unix fd passing of sockets to hand off the
> > listening sockets from the old process/thread(s) to the new
> > process/thread(s).  Once handed off the old server can stop accept'ing
> > on the listening sockets and close them (the real copies are in the
> > child), finish processing any still active connections (or time them
> 
> (this doesn't actually need to be a child, in can be an entirely new
> parallel instance of the server,
> potentially running in an entirely new container/cgroup setup, though
> in the same network namespace)
> 
> > out) and terminate.
> >
> > Either way you're never creating new SO_REUSEPORT sockets (dup doesn't
> > count), nor closing the final copy of a given socket.

Indeed each approach can be an option, but it makes application more
complicated. Also what if the process holding the listener fd died, there
could be down time.

I do not think every approach works well in everywhere for everyone.


> >
> > This is basically the same thing that was needed not to lose incoming
> > connections in a pre-SO_REUSEPORT world.
> > (no SO_REUSEADDR by itself doesn't prevent an incoming SYN from
> > triggering a RST during the server restart, it just makes the window
> > when RSTs happen shorter)

SO_REUSEPORT makes each process/listener independent, and we need not pass
fds. So, it makes application much simpler. Even with SO_REUSEPORT, one
listener might crash, but it is more tolerant than losing all connections
at once.

To enjoy such merits, isn't it natural to improve the existing feature in
this post-SO_REUSEPORT world?


> >
> > This was from day one (I reported to Tom and worked with him on the
> > very initial distribution function) envisioned to work like this,
> > and we (Google) have always used it with unix fd handoff to support
> > transparent restart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-28  1:27   ` Martin KaFai Lau
@ 2021-04-28 14:18     ` Eric Dumazet
  2021-04-28 15:49       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2021-04-28 14:18 UTC (permalink / raw)
  To: Martin KaFai Lau, Jason Baron
  Cc: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski,
	Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Benjamin Herrenschmidt, Kuniyuki Iwashima, bpf,
	netdev, linux-kernel



On 4/28/21 3:27 AM, Martin KaFai Lau wrote:
> On Tue, Apr 27, 2021 at 12:38:58PM -0400, Jason Baron wrote:
>>
>>
>> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
>>> The SO_REUSEPORT option allows sockets to listen on the same port and to
>>> accept connections evenly. However, there is a defect in the current
>>> implementation [1]. When a SYN packet is received, the connection is tied
>>> to a listening socket. Accordingly, when the listener is closed, in-flight
>>> requests during the three-way handshake and child sockets in the accept
>>> queue are dropped even if other listeners on the same port could accept
>>> such connections.
>>>
>>> This situation can happen when various server management tools restart
>>> server (such as nginx) processes. For instance, when we change nginx
>>> configurations and restart it, it spins up new workers that respect the new
>>> configuration and closes all listeners on the old workers, resulting in the
>>> in-flight ACK of 3WHS is responded by RST.
>>
>> Hi Kuniyuki,
>>
>> I had implemented a different approach to this that I wanted to get your
>> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
>> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
>> that you want to replace with a 'new' webserver, you would need a separate
>> process to receive the listen fd and then have that process send the fd to
>> the new webserver, if they are not running con-currently. So instead what
>> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
>>
>> 1) bind unix socket with path '/sockets'
>> 2) sendmsg() the listen fd via the unix socket
>> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
>> 3) exit/close the old webserver and the listen socket
>> 4) start the new webserver
>> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
>> 6) recvmsg() the listen fd
>>
>> So the idea is that we set a timeout on the unix socket. If the new process
>> does not start and bind to the unix socket, it simply closes, thus releasing
>> the listen socket. However, if it does bind it can now call recvmsg() and
>> use the listen fd as normal. It can then simply continue to use the old listen
>> fds and/or create new ones and drain the old ones.
>>
>> Thus, the old and new webservers do not have to run concurrently. This doesn't
>> involve any changes to the tcp layer and can be used to pass any type of fd.
>> not sure if it's actually useful for anything else though.
> We also used to do tcp-listen(/udp) fd transfer because the new process can not
> bind to the same IP:PORT in the old kernel without SO_REUSEPORT.  Some of the
> services listen to many different IP:PORT(s).  Transferring all of them
> was ok-ish but the old and new process do not necessary listen to the same set
> of IP:PORT(s) (e.g. the config may have changed during restart) and it further
> complicates the fd transfer logic in the userspace.
> 
> It was then moved to SO_REUSEPORT.  The new process can create its listen fds
> without depending on the old process.  It pretty much starts as if there is
> no old process.  There is no need to transfer the fds, simplified the userspace
> logic.  The old and new process can work independently.  The old and new process
> still run concurrently for a brief time period to avoid service disruption.
> 


Note that another technique is to force syncookies during the switch of old/new servers.

echo 2 >/proc/sys/net/ipv4/tcp_syncookies

If there is interest, we could add a socket option to override the sysctl on a per-socket basis.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-28  8:13   ` Kuniyuki Iwashima
@ 2021-04-28 14:44     ` Jason Baron
  2021-04-28 15:52       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Baron @ 2021-04-28 14:44 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, ast, benh, bpf, daniel, davem, edumazet, kafai, kuba,
	kuni1840, linux-kernel, netdev



On 4/28/21 4:13 AM, Kuniyuki Iwashima wrote:
> From:   Jason Baron <jbaron@akamai.com>
> Date:   Tue, 27 Apr 2021 12:38:58 -0400
>> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
>>> The SO_REUSEPORT option allows sockets to listen on the same port and to
>>> accept connections evenly. However, there is a defect in the current
>>> implementation [1]. When a SYN packet is received, the connection is tied
>>> to a listening socket. Accordingly, when the listener is closed, in-flight
>>> requests during the three-way handshake and child sockets in the accept
>>> queue are dropped even if other listeners on the same port could accept
>>> such connections.
>>>
>>> This situation can happen when various server management tools restart
>>> server (such as nginx) processes. For instance, when we change nginx
>>> configurations and restart it, it spins up new workers that respect the new
>>> configuration and closes all listeners on the old workers, resulting in the
>>> in-flight ACK of 3WHS is responded by RST.
>>
>> Hi Kuniyuki,
>>
>> I had implemented a different approach to this that I wanted to get your
>> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
>> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
>> that you want to replace with a 'new' webserver, you would need a separate
>> process to receive the listen fd and then have that process send the fd to
>> the new webserver, if they are not running con-currently. So instead what
>> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
>>
>> 1) bind unix socket with path '/sockets'
>> 2) sendmsg() the listen fd via the unix socket
>> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
>> 3) exit/close the old webserver and the listen socket
>> 4) start the new webserver
>> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
>> 6) recvmsg() the listen fd
>>
>> So the idea is that we set a timeout on the unix socket. If the new process
>> does not start and bind to the unix socket, it simply closes, thus releasing
>> the listen socket. However, if it does bind it can now call recvmsg() and
>> use the listen fd as normal. It can then simply continue to use the old listen
>> fds and/or create new ones and drain the old ones.
>>
>> Thus, the old and new webservers do not have to run concurrently. This doesn't
>> involve any changes to the tcp layer and can be used to pass any type of fd.
>> not sure if it's actually useful for anything else though.
>>
>> I'm not sure if this solves your use-case or not but I thought I'd share it.
>> One can also inherit the fds like in systemd's socket activation model, but
>> that again requires another process to hold open the listen fd.
> 
> Thank you for sharing code.
> 
> It seems bit more crash-tolerant than normal fd passing, but it can still
> suffer if the process dies before passing fds. With this patch set, we can
> migrate children sockets even if the process dies.
> 

I don't think crashing should be much of an issue. The old server can setup the
unix socket patch '/sockets' when it starts up and queue the listen sockets
there from the start. When it dies it will close all its fds, and the new
server can pick anything up any fds that are in the '/sockets' queue.


> Also, as Martin said, fd passing tends to make application complicated.
> 

It may be but perhaps its more flexible? It gives the new server the
chance to re-use the existing listen fds, close, drain and/or start new
ones. It also addresses the non-REUSEPORT case where you can't bind right
away.

Thanks,

-Jason

> If we do not mind these points, your approach could be an option.
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-28 14:18     ` Eric Dumazet
@ 2021-04-28 15:49       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-28 15:49 UTC (permalink / raw)
  To: eric.dumazet
  Cc: andrii, ast, benh, bpf, daniel, davem, edumazet, jbaron, kafai,
	kuba, kuni1840, kuniyu, linux-kernel, netdev

From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed, 28 Apr 2021 16:18:30 +0200
> On 4/28/21 3:27 AM, Martin KaFai Lau wrote:
> > On Tue, Apr 27, 2021 at 12:38:58PM -0400, Jason Baron wrote:
> >>
> >>
> >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> >>> The SO_REUSEPORT option allows sockets to listen on the same port and to
> >>> accept connections evenly. However, there is a defect in the current
> >>> implementation [1]. When a SYN packet is received, the connection is tied
> >>> to a listening socket. Accordingly, when the listener is closed, in-flight
> >>> requests during the three-way handshake and child sockets in the accept
> >>> queue are dropped even if other listeners on the same port could accept
> >>> such connections.
> >>>
> >>> This situation can happen when various server management tools restart
> >>> server (such as nginx) processes. For instance, when we change nginx
> >>> configurations and restart it, it spins up new workers that respect the new
> >>> configuration and closes all listeners on the old workers, resulting in the
> >>> in-flight ACK of 3WHS is responded by RST.
> >>
> >> Hi Kuniyuki,
> >>
> >> I had implemented a different approach to this that I wanted to get your
> >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> >> that you want to replace with a 'new' webserver, you would need a separate
> >> process to receive the listen fd and then have that process send the fd to
> >> the new webserver, if they are not running con-currently. So instead what
> >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> >>
> >> 1) bind unix socket with path '/sockets'
> >> 2) sendmsg() the listen fd via the unix socket
> >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> >> 3) exit/close the old webserver and the listen socket
> >> 4) start the new webserver
> >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> >> 6) recvmsg() the listen fd
> >>
> >> So the idea is that we set a timeout on the unix socket. If the new process
> >> does not start and bind to the unix socket, it simply closes, thus releasing
> >> the listen socket. However, if it does bind it can now call recvmsg() and
> >> use the listen fd as normal. It can then simply continue to use the old listen
> >> fds and/or create new ones and drain the old ones.
> >>
> >> Thus, the old and new webservers do not have to run concurrently. This doesn't
> >> involve any changes to the tcp layer and can be used to pass any type of fd.
> >> not sure if it's actually useful for anything else though.
> > We also used to do tcp-listen(/udp) fd transfer because the new process can not
> > bind to the same IP:PORT in the old kernel without SO_REUSEPORT.  Some of the
> > services listen to many different IP:PORT(s).  Transferring all of them
> > was ok-ish but the old and new process do not necessary listen to the same set
> > of IP:PORT(s) (e.g. the config may have changed during restart) and it further
> > complicates the fd transfer logic in the userspace.
> > 
> > It was then moved to SO_REUSEPORT.  The new process can create its listen fds
> > without depending on the old process.  It pretty much starts as if there is
> > no old process.  There is no need to transfer the fds, simplified the userspace
> > logic.  The old and new process can work independently.  The old and new process
> > still run concurrently for a brief time period to avoid service disruption.
> > 
> 
> 
> Note that another technique is to force syncookies during the switch of old/new servers.
> 
> echo 2 >/proc/sys/net/ipv4/tcp_syncookies
> 
> If there is interest, we could add a socket option to override the sysctl on a per-socket basis.

It can be a work-around but syncookies has its own downside. Forcing it may
lose some valuable TCP options. If there is an approach without syncookies,
it is better.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-28 14:44     ` Jason Baron
@ 2021-04-28 15:52       ` Kuniyuki Iwashima
  2021-04-28 16:33         ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-28 15:52 UTC (permalink / raw)
  To: jbaron
  Cc: andrii, ast, benh, bpf, daniel, davem, edumazet, kafai, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Jason Baron <jbaron@akamai.com>
Date:   Wed, 28 Apr 2021 10:44:12 -0400
> On 4/28/21 4:13 AM, Kuniyuki Iwashima wrote:
> > From:   Jason Baron <jbaron@akamai.com>
> > Date:   Tue, 27 Apr 2021 12:38:58 -0400
> >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> >>> The SO_REUSEPORT option allows sockets to listen on the same port and to
> >>> accept connections evenly. However, there is a defect in the current
> >>> implementation [1]. When a SYN packet is received, the connection is tied
> >>> to a listening socket. Accordingly, when the listener is closed, in-flight
> >>> requests during the three-way handshake and child sockets in the accept
> >>> queue are dropped even if other listeners on the same port could accept
> >>> such connections.
> >>>
> >>> This situation can happen when various server management tools restart
> >>> server (such as nginx) processes. For instance, when we change nginx
> >>> configurations and restart it, it spins up new workers that respect the new
> >>> configuration and closes all listeners on the old workers, resulting in the
> >>> in-flight ACK of 3WHS is responded by RST.
> >>
> >> Hi Kuniyuki,
> >>
> >> I had implemented a different approach to this that I wanted to get your
> >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> >> that you want to replace with a 'new' webserver, you would need a separate
> >> process to receive the listen fd and then have that process send the fd to
> >> the new webserver, if they are not running con-currently. So instead what
> >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> >>
> >> 1) bind unix socket with path '/sockets'
> >> 2) sendmsg() the listen fd via the unix socket
> >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> >> 3) exit/close the old webserver and the listen socket
> >> 4) start the new webserver
> >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> >> 6) recvmsg() the listen fd
> >>
> >> So the idea is that we set a timeout on the unix socket. If the new process
> >> does not start and bind to the unix socket, it simply closes, thus releasing
> >> the listen socket. However, if it does bind it can now call recvmsg() and
> >> use the listen fd as normal. It can then simply continue to use the old listen
> >> fds and/or create new ones and drain the old ones.
> >>
> >> Thus, the old and new webservers do not have to run concurrently. This doesn't
> >> involve any changes to the tcp layer and can be used to pass any type of fd.
> >> not sure if it's actually useful for anything else though.
> >>
> >> I'm not sure if this solves your use-case or not but I thought I'd share it.
> >> One can also inherit the fds like in systemd's socket activation model, but
> >> that again requires another process to hold open the listen fd.
> > 
> > Thank you for sharing code.
> > 
> > It seems bit more crash-tolerant than normal fd passing, but it can still
> > suffer if the process dies before passing fds. With this patch set, we can
> > migrate children sockets even if the process dies.
> > 
> 
> I don't think crashing should be much of an issue. The old server can setup the
> unix socket patch '/sockets' when it starts up and queue the listen sockets
> there from the start. When it dies it will close all its fds, and the new
> server can pick anything up any fds that are in the '/sockets' queue.
> 
> 
> > Also, as Martin said, fd passing tends to make application complicated.
> > 
> 
> It may be but perhaps its more flexible? It gives the new server the
> chance to re-use the existing listen fds, close, drain and/or start new
> ones. It also addresses the non-REUSEPORT case where you can't bind right
> away.

If the flexibility is really worth the complexity, we do not care about it.
But, SO_REUSEPORT can give enough flexibility we want.

With socket migration, there is no need to reuse listener (fd passing),
drain children (incoming connections are automatically migrated if there is
already another listener bind()ed), and of course another listener can
close itself and migrated children.

If two different approaches resolves the same issue and one does not need
complexity in userspace, we select the simpler one.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-28 15:52       ` Kuniyuki Iwashima
@ 2021-04-28 16:33         ` Eric Dumazet
  2021-04-29  3:16           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2021-04-28 16:33 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Jason Baron, andrii, Alexei Starovoitov, Benjamin Herrenschmidt,
	bpf, Daniel Borkmann, David Miller, Martin KaFai Lau,
	Jakub Kicinski, Kuniyuki Iwashima, LKML, netdev

On Wed, Apr 28, 2021 at 5:52 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
>
> From:   Jason Baron <jbaron@akamai.com>
> Date:   Wed, 28 Apr 2021 10:44:12 -0400
> > On 4/28/21 4:13 AM, Kuniyuki Iwashima wrote:
> > > From:   Jason Baron <jbaron@akamai.com>
> > > Date:   Tue, 27 Apr 2021 12:38:58 -0400
> > >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> > >>> The SO_REUSEPORT option allows sockets to listen on the same port and to
> > >>> accept connections evenly. However, there is a defect in the current
> > >>> implementation [1]. When a SYN packet is received, the connection is tied
> > >>> to a listening socket. Accordingly, when the listener is closed, in-flight
> > >>> requests during the three-way handshake and child sockets in the accept
> > >>> queue are dropped even if other listeners on the same port could accept
> > >>> such connections.
> > >>>
> > >>> This situation can happen when various server management tools restart
> > >>> server (such as nginx) processes. For instance, when we change nginx
> > >>> configurations and restart it, it spins up new workers that respect the new
> > >>> configuration and closes all listeners on the old workers, resulting in the
> > >>> in-flight ACK of 3WHS is responded by RST.
> > >>
> > >> Hi Kuniyuki,
> > >>
> > >> I had implemented a different approach to this that I wanted to get your
> > >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> > >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> > >> that you want to replace with a 'new' webserver, you would need a separate
> > >> process to receive the listen fd and then have that process send the fd to
> > >> the new webserver, if they are not running con-currently. So instead what
> > >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> > >>
> > >> 1) bind unix socket with path '/sockets'
> > >> 2) sendmsg() the listen fd via the unix socket
> > >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> > >> 3) exit/close the old webserver and the listen socket
> > >> 4) start the new webserver
> > >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> > >> 6) recvmsg() the listen fd
> > >>
> > >> So the idea is that we set a timeout on the unix socket. If the new process
> > >> does not start and bind to the unix socket, it simply closes, thus releasing
> > >> the listen socket. However, if it does bind it can now call recvmsg() and
> > >> use the listen fd as normal. It can then simply continue to use the old listen
> > >> fds and/or create new ones and drain the old ones.
> > >>
> > >> Thus, the old and new webservers do not have to run concurrently. This doesn't
> > >> involve any changes to the tcp layer and can be used to pass any type of fd.
> > >> not sure if it's actually useful for anything else though.
> > >>
> > >> I'm not sure if this solves your use-case or not but I thought I'd share it.
> > >> One can also inherit the fds like in systemd's socket activation model, but
> > >> that again requires another process to hold open the listen fd.
> > >
> > > Thank you for sharing code.
> > >
> > > It seems bit more crash-tolerant than normal fd passing, but it can still
> > > suffer if the process dies before passing fds. With this patch set, we can
> > > migrate children sockets even if the process dies.
> > >
> >
> > I don't think crashing should be much of an issue. The old server can setup the
> > unix socket patch '/sockets' when it starts up and queue the listen sockets
> > there from the start. When it dies it will close all its fds, and the new
> > server can pick anything up any fds that are in the '/sockets' queue.
> >
> >
> > > Also, as Martin said, fd passing tends to make application complicated.
> > >
> >
> > It may be but perhaps its more flexible? It gives the new server the
> > chance to re-use the existing listen fds, close, drain and/or start new
> > ones. It also addresses the non-REUSEPORT case where you can't bind right
> > away.
>
> If the flexibility is really worth the complexity, we do not care about it.
> But, SO_REUSEPORT can give enough flexibility we want.
>
> With socket migration, there is no need to reuse listener (fd passing),
> drain children (incoming connections are automatically migrated if there is
> already another listener bind()ed), and of course another listener can
> close itself and migrated children.
>
> If two different approaches resolves the same issue and one does not need
> complexity in userspace, we select the simpler one.

Kernel bloat and complexity is _not_ the simplest choice.

Touching a complex part of TCP stack is quite risky.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-28 16:33         ` Eric Dumazet
@ 2021-04-29  3:16           ` Kuniyuki Iwashima
  2021-05-05  6:54             ` Martin KaFai Lau
  0 siblings, 1 reply; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-04-29  3:16 UTC (permalink / raw)
  To: edumazet
  Cc: andrii, ast, benh, bpf, daniel, davem, jbaron, kafai, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Eric Dumazet <edumazet@google.com>
Date:   Wed, 28 Apr 2021 18:33:32 +0200
> On Wed, Apr 28, 2021 at 5:52 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> >
> > From:   Jason Baron <jbaron@akamai.com>
> > Date:   Wed, 28 Apr 2021 10:44:12 -0400
> > > On 4/28/21 4:13 AM, Kuniyuki Iwashima wrote:
> > > > From:   Jason Baron <jbaron@akamai.com>
> > > > Date:   Tue, 27 Apr 2021 12:38:58 -0400
> > > >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> > > >>> The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > >>> accept connections evenly. However, there is a defect in the current
> > > >>> implementation [1]. When a SYN packet is received, the connection is tied
> > > >>> to a listening socket. Accordingly, when the listener is closed, in-flight
> > > >>> requests during the three-way handshake and child sockets in the accept
> > > >>> queue are dropped even if other listeners on the same port could accept
> > > >>> such connections.
> > > >>>
> > > >>> This situation can happen when various server management tools restart
> > > >>> server (such as nginx) processes. For instance, when we change nginx
> > > >>> configurations and restart it, it spins up new workers that respect the new
> > > >>> configuration and closes all listeners on the old workers, resulting in the
> > > >>> in-flight ACK of 3WHS is responded by RST.
> > > >>
> > > >> Hi Kuniyuki,
> > > >>
> > > >> I had implemented a different approach to this that I wanted to get your
> > > >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> > > >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> > > >> that you want to replace with a 'new' webserver, you would need a separate
> > > >> process to receive the listen fd and then have that process send the fd to
> > > >> the new webserver, if they are not running con-currently. So instead what
> > > >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> > > >>
> > > >> 1) bind unix socket with path '/sockets'
> > > >> 2) sendmsg() the listen fd via the unix socket
> > > >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> > > >> 3) exit/close the old webserver and the listen socket
> > > >> 4) start the new webserver
> > > >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> > > >> 6) recvmsg() the listen fd
> > > >>
> > > >> So the idea is that we set a timeout on the unix socket. If the new process
> > > >> does not start and bind to the unix socket, it simply closes, thus releasing
> > > >> the listen socket. However, if it does bind it can now call recvmsg() and
> > > >> use the listen fd as normal. It can then simply continue to use the old listen
> > > >> fds and/or create new ones and drain the old ones.
> > > >>
> > > >> Thus, the old and new webservers do not have to run concurrently. This doesn't
> > > >> involve any changes to the tcp layer and can be used to pass any type of fd.
> > > >> not sure if it's actually useful for anything else though.
> > > >>
> > > >> I'm not sure if this solves your use-case or not but I thought I'd share it.
> > > >> One can also inherit the fds like in systemd's socket activation model, but
> > > >> that again requires another process to hold open the listen fd.
> > > >
> > > > Thank you for sharing code.
> > > >
> > > > It seems bit more crash-tolerant than normal fd passing, but it can still
> > > > suffer if the process dies before passing fds. With this patch set, we can
> > > > migrate children sockets even if the process dies.
> > > >
> > >
> > > I don't think crashing should be much of an issue. The old server can setup the
> > > unix socket patch '/sockets' when it starts up and queue the listen sockets
> > > there from the start. When it dies it will close all its fds, and the new
> > > server can pick anything up any fds that are in the '/sockets' queue.
> > >
> > >
> > > > Also, as Martin said, fd passing tends to make application complicated.
> > > >
> > >
> > > It may be but perhaps its more flexible? It gives the new server the
> > > chance to re-use the existing listen fds, close, drain and/or start new
> > > ones. It also addresses the non-REUSEPORT case where you can't bind right
> > > away.
> >
> > If the flexibility is really worth the complexity, we do not care about it.
> > But, SO_REUSEPORT can give enough flexibility we want.
> >
> > With socket migration, there is no need to reuse listener (fd passing),
> > drain children (incoming connections are automatically migrated if there is
> > already another listener bind()ed), and of course another listener can
> > close itself and migrated children.
> >
> > If two different approaches resolves the same issue and one does not need
> > complexity in userspace, we select the simpler one.
> 
> Kernel bloat and complexity is _not_ the simplest choice.
> 
> Touching a complex part of TCP stack is quite risky.

Yes, we understand that is not a simple decision and your concern. So many
reviews are needed to see if our approach is really risky or not.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.
  2021-04-27  3:46 ` [PATCH v4 bpf-next 06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs Kuniyuki Iwashima
@ 2021-05-05  4:56   ` Martin KaFai Lau
  2021-05-05 23:16     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2021-05-05  4:56 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Benjamin Herrenschmidt, Kuniyuki Iwashima, bpf, netdev,
	linux-kernel

On Tue, Apr 27, 2021 at 12:46:18PM +0900, Kuniyuki Iwashima wrote:
[ ... ]

> diff --git a/net/core/request_sock.c b/net/core/request_sock.c
> index 82cf9fbe2668..08c37ecd923b 100644
> --- a/net/core/request_sock.c
> +++ b/net/core/request_sock.c
> @@ -151,6 +151,7 @@ struct request_sock *reqsk_clone(struct request_sock *req, struct sock *sk)
>  	memcpy(&nreq_sk->sk_dontcopy_end, &req_sk->sk_dontcopy_end,
>  	       req->rsk_ops->obj_size - offsetof(struct sock, sk_dontcopy_end));
>  
> +	sk_node_init(&nreq_sk->sk_node);
This belongs to patch 5.
"rsk_refcnt" also needs to be 0 instead of staying uninitialized
after reqsk_clone() returned.

>  	nreq_sk->sk_tx_queue_mapping = req_sk->sk_tx_queue_mapping;
>  #ifdef CONFIG_XPS
>  	nreq_sk->sk_rx_queue_mapping = req_sk->sk_rx_queue_mapping;
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 851992405826..dc984d1f352e 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -695,10 +695,20 @@ int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req)
>  }
>  EXPORT_SYMBOL(inet_rtx_syn_ack);
>  
> +static void reqsk_queue_migrated(struct request_sock_queue *queue,
> +				 const struct request_sock *req)
> +{
> +	if (req->num_timeout == 0)
> +		atomic_inc(&queue->young);
> +	atomic_inc(&queue->qlen);
> +}
> +
>  static void reqsk_migrate_reset(struct request_sock *req)
>  {
> +	req->saved_syn = NULL;
> +	inet_rsk(req)->ireq_opt = NULL;
>  #if IS_ENABLED(CONFIG_IPV6)
> -	inet_rsk(req)->ipv6_opt = NULL;
> +	inet_rsk(req)->pktopts = NULL;
>  #endif
>  }
>  
> @@ -741,16 +751,37 @@ EXPORT_SYMBOL(inet_csk_reqsk_queue_drop_and_put);
>  
>  static void reqsk_timer_handler(struct timer_list *t)
>  {
> -	struct request_sock *req = from_timer(req, t, rsk_timer);
> -	struct sock *sk_listener = req->rsk_listener;
> -	struct net *net = sock_net(sk_listener);
> -	struct inet_connection_sock *icsk = inet_csk(sk_listener);
> -	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
> +	struct request_sock *req = from_timer(req, t, rsk_timer), *nreq = NULL, *oreq = req;
nit. This line is too long.
Lets move the new "*nreq" and "*oreg" to a new line and keep the current
"*req" line as is:
	struct request_sock *req = from_timer(req, t, rsk_timer);
	struct request_sock *oreq = req, *nreq = NULL;

> +	struct sock *sk_listener = req->rsk_listener, *nsk = NULL;
"*nsk" can be moved into the following "!= TCP_LISTEN" case below.
Keep the current "*sk_listener" line as is.

> +	struct inet_connection_sock *icsk;
> +	struct request_sock_queue *queue;
> +	struct net *net;
>  	int max_syn_ack_retries, qlen, expire = 0, resend = 0;
>  
> -	if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
> -		goto drop;
> +	if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {

		struct sock *nsk;

> +		nsk = reuseport_migrate_sock(sk_listener, req_to_sk(req), NULL);
> +		if (!nsk)
> +			goto drop;
> +
> +		nreq = reqsk_clone(req, nsk);
> +		if (!nreq)
> +			goto drop;
> +
> +		/* The new timer for the cloned req can decrease the 2
> +		 * by calling inet_csk_reqsk_queue_drop_and_put(), so
> +		 * hold another count to prevent use-after-free and
> +		 * call reqsk_put() just before return.
> +		 */
> +		refcount_set(&nreq->rsk_refcnt, 2 + 1);
> +		timer_setup(&nreq->rsk_timer, reqsk_timer_handler, TIMER_PINNED);
> +		reqsk_queue_migrated(&inet_csk(nsk)->icsk_accept_queue, req);
> +
> +		req = nreq;
> +		sk_listener = nsk;
> +	}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
  2021-04-27  3:46 ` [PATCH v4 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
@ 2021-05-05  5:14   ` Martin KaFai Lau
  2021-05-05 23:19     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2021-05-05  5:14 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Benjamin Herrenschmidt, Kuniyuki Iwashima, bpf, netdev,
	linux-kernel

On Tue, Apr 27, 2021 at 12:46:23PM +0900, Kuniyuki Iwashima wrote:
[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c b/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c
> new file mode 100644
> index 000000000000..d7136dc29fa2
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c
> @@ -0,0 +1,51 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Check if we can migrate child sockets.
> + *
> + *   1. If reuse_md->migrating_sk is NULL (SYN packet),
> + *        return SK_PASS without selecting a listener.
> + *   2. If reuse_md->migrating_sk is not NULL (socket migration),
> + *        select a listener (reuseport_map[migrate_map[cookie]])
> + *
> + * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> + */
> +
> +#include <stddef.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_REUSEPORT_SOCKARRAY);
> +	__uint(max_entries, 256);
> +	__type(key, int);
> +	__type(value, __u64);
> +} reuseport_map SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(max_entries, 256);
> +	__type(key, __u64);
> +	__type(value, int);
> +} migrate_map SEC(".maps");
> +
> +SEC("sk_reuseport/migrate")
> +int prog_migrate_reuseport(struct sk_reuseport_md *reuse_md)
> +{
> +	int *key, flags = 0;
> +	__u64 cookie;
> +
> +	if (!reuse_md->migrating_sk)
> +		return SK_PASS;
> +

It will be useful to check if it is migrating a child sk or
a reqsk by testing the migrating_sk->state for TCP_ESTABLISHED
and TCP_NEW_SYN_RECV.  skb can be further tested to check if it is
selecting for the final ACK.  Global variables can then be
incremented and the user prog can check that, for example,
it is indeed testing the TCP_NEW_SYN_RECV code path...etc.

It will also become a good example for others on how migrating_sk
can be used.


> +	cookie = bpf_get_socket_cookie(reuse_md->sk);
> +
> +	key = bpf_map_lookup_elem(&migrate_map, &cookie);
> +	if (!key)
> +		return SK_DROP;
> +
> +	bpf_sk_select_reuseport(reuse_md, &reuseport_map, key, flags);
> +
> +	return SK_PASS;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.
  2021-04-29  3:16           ` Kuniyuki Iwashima
@ 2021-05-05  6:54             ` Martin KaFai Lau
  0 siblings, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2021-05-05  6:54 UTC (permalink / raw)
  To: Kuniyuki Iwashima, edumazet, jbaron
  Cc: andrii, ast, benh, bpf, daniel, davem, kuba, kuni1840,
	linux-kernel, netdev

On Thu, Apr 29, 2021 at 12:16:09PM +0900, Kuniyuki Iwashima wrote:
[ ... ]

> > > > It may be but perhaps its more flexible? It gives the new server the
> > > > chance to re-use the existing listen fds, close, drain and/or start new
> > > > ones. It also addresses the non-REUSEPORT case where you can't bind right
> > > > away.

> > > If the flexibility is really worth the complexity, we do not care about it.
> > > But, SO_REUSEPORT can give enough flexibility we want.
> > >
> > > With socket migration, there is no need to reuse listener (fd passing),
> > > drain children (incoming connections are automatically migrated if there is
> > > already another listener bind()ed), and of course another listener can
> > > close itself and migrated children.
> > >
> > > If two different approaches resolves the same issue and one does not need
> > > complexity in userspace, we select the simpler one.

> > 
> > Kernel bloat and complexity is _not_ the simplest choice.
> > 
> > Touching a complex part of TCP stack is quite risky.

> 
> Yes, we understand that is not a simple decision and your concern. So many
> reviews are needed to see if our approach is really risky or not.

If fd passing is sufficient for a set of use cases, it is great.

However, it does not work well for everyone.  We are not saying
the SO_REUSEPORT(+ optional bpf) is better in all cases also.

After SO_REUSEPORT was added, some people had moved from fd-passing
to SO_REUSEPORT instead and have one bpf policy to select for both
TCP and UDP sk.

Since SO_REUSEPORT was first added, there has been multiple contributions
from different people and companies.  For example, first adding bpf
support to UDP, then to TCP, then a much more flexible way to select sk
from reuseport_array, and then sock_map/sock_hash support.  That is another
perspective showing that people find it useful.  Each of the contributions
changed the kernel code also for practical use cases.

This set is an extension/improvement to address a lacking in SO_REUSEPORT
when some of the sk is closed.  Patch 2 to 4 are the prep work
in sock_reuseport.c and they have the most changes in this set.
Patch 5 to 7 are the changes in tcp.  The code has been structured
to be as isolated as possible.  It will be most useful to at least
review and getting feedback in this part.  The remaining is bpf
related.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.
  2021-05-05  4:56   ` Martin KaFai Lau
@ 2021-05-05 23:16     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-05-05 23:16 UTC (permalink / raw)
  To: kafai
  Cc: andrii, ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840,
	kuniyu, linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Tue, 4 May 2021 21:56:18 -0700
> On Tue, Apr 27, 2021 at 12:46:18PM +0900, Kuniyuki Iwashima wrote:
> [ ... ]
> 
> > diff --git a/net/core/request_sock.c b/net/core/request_sock.c
> > index 82cf9fbe2668..08c37ecd923b 100644
> > --- a/net/core/request_sock.c
> > +++ b/net/core/request_sock.c
> > @@ -151,6 +151,7 @@ struct request_sock *reqsk_clone(struct request_sock *req, struct sock *sk)
> >  	memcpy(&nreq_sk->sk_dontcopy_end, &req_sk->sk_dontcopy_end,
> >  	       req->rsk_ops->obj_size - offsetof(struct sock, sk_dontcopy_end));
> >  
> > +	sk_node_init(&nreq_sk->sk_node);
> This belongs to patch 5.
> "rsk_refcnt" also needs to be 0 instead of staying uninitialized
> after reqsk_clone() returned.

I'll move this part to patch 5 and initialize refcnt as 0 in reqsk_clone()
like reqsk_alloc().


> 
> >  	nreq_sk->sk_tx_queue_mapping = req_sk->sk_tx_queue_mapping;
> >  #ifdef CONFIG_XPS
> >  	nreq_sk->sk_rx_queue_mapping = req_sk->sk_rx_queue_mapping;
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index 851992405826..dc984d1f352e 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -695,10 +695,20 @@ int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req)
> >  }
> >  EXPORT_SYMBOL(inet_rtx_syn_ack);
> >  
> > +static void reqsk_queue_migrated(struct request_sock_queue *queue,
> > +				 const struct request_sock *req)
> > +{
> > +	if (req->num_timeout == 0)
> > +		atomic_inc(&queue->young);
> > +	atomic_inc(&queue->qlen);
> > +}
> > +
> >  static void reqsk_migrate_reset(struct request_sock *req)
> >  {
> > +	req->saved_syn = NULL;
> > +	inet_rsk(req)->ireq_opt = NULL;
> >  #if IS_ENABLED(CONFIG_IPV6)
> > -	inet_rsk(req)->ipv6_opt = NULL;
> > +	inet_rsk(req)->pktopts = NULL;
> >  #endif
> >  }
> >  
> > @@ -741,16 +751,37 @@ EXPORT_SYMBOL(inet_csk_reqsk_queue_drop_and_put);
> >  
> >  static void reqsk_timer_handler(struct timer_list *t)
> >  {
> > -	struct request_sock *req = from_timer(req, t, rsk_timer);
> > -	struct sock *sk_listener = req->rsk_listener;
> > -	struct net *net = sock_net(sk_listener);
> > -	struct inet_connection_sock *icsk = inet_csk(sk_listener);
> > -	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
> > +	struct request_sock *req = from_timer(req, t, rsk_timer), *nreq = NULL, *oreq = req;
> nit. This line is too long.
> Lets move the new "*nreq" and "*oreg" to a new line and keep the current
> "*req" line as is:
> 	struct request_sock *req = from_timer(req, t, rsk_timer);
> 	struct request_sock *oreq = req, *nreq = NULL;

I'll fix that.


> 
> > +	struct sock *sk_listener = req->rsk_listener, *nsk = NULL;
> "*nsk" can be moved into the following "!= TCP_LISTEN" case below.
> Keep the current "*sk_listener" line as is.

I'll move the nsk's definition.

Thank you.


> 
> > +	struct inet_connection_sock *icsk;
> > +	struct request_sock_queue *queue;
> > +	struct net *net;
> >  	int max_syn_ack_retries, qlen, expire = 0, resend = 0;
> >  
> > -	if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
> > -		goto drop;
> > +	if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
> 
> 		struct sock *nsk;
> 
> > +		nsk = reuseport_migrate_sock(sk_listener, req_to_sk(req), NULL);
> > +		if (!nsk)
> > +			goto drop;
> > +
> > +		nreq = reqsk_clone(req, nsk);
> > +		if (!nreq)
> > +			goto drop;
> > +
> > +		/* The new timer for the cloned req can decrease the 2
> > +		 * by calling inet_csk_reqsk_queue_drop_and_put(), so
> > +		 * hold another count to prevent use-after-free and
> > +		 * call reqsk_put() just before return.
> > +		 */
> > +		refcount_set(&nreq->rsk_refcnt, 2 + 1);
> > +		timer_setup(&nreq->rsk_timer, reqsk_timer_handler, TIMER_PINNED);
> > +		reqsk_queue_migrated(&inet_csk(nsk)->icsk_accept_queue, req);
> > +
> > +		req = nreq;
> > +		sk_listener = nsk;
> > +	}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
  2021-05-05  5:14   ` Martin KaFai Lau
@ 2021-05-05 23:19     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 29+ messages in thread
From: Kuniyuki Iwashima @ 2021-05-05 23:19 UTC (permalink / raw)
  To: kafai
  Cc: andrii, ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840,
	kuniyu, linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Tue, 4 May 2021 22:14:08 -0700
> On Tue, Apr 27, 2021 at 12:46:23PM +0900, Kuniyuki Iwashima wrote:
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c b/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c
> > new file mode 100644
> > index 000000000000..d7136dc29fa2
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c
> > @@ -0,0 +1,51 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Check if we can migrate child sockets.
> > + *
> > + *   1. If reuse_md->migrating_sk is NULL (SYN packet),
> > + *        return SK_PASS without selecting a listener.
> > + *   2. If reuse_md->migrating_sk is not NULL (socket migration),
> > + *        select a listener (reuseport_map[migrate_map[cookie]])
> > + *
> > + * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > + */
> > +
> > +#include <stddef.h>
> > +#include <linux/bpf.h>
> > +#include <bpf/bpf_helpers.h>
> > +
> > +struct {
> > +	__uint(type, BPF_MAP_TYPE_REUSEPORT_SOCKARRAY);
> > +	__uint(max_entries, 256);
> > +	__type(key, int);
> > +	__type(value, __u64);
> > +} reuseport_map SEC(".maps");
> > +
> > +struct {
> > +	__uint(type, BPF_MAP_TYPE_HASH);
> > +	__uint(max_entries, 256);
> > +	__type(key, __u64);
> > +	__type(value, int);
> > +} migrate_map SEC(".maps");
> > +
> > +SEC("sk_reuseport/migrate")
> > +int prog_migrate_reuseport(struct sk_reuseport_md *reuse_md)
> > +{
> > +	int *key, flags = 0;
> > +	__u64 cookie;
> > +
> > +	if (!reuse_md->migrating_sk)
> > +		return SK_PASS;
> > +
> 
> It will be useful to check if it is migrating a child sk or
> a reqsk by testing the migrating_sk->state for TCP_ESTABLISHED
> and TCP_NEW_SYN_RECV.  skb can be further tested to check if it is
> selecting for the final ACK.  Global variables can then be
> incremented and the user prog can check that, for example,
> it is indeed testing the TCP_NEW_SYN_RECV code path...etc.
> 
> It will also become a good example for others on how migrating_sk
> can be used.

Exactly, I'll count in which state/path the migration happens and validate
them in userspace.

Thank you.


> 
> 
> > +	cookie = bpf_get_socket_cookie(reuse_md->sk);
> > +
> > +	key = bpf_map_lookup_elem(&migrate_map, &cookie);
> > +	if (!key)
> > +		return SK_DROP;
> > +
> > +	bpf_sk_select_reuseport(reuse_md, &reuseport_map, key, flags);
> > +
> > +	return SK_PASS;
> > +}
> > +
> > +char _license[] SEC("license") = "GPL";
> > -- 
> > 2.30.2
> > 

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2021-05-05 23:19 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-27  3:46 [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 01/11] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 02/11] tcp: Add num_closed_socks to struct sock_reuseport Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 03/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 04/11] tcp: Add reuseport_migrate_sock() to select a new listener Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 05/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs Kuniyuki Iwashima
2021-05-05  4:56   ` Martin KaFai Lau
2021-05-05 23:16     ` Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 07/11] tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 08/11] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 09/11] bpf: Support socket migration by eBPF Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 10/11] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2021-04-27  3:46 ` [PATCH v4 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
2021-05-05  5:14   ` Martin KaFai Lau
2021-05-05 23:19     ` Kuniyuki Iwashima
2021-04-27 16:38 ` [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT Jason Baron
2021-04-28  1:27   ` Martin KaFai Lau
2021-04-28 14:18     ` Eric Dumazet
2021-04-28 15:49       ` Kuniyuki Iwashima
2021-04-28  8:13   ` Kuniyuki Iwashima
2021-04-28 14:44     ` Jason Baron
2021-04-28 15:52       ` Kuniyuki Iwashima
2021-04-28 16:33         ` Eric Dumazet
2021-04-29  3:16           ` Kuniyuki Iwashima
2021-05-05  6:54             ` Martin KaFai Lau
2021-04-27 21:55 ` Maciej Żenczykowski
2021-04-27 22:00   ` Maciej Żenczykowski
2021-04-28  8:18     ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).