netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
@ 2020-11-17  9:40 Kuniyuki Iwashima
  2020-11-17  9:40 ` [RFC PATCH bpf-next 1/8] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
                   ` (10 more replies)
  0 siblings, 11 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-17  9:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

The SO_REUSEPORT option allows sockets to listen on the same port and to
accept connections evenly. However, there is a defect in the current
implementation. When a SYN packet is received, the connection is tied to a
listening socket. Accordingly, when the listener is closed, in-flight
requests during the three-way handshake and child sockets in the accept
queue are dropped even if other listeners could accept such connections.

This situation can happen when various server management tools restart
server (such as nginx) processes. For instance, when we change nginx
configurations and restart it, it spins up new workers that respect the new
configuration and closes all listeners on the old workers, resulting in
in-flight ACK of 3WHS is responded by RST.

As a workaround for this issue, we can do connection draining by eBPF:

  1. Before closing a listener, stop routing SYN packets to it.
  2. Wait enough time for requests to complete 3WHS.
  3. Accept connections until EAGAIN, then close the listener.

Although this approach seems to work well, EAGAIN has nothing to do with
how many requests are still during 3WHS. Thus, we have to know the number
of such requests by counting SYN packets by eBPF to complete connection
draining.

  1. Start counting SYN packets and accept syscalls using eBPF map.
  2. Stop routing SYN packets.
  3. Accept connections up to the count, then close the listener.

In cases that eBPF is used only for connection draining, it seems a bit
expensive. Moreover, there is some situation that we cannot modify and
build a server program to implement the workaround. This patchset
introduces a new sysctl option to free userland programs from the kernel
issue. If we enable net.ipv4.tcp_migrate_req before creating a reuseport
group, we can redistribute requests and connections from a listener to
others in the same reuseport group at close() or shutdown() syscalls.

Note that the source and destination listeners MUST have the same settings
at the socket API level; otherwise, applications may face inconsistency and
cause errors. In such a case, we have to use eBPF program to select a
specific listener or to cancel migration.

Kuniyuki Iwashima (8):
  net: Introduce net.ipv4.tcp_migrate_req.
  tcp: Keep TCP_CLOSE sockets in the reuseport group.
  tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  tcp: Migrate TFO requests causing RST during TCP_SYN_RECV.
  tcp: Migrate TCP_NEW_SYN_RECV requests.
  bpf: Add cookie in sk_reuseport_md.
  bpf: Call bpf_run_sk_reuseport() for socket migration.
  bpf: Test BPF_PROG_TYPE_SK_REUSEPORT for socket migration.

 Documentation/networking/ip-sysctl.rst        |  15 ++
 include/linux/bpf.h                           |   1 +
 include/net/inet_connection_sock.h            |  13 ++
 include/net/netns/ipv4.h                      |   1 +
 include/net/request_sock.h                    |  13 ++
 include/net/sock_reuseport.h                  |   8 +-
 include/uapi/linux/bpf.h                      |   1 +
 net/core/filter.c                             |  34 +++-
 net/core/sock_reuseport.c                     | 110 +++++++++--
 net/ipv4/inet_connection_sock.c               |  84 ++++++++-
 net/ipv4/inet_hashtables.c                    |   9 +-
 net/ipv4/sysctl_net_ipv4.c                    |   9 +
 net/ipv4/tcp_ipv4.c                           |   9 +-
 net/ipv6/tcp_ipv6.c                           |   9 +-
 tools/include/uapi/linux/bpf.h                |   1 +
 .../bpf/prog_tests/migrate_reuseport.c        | 175 ++++++++++++++++++
 .../bpf/progs/test_migrate_reuseport_kern.c   |  53 ++++++
 17 files changed, 511 insertions(+), 34 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c

-- 
2.17.2 (Apple Git-113)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH bpf-next 1/8] net: Introduce net.ipv4.tcp_migrate_req.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
@ 2020-11-17  9:40 ` Kuniyuki Iwashima
  2020-11-17  9:40 ` [RFC PATCH bpf-next 2/8] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-17  9:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
option is enabled, and then we call listen() for SO_REUSEPORT enabled
sockets and close one, we will be able to migrate its child sockets to
another listener.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 Documentation/networking/ip-sysctl.rst | 15 +++++++++++++++
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 3 files changed, 25 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index dd2b12a32b73..4116771bf5ef 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -712,6 +712,21 @@ tcp_syncookies - INTEGER
 	network connections you can set this knob to 2 to enable
 	unconditionally generation of syncookies.
 
+tcp_migrate_req - INTEGER
+	By default, when a listening socket is closed, child sockets are also
+	closed. If it has SO_REUSEPORT enabled, the dropped connections should
+	have been accepted by other listeners on the same port. This option
+	makes it possible to migrate child sockets to another listener when
+	calling close() or shutdown().
+
+	Default: 0
+
+	Note that the source and destination listeners _must_ have the same
+	settings at the socket API level. If there are different kinds of
+	sockets on the port, disable this option or use
+	BPF_PROG_TYPE_SK_REUSEPORT program to select the correct socket by
+	bpf_sk_select_reuseport() or to cancel migration by returning SK_DROP.
+
 tcp_fastopen - INTEGER
 	Enable TCP Fast Open (RFC7413) to send and accept data in the opening
 	SYN packet.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8e4fcac4df72..a3edc30d6a63 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -132,6 +132,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_syn_retries;
 	int sysctl_tcp_synack_retries;
 	int sysctl_tcp_syncookies;
+	int sysctl_tcp_migrate_req;
 	int sysctl_tcp_reordering;
 	int sysctl_tcp_retries1;
 	int sysctl_tcp_retries2;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 3e5f4f2e705e..6b76298fa271 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -933,6 +933,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 #endif
+	{
+		.procname	= "tcp_migrate_req",
+		.data		= &init_net.ipv4.sysctl_tcp_migrate_req,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE
+	},
 	{
 		.procname	= "tcp_reordering",
 		.data		= &init_net.ipv4.sysctl_tcp_reordering,
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH bpf-next 2/8] tcp: Keep TCP_CLOSE sockets in the reuseport group.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
  2020-11-17  9:40 ` [RFC PATCH bpf-next 1/8] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
@ 2020-11-17  9:40 ` Kuniyuki Iwashima
  2020-11-17  9:40 ` [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-17  9:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

This patch is a preparation patch to migrate incoming connections in the
later commits and adds two fields (migrate_req and num_closed_socks) to the
struct sock_reuseport to keep TCP_CLOSE sockets in the reuseport group.

If migrate_req is 1, and then we close a listening socket, we can migrate
its connections to another listener in the same reuseport group. Then we
have to handle two kinds of child sockets. One is that a listening socket
has a reference to, and the other is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, it is impossible because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to remain in the reuseport group and to
have access to it while any child socket references to them. The point is
that reuseport_detach_sock() is called twice from inet_unhash() and
sk_destruct(). At first, it moves the socket backwards in socks[] and
increments num_closed_socks. Later, when all migrated connections are
accepted, it removes the socket from socks[], decrements num_closed_socks,
and sets NULL to sk_reuseport_cb.

By this change, closed sockets can keep sk_reuseport_cb until all child
requests have been freed or accepted. Consequently calling listen() after
shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or
inet_csk_bind_conflict() which expect that such sockets should not have the
reuseport group. Therefore, this patch loosens such validation rules so
that the socket can listen again if it has the same reuseport group with
other listening sockets.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/net/sock_reuseport.h    |  6 ++-
 net/core/sock_reuseport.c       | 83 +++++++++++++++++++++++++++------
 net/ipv4/inet_connection_sock.c |  7 ++-
 3 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 505f1e18e9bf..ade3af55c91f 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock;
 struct sock_reuseport {
 	struct rcu_head		rcu;
 
-	u16			max_socks;	/* length of socks */
-	u16			num_socks;	/* elements in socks */
+	u16			max_socks;		/* length of socks */
+	u16			num_socks;		/* elements in socks */
+	u16			num_closed_socks;	/* closed elements in socks */
 	/* The last synq overflow event timestamp of this
 	 * reuse->socks[] group.
 	 */
@@ -23,6 +24,7 @@ struct sock_reuseport {
 	unsigned int		reuseport_id;
 	unsigned int		bind_inany:1;
 	unsigned int		has_conns:1;
+	unsigned int		migrate_req:1;
 	struct bpf_prog __rcu	*prog;		/* optional BPF sock selector */
 	struct sock		*socks[];	/* array of sock pointers */
 };
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index bbdd3c7b6cb5..01a8b4ba39d7 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -36,6 +36,7 @@ static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks)
 int reuseport_alloc(struct sock *sk, bool bind_inany)
 {
 	struct sock_reuseport *reuse;
+	struct net *net = sock_net(sk);
 	int id, ret = 0;
 
 	/* bh lock used since this function call may precede hlist lock in
@@ -75,6 +76,8 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)
 	reuse->socks[0] = sk;
 	reuse->num_socks = 1;
 	reuse->bind_inany = bind_inany;
+	reuse->migrate_req = sk->sk_protocol == IPPROTO_TCP ?
+		net->ipv4.sysctl_tcp_migrate_req : 0;
 	rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
 
 out:
@@ -98,16 +101,22 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse)
 		return NULL;
 
 	more_reuse->num_socks = reuse->num_socks;
+	more_reuse->num_closed_socks = reuse->num_closed_socks;
 	more_reuse->prog = reuse->prog;
 	more_reuse->reuseport_id = reuse->reuseport_id;
 	more_reuse->bind_inany = reuse->bind_inany;
 	more_reuse->has_conns = reuse->has_conns;
+	more_reuse->migrate_req = reuse->migrate_req;
+	more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
 
 	memcpy(more_reuse->socks, reuse->socks,
 	       reuse->num_socks * sizeof(struct sock *));
-	more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
+	memcpy(more_reuse->socks +
+	       (more_reuse->max_socks - more_reuse->num_closed_socks),
+	       reuse->socks + reuse->num_socks,
+	       reuse->num_closed_socks * sizeof(struct sock *));
 
-	for (i = 0; i < reuse->num_socks; ++i)
+	for (i = 0; i < reuse->max_socks; ++i)
 		rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
 				   more_reuse);
 
@@ -129,6 +138,25 @@ static void reuseport_free_rcu(struct rcu_head *head)
 	kfree(reuse);
 }
 
+static int reuseport_sock_index(struct sock_reuseport *reuse, struct sock *sk,
+				bool closed)
+{
+	int left, right;
+
+	if (!closed) {
+		left = 0;
+		right = reuse->num_socks;
+	} else {
+		left = reuse->max_socks - reuse->num_closed_socks;
+		right = reuse->max_socks;
+	}
+
+	for (; left < right; left++)
+		if (reuse->socks[left] == sk)
+			return left;
+	return -1;
+}
+
 /**
  *  reuseport_add_sock - Add a socket to the reuseport group of another.
  *  @sk:  New socket to add to the group.
@@ -153,12 +181,23 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 					  lockdep_is_held(&reuseport_lock));
 	old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
 					     lockdep_is_held(&reuseport_lock));
-	if (old_reuse && old_reuse->num_socks != 1) {
+
+	if (old_reuse == reuse) {
+		int i = reuseport_sock_index(reuse, sk, true);
+
+		if (i == -1) {
+			spin_unlock_bh(&reuseport_lock);
+			return -EBUSY;
+		}
+
+		reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks];
+		reuse->num_closed_socks--;
+	} else if (old_reuse && old_reuse->num_socks != 1) {
 		spin_unlock_bh(&reuseport_lock);
 		return -EBUSY;
 	}
 
-	if (reuse->num_socks == reuse->max_socks) {
+	if (reuse->num_socks + reuse->num_closed_socks == reuse->max_socks) {
 		reuse = reuseport_grow(reuse);
 		if (!reuse) {
 			spin_unlock_bh(&reuseport_lock);
@@ -174,8 +213,9 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 
 	spin_unlock_bh(&reuseport_lock);
 
-	if (old_reuse)
+	if (old_reuse && old_reuse != reuse)
 		call_rcu(&old_reuse->rcu, reuseport_free_rcu);
+
 	return 0;
 }
 EXPORT_SYMBOL(reuseport_add_sock);
@@ -199,17 +239,34 @@ void reuseport_detach_sock(struct sock *sk)
 	 */
 	bpf_sk_reuseport_detach(sk);
 
-	rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
+	if (!reuse->migrate_req || sk->sk_state == TCP_LISTEN) {
+		i = reuseport_sock_index(reuse, sk, false);
+		if (i == -1)
+			goto out;
+
+		reuse->num_socks--;
+		reuse->socks[i] = reuse->socks[reuse->num_socks];
 
-	for (i = 0; i < reuse->num_socks; i++) {
-		if (reuse->socks[i] == sk) {
-			reuse->socks[i] = reuse->socks[reuse->num_socks - 1];
-			reuse->num_socks--;
-			if (reuse->num_socks == 0)
-				call_rcu(&reuse->rcu, reuseport_free_rcu);
-			break;
+		if (reuse->migrate_req) {
+			reuse->num_closed_socks++;
+			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
+		} else {
+			rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
 		}
+	} else {
+		i = reuseport_sock_index(reuse, sk, true);
+		if (i == -1)
+			goto out;
+
+		reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks];
+		reuse->num_closed_socks--;
+
+		rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
 	}
+
+	if (reuse->num_socks + reuse->num_closed_socks == 0)
+		call_rcu(&reuse->rcu, reuseport_free_rcu);
+out:
 	spin_unlock_bh(&reuseport_lock);
 }
 EXPORT_SYMBOL(reuseport_detach_sock);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 4148f5f78f31..be8cda5b664f 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -138,6 +138,7 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 	bool reuse = sk->sk_reuse;
 	bool reuseport = !!sk->sk_reuseport;
 	kuid_t uid = sock_i_uid((struct sock *)sk);
+	struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb);
 
 	/*
 	 * Unlike other sk lookup places we do not check
@@ -156,14 +157,16 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 				if ((!relax ||
 				     (!reuseport_ok &&
 				      reuseport && sk2->sk_reuseport &&
-				      !rcu_access_pointer(sk->sk_reuseport_cb) &&
+				      (!reuseport_cb ||
+				       reuseport_cb == rcu_access_pointer(sk2->sk_reuseport_cb)) &&
 				      (sk2->sk_state == TCP_TIME_WAIT ||
 				       uid_eq(uid, sock_i_uid(sk2))))) &&
 				    inet_rcv_saddr_equal(sk, sk2, true))
 					break;
 			} else if (!reuseport_ok ||
 				   !reuseport || !sk2->sk_reuseport ||
-				   rcu_access_pointer(sk->sk_reuseport_cb) ||
+				   (reuseport_cb &&
+				    reuseport_cb != rcu_access_pointer(sk2->sk_reuseport_cb)) ||
 				   (sk2->sk_state != TCP_TIME_WAIT &&
 				    !uid_eq(uid, sock_i_uid(sk2)))) {
 				if (inet_rcv_saddr_equal(sk, sk2, true))
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
  2020-11-17  9:40 ` [RFC PATCH bpf-next 1/8] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
  2020-11-17  9:40 ` [RFC PATCH bpf-next 2/8] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
@ 2020-11-17  9:40 ` Kuniyuki Iwashima
  2020-11-18 23:50   ` Martin KaFai Lau
  2020-11-17  9:40 ` [RFC PATCH bpf-next 4/8] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-17  9:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

This patch lets reuseport_detach_sock() return a pointer of struct sock,
which is used only by inet_unhash(). If it is not NULL,
inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
sockets from the closing listener to the selected one.

Listening sockets hold incoming connections as a linked list of struct
request_sock in the accept queue, and each request has reference to a full
socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
requests from the closing listener's queue and relink them to the head of
the new listener's queue. We do not process each request, so the migration
completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
sockets, we will take special care in the next commit.

By default, we select the last element of socks[] as the new listener.
This behaviour is based on how the kernel moves sockets in socks[].

For example, we call listen() for four sockets (A, B, C, D), and close the
first two by turns. The sockets move in socks[] like below. (See also [1])

  socks[0] : A <-.      socks[0] : D          socks[0] : D
  socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
  socks[2] : C   |      socks[2] : C --'
  socks[3] : D --'

Then, if C and D have newer settings than A and B, and each socket has a
request (a, b, c, d) in their accept queue, we can redistribute old
requests evenly to new listeners.

  socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
  socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
  socks[2] : C (c)   |      socks[2] : C (c) --'
  socks[3] : D (d) --'

Here, (A, D), or (B, C) can have different application settings, but they
MUST have the same settings at the socket API level; otherwise, unexpected
error may happen. For instance, if only the new listeners have
TCP_SAVE_SYN, old requests do not have SYN data, so the application will
face inconsistency and cause an error.

Therefore, if there are different kinds of sockets, we must disable this
feature or use an eBPF program described in later commits.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
---
 include/net/inet_connection_sock.h |  1 +
 include/net/sock_reuseport.h       |  2 +-
 net/core/sock_reuseport.c          |  8 +++++++-
 net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
 net/ipv4/inet_hashtables.c         |  9 +++++++--
 5 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 7338b3865a2a..2ea2d743f8fc 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
 				      struct request_sock *req,
 				      struct sock *child);
+void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
 void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
 				   unsigned long timeout);
 struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index ade3af55c91f..ece1c70ca907 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -32,7 +32,7 @@ struct sock_reuseport {
 extern int reuseport_alloc(struct sock *sk, bool bind_inany);
 extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
 			      bool bind_inany);
-extern void reuseport_detach_sock(struct sock *sk);
+extern struct sock *reuseport_detach_sock(struct sock *sk);
 extern struct sock *reuseport_select_sock(struct sock *sk,
 					  u32 hash,
 					  struct sk_buff *skb,
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 01a8b4ba39d7..74a46197854b 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -220,9 +220,10 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 }
 EXPORT_SYMBOL(reuseport_add_sock);
 
-void reuseport_detach_sock(struct sock *sk)
+struct sock *reuseport_detach_sock(struct sock *sk)
 {
 	struct sock_reuseport *reuse;
+	struct sock *nsk = NULL;
 	int i;
 
 	spin_lock_bh(&reuseport_lock);
@@ -248,6 +249,9 @@ void reuseport_detach_sock(struct sock *sk)
 		reuse->socks[i] = reuse->socks[reuse->num_socks];
 
 		if (reuse->migrate_req) {
+			if (reuse->num_socks)
+				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
+
 			reuse->num_closed_socks++;
 			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
 		} else {
@@ -268,6 +272,8 @@ void reuseport_detach_sock(struct sock *sk)
 		call_rcu(&reuse->rcu, reuseport_free_rcu);
 out:
 	spin_unlock_bh(&reuseport_lock);
+
+	return nsk;
 }
 EXPORT_SYMBOL(reuseport_detach_sock);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index be8cda5b664f..583db7e2b1da 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
 }
 EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
 
+void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
+{
+	struct request_sock_queue *old_accept_queue, *new_accept_queue;
+
+	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
+	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
+
+	spin_lock(&old_accept_queue->rskq_lock);
+	spin_lock(&new_accept_queue->rskq_lock);
+
+	if (old_accept_queue->rskq_accept_head) {
+		if (new_accept_queue->rskq_accept_head)
+			old_accept_queue->rskq_accept_tail->dl_next =
+				new_accept_queue->rskq_accept_head;
+		else
+			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
+
+		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
+		old_accept_queue->rskq_accept_head = NULL;
+		old_accept_queue->rskq_accept_tail = NULL;
+
+		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
+		WRITE_ONCE(sk->sk_ack_backlog, 0);
+	}
+
+	spin_unlock(&new_accept_queue->rskq_lock);
+	spin_unlock(&old_accept_queue->rskq_lock);
+}
+EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
+
 struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
 					 struct request_sock *req, bool own_req)
 {
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 8cbe74313f38..f35c76cf3365 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -629,6 +629,7 @@ void inet_unhash(struct sock *sk)
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
 	struct inet_listen_hashbucket *ilb = NULL;
+	struct sock *nsk;
 	spinlock_t *lock;
 
 	if (sk_unhashed(sk))
@@ -644,8 +645,12 @@ void inet_unhash(struct sock *sk)
 	if (sk_unhashed(sk))
 		goto unlock;
 
-	if (rcu_access_pointer(sk->sk_reuseport_cb))
-		reuseport_detach_sock(sk);
+	if (rcu_access_pointer(sk->sk_reuseport_cb)) {
+		nsk = reuseport_detach_sock(sk);
+		if (nsk)
+			inet_csk_reqsk_queue_migrate(sk, nsk);
+	}
+
 	if (ilb) {
 		inet_unhash2(hashinfo, sk);
 		ilb->count--;
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH bpf-next 4/8] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2020-11-17  9:40 ` [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
@ 2020-11-17  9:40 ` Kuniyuki Iwashima
  2020-11-17  9:40 ` [RFC PATCH bpf-next 5/8] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-17  9:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

A TFO request socket is only freed after BOTH 3WHS has completed (or
aborted) and the child socket has been accepted (or its listener closed).
Hence, depending on the order, there can be two kinds of request sockets in
the accept queue.

  3WHS -> accept : TCP_ESTABLISHED
  accept -> 3WHS : TCP_SYN_RECV

Unlike TCP_ESTABLISHED socket, accept() does not free the request socket
for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove().
Also, it accesses request_sock.rsk_listener. So, in order to complete TFO
socket migration, we have to set the current listener to it at accept()
before reqsk_fastopen_remove().

Moreover, if TFO request caused RST before 3WHS has completed, it is held
in the listener's TFO queue to prevent DDoS attack. Thus, we also have to
migrate the requests in TFO queue.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 net/ipv4/inet_connection_sock.c | 35 ++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 583db7e2b1da..398c5c708bc5 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -500,6 +500,16 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
 	    tcp_rsk(req)->tfo_listener) {
 		spin_lock_bh(&queue->fastopenq.lock);
 		if (tcp_rsk(req)->tfo_listener) {
+			if (req->rsk_listener != sk) {
+				/* TFO request was migrated to another listener so
+				 * the new listener must be used in reqsk_fastopen_remove()
+				 * to hold requests which cause RST.
+				 */
+				sock_put(req->rsk_listener);
+				sock_hold(sk);
+				req->rsk_listener = sk;
+			}
+
 			/* We are still waiting for the final ACK from 3WHS
 			 * so can't free req now. Instead, we set req->sk to
 			 * NULL to signify that the child socket is taken
@@ -954,7 +964,6 @@ static void inet_child_forget(struct sock *sk, struct request_sock *req,
 
 	if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(req)->tfo_listener) {
 		BUG_ON(rcu_access_pointer(tcp_sk(child)->fastopen_rsk) != req);
-		BUG_ON(sk != req->rsk_listener);
 
 		/* Paranoid, to prevent race condition if
 		 * an inbound pkt destined for child is
@@ -995,6 +1004,7 @@ EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
 void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
 {
 	struct request_sock_queue *old_accept_queue, *new_accept_queue;
+	struct fastopen_queue *old_fastopenq, *new_fastopenq;
 
 	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
 	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
@@ -1019,6 +1029,29 @@ void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
 
 	spin_unlock(&new_accept_queue->rskq_lock);
 	spin_unlock(&old_accept_queue->rskq_lock);
+
+	old_fastopenq = &old_accept_queue->fastopenq;
+	new_fastopenq = &new_accept_queue->fastopenq;
+
+	spin_lock_bh(&old_fastopenq->lock);
+	spin_lock_bh(&new_fastopenq->lock);
+
+	new_fastopenq->qlen += old_fastopenq->qlen;
+	old_fastopenq->qlen = 0;
+
+	if (old_fastopenq->rskq_rst_head) {
+		if (new_fastopenq->rskq_rst_head)
+			old_fastopenq->rskq_rst_tail->dl_next = new_fastopenq->rskq_rst_head;
+		else
+			old_fastopenq->rskq_rst_tail = new_fastopenq->rskq_rst_tail;
+
+		new_fastopenq->rskq_rst_head = old_fastopenq->rskq_rst_head;
+		old_fastopenq->rskq_rst_head = NULL;
+		old_fastopenq->rskq_rst_tail = NULL;
+	}
+
+	spin_unlock_bh(&new_fastopenq->lock);
+	spin_unlock_bh(&old_fastopenq->lock);
 }
 EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
 
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH bpf-next 5/8] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2020-11-17  9:40 ` [RFC PATCH bpf-next 4/8] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
@ 2020-11-17  9:40 ` Kuniyuki Iwashima
  2020-11-17  9:40 ` [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-17  9:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
patch changes the code to call reuseport_select_sock() even if the
listening socket is TCP_CLOSE. If we can pick out a listening socket from
the reuseport group, we rewrite request_sock.rsk_listener and resume
processing the request.

Note that we call reuseport_select_sock() with skb NULL so that it selects
a listener randomly by hash. There are two reasons to do so. First, we do
not remember from which listener to which listener we have migrated sockets
at close() or shutdown() syscalls, so we redistribute the requests evenly.
As regards the second, we will cover in a later commit.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/net/inet_connection_sock.h | 12 ++++++++++++
 include/net/request_sock.h         | 13 +++++++++++++
 net/ipv4/inet_connection_sock.c    | 12 ++++++++++--
 net/ipv4/tcp_ipv4.c                |  9 +++++++--
 net/ipv6/tcp_ipv6.c                |  9 +++++++--
 5 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 2ea2d743f8fc..1e0958f5eb21 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
 	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
 }
 
+static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
+						 struct sock *nsk,
+						 struct request_sock *req)
+{
+	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
+			     &inet_csk(nsk)->icsk_accept_queue,
+			     req);
+	sock_put(sk);
+	sock_hold(nsk);
+	req->rsk_listener = nsk;
+}
+
 static inline int inet_csk_reqsk_queue_len(const struct sock *sk)
 {
 	return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 29e41ff3ec93..d18ba0b857cc 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -226,6 +226,19 @@ static inline void reqsk_queue_added(struct request_sock_queue *queue)
 	atomic_inc(&queue->qlen);
 }
 
+static inline void reqsk_queue_migrated(struct request_sock_queue *old_accept_queue,
+					struct request_sock_queue *new_accept_queue,
+					const struct request_sock *req)
+{
+	atomic_dec(&old_accept_queue->qlen);
+	atomic_inc(&new_accept_queue->qlen);
+
+	if (req->num_timeout == 0) {
+		atomic_dec(&old_accept_queue->young);
+		atomic_inc(&new_accept_queue->young);
+	}
+}
+
 static inline int reqsk_queue_len(const struct request_sock_queue *queue)
 {
 	return atomic_read(&queue->qlen);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 398c5c708bc5..8be20e3fff4f 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -743,8 +743,16 @@ static void reqsk_timer_handler(struct timer_list *t)
 	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
 	int max_syn_ack_retries, qlen, expire = 0, resend = 0;
 
-	if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
-		goto drop;
+	if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
+		sk_listener = reuseport_select_sock(sk_listener, sk_listener->sk_hash, NULL, 0);
+		if (!sk_listener) {
+			sk_listener = req->rsk_listener;
+			goto drop;
+		}
+		inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req);
+		icsk = inet_csk(sk_listener);
+		queue = &icsk->icsk_accept_queue;
+	}
 
 	max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries;
 	/* Normally all the openreqs are young and become mature
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c2d5132c523c..7219984584ae 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1957,8 +1957,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
 			goto csum_error;
 		}
 		if (unlikely(sk->sk_state != TCP_LISTEN)) {
-			inet_csk_reqsk_queue_drop_and_put(sk, req);
-			goto lookup;
+			nsk = reuseport_select_sock(sk, sk->sk_hash, NULL, 0);
+			if (!nsk) {
+				inet_csk_reqsk_queue_drop_and_put(sk, req);
+				goto lookup;
+			}
+			inet_csk_reqsk_queue_migrated(sk, nsk, req);
+			sk = nsk;
 		}
 		/* We own a reference on the listener, increase it again
 		 * as we might lose it too soon.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 8db59f4e5f13..9a068b69a26e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1619,8 +1619,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 			goto csum_error;
 		}
 		if (unlikely(sk->sk_state != TCP_LISTEN)) {
-			inet_csk_reqsk_queue_drop_and_put(sk, req);
-			goto lookup;
+			nsk = reuseport_select_sock(sk, sk->sk_hash, NULL, 0);
+			if (!nsk) {
+				inet_csk_reqsk_queue_drop_and_put(sk, req);
+				goto lookup;
+			}
+			inet_csk_reqsk_queue_migrated(sk, nsk, req);
+			sk = nsk;
 		}
 		sock_hold(sk);
 		refcounted = true;
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2020-11-17  9:40 ` [RFC PATCH bpf-next 5/8] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
@ 2020-11-17  9:40 ` Kuniyuki Iwashima
  2020-11-19  0:11   ` Martin KaFai Lau
  2020-11-17  9:40 ` [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-17  9:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

We will call sock_reuseport.prog for socket migration in the next commit,
so the eBPF program has to know which listener is closing in order to
select the new listener.

Currently, we can get a unique ID for each listener in the userspace by
calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.
This patch exposes the ID to the eBPF program.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/linux/bpf.h            | 1 +
 include/uapi/linux/bpf.h       | 1 +
 net/core/filter.c              | 8 ++++++++
 tools/include/uapi/linux/bpf.h | 1 +
 4 files changed, 11 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 581b2a2e78eb..c0646eceffa2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1897,6 +1897,7 @@ struct sk_reuseport_kern {
 	u32 hash;
 	u32 reuseport_id;
 	bool bind_inany;
+	u64 cookie;
 };
 bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 				  struct bpf_insn_access_aux *info);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 162999b12790..3fcddb032838 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4403,6 +4403,7 @@ struct sk_reuseport_md {
 	__u32 ip_protocol;	/* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
 	__u32 bind_inany;	/* Is sock bound to an INANY address? */
 	__u32 hash;		/* A hash of the packet 4 tuples */
+	__u64 cookie;		/* ID of the listener in map */
 };
 
 #define BPF_TAG_SIZE	8
diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..01e28f283962 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9862,6 +9862,7 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
 	reuse_kern->hash = hash;
 	reuse_kern->reuseport_id = reuse->reuseport_id;
 	reuse_kern->bind_inany = reuse->bind_inany;
+	reuse_kern->cookie = sock_gen_cookie(sk);
 }
 
 struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
@@ -10010,6 +10011,9 @@ sk_reuseport_is_valid_access(int off, int size,
 	case offsetof(struct sk_reuseport_md, hash):
 		return size == size_default;
 
+	case bpf_ctx_range(struct sk_reuseport_md, cookie):
+		return size == sizeof(__u64);
+
 	/* Fields that allow narrowing */
 	case bpf_ctx_range(struct sk_reuseport_md, eth_protocol):
 		if (size < sizeof_field(struct sk_buff, protocol))
@@ -10082,6 +10086,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct sk_reuseport_md, bind_inany):
 		SK_REUSEPORT_LOAD_FIELD(bind_inany);
 		break;
+
+	case offsetof(struct sk_reuseport_md, cookie):
+		SK_REUSEPORT_LOAD_FIELD(cookie);
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 162999b12790..3fcddb032838 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4403,6 +4403,7 @@ struct sk_reuseport_md {
 	__u32 ip_protocol;	/* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
 	__u32 bind_inany;	/* Is sock bound to an INANY address? */
 	__u32 hash;		/* A hash of the packet 4 tuples */
+	__u64 cookie;		/* ID of the listener in map */
 };
 
 #define BPF_TAG_SIZE	8
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2020-11-17  9:40 ` [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md Kuniyuki Iwashima
@ 2020-11-17  9:40 ` Kuniyuki Iwashima
  2020-11-19  1:00   ` Martin KaFai Lau
  2020-11-17  9:40 ` [RFC PATCH bpf-next 8/8] bpf: Test BPF_PROG_TYPE_SK_REUSEPORT " Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-17  9:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

This patch makes it possible to select a new listener for socket migration
by eBPF.

The noteworthy point is that we select a listening socket in
reuseport_detach_sock() and reuseport_select_sock(), but we do not have
struct skb in the unhash path.

Since we cannot pass skb to the eBPF program, we run only the
BPF_PROG_TYPE_SK_REUSEPORT program by calling bpf_run_sk_reuseport() with
skb NULL. So, some fields derived from skb are also NULL in the eBPF
program.

Moreover, we can cancel migration by returning SK_DROP. This feature is
useful when listeners have different settings at the socket API level or
when we want to free resources as soon as possible.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 net/core/filter.c          | 26 +++++++++++++++++++++-----
 net/core/sock_reuseport.c  | 23 ++++++++++++++++++++---
 net/ipv4/inet_hashtables.c |  2 +-
 3 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 01e28f283962..ffc4591878b8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8914,6 +8914,22 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 	SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(S, NS, F, NF,		       \
 					     BPF_FIELD_SIZEOF(NS, NF), 0)
 
+#define SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF_OR_NULL(S, NS, F, NF, SIZE, OFF)	\
+	do {									\
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), si->dst_reg,	\
+				      si->src_reg, offsetof(S, F));		\
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);		\
+		*insn++ = BPF_LDX_MEM(						\
+			SIZE, si->dst_reg, si->dst_reg,				\
+			bpf_target_off(NS, NF, sizeof_field(NS, NF),		\
+				       target_size)				\
+			+ OFF);							\
+	} while (0)
+
+#define SOCK_ADDR_LOAD_NESTED_FIELD_OR_NULL(S, NS, F, NF)			\
+	SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF_OR_NULL(S, NS, F, NF,		\
+						     BPF_FIELD_SIZEOF(NS, NF), 0)
+
 /* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to
  * SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation.
  *
@@ -9858,7 +9874,7 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
 	reuse_kern->skb = skb;
 	reuse_kern->sk = sk;
 	reuse_kern->selected_sk = NULL;
-	reuse_kern->data_end = skb->data + skb_headlen(skb);
+	reuse_kern->data_end = skb ? skb->data + skb_headlen(skb) : NULL;
 	reuse_kern->hash = hash;
 	reuse_kern->reuseport_id = reuse->reuseport_id;
 	reuse_kern->bind_inany = reuse->bind_inany;
@@ -10039,10 +10055,10 @@ sk_reuseport_is_valid_access(int off, int size,
 	})
 
 #define SK_REUSEPORT_LOAD_SKB_FIELD(SKB_FIELD)				\
-	SOCK_ADDR_LOAD_NESTED_FIELD(struct sk_reuseport_kern,		\
-				    struct sk_buff,			\
-				    skb,				\
-				    SKB_FIELD)
+	SOCK_ADDR_LOAD_NESTED_FIELD_OR_NULL(struct sk_reuseport_kern,	\
+					    struct sk_buff,		\
+					    skb,			\
+					    SKB_FIELD)
 
 #define SK_REUSEPORT_LOAD_SK_FIELD(SK_FIELD)				\
 	SOCK_ADDR_LOAD_NESTED_FIELD(struct sk_reuseport_kern,		\
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 74a46197854b..903f78ab35c3 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -224,6 +224,7 @@ struct sock *reuseport_detach_sock(struct sock *sk)
 {
 	struct sock_reuseport *reuse;
 	struct sock *nsk = NULL;
+	struct bpf_prog *prog;
 	int i;
 
 	spin_lock_bh(&reuseport_lock);
@@ -249,8 +250,16 @@ struct sock *reuseport_detach_sock(struct sock *sk)
 		reuse->socks[i] = reuse->socks[reuse->num_socks];
 
 		if (reuse->migrate_req) {
-			if (reuse->num_socks)
-				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
+			if (reuse->num_socks) {
+				prog = rcu_dereference(reuse->prog);
+				if (prog && prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
+					nsk = bpf_run_sk_reuseport(reuse, sk, prog,
+								   NULL, sk->sk_hash);
+
+				if (!nsk)
+					nsk = i == reuse->num_socks ?
+						reuse->socks[i - 1] : reuse->socks[i];
+			}
 
 			reuse->num_closed_socks++;
 			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
@@ -340,8 +349,16 @@ struct sock *reuseport_select_sock(struct sock *sk,
 		/* paired with smp_wmb() in reuseport_add_sock() */
 		smp_rmb();
 
-		if (!prog || !skb)
+		if (!prog)
+			goto select_by_hash;
+
+		if (!skb) {
+			if (reuse->migrate_req &&
+			    prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
+				sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash);
+
 			goto select_by_hash;
+		}
 
 		if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
 			sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index f35c76cf3365..d981e4876679 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -647,7 +647,7 @@ void inet_unhash(struct sock *sk)
 
 	if (rcu_access_pointer(sk->sk_reuseport_cb)) {
 		nsk = reuseport_detach_sock(sk);
-		if (nsk)
+		if (!IS_ERR_OR_NULL(nsk))
 			inet_csk_reqsk_queue_migrate(sk, nsk);
 	}
 
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH bpf-next 8/8] bpf: Test BPF_PROG_TYPE_SK_REUSEPORT for socket migration.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2020-11-17  9:40 ` [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
@ 2020-11-17  9:40 ` Kuniyuki Iwashima
  2020-11-18  9:18 ` [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT David Laight
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-17  9:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	bpf, netdev, linux-kernel

This patch adds a test for net.ipv4.tcp_migrate_req with eBPF.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 .../bpf/prog_tests/migrate_reuseport.c        | 175 ++++++++++++++++++
 .../bpf/progs/test_migrate_reuseport_kern.c   |  53 ++++++
 2 files changed, 228 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c

diff --git a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
new file mode 100644
index 000000000000..fb182e575371
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
@@ -0,0 +1,175 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Check if we can migrate child sockets.
+ *
+ *   1. call listen() for 5 server sockets.
+ *   2. update a map to migrate all child socket
+ *        to the last server socket (map[cookie] = 4)
+ *   3. call connect() for 25 client sockets.
+ *   4. call close() first 4 server sockets.
+ *   5. call receive() for the last server socket.
+ *
+ * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
+ */
+
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <linux/bpf.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define NUM_SOCKS 5
+#define LOCALHOST "127.0.0.1"
+#define err_exit(condition, message)			      \
+	do {						      \
+		if (condition) {			      \
+			perror("ERROR: " message " ");	      \
+			setup_sysctl(0);		      \
+			exit(1);			      \
+		}					      \
+	} while (0)
+
+__u64 server_fds[NUM_SOCKS];
+int prog_fd, map_fd, migrate_map_fd;
+
+void setup_sysctl(int value)
+{
+	FILE *file;
+
+	file = fopen("/proc/sys/net/ipv4/tcp_migrate_req", "w");
+	fprintf(file, "%d", value);
+	fclose(file);
+}
+
+void setup_bpf(void)
+{
+	struct bpf_object *obj;
+	struct bpf_program *prog;
+	struct bpf_map *map, *migrate_map;
+	int err;
+
+	obj = bpf_object__open("test_migrate_reuseport_kern.o");
+	err_exit(libbpf_get_error(obj), "opening BPF object file failed");
+
+	err = bpf_object__load(obj);
+	err_exit(err, "loading BPF object failed");
+
+	prog = bpf_program__next(NULL, obj);
+	err_exit(!prog, "loading BPF program failed");
+
+	map = bpf_object__find_map_by_name(obj, "reuseport_map");
+	err_exit(!map, "loading BPF reuseport_map failed");
+
+	migrate_map = bpf_object__find_map_by_name(obj, "migrate_map");
+	err_exit(!map, "loading BPF migrate_map failed");
+
+	prog_fd = bpf_program__fd(prog);
+	map_fd = bpf_map__fd(map);
+	migrate_map_fd = bpf_map__fd(migrate_map);
+}
+
+void test_listen(void)
+{
+	struct sockaddr_in addr;
+	socklen_t addr_len = sizeof(addr);
+	int i, err, optval = 1, migrated_to = NUM_SOCKS - 1;
+	__u64 value;
+
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(80);
+	inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr);
+
+	for (i = 0; i < NUM_SOCKS; i++) {
+		server_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+		err_exit(server_fds[i] == -1, "socket() for listener sockets failed");
+
+		err = setsockopt(server_fds[i], SOL_SOCKET, SO_REUSEPORT,
+				 &optval, sizeof(optval));
+		err_exit(err == -1, "setsockopt() for SO_REUSEPORT failed");
+
+		if (i == 0) {
+			err = setsockopt(server_fds[i], SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
+					 &prog_fd, sizeof(prog_fd));
+			err_exit(err == -1, "setsockopt() for SO_ATTACH_REUSEPORT_EBPF failed");
+		}
+
+		err = bind(server_fds[i], (struct sockaddr *)&addr, addr_len);
+		err_exit(err == -1, "bind() failed");
+
+		err = listen(server_fds[i], 32);
+		err_exit(err == -1, "listen() failed");
+
+		err = bpf_map_update_elem(map_fd, &i, &server_fds[i], BPF_NOEXIST);
+		err_exit(err == -1, "updating BPF reuseport_map failed");
+
+		err = bpf_map_lookup_elem(map_fd, &i, &value);
+		err_exit(err == -1, "looking up BPF reuseport_map failed");
+
+		printf("fd[%d] (cookie: %llu) -> fd[%d]\n", i, value, migrated_to);
+		err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, BPF_NOEXIST);
+		err_exit(err == -1, "updating BPF migrate_map failed");
+	}
+}
+
+void test_connect(void)
+{
+	struct sockaddr_in addr;
+	socklen_t addr_len = sizeof(addr);
+	int i, err, client_fd;
+
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(80);
+	inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr);
+
+	for (i = 0; i < NUM_SOCKS * 5; i++) {
+		client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+		err_exit(client_fd == -1, "socket() for listener sockets failed");
+
+		err = connect(client_fd, (struct sockaddr *)&addr, addr_len);
+		err_exit(err == -1, "connect() failed");
+
+		close(client_fd);
+	}
+}
+
+void test_close(void)
+{
+	int i;
+
+	for (i = 0; i < NUM_SOCKS - 1; i++)
+		close(server_fds[i]);
+}
+
+void test_receive(void)
+{
+	struct sockaddr_in addr;
+	socklen_t addr_len = sizeof(addr);
+	int cnt, client_fd;
+
+	fcntl(server_fds[NUM_SOCKS - 1], F_SETFL, O_NONBLOCK);
+
+	for (cnt = 0; cnt < NUM_SOCKS * 5; cnt++) {
+		client_fd = accept(server_fds[NUM_SOCKS - 1], (struct sockaddr *)&addr, &addr_len);
+		err_exit(client_fd == -1, "accept() failed");
+	}
+
+	printf("%d accepted, %d is expected\n", cnt, NUM_SOCKS * 5);
+}
+
+int main(void)
+{
+	setup_sysctl(1);
+	setup_bpf();
+	test_listen();
+	test_connect();
+	test_close();
+	test_receive();
+	close(server_fds[NUM_SOCKS - 1]);
+	setup_sysctl(0);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c b/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c
new file mode 100644
index 000000000000..79f8a3465c20
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Check if we can migrate child sockets.
+ *
+ *   1. If data is not NULL (SYN packet),
+ *        return SK_PASS without selecting a listener.
+ *   2. If data is NULL (socket migration),
+ *        select a listener (reuseport_map[map[cookie]])
+ *
+ * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
+ */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+#define NULL ((void *)0)
+
+int _version SEC("version") = 1;
+
+struct bpf_map_def SEC("maps") reuseport_map = {
+	.type = BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(__u64),
+	.max_entries = 256,
+};
+
+struct bpf_map_def SEC("maps") migrate_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(__u64),
+	.value_size = sizeof(int),
+	.max_entries = 256,
+};
+
+SEC("sk_reuseport")
+int select_by_skb_data(struct sk_reuseport_md *reuse_md)
+{
+	int *key, flags = 0;
+	void *data = reuse_md->data;
+	__u64 cookie = reuse_md->cookie;
+
+	if (data)
+		return SK_PASS;
+
+	key = bpf_map_lookup_elem(&migrate_map, &cookie);
+	if (key == NULL)
+		return SK_DROP;
+
+	bpf_sk_select_reuseport(reuse_md, &reuseport_map, key, flags);
+
+	return SK_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* RE: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2020-11-17  9:40 ` [RFC PATCH bpf-next 8/8] bpf: Test BPF_PROG_TYPE_SK_REUSEPORT " Kuniyuki Iwashima
@ 2020-11-18  9:18 ` David Laight
  2020-11-19 22:01   ` Kuniyuki Iwashima
  2020-11-18 16:25 ` Eric Dumazet
  2020-11-19  1:49 ` Martin KaFai Lau
  10 siblings, 1 reply; 27+ messages in thread
From: David Laight @ 2020-11-18  9:18 UTC (permalink / raw)
  To: 'Kuniyuki Iwashima',
	David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, bpf, netdev, linux-kernel

From: Kuniyuki Iwashima
> Sent: 17 November 2020 09:40
> 
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation. When a SYN packet is received, the connection is tied to a
> listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners could accept such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in
> in-flight ACK of 3WHS is responded by RST.

Can't you do something to stop new connections being queued (like
setting the 'backlog' to zero), then carry on doing accept()s
for a guard time (or until the queue length is zero) before finally
closing the listening socket.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2020-11-18  9:18 ` [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT David Laight
@ 2020-11-18 16:25 ` Eric Dumazet
  2020-11-19 22:05   ` Kuniyuki Iwashima
  2020-11-19  1:49 ` Martin KaFai Lau
  10 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2020-11-18 16:25 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski,
	Eric Dumazet, Alexei Starovoitov, Daniel Borkmann
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, bpf, netdev, linux-kernel



On 11/17/20 10:40 AM, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation. When a SYN packet is received, the connection is tied to a
> listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners could accept such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in
> in-flight ACK of 3WHS is responded by RST.
> 

I know some programs are simply removing a listener from the group,
so that they no longer handle new SYN packets,
and wait until all timers or 3WHS have completed before closing them.

They pass fd of newly accepted children to more recent programs using af_unix fd passing,
while in this draining mode.

Quite frankly, mixing eBPF in the picture is distracting.

It seems you want some way to transfer request sockets (and/or not yet accepted established ones)
from fd1 to fd2, isn't it something that should be discussed independently ?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-11-17  9:40 ` [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
@ 2020-11-18 23:50   ` Martin KaFai Lau
  2020-11-19 22:09     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2020-11-18 23:50 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, bpf, netdev, linux-kernel

On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> This patch lets reuseport_detach_sock() return a pointer of struct sock,
> which is used only by inet_unhash(). If it is not NULL,
> inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> sockets from the closing listener to the selected one.
> 
> Listening sockets hold incoming connections as a linked list of struct
> request_sock in the accept queue, and each request has reference to a full
> socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> requests from the closing listener's queue and relink them to the head of
> the new listener's queue. We do not process each request, so the migration
> completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> sockets, we will take special care in the next commit.
> 
> By default, we select the last element of socks[] as the new listener.
> This behaviour is based on how the kernel moves sockets in socks[].
> 
> For example, we call listen() for four sockets (A, B, C, D), and close the
> first two by turns. The sockets move in socks[] like below. (See also [1])
> 
>   socks[0] : A <-.      socks[0] : D          socks[0] : D
>   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
>   socks[2] : C   |      socks[2] : C --'
>   socks[3] : D --'
> 
> Then, if C and D have newer settings than A and B, and each socket has a
> request (a, b, c, d) in their accept queue, we can redistribute old
> requests evenly to new listeners.
I don't think it should emphasize/claim there is a specific way that
the kernel-pick here can redistribute the requests evenly.  It depends on
how the application close/listen.  The userspace can not expect the
ordering of socks[] will behave in a certain way.

The primary redistribution policy has to depend on BPF which is the
policy defined by the user based on its application logic (e.g. how
its binary restart work).  The application (and bpf) knows which one
is a dying process and can avoid distributing to it.

The kernel-pick could be an optional fallback but not a must.  If the bpf
prog is attached, I would even go further to call bpf to redistribute
regardless of the sysctl, so I think the sysctl is not necessary.

> 
>   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
>   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
>   socks[2] : C (c)   |      socks[2] : C (c) --'
>   socks[3] : D (d) --'
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md.
  2020-11-17  9:40 ` [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md Kuniyuki Iwashima
@ 2020-11-19  0:11   ` Martin KaFai Lau
  2020-11-19 22:10     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2020-11-19  0:11 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, bpf, netdev, linux-kernel

On Tue, Nov 17, 2020 at 06:40:21PM +0900, Kuniyuki Iwashima wrote:
> We will call sock_reuseport.prog for socket migration in the next commit,
> so the eBPF program has to know which listener is closing in order to
> select the new listener.
> 
> Currently, we can get a unique ID for each listener in the userspace by
> calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.
> This patch exposes the ID to the eBPF program.
> 
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  include/linux/bpf.h            | 1 +
>  include/uapi/linux/bpf.h       | 1 +
>  net/core/filter.c              | 8 ++++++++
>  tools/include/uapi/linux/bpf.h | 1 +
>  4 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 581b2a2e78eb..c0646eceffa2 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1897,6 +1897,7 @@ struct sk_reuseport_kern {
>  	u32 hash;
>  	u32 reuseport_id;
>  	bool bind_inany;
> +	u64 cookie;
>  };
>  bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
>  				  struct bpf_insn_access_aux *info);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 162999b12790..3fcddb032838 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4403,6 +4403,7 @@ struct sk_reuseport_md {
>  	__u32 ip_protocol;	/* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
>  	__u32 bind_inany;	/* Is sock bound to an INANY address? */
>  	__u32 hash;		/* A hash of the packet 4 tuples */
> +	__u64 cookie;		/* ID of the listener in map */
Instead of only adding the cookie of a sk, lets make the sk pointer available:

	__bpf_md_ptr(struct bpf_sock *, sk);

and then use the BPF_FUNC_get_socket_cookie to get the cookie.

Other fields of the sk can also be directly accessed too once
the sk pointer is available.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration.
  2020-11-17  9:40 ` [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
@ 2020-11-19  1:00   ` Martin KaFai Lau
  2020-11-19 22:13     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2020-11-19  1:00 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, bpf, netdev, linux-kernel

On Tue, Nov 17, 2020 at 06:40:22PM +0900, Kuniyuki Iwashima wrote:
> This patch makes it possible to select a new listener for socket migration
> by eBPF.
> 
> The noteworthy point is that we select a listening socket in
> reuseport_detach_sock() and reuseport_select_sock(), but we do not have
> struct skb in the unhash path.
> 
> Since we cannot pass skb to the eBPF program, we run only the
> BPF_PROG_TYPE_SK_REUSEPORT program by calling bpf_run_sk_reuseport() with
> skb NULL. So, some fields derived from skb are also NULL in the eBPF
> program.
More things need to be considered here when skb is NULL.

Some helpers are probably assuming skb is not NULL.

Also, the sk_lookup in filter.c is actually passing a NULL skb to avoid
doing the reuseport select.

> 
> Moreover, we can cancel migration by returning SK_DROP. This feature is
> useful when listeners have different settings at the socket API level or
> when we want to free resources as soon as possible.
> 
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  net/core/filter.c          | 26 +++++++++++++++++++++-----
>  net/core/sock_reuseport.c  | 23 ++++++++++++++++++++---
>  net/ipv4/inet_hashtables.c |  2 +-
>  3 files changed, 42 insertions(+), 9 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 01e28f283962..ffc4591878b8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -8914,6 +8914,22 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
>  	SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(S, NS, F, NF,		       \
>  					     BPF_FIELD_SIZEOF(NS, NF), 0)
>  
> +#define SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF_OR_NULL(S, NS, F, NF, SIZE, OFF)	\
> +	do {									\
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), si->dst_reg,	\
> +				      si->src_reg, offsetof(S, F));		\
> +		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);		\
Although it may not matter much, always doing this check seems not very ideal
considering the fast path will always have skb and only the slow
path (accept-queue migrate) has skb is NULL.  I think the req_sk usually
has the skb also except the timer one.

First thought is to create a temp skb but it has its own issues.
or it may actually belong to a new prog type.  However, lets keep
exploring possible options (including NULL skb).

> +		*insn++ = BPF_LDX_MEM(						\
> +			SIZE, si->dst_reg, si->dst_reg,				\
> +			bpf_target_off(NS, NF, sizeof_field(NS, NF),		\
> +				       target_size)				\
> +			+ OFF);							\
> +	} while (0)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
  2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2020-11-18 16:25 ` Eric Dumazet
@ 2020-11-19  1:49 ` Martin KaFai Lau
  2020-11-19 22:17   ` Kuniyuki Iwashima
  10 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2020-11-19  1:49 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, bpf, netdev, linux-kernel

On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation. When a SYN packet is received, the connection is tied to a
> listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners could accept such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in
> in-flight ACK of 3WHS is responded by RST.
> 
> As a workaround for this issue, we can do connection draining by eBPF:
> 
>   1. Before closing a listener, stop routing SYN packets to it.
>   2. Wait enough time for requests to complete 3WHS.
>   3. Accept connections until EAGAIN, then close the listener.
> 
> Although this approach seems to work well, EAGAIN has nothing to do with
> how many requests are still during 3WHS. Thus, we have to know the number
It sounds like the application can already drain the established socket
by accept()?  To solve the problem that you have,
does it mean migrating req_sk (the in-progress 3WHS) is enough?

Applications can already use the bpf prog to do (1) and divert
the SYN to the newly started process.

If the application cares about service disruption,
it usually needs to drain the fd(s) that it already has and
finishes serving the pending request (e.g. https) on them anyway.
The time taking to finish those could already be longer than it takes
to drain the accept queue or finish off the 3WHS in reasonable time.
or the application that you have does not need to drain the fd(s) 
it already has and it can close them immediately?

> of such requests by counting SYN packets by eBPF to complete connection
> draining.
> 
>   1. Start counting SYN packets and accept syscalls using eBPF map.
>   2. Stop routing SYN packets.
>   3. Accept connections up to the count, then close the listener.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
  2020-11-18  9:18 ` [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT David Laight
@ 2020-11-19 22:01   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-19 22:01 UTC (permalink / raw)
  To: david.laight
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   David Laight <David.Laight@ACULAB.COM>
Date:   Wed, 18 Nov 2020 09:18:24 +0000
> From: Kuniyuki Iwashima
> > Sent: 17 November 2020 09:40
> > 
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation. When a SYN packet is received, the connection is tied to a
> > listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners could accept such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in
> > in-flight ACK of 3WHS is responded by RST.
> 
> Can't you do something to stop new connections being queued (like
> setting the 'backlog' to zero), then carry on doing accept()s
> for a guard time (or until the queue length is zero) before finally
> closing the listening socket.

Yes, but with eBPF.
There are some ideas suggested and well discussed in the thread below,
resulting in that connection draining by eBPF was merged.
https://lore.kernel.org/netdev/1443313848-751-1-git-send-email-tolga.ceylan@gmail.com/


Also, setting zero to backlog does not work well.
https://lore.kernel.org/netdev/1447262610.17135.114.camel@edumazet-glaptop2.roam.corp.google.com/

---8<---
From: Eric Dumazet <eric.dumazet@gmail.com>
Subject: Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as
 drain mode
Date: Wed, 11 Nov 2015 09:23:30 -0800
> Actually listen(fd, 0) is not going to work well :
> 
> For request_sock that were created (by incoming SYN packet) before this
> listen(fd, 0) call, the 3rd packet (ACK coming from client) would not be
> able to create a child attached to this listener.
> 
> sk_acceptq_is_full() test in tcp_v4_syn_recv_sock() would simply drop
> the thing.
---8<---

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
  2020-11-18 16:25 ` Eric Dumazet
@ 2020-11-19 22:05   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-19 22:05 UTC (permalink / raw)
  To: eric.dumazet
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed, 18 Nov 2020 17:25:44 +0100
> On 11/17/20 10:40 AM, Kuniyuki Iwashima wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation. When a SYN packet is received, the connection is tied to a
> > listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners could accept such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in
> > in-flight ACK of 3WHS is responded by RST.
> > 
> 
> I know some programs are simply removing a listener from the group,
> so that they no longer handle new SYN packets,
> and wait until all timers or 3WHS have completed before closing them.
> 
> They pass fd of newly accepted children to more recent programs using af_unix fd passing,
> while in this draining mode.

Just out of curiosity, can I know the software for more study?


> Quite frankly, mixing eBPF in the picture is distracting.

I agree.
Also, I think eBPF itself is not always necessary in many cases and want
to make user programs simpler with this patchset.

The SO_REUSEPORT implementation is excellent to improve the scalability. On
the other hand, as a trade-off, users have to know deeply how the kernel
handles SYN packets and to implement connection draining by eBPF.


> It seems you want some way to transfer request sockets (and/or not yet accepted established ones)
> from fd1 to fd2, isn't it something that should be discussed independently ?

I understand that you are asking that I should discuss the issue and how to
transfer sockets independently. Please correct me if I have misunderstood
your question.

The kernel handles 3WHS and users cannot know its existence (without eBPF).
Many users believe SO_REUSEPORT should make it possible to distribute all
connections across available listeners ideally, but actually, there are
possibly some connections aborted silently. Some user may think that if the
kernel selected other listeners, the connections would not be dropped.

The root cause is within the kernel, so the issue should be addressed in
the kernel space and should not be visible to userspace. In order not to
make users bother with implementing new some stuff, I want to fix the root
cause by transferring sockets automatically so that users need not take
care of kernel implementation and connection draining.

Moreover, if possible, I did not want to mix eBPF with the issue. But there
may be some cases that different applications listen on the same port and
eBPF routes packets to each by some rules. In such cases, redistributing
sockets without user intention will break the application. This patchset
will work in many cases, but to care such cases, I added the eBPF part.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-11-18 23:50   ` Martin KaFai Lau
@ 2020-11-19 22:09     ` Kuniyuki Iwashima
  2020-11-20  1:53       ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-19 22:09 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From: Martin KaFai Lau <kafai@fb.com>
Date: Wed, 18 Nov 2020 15:50:17 -0800
> On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > which is used only by inet_unhash(). If it is not NULL,
> > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > sockets from the closing listener to the selected one.
> > 
> > Listening sockets hold incoming connections as a linked list of struct
> > request_sock in the accept queue, and each request has reference to a full
> > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > requests from the closing listener's queue and relink them to the head of
> > the new listener's queue. We do not process each request, so the migration
> > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > sockets, we will take special care in the next commit.
> > 
> > By default, we select the last element of socks[] as the new listener.
> > This behaviour is based on how the kernel moves sockets in socks[].
> > 
> > For example, we call listen() for four sockets (A, B, C, D), and close the
> > first two by turns. The sockets move in socks[] like below. (See also [1])
> > 
> >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> >   socks[2] : C   |      socks[2] : C --'
> >   socks[3] : D --'
> > 
> > Then, if C and D have newer settings than A and B, and each socket has a
> > request (a, b, c, d) in their accept queue, we can redistribute old
> > requests evenly to new listeners.
> I don't think it should emphasize/claim there is a specific way that
> the kernel-pick here can redistribute the requests evenly.  It depends on
> how the application close/listen.  The userspace can not expect the
> ordering of socks[] will behave in a certain way.

I've expected replacing listeners by generations as a general use case.
But exactly. Users should not expect the undocumented kernel internal.


> The primary redistribution policy has to depend on BPF which is the
> policy defined by the user based on its application logic (e.g. how
> its binary restart work).  The application (and bpf) knows which one
> is a dying process and can avoid distributing to it.
> 
> The kernel-pick could be an optional fallback but not a must.  If the bpf
> prog is attached, I would even go further to call bpf to redistribute
> regardless of the sysctl, so I think the sysctl is not necessary.

I also think it is just an optional fallback, but to pick out a different
listener everytime, choosing the moved socket was reasonable. So the even
redistribution for a specific use case is a side effect of such socket
selection.

But, users should decide to use either way:
  (1) let the kernel select a new listener randomly
  (2) select a particular listener by eBPF

I will update the commit message like:
The kernel selects a new listener randomly, but as the side effect, it can
redistribute packets evenly for a specific case where an application
replaces listeners by generations.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md.
  2020-11-19  0:11   ` Martin KaFai Lau
@ 2020-11-19 22:10     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-19 22:10 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Wed, 18 Nov 2020 16:11:54 -0800
> On Tue, Nov 17, 2020 at 06:40:21PM +0900, Kuniyuki Iwashima wrote:
> > We will call sock_reuseport.prog for socket migration in the next commit,
> > so the eBPF program has to know which listener is closing in order to
> > select the new listener.
> > 
> > Currently, we can get a unique ID for each listener in the userspace by
> > calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.
> > This patch exposes the ID to the eBPF program.
> > 
> > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  include/linux/bpf.h            | 1 +
> >  include/uapi/linux/bpf.h       | 1 +
> >  net/core/filter.c              | 8 ++++++++
> >  tools/include/uapi/linux/bpf.h | 1 +
> >  4 files changed, 11 insertions(+)
> > 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 581b2a2e78eb..c0646eceffa2 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1897,6 +1897,7 @@ struct sk_reuseport_kern {
> >  	u32 hash;
> >  	u32 reuseport_id;
> >  	bool bind_inany;
> > +	u64 cookie;
> >  };
> >  bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
> >  				  struct bpf_insn_access_aux *info);
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 162999b12790..3fcddb032838 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -4403,6 +4403,7 @@ struct sk_reuseport_md {
> >  	__u32 ip_protocol;	/* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
> >  	__u32 bind_inany;	/* Is sock bound to an INANY address? */
> >  	__u32 hash;		/* A hash of the packet 4 tuples */
> > +	__u64 cookie;		/* ID of the listener in map */
> Instead of only adding the cookie of a sk, lets make the sk pointer available:
> 
> 	__bpf_md_ptr(struct bpf_sock *, sk);
> 
> and then use the BPF_FUNC_get_socket_cookie to get the cookie.
> 
> Other fields of the sk can also be directly accessed too once
> the sk pointer is available.

Oh, I did not know BPF_FUNC_get_socket_cookie.
I will add the sk pointer and use the helper function in the next spin!
Thank you.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration.
  2020-11-19  1:00   ` Martin KaFai Lau
@ 2020-11-19 22:13     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-19 22:13 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Wed, 18 Nov 2020 17:00:45 -0800
> On Tue, Nov 17, 2020 at 06:40:22PM +0900, Kuniyuki Iwashima wrote:
> > This patch makes it possible to select a new listener for socket migration
> > by eBPF.
> > 
> > The noteworthy point is that we select a listening socket in
> > reuseport_detach_sock() and reuseport_select_sock(), but we do not have
> > struct skb in the unhash path.
> > 
> > Since we cannot pass skb to the eBPF program, we run only the
> > BPF_PROG_TYPE_SK_REUSEPORT program by calling bpf_run_sk_reuseport() with
> > skb NULL. So, some fields derived from skb are also NULL in the eBPF
> > program.
> More things need to be considered here when skb is NULL.
> 
> Some helpers are probably assuming skb is not NULL.
> 
> Also, the sk_lookup in filter.c is actually passing a NULL skb to avoid
> doing the reuseport select.

Honestly, I have missed this point...
I wanted users to reuse the same eBPF program seamlessly, but it seems unsafe.


> > Moreover, we can cancel migration by returning SK_DROP. This feature is
> > useful when listeners have different settings at the socket API level or
> > when we want to free resources as soon as possible.
> > 
> > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  net/core/filter.c          | 26 +++++++++++++++++++++-----
> >  net/core/sock_reuseport.c  | 23 ++++++++++++++++++++---
> >  net/ipv4/inet_hashtables.c |  2 +-
> >  3 files changed, 42 insertions(+), 9 deletions(-)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 01e28f283962..ffc4591878b8 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -8914,6 +8914,22 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
> >  	SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(S, NS, F, NF,		       \
> >  					     BPF_FIELD_SIZEOF(NS, NF), 0)
> >  
> > +#define SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF_OR_NULL(S, NS, F, NF, SIZE, OFF)	\
> > +	do {									\
> > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), si->dst_reg,	\
> > +				      si->src_reg, offsetof(S, F));		\
> > +		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);		\
> Although it may not matter much, always doing this check seems not very ideal
> considering the fast path will always have skb and only the slow
> path (accept-queue migrate) has skb is NULL.  I think the req_sk usually
> has the skb also except the timer one.

Yes, but the migration happens only when/after the listener is closed, so
I think it does not occur so frequently and will not be a problem.


> First thought is to create a temp skb but it has its own issues.
> or it may actually belong to a new prog type.  However, lets keep
> exploring possible options (including NULL skb).

I also thought up the two ideas, but the former will be a bit complicated.
And the latter makes users implement the new eBPF program. I did not want
users to struggle anymore, so I have selected the NULL skb. However, it is
not safe, so adding a new prog type seems to be the better way.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
  2020-11-19  1:49 ` Martin KaFai Lau
@ 2020-11-19 22:17   ` Kuniyuki Iwashima
  2020-11-20  2:31     ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-19 22:17 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Wed, 18 Nov 2020 17:49:13 -0800
> On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation. When a SYN packet is received, the connection is tied to a
> > listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners could accept such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in
> > in-flight ACK of 3WHS is responded by RST.
> > 
> > As a workaround for this issue, we can do connection draining by eBPF:
> > 
> >   1. Before closing a listener, stop routing SYN packets to it.
> >   2. Wait enough time for requests to complete 3WHS.
> >   3. Accept connections until EAGAIN, then close the listener.
> > 
> > Although this approach seems to work well, EAGAIN has nothing to do with
> > how many requests are still during 3WHS. Thus, we have to know the number
> It sounds like the application can already drain the established socket
> by accept()?  To solve the problem that you have,
> does it mean migrating req_sk (the in-progress 3WHS) is enough?

Ideally, the application needs to drain only the accepted sockets because
3WHS and tying a connection to a listener are just kernel behaviour. Also,
there are some cases where we want to apply new configurations as soon as
possible such as replacing TLS certificates.

It is possible to drain the established sockets by accept(), but the
sockets in the accept queue have not started application sessions yet. So,
if we do not drain such sockets (or if the kernel happened to select
another listener), we can (could) apply the new settings much earlier.

Moreover, the established sockets may start long-standing connections so
that we cannot complete draining for a long time and may have to
force-close them (and they would have longer lifetime if they are migrated
to a new listener).


> Applications can already use the bpf prog to do (1) and divert
> the SYN to the newly started process.
> 
> If the application cares about service disruption,
> it usually needs to drain the fd(s) that it already has and
> finishes serving the pending request (e.g. https) on them anyway.
> The time taking to finish those could already be longer than it takes
> to drain the accept queue or finish off the 3WHS in reasonable time.
> or the application that you have does not need to drain the fd(s) 
> it already has and it can close them immediately?

In the point of view of service disruption, I agree with you.

However, I think that there are some situations where we want to apply new
configurations rather than to drain sockets with old configurations and
that if the kernel migrates sockets automatically, we can simplify user
programs.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-11-19 22:09     ` Kuniyuki Iwashima
@ 2020-11-20  1:53       ` Martin KaFai Lau
  2020-11-21 10:13         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2020-11-20  1:53 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840,
	linux-kernel, netdev

On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> From: Martin KaFai Lau <kafai@fb.com>
> Date: Wed, 18 Nov 2020 15:50:17 -0800
> > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > which is used only by inet_unhash(). If it is not NULL,
> > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > sockets from the closing listener to the selected one.
> > > 
> > > Listening sockets hold incoming connections as a linked list of struct
> > > request_sock in the accept queue, and each request has reference to a full
> > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > requests from the closing listener's queue and relink them to the head of
> > > the new listener's queue. We do not process each request, so the migration
> > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > sockets, we will take special care in the next commit.
> > > 
> > > By default, we select the last element of socks[] as the new listener.
> > > This behaviour is based on how the kernel moves sockets in socks[].
> > > 
> > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > 
> > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > >   socks[2] : C   |      socks[2] : C --'
> > >   socks[3] : D --'
> > > 
> > > Then, if C and D have newer settings than A and B, and each socket has a
> > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > requests evenly to new listeners.
> > I don't think it should emphasize/claim there is a specific way that
> > the kernel-pick here can redistribute the requests evenly.  It depends on
> > how the application close/listen.  The userspace can not expect the
> > ordering of socks[] will behave in a certain way.
> 
> I've expected replacing listeners by generations as a general use case.
> But exactly. Users should not expect the undocumented kernel internal.
> 
> 
> > The primary redistribution policy has to depend on BPF which is the
> > policy defined by the user based on its application logic (e.g. how
> > its binary restart work).  The application (and bpf) knows which one
> > is a dying process and can avoid distributing to it.
> > 
> > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > prog is attached, I would even go further to call bpf to redistribute
> > regardless of the sysctl, so I think the sysctl is not necessary.
> 
> I also think it is just an optional fallback, but to pick out a different
> listener everytime, choosing the moved socket was reasonable. So the even
> redistribution for a specific use case is a side effect of such socket
> selection.
> 
> But, users should decide to use either way:
>   (1) let the kernel select a new listener randomly
>   (2) select a particular listener by eBPF
> 
> I will update the commit message like:
> The kernel selects a new listener randomly, but as the side effect, it can
> redistribute packets evenly for a specific case where an application
> replaces listeners by generations.
Since there is no feedback on sysctl, so may be something missed
in the lines.

I don't think this migration logic should depend on a sysctl.
At least not when a bpf prog is attached that is capable of doing
migration, it is too fragile to ask user to remember to turn on
the sysctl before attaching the bpf prog.

Your use case is to primarily based on bpf prog to pick or only based
on kernel to do a random pick?

Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
change it until all the listening sockets are closed which is exactly
the service disruption problem this series is trying to solve here.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
  2020-11-19 22:17   ` Kuniyuki Iwashima
@ 2020-11-20  2:31     ` Martin KaFai Lau
  2020-11-21 10:16       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2020-11-20  2:31 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840,
	linux-kernel, netdev

On Fri, Nov 20, 2020 at 07:17:49AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Wed, 18 Nov 2020 17:49:13 -0800
> > On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> > > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > accept connections evenly. However, there is a defect in the current
> > > implementation. When a SYN packet is received, the connection is tied to a
> > > listening socket. Accordingly, when the listener is closed, in-flight
> > > requests during the three-way handshake and child sockets in the accept
> > > queue are dropped even if other listeners could accept such connections.
> > > 
> > > This situation can happen when various server management tools restart
> > > server (such as nginx) processes. For instance, when we change nginx
> > > configurations and restart it, it spins up new workers that respect the new
> > > configuration and closes all listeners on the old workers, resulting in
> > > in-flight ACK of 3WHS is responded by RST.
> > > 
> > > As a workaround for this issue, we can do connection draining by eBPF:
> > > 
> > >   1. Before closing a listener, stop routing SYN packets to it.
> > >   2. Wait enough time for requests to complete 3WHS.
> > >   3. Accept connections until EAGAIN, then close the listener.
> > > 
> > > Although this approach seems to work well, EAGAIN has nothing to do with
> > > how many requests are still during 3WHS. Thus, we have to know the number
> > It sounds like the application can already drain the established socket
> > by accept()?  To solve the problem that you have,
> > does it mean migrating req_sk (the in-progress 3WHS) is enough?
> 
> Ideally, the application needs to drain only the accepted sockets because
> 3WHS and tying a connection to a listener are just kernel behaviour. Also,
> there are some cases where we want to apply new configurations as soon as
> possible such as replacing TLS certificates.
> 
> It is possible to drain the established sockets by accept(), but the
> sockets in the accept queue have not started application sessions yet. So,
> if we do not drain such sockets (or if the kernel happened to select
> another listener), we can (could) apply the new settings much earlier.
> 
> Moreover, the established sockets may start long-standing connections so
> that we cannot complete draining for a long time and may have to
> force-close them (and they would have longer lifetime if they are migrated
> to a new listener).
> 
> 
> > Applications can already use the bpf prog to do (1) and divert
> > the SYN to the newly started process.
> > 
> > If the application cares about service disruption,
> > it usually needs to drain the fd(s) that it already has and
> > finishes serving the pending request (e.g. https) on them anyway.
> > The time taking to finish those could already be longer than it takes
> > to drain the accept queue or finish off the 3WHS in reasonable time.
> > or the application that you have does not need to drain the fd(s) 
> > it already has and it can close them immediately?
> 
> In the point of view of service disruption, I agree with you.
> 
> However, I think that there are some situations where we want to apply new
> configurations rather than to drain sockets with old configurations and
> that if the kernel migrates sockets automatically, we can simplify user
> programs.
This configuration-update(/new-TLS-cert...etc) consideration will be useful
if it is also included in the cover letter.

It sounds like the service that you have is draining the existing
already-accepted fd(s) which are using the old configuration.
Those existing fd(s) could also be long life.  Potentially those
existing fd(s) will be in a much higher number than the
to-be-accepted fd(s)?

or you meant in some cases it wants to migrate to the new configuration
ASAP (e.g. for security reason) even it has to close all the
already-accepted fds() which are using the old configuration??

In either cases, considering the already-accepted fd(s)
is usually in a much more number, does the to-be-accepted
connection make any difference percentage-wise?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-11-20  1:53       ` Martin KaFai Lau
@ 2020-11-21 10:13         ` Kuniyuki Iwashima
  2020-11-23  0:40           ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-21 10:13 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Thu, 19 Nov 2020 17:53:46 -0800
> On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> > From: Martin KaFai Lau <kafai@fb.com>
> > Date: Wed, 18 Nov 2020 15:50:17 -0800
> > > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > which is used only by inet_unhash(). If it is not NULL,
> > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > sockets from the closing listener to the selected one.
> > > > 
> > > > Listening sockets hold incoming connections as a linked list of struct
> > > > request_sock in the accept queue, and each request has reference to a full
> > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > > requests from the closing listener's queue and relink them to the head of
> > > > the new listener's queue. We do not process each request, so the migration
> > > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > > sockets, we will take special care in the next commit.
> > > > 
> > > > By default, we select the last element of socks[] as the new listener.
> > > > This behaviour is based on how the kernel moves sockets in socks[].
> > > > 
> > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > > 
> > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > >   socks[2] : C   |      socks[2] : C --'
> > > >   socks[3] : D --'
> > > > 
> > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > requests evenly to new listeners.
> > > I don't think it should emphasize/claim there is a specific way that
> > > the kernel-pick here can redistribute the requests evenly.  It depends on
> > > how the application close/listen.  The userspace can not expect the
> > > ordering of socks[] will behave in a certain way.
> > 
> > I've expected replacing listeners by generations as a general use case.
> > But exactly. Users should not expect the undocumented kernel internal.
> > 
> > 
> > > The primary redistribution policy has to depend on BPF which is the
> > > policy defined by the user based on its application logic (e.g. how
> > > its binary restart work).  The application (and bpf) knows which one
> > > is a dying process and can avoid distributing to it.
> > > 
> > > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > > prog is attached, I would even go further to call bpf to redistribute
> > > regardless of the sysctl, so I think the sysctl is not necessary.
> > 
> > I also think it is just an optional fallback, but to pick out a different
> > listener everytime, choosing the moved socket was reasonable. So the even
> > redistribution for a specific use case is a side effect of such socket
> > selection.
> > 
> > But, users should decide to use either way:
> >   (1) let the kernel select a new listener randomly
> >   (2) select a particular listener by eBPF
> > 
> > I will update the commit message like:
> > The kernel selects a new listener randomly, but as the side effect, it can
> > redistribute packets evenly for a specific case where an application
> > replaces listeners by generations.
> Since there is no feedback on sysctl, so may be something missed
> in the lines.

I'm sorry, I have missed this point while thinking about each reply...


> I don't think this migration logic should depend on a sysctl.
> At least not when a bpf prog is attached that is capable of doing
> migration, it is too fragile to ask user to remember to turn on
> the sysctl before attaching the bpf prog.
> 
> Your use case is to primarily based on bpf prog to pick or only based
> on kernel to do a random pick?

I think we have to care about both cases.

I think we can always enable the migration feature if eBPF prog is not
attached. On the other hand, if BPF_PROG_TYPE_SK_REUSEPORT prog is attached
to select a listener by some rules, along updating the kernel,
redistributing requests without user intention can break the application.
So, there is something needed to confirm user intension at least if eBPF
prog is attached.

But honestly, I believe such eBPF users can follow this change and
implement migration eBPF prog if we introduce such a breaking change.


> Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
> change it until all the listening sockets are closed which is exactly
> the service disruption problem this series is trying to solve here.

Oh, exactly...
If we apply this series by live patching, we cannot enable the feature
without service disruption.

To enable the migration feature dynamically, how about this logic?
In this logic, we do not save the sysctl value and check it at each time.

  1. no eBPF prog attached -> ON
  2. eBPF prog attached and sysctl is 0 -> OFF
  3. eBPF prog attached and sysctl is 1 -> ON

So, replacing 

  if (reuse->migrate_req)

to 

  if (!reuse->prog || net->ipv4.sysctl_tcp_migrate_req)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
  2020-11-20  2:31     ` Martin KaFai Lau
@ 2020-11-21 10:16       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-21 10:16 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Thu, 19 Nov 2020 18:31:57 -0800
> On Fri, Nov 20, 2020 at 07:17:49AM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Wed, 18 Nov 2020 17:49:13 -0800
> > > On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> > > > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > > accept connections evenly. However, there is a defect in the current
> > > > implementation. When a SYN packet is received, the connection is tied to a
> > > > listening socket. Accordingly, when the listener is closed, in-flight
> > > > requests during the three-way handshake and child sockets in the accept
> > > > queue are dropped even if other listeners could accept such connections.
> > > > 
> > > > This situation can happen when various server management tools restart
> > > > server (such as nginx) processes. For instance, when we change nginx
> > > > configurations and restart it, it spins up new workers that respect the new
> > > > configuration and closes all listeners on the old workers, resulting in
> > > > in-flight ACK of 3WHS is responded by RST.
> > > > 
> > > > As a workaround for this issue, we can do connection draining by eBPF:
> > > > 
> > > >   1. Before closing a listener, stop routing SYN packets to it.
> > > >   2. Wait enough time for requests to complete 3WHS.
> > > >   3. Accept connections until EAGAIN, then close the listener.
> > > > 
> > > > Although this approach seems to work well, EAGAIN has nothing to do with
> > > > how many requests are still during 3WHS. Thus, we have to know the number
> > > It sounds like the application can already drain the established socket
> > > by accept()?  To solve the problem that you have,
> > > does it mean migrating req_sk (the in-progress 3WHS) is enough?
> > 
> > Ideally, the application needs to drain only the accepted sockets because
> > 3WHS and tying a connection to a listener are just kernel behaviour. Also,
> > there are some cases where we want to apply new configurations as soon as
> > possible such as replacing TLS certificates.
> > 
> > It is possible to drain the established sockets by accept(), but the
> > sockets in the accept queue have not started application sessions yet. So,
> > if we do not drain such sockets (or if the kernel happened to select
> > another listener), we can (could) apply the new settings much earlier.
> > 
> > Moreover, the established sockets may start long-standing connections so
> > that we cannot complete draining for a long time and may have to
> > force-close them (and they would have longer lifetime if they are migrated
> > to a new listener).
> > 
> > 
> > > Applications can already use the bpf prog to do (1) and divert
> > > the SYN to the newly started process.
> > > 
> > > If the application cares about service disruption,
> > > it usually needs to drain the fd(s) that it already has and
> > > finishes serving the pending request (e.g. https) on them anyway.
> > > The time taking to finish those could already be longer than it takes
> > > to drain the accept queue or finish off the 3WHS in reasonable time.
> > > or the application that you have does not need to drain the fd(s) 
> > > it already has and it can close them immediately?
> > 
> > In the point of view of service disruption, I agree with you.
> > 
> > However, I think that there are some situations where we want to apply new
> > configurations rather than to drain sockets with old configurations and
> > that if the kernel migrates sockets automatically, we can simplify user
> > programs.
> This configuration-update(/new-TLS-cert...etc) consideration will be useful
> if it is also included in the cover letter.

I will add this to the next cover letter.


> It sounds like the service that you have is draining the existing
> already-accepted fd(s) which are using the old configuration.
> Those existing fd(s) could also be long life.  Potentially those
> existing fd(s) will be in a much higher number than the
> to-be-accepted fd(s)?

In many cases, yes.


> or you meant in some cases it wants to migrate to the new configuration
> ASAP (e.g. for security reason) even it has to close all the
> already-accepted fds() which are using the old configuration??

And sometimes, yes.
As you expected, for some reasons including security, there are cases we
have to prioritize to close connections than to complete them.

For example, HTTP/1.1 is often short-lived, and we can complete draining
immediately. However, sometimes it can be long-lived by upgrading to
WebSocket. Then we may be not able to wait to finish draining.


> In either cases, considering the already-accepted fd(s)
> is usually in a much more number, does the to-be-accepted
> connection make any difference percentage-wise?

It is difficult to drain all connections in every case, but we can decrease
such aborted connections by migration. In that sense, I think migration is
always better than draining.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-11-21 10:13         ` Kuniyuki Iwashima
@ 2020-11-23  0:40           ` Martin KaFai Lau
  2020-11-24  9:24             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2020-11-23  0:40 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840,
	linux-kernel, netdev

On Sat, Nov 21, 2020 at 07:13:22PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Thu, 19 Nov 2020 17:53:46 -0800
> > On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> > > From: Martin KaFai Lau <kafai@fb.com>
> > > Date: Wed, 18 Nov 2020 15:50:17 -0800
> > > > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > sockets from the closing listener to the selected one.
> > > > > 
> > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > > > requests from the closing listener's queue and relink them to the head of
> > > > > the new listener's queue. We do not process each request, so the migration
> > > > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > > > sockets, we will take special care in the next commit.
> > > > > 
> > > > > By default, we select the last element of socks[] as the new listener.
> > > > > This behaviour is based on how the kernel moves sockets in socks[].
> > > > > 
> > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > > > 
> > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > >   socks[2] : C   |      socks[2] : C --'
> > > > >   socks[3] : D --'
> > > > > 
> > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > requests evenly to new listeners.
> > > > I don't think it should emphasize/claim there is a specific way that
> > > > the kernel-pick here can redistribute the requests evenly.  It depends on
> > > > how the application close/listen.  The userspace can not expect the
> > > > ordering of socks[] will behave in a certain way.
> > > 
> > > I've expected replacing listeners by generations as a general use case.
> > > But exactly. Users should not expect the undocumented kernel internal.
> > > 
> > > 
> > > > The primary redistribution policy has to depend on BPF which is the
> > > > policy defined by the user based on its application logic (e.g. how
> > > > its binary restart work).  The application (and bpf) knows which one
> > > > is a dying process and can avoid distributing to it.
> > > > 
> > > > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > > > prog is attached, I would even go further to call bpf to redistribute
> > > > regardless of the sysctl, so I think the sysctl is not necessary.
> > > 
> > > I also think it is just an optional fallback, but to pick out a different
> > > listener everytime, choosing the moved socket was reasonable. So the even
> > > redistribution for a specific use case is a side effect of such socket
> > > selection.
> > > 
> > > But, users should decide to use either way:
> > >   (1) let the kernel select a new listener randomly
> > >   (2) select a particular listener by eBPF
> > > 
> > > I will update the commit message like:
> > > The kernel selects a new listener randomly, but as the side effect, it can
> > > redistribute packets evenly for a specific case where an application
> > > replaces listeners by generations.
> > Since there is no feedback on sysctl, so may be something missed
> > in the lines.
> 
> I'm sorry, I have missed this point while thinking about each reply...
> 
> 
> > I don't think this migration logic should depend on a sysctl.
> > At least not when a bpf prog is attached that is capable of doing
> > migration, it is too fragile to ask user to remember to turn on
> > the sysctl before attaching the bpf prog.
> > 
> > Your use case is to primarily based on bpf prog to pick or only based
> > on kernel to do a random pick?
Again, what is your primarily use case?

> 
> I think we have to care about both cases.
> 
> I think we can always enable the migration feature if eBPF prog is not
> attached. On the other hand, if BPF_PROG_TYPE_SK_REUSEPORT prog is attached
> to select a listener by some rules, along updating the kernel,
> redistributing requests without user intention can break the application.
> So, there is something needed to confirm user intension at least if eBPF
> prog is attached.
Right, something being able to tell if the bpf prog can do migration
can confirm the user intention here.  However, this will not be a
sysctl.

A new bpf_attach_type "BPF_SK_REUSEPORT_SELECT_OR_MIGRATE" can be added.
"prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE"
can be used to decide if migration can be done by the bpf prog.
Although the prog->expected_attach_type has not been checked for
BPF_PROG_TYPE_SK_REUSEPORT, there was an earlier discussion
that the risk of breaking is very small and is acceptable.

Instead of depending on !reuse_md->data to decide if it
is doing migration or not, a clearer signal should be given
to the bpf prog.  A "u8 migration" can be added to "struct sk_reuseport_kern"
(and to "struct sk_reuseport_md" accordingly).  It can tell
the bpf prog that it is doing migration.  It should also tell if it is
migrating a list of established sk(s) or an individual req_sk.
Accessing "reuse_md->migration" should only be allowed for
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE during is_valid_access().

During migration, if skb is not available, an empty skb can be used.
Migration is a slow path and does not happen very often, so it will
be fine even it has to create a temp skb (or may be a static const skb
can be used, not sure but this is implementation details).

> 
> But honestly, I believe such eBPF users can follow this change and
> implement migration eBPF prog if we introduce such a breaking change.
> 
> 
> > Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
> > change it until all the listening sockets are closed which is exactly
> > the service disruption problem this series is trying to solve here.
> 
> Oh, exactly...
> If we apply this series by live patching, we cannot enable the feature
> without service disruption.
> 
> To enable the migration feature dynamically, how about this logic?
> In this logic, we do not save the sysctl value and check it at each time.
> 
>   1. no eBPF prog attached -> ON
>   2. eBPF prog attached and sysctl is 0 -> OFF
No.  When bpf prog is attached and it clearly signals (expected_attach_type
here) it can do migration, it should not depend on anything else.  It is very
confusing to use.  When a prog is successfully loaded, verified
and attached, it is expected to run.

This sysctl essentially only disables the bpf prog with
type == BPF_PROG_TYPE_SK_REUSEPORT running at a particular point.
This is going down a path that having another sysctl in the future
to disable another bpf prog type.  If there would be a need to disable
bpf prog on a type-by-type bases, it would need a more
generic solution on the bpf side and do it in a consistent way
for all prog types.  It needs a separate and longer discussion.

All behaviors of the BPF_SK_REUSEPORT_SELECT_OR_MIGRATE bpf prog
should not depend on this sysctl at all .

/* Pseudo code to show the idea only.
 * Actual implementation should try to fit
 * better into the current code and should look
 * quite different from here.
 */

if ((prog && prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)) {
	/* call bpf to migrate */
	action = BPF_PROG_RUN(prog, &reuse_kern);

	if (action == SK_PASS) {
		if (!reuse_kern.selected_sk)
			/* fallback to kernel random pick */
		else
			/* migrate to reuse_kern.selected_sk */
	} else {
		/* action == SK_DROP. don't do migration at all and
		 * don't fallback to kernel random pick.
		 */ 
	}
}

Going back to the sysctl, with BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
do you still have a need on adding sysctl_tcp_migrate_req?
Regardless, if there is still a need,
the document for sysctl_tcp_migrate_req should be something like:
"the kernel will do a random pick when there is no bpf prog
 attached to the reuseport group...."

[ ps, my reply will be slow in this week. ]

>   3. eBPF prog attached and sysctl is 1 -> ON
> 
> So, replacing 
> 
>   if (reuse->migrate_req)
> 
> to 
> 
>   if (!reuse->prog || net->ipv4.sysctl_tcp_migrate_req)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-11-23  0:40           ` Martin KaFai Lau
@ 2020-11-24  9:24             ` Kuniyuki Iwashima
  0 siblings, 0 replies; 27+ messages in thread
From: Kuniyuki Iwashima @ 2020-11-24  9:24 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Sun, 22 Nov 2020 16:40:20 -0800
> On Sat, Nov 21, 2020 at 07:13:22PM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Thu, 19 Nov 2020 17:53:46 -0800
> > > On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> > > > From: Martin KaFai Lau <kafai@fb.com>
> > > > Date: Wed, 18 Nov 2020 15:50:17 -0800
> > > > > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > > sockets from the closing listener to the selected one.
> > > > > > 
> > > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > > > > requests from the closing listener's queue and relink them to the head of
> > > > > > the new listener's queue. We do not process each request, so the migration
> > > > > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > > > > sockets, we will take special care in the next commit.
> > > > > > 
> > > > > > By default, we select the last element of socks[] as the new listener.
> > > > > > This behaviour is based on how the kernel moves sockets in socks[].
> > > > > > 
> > > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > > > > 
> > > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > > >   socks[2] : C   |      socks[2] : C --'
> > > > > >   socks[3] : D --'
> > > > > > 
> > > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > > requests evenly to new listeners.
> > > > > I don't think it should emphasize/claim there is a specific way that
> > > > > the kernel-pick here can redistribute the requests evenly.  It depends on
> > > > > how the application close/listen.  The userspace can not expect the
> > > > > ordering of socks[] will behave in a certain way.
> > > > 
> > > > I've expected replacing listeners by generations as a general use case.
> > > > But exactly. Users should not expect the undocumented kernel internal.
> > > > 
> > > > 
> > > > > The primary redistribution policy has to depend on BPF which is the
> > > > > policy defined by the user based on its application logic (e.g. how
> > > > > its binary restart work).  The application (and bpf) knows which one
> > > > > is a dying process and can avoid distributing to it.
> > > > > 
> > > > > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > > > > prog is attached, I would even go further to call bpf to redistribute
> > > > > regardless of the sysctl, so I think the sysctl is not necessary.
> > > > 
> > > > I also think it is just an optional fallback, but to pick out a different
> > > > listener everytime, choosing the moved socket was reasonable. So the even
> > > > redistribution for a specific use case is a side effect of such socket
> > > > selection.
> > > > 
> > > > But, users should decide to use either way:
> > > >   (1) let the kernel select a new listener randomly
> > > >   (2) select a particular listener by eBPF
> > > > 
> > > > I will update the commit message like:
> > > > The kernel selects a new listener randomly, but as the side effect, it can
> > > > redistribute packets evenly for a specific case where an application
> > > > replaces listeners by generations.
> > > Since there is no feedback on sysctl, so may be something missed
> > > in the lines.
> > 
> > I'm sorry, I have missed this point while thinking about each reply...
> > 
> > 
> > > I don't think this migration logic should depend on a sysctl.
> > > At least not when a bpf prog is attached that is capable of doing
> > > migration, it is too fragile to ask user to remember to turn on
> > > the sysctl before attaching the bpf prog.
> > > 
> > > Your use case is to primarily based on bpf prog to pick or only based
> > > on kernel to do a random pick?
> Again, what is your primarily use case?

We have so many services and components that I cannot grasp all of their
implementations, but I have started this series because a service component
based on the random pick by the kernel suffered from the issue.


> > I think we have to care about both cases.
> > 
> > I think we can always enable the migration feature if eBPF prog is not
> > attached. On the other hand, if BPF_PROG_TYPE_SK_REUSEPORT prog is attached
> > to select a listener by some rules, along updating the kernel,
> > redistributing requests without user intention can break the application.
> > So, there is something needed to confirm user intension at least if eBPF
> > prog is attached.
> Right, something being able to tell if the bpf prog can do migration
> can confirm the user intention here.  However, this will not be a
> sysctl.
> 
> A new bpf_attach_type "BPF_SK_REUSEPORT_SELECT_OR_MIGRATE" can be added.
> "prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE"
> can be used to decide if migration can be done by the bpf prog.
> Although the prog->expected_attach_type has not been checked for
> BPF_PROG_TYPE_SK_REUSEPORT, there was an earlier discussion
> that the risk of breaking is very small and is acceptable.
> 
> Instead of depending on !reuse_md->data to decide if it
> is doing migration or not, a clearer signal should be given
> to the bpf prog.  A "u8 migration" can be added to "struct sk_reuseport_kern"
> (and to "struct sk_reuseport_md" accordingly).  It can tell
> the bpf prog that it is doing migration.  It should also tell if it is
> migrating a list of established sk(s) or an individual req_sk.
> Accessing "reuse_md->migration" should only be allowed for
> BPF_SK_REUSEPORT_SELECT_OR_MIGRATE during is_valid_access().
> 
> During migration, if skb is not available, an empty skb can be used.
> Migration is a slow path and does not happen very often, so it will
> be fine even it has to create a temp skb (or may be a static const skb
> can be used, not sure but this is implementation details).

I greatly appreciate your detailed idea and explanation!
I will try to implement this.


> > But honestly, I believe such eBPF users can follow this change and
> > implement migration eBPF prog if we introduce such a breaking change.
> > 
> > 
> > > Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
> > > change it until all the listening sockets are closed which is exactly
> > > the service disruption problem this series is trying to solve here.
> > 
> > Oh, exactly...
> > If we apply this series by live patching, we cannot enable the feature
> > without service disruption.
> > 
> > To enable the migration feature dynamically, how about this logic?
> > In this logic, we do not save the sysctl value and check it at each time.
> > 
> >   1. no eBPF prog attached -> ON
> >   2. eBPF prog attached and sysctl is 0 -> OFF
> No.  When bpf prog is attached and it clearly signals (expected_attach_type
> here) it can do migration, it should not depend on anything else.  It is very
> confusing to use.  When a prog is successfully loaded, verified
> and attached, it is expected to run.
> 
> This sysctl essentially only disables the bpf prog with
> type == BPF_PROG_TYPE_SK_REUSEPORT running at a particular point.
> This is going down a path that having another sysctl in the future
> to disable another bpf prog type.  If there would be a need to disable
> bpf prog on a type-by-type bases, it would need a more
> generic solution on the bpf side and do it in a consistent way
> for all prog types.  It needs a separate and longer discussion.
> 
> All behaviors of the BPF_SK_REUSEPORT_SELECT_OR_MIGRATE bpf prog
> should not depend on this sysctl at all .
> 
> /* Pseudo code to show the idea only.
>  * Actual implementation should try to fit
>  * better into the current code and should look
>  * quite different from here.
>  */
> 
> if ((prog && prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)) {
> 	/* call bpf to migrate */
> 	action = BPF_PROG_RUN(prog, &reuse_kern);
> 
> 	if (action == SK_PASS) {
> 		if (!reuse_kern.selected_sk)
> 			/* fallback to kernel random pick */
> 		else
> 			/* migrate to reuse_kern.selected_sk */
> 	} else {
> 		/* action == SK_DROP. don't do migration at all and
> 		 * don't fallback to kernel random pick.
> 		 */ 
> 	}
> }
> 
> Going back to the sysctl, with BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> do you still have a need on adding sysctl_tcp_migrate_req?

No, now I do not think the option should be sysctl.
It will be BPF_SK_REUSEPORT_SELECT_OR_MIGRATE in the next series.
Thank you!


> Regardless, if there is still a need,
> the document for sysctl_tcp_migrate_req should be something like:
> "the kernel will do a random pick when there is no bpf prog
>  attached to the reuseport group...."
> 
> [ ps, my reply will be slow in this week. ]

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-11-24  9:25 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 1/8] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 2/8] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2020-11-18 23:50   ` Martin KaFai Lau
2020-11-19 22:09     ` Kuniyuki Iwashima
2020-11-20  1:53       ` Martin KaFai Lau
2020-11-21 10:13         ` Kuniyuki Iwashima
2020-11-23  0:40           ` Martin KaFai Lau
2020-11-24  9:24             ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 4/8] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 5/8] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md Kuniyuki Iwashima
2020-11-19  0:11   ` Martin KaFai Lau
2020-11-19 22:10     ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
2020-11-19  1:00   ` Martin KaFai Lau
2020-11-19 22:13     ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 8/8] bpf: Test BPF_PROG_TYPE_SK_REUSEPORT " Kuniyuki Iwashima
2020-11-18  9:18 ` [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT David Laight
2020-11-19 22:01   ` Kuniyuki Iwashima
2020-11-18 16:25 ` Eric Dumazet
2020-11-19 22:05   ` Kuniyuki Iwashima
2020-11-19  1:49 ` Martin KaFai Lau
2020-11-19 22:17   ` Kuniyuki Iwashima
2020-11-20  2:31     ` Martin KaFai Lau
2020-11-21 10:16       ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).