linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT.
@ 2020-12-01 14:44 Kuniyuki Iwashima
  2020-12-01 14:44 ` [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
                   ` (10 more replies)
  0 siblings, 11 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

The SO_REUSEPORT option allows sockets to listen on the same port and to
accept connections evenly. However, there is a defect in the current
implementation[1]. When a SYN packet is received, the connection is tied to
a listening socket. Accordingly, when the listener is closed, in-flight
requests during the three-way handshake and child sockets in the accept
queue are dropped even if other listeners on the same port could accept
such connections.

This situation can happen when various server management tools restart
server (such as nginx) processes. For instance, when we change nginx
configurations and restart it, it spins up new workers that respect the new
configuration and closes all listeners on the old workers, resulting in the
in-flight ACK of 3WHS is responded by RST.

The SO_REUSEPORT option is excellent to improve scalability. On the other
hand, as a trade-off, users have to know deeply how the kernel handles SYN
packets and implement connection draining by eBPF[2]:

  1. Stop routing SYN packets to the listener by eBPF.
  2. Wait for all timers to expire to complete requests
  3. Accept connections until EAGAIN, then close the listener.
  
or

  1. Start counting SYN packets and accept syscalls using eBPF map.
  2. Stop routing SYN packets.
  3. Accept connections up to the count, then close the listener.

In either way, we cannot close a listener immediately. However, ideally,
the application need not drain the not yet accepted sockets because 3WHS
and tying a connection to a listener are just the kernel behaviour. The
root cause is within the kernel, so the issue should be addressed in kernel
space and should not be visible to user space. This patchset fixes it so
that users need not take care of kernel implementation and connection
draining. With this patchset, the kernel redistributes requests and
connections from a listener to others in the same reuseport group at/after
close() or shutdown() syscalls.

Although some software does connection draining, there are still merits in
migration. For some security reasons such as replacing TLS certificates, we
may want to apply new settings as soon as possible and/or we may not be
able to wait for connection draining. The sockets in the accept queue have
not started application sessions yet. So, if we do not drain such sockets,
they can be handled by the newer listeners and could have a longer
lifetime. It is difficult to drain all connections in every case, but we
can decrease such aborted connections by migration. In that sense,
migration is always better than draining. 

Moreover, auto-migration simplifies userspace logic and also works well in
a case where we cannot modify and build a server program to implement the
workaround.

Note that the source and destination listeners MUST have the same settings
at the socket API level; otherwise, applications may face inconsistency and
cause errors. In such a case, we have to use eBPF program to select a
specific listener or to cancel migration.


Link:

 [1] The SO_REUSEPORT socket option
 https://lwn.net/Articles/542629/

 [2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
 https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/


Changelog:

 v1:
  * Remove the sysctl option
  * Enable migration if eBPF progam is not attached
  * Add expected_attach_type to check if eBPF program can migrate sockets
  * Add a field to tell migration type to eBPF program
  * Support BPF_FUNC_get_socket_cookie to get the cookie of sk
  * Allocate an empty skb if skb is NULL
  * Pass req_to_sk(req)->sk_hash because listener's hash is zero
  * Update commit messages and coverletter

 RFC v0:
 https://lore.kernel.org/netdev/20201117094023.3685-1-kuniyu@amazon.co.jp/


Kuniyuki Iwashima (11):
  tcp: Keep TCP_CLOSE sockets in the reuseport group.
  bpf: Define migration types for SO_REUSEPORT.
  tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  tcp: Migrate TFO requests causing RST during TCP_SYN_RECV.
  tcp: Migrate TCP_NEW_SYN_RECV requests.
  bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.
  libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
  bpf: Add migration to sk_reuseport_(kern|md).
  bpf: Support bpf_get_socket_cookie_sock() for
    BPF_PROG_TYPE_SK_REUSEPORT.
  bpf: Call bpf_run_sk_reuseport() for socket migration.
  bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

 include/linux/bpf.h                           |   1 +
 include/linux/filter.h                        |   4 +-
 include/net/inet_connection_sock.h            |  13 ++
 include/net/request_sock.h                    |  13 ++
 include/net/sock_reuseport.h                  |  15 +-
 include/uapi/linux/bpf.h                      |  25 +++
 kernel/bpf/syscall.c                          |   8 +
 net/core/filter.c                             |  46 ++++-
 net/core/sock_reuseport.c                     | 128 +++++++++++---
 net/ipv4/inet_connection_sock.c               |  85 ++++++++-
 net/ipv4/inet_hashtables.c                    |   9 +-
 net/ipv4/tcp_ipv4.c                           |   9 +-
 net/ipv6/tcp_ipv6.c                           |   9 +-
 tools/include/uapi/linux/bpf.h                |  25 +++
 tools/lib/bpf/libbpf.c                        |   5 +-
 .../bpf/prog_tests/migrate_reuseport.c        | 164 ++++++++++++++++++
 .../bpf/progs/test_migrate_reuseport_kern.c   |  54 ++++++
 17 files changed, 565 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c

-- 
2.17.2 (Apple Git-113)


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group.
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-05  1:31   ` Martin KaFai Lau
  2020-12-01 14:44 ` [PATCH v1 bpf-next 02/11] bpf: Define migration types for SO_REUSEPORT Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

This patch is a preparation patch to migrate incoming connections in the
later commits and adds a field (num_closed_socks) to the struct
sock_reuseport to keep TCP_CLOSE sockets in the reuseport group.

When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So, we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, it is impossible because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to remain in the reuseport group and to
have access to it while any child socket references to them. The point is
that reuseport_detach_sock() is called twice from inet_unhash() and
sk_destruct(). At first, it moves the socket backwards in socks[] and
increments num_closed_socks. Later, when all migrated connections are
accepted, it removes the socket from socks[], decrements num_closed_socks,
and sets NULL to sk_reuseport_cb.

By this change, closed sockets can keep sk_reuseport_cb until all child
requests have been freed or accepted. Consequently calling listen() after
shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or
inet_csk_bind_conflict() which expect that such sockets should not have the
reuseport group. Therefore, this patch also loosens such validation rules
so that the socket can listen again if it has the same reuseport group with
other listening sockets.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/net/sock_reuseport.h    |  5 ++-
 net/core/sock_reuseport.c       | 79 +++++++++++++++++++++++++++------
 net/ipv4/inet_connection_sock.c |  7 ++-
 3 files changed, 74 insertions(+), 17 deletions(-)

diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 505f1e18e9bf..0e558ca7afbf 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock;
 struct sock_reuseport {
 	struct rcu_head		rcu;
 
-	u16			max_socks;	/* length of socks */
-	u16			num_socks;	/* elements in socks */
+	u16			max_socks;		/* length of socks */
+	u16			num_socks;		/* elements in socks */
+	u16			num_closed_socks;	/* closed elements in socks */
 	/* The last synq overflow event timestamp of this
 	 * reuse->socks[] group.
 	 */
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index bbdd3c7b6cb5..fd133516ac0e 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -98,16 +98,21 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse)
 		return NULL;
 
 	more_reuse->num_socks = reuse->num_socks;
+	more_reuse->num_closed_socks = reuse->num_closed_socks;
 	more_reuse->prog = reuse->prog;
 	more_reuse->reuseport_id = reuse->reuseport_id;
 	more_reuse->bind_inany = reuse->bind_inany;
 	more_reuse->has_conns = reuse->has_conns;
+	more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
 
 	memcpy(more_reuse->socks, reuse->socks,
 	       reuse->num_socks * sizeof(struct sock *));
-	more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
+	memcpy(more_reuse->socks +
+	       (more_reuse->max_socks - more_reuse->num_closed_socks),
+	       reuse->socks + reuse->num_socks,
+	       reuse->num_closed_socks * sizeof(struct sock *));
 
-	for (i = 0; i < reuse->num_socks; ++i)
+	for (i = 0; i < reuse->max_socks; ++i)
 		rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
 				   more_reuse);
 
@@ -129,6 +134,25 @@ static void reuseport_free_rcu(struct rcu_head *head)
 	kfree(reuse);
 }
 
+static int reuseport_sock_index(struct sock_reuseport *reuse, struct sock *sk,
+				bool closed)
+{
+	int left, right;
+
+	if (!closed) {
+		left = 0;
+		right = reuse->num_socks;
+	} else {
+		left = reuse->max_socks - reuse->num_closed_socks;
+		right = reuse->max_socks;
+	}
+
+	for (; left < right; left++)
+		if (reuse->socks[left] == sk)
+			return left;
+	return -1;
+}
+
 /**
  *  reuseport_add_sock - Add a socket to the reuseport group of another.
  *  @sk:  New socket to add to the group.
@@ -153,12 +177,23 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 					  lockdep_is_held(&reuseport_lock));
 	old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
 					     lockdep_is_held(&reuseport_lock));
-	if (old_reuse && old_reuse->num_socks != 1) {
+
+	if (old_reuse == reuse) {
+		int i = reuseport_sock_index(reuse, sk, true);
+
+		if (i == -1) {
+			spin_unlock_bh(&reuseport_lock);
+			return -EBUSY;
+		}
+
+		reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks];
+		reuse->num_closed_socks--;
+	} else if (old_reuse && old_reuse->num_socks != 1) {
 		spin_unlock_bh(&reuseport_lock);
 		return -EBUSY;
 	}
 
-	if (reuse->num_socks == reuse->max_socks) {
+	if (reuse->num_socks + reuse->num_closed_socks == reuse->max_socks) {
 		reuse = reuseport_grow(reuse);
 		if (!reuse) {
 			spin_unlock_bh(&reuseport_lock);
@@ -174,8 +209,9 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 
 	spin_unlock_bh(&reuseport_lock);
 
-	if (old_reuse)
+	if (old_reuse && old_reuse != reuse)
 		call_rcu(&old_reuse->rcu, reuseport_free_rcu);
+
 	return 0;
 }
 EXPORT_SYMBOL(reuseport_add_sock);
@@ -199,17 +235,34 @@ void reuseport_detach_sock(struct sock *sk)
 	 */
 	bpf_sk_reuseport_detach(sk);
 
-	rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
+	if (sk->sk_protocol != IPPROTO_TCP || sk->sk_state == TCP_LISTEN) {
+		i = reuseport_sock_index(reuse, sk, false);
+		if (i == -1)
+			goto out;
+
+		reuse->num_socks--;
+		reuse->socks[i] = reuse->socks[reuse->num_socks];
 
-	for (i = 0; i < reuse->num_socks; i++) {
-		if (reuse->socks[i] == sk) {
-			reuse->socks[i] = reuse->socks[reuse->num_socks - 1];
-			reuse->num_socks--;
-			if (reuse->num_socks == 0)
-				call_rcu(&reuse->rcu, reuseport_free_rcu);
-			break;
+		if (sk->sk_protocol == IPPROTO_TCP) {
+			reuse->num_closed_socks++;
+			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
+		} else {
+			rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
 		}
+	} else {
+		i = reuseport_sock_index(reuse, sk, true);
+		if (i == -1)
+			goto out;
+
+		reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks];
+		reuse->num_closed_socks--;
+
+		rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
 	}
+
+	if (reuse->num_socks + reuse->num_closed_socks == 0)
+		call_rcu(&reuse->rcu, reuseport_free_rcu);
+out:
 	spin_unlock_bh(&reuseport_lock);
 }
 EXPORT_SYMBOL(reuseport_detach_sock);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index f60869acbef0..1451aa9712b0 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -138,6 +138,7 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 	bool reuse = sk->sk_reuse;
 	bool reuseport = !!sk->sk_reuseport;
 	kuid_t uid = sock_i_uid((struct sock *)sk);
+	struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb);
 
 	/*
 	 * Unlike other sk lookup places we do not check
@@ -156,14 +157,16 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 				if ((!relax ||
 				     (!reuseport_ok &&
 				      reuseport && sk2->sk_reuseport &&
-				      !rcu_access_pointer(sk->sk_reuseport_cb) &&
+				      (!reuseport_cb ||
+				       reuseport_cb == rcu_access_pointer(sk2->sk_reuseport_cb)) &&
 				      (sk2->sk_state == TCP_TIME_WAIT ||
 				       uid_eq(uid, sock_i_uid(sk2))))) &&
 				    inet_rcv_saddr_equal(sk, sk2, true))
 					break;
 			} else if (!reuseport_ok ||
 				   !reuseport || !sk2->sk_reuseport ||
-				   rcu_access_pointer(sk->sk_reuseport_cb) ||
+				   (reuseport_cb &&
+				    reuseport_cb != rcu_access_pointer(sk2->sk_reuseport_cb)) ||
 				   (sk2->sk_state != TCP_TIME_WAIT &&
 				    !uid_eq(uid, sock_i_uid(sk2)))) {
 				if (inet_rcv_saddr_equal(sk, sk2, true))
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 02/11] bpf: Define migration types for SO_REUSEPORT.
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
  2020-12-01 14:44 ` [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-01 14:44 ` [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

As noted in the preceding commit, there are two migration types. In
addition to that, the kernel will run the same eBPF program to select a
listener for SYN packets.

This patch defines three types to signal the kernel and the eBPF program if
it is receiving a new request or migrating ESTABLISHED/SYN_RECV sockets in
the accept queue or NEW_SYN_RECV socket during 3WHS.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/uapi/linux/bpf.h       | 14 ++++++++++++++
 tools/include/uapi/linux/bpf.h | 14 ++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 162999b12790..85278deff439 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4380,6 +4380,20 @@ struct sk_msg_md {
 	__bpf_md_ptr(struct bpf_sock *, sk); /* current socket */
 };
 
+/* Migration type for SO_REUSEPORT enabled TCP sockets.
+ *
+ * BPF_SK_REUSEPORT_MIGRATE_NO      : Select a listener for SYN packets.
+ * BPF_SK_REUSEPORT_MIGRATE_QUEUE   : Migrate ESTABLISHED and SYN_RECV sockets in
+ *                                    the accept queue at close() or shutdown().
+ * BPF_SK_REUSEPORT_MIGRATE_REQUEST : Migrate NEW_SYN_RECV socket at receiving the
+ *                                    final ACK of 3WHS or retransmitting SYN+ACKs.
+ */
+enum {
+	BPF_SK_REUSEPORT_MIGRATE_NO,
+	BPF_SK_REUSEPORT_MIGRATE_QUEUE,
+	BPF_SK_REUSEPORT_MIGRATE_REQUEST,
+};
+
 struct sk_reuseport_md {
 	/*
 	 * Start of directly accessible data. It begins from
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 162999b12790..85278deff439 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4380,6 +4380,20 @@ struct sk_msg_md {
 	__bpf_md_ptr(struct bpf_sock *, sk); /* current socket */
 };
 
+/* Migration type for SO_REUSEPORT enabled TCP sockets.
+ *
+ * BPF_SK_REUSEPORT_MIGRATE_NO      : Select a listener for SYN packets.
+ * BPF_SK_REUSEPORT_MIGRATE_QUEUE   : Migrate ESTABLISHED and SYN_RECV sockets in
+ *                                    the accept queue at close() or shutdown().
+ * BPF_SK_REUSEPORT_MIGRATE_REQUEST : Migrate NEW_SYN_RECV socket at receiving the
+ *                                    final ACK of 3WHS or retransmitting SYN+ACKs.
+ */
+enum {
+	BPF_SK_REUSEPORT_MIGRATE_NO,
+	BPF_SK_REUSEPORT_MIGRATE_QUEUE,
+	BPF_SK_REUSEPORT_MIGRATE_REQUEST,
+};
+
 struct sk_reuseport_md {
 	/*
 	 * Start of directly accessible data. It begins from
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
  2020-12-01 14:44 ` [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
  2020-12-01 14:44 ` [PATCH v1 bpf-next 02/11] bpf: Define migration types for SO_REUSEPORT Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-01 15:25   ` Eric Dumazet
                     ` (2 more replies)
  2020-12-01 14:44 ` [PATCH v1 bpf-next 04/11] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

This patch lets reuseport_detach_sock() return a pointer of struct sock,
which is used only by inet_unhash(). If it is not NULL,
inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
sockets from the closing listener to the selected one.

Listening sockets hold incoming connections as a linked list of struct
request_sock in the accept queue, and each request has reference to a full
socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
the requests from the closing listener's queue and relink them to the head
of the new listener's queue. We do not process each request and its
reference to the listener, so the migration completes in O(1) time
complexity. However, in the case of TCP_SYN_RECV sockets, we take special
care in the next commit.

By default, the kernel selects a new listener randomly. In order to pick
out a different socket every time, we select the last element of socks[] as
the new listener. This behaviour is based on how the kernel moves sockets
in socks[]. (See also [1])

Basically, in order to redistribute sockets evenly, we have to use an eBPF
program called in the later commit, but as the side effect of such default
selection, the kernel can redistribute old requests evenly to new listeners
for a specific case where the application replaces listeners by
generations.

For example, we call listen() for four sockets (A, B, C, D), and close the
first two by turns. The sockets move in socks[] like below.

  socks[0] : A <-.      socks[0] : D          socks[0] : D
  socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
  socks[2] : C   |      socks[2] : C --'
  socks[3] : D --'

Then, if C and D have newer settings than A and B, and each socket has a
request (a, b, c, d) in their accept queue, we can redistribute old
requests evenly to new listeners.

  socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
  socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
  socks[2] : C (c)   |      socks[2] : C (c) --'
  socks[3] : D (d) --'

Here, (A, D) or (B, C) can have different application settings, but they
MUST have the same settings at the socket API level; otherwise, unexpected
error may happen. For instance, if only the new listeners have
TCP_SAVE_SYN, old requests do not have SYN data, so the application will
face inconsistency and cause an error.

Therefore, if there are different kinds of sockets, we must attach an eBPF
program described in later commits.

Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/net/inet_connection_sock.h |  1 +
 include/net/sock_reuseport.h       |  2 +-
 net/core/sock_reuseport.c          | 10 +++++++++-
 net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
 net/ipv4/inet_hashtables.c         |  9 +++++++--
 5 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 7338b3865a2a..2ea2d743f8fc 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
 				      struct request_sock *req,
 				      struct sock *child);
+void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
 void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
 				   unsigned long timeout);
 struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 0e558ca7afbf..09a1b1539d4c 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -31,7 +31,7 @@ struct sock_reuseport {
 extern int reuseport_alloc(struct sock *sk, bool bind_inany);
 extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
 			      bool bind_inany);
-extern void reuseport_detach_sock(struct sock *sk);
+extern struct sock *reuseport_detach_sock(struct sock *sk);
 extern struct sock *reuseport_select_sock(struct sock *sk,
 					  u32 hash,
 					  struct sk_buff *skb,
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index fd133516ac0e..60d7c1f28809 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 }
 EXPORT_SYMBOL(reuseport_add_sock);
 
-void reuseport_detach_sock(struct sock *sk)
+struct sock *reuseport_detach_sock(struct sock *sk)
 {
 	struct sock_reuseport *reuse;
+	struct bpf_prog *prog;
+	struct sock *nsk = NULL;
 	int i;
 
 	spin_lock_bh(&reuseport_lock);
@@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
 
 		reuse->num_socks--;
 		reuse->socks[i] = reuse->socks[reuse->num_socks];
+		prog = rcu_dereference(reuse->prog);
 
 		if (sk->sk_protocol == IPPROTO_TCP) {
+			if (reuse->num_socks && !prog)
+				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
+
 			reuse->num_closed_socks++;
 			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
 		} else {
@@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
 		call_rcu(&reuse->rcu, reuseport_free_rcu);
 out:
 	spin_unlock_bh(&reuseport_lock);
+
+	return nsk;
 }
 EXPORT_SYMBOL(reuseport_detach_sock);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 1451aa9712b0..b27241ea96bd 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
 }
 EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
 
+void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
+{
+	struct request_sock_queue *old_accept_queue, *new_accept_queue;
+
+	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
+	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
+
+	spin_lock(&old_accept_queue->rskq_lock);
+	spin_lock(&new_accept_queue->rskq_lock);
+
+	if (old_accept_queue->rskq_accept_head) {
+		if (new_accept_queue->rskq_accept_head)
+			old_accept_queue->rskq_accept_tail->dl_next =
+				new_accept_queue->rskq_accept_head;
+		else
+			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
+
+		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
+		old_accept_queue->rskq_accept_head = NULL;
+		old_accept_queue->rskq_accept_tail = NULL;
+
+		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
+		WRITE_ONCE(sk->sk_ack_backlog, 0);
+	}
+
+	spin_unlock(&new_accept_queue->rskq_lock);
+	spin_unlock(&old_accept_queue->rskq_lock);
+}
+EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
+
 struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
 					 struct request_sock *req, bool own_req)
 {
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 45fb450b4522..545538a6bfac 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -681,6 +681,7 @@ void inet_unhash(struct sock *sk)
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
 	struct inet_listen_hashbucket *ilb = NULL;
+	struct sock *nsk;
 	spinlock_t *lock;
 
 	if (sk_unhashed(sk))
@@ -696,8 +697,12 @@ void inet_unhash(struct sock *sk)
 	if (sk_unhashed(sk))
 		goto unlock;
 
-	if (rcu_access_pointer(sk->sk_reuseport_cb))
-		reuseport_detach_sock(sk);
+	if (rcu_access_pointer(sk->sk_reuseport_cb)) {
+		nsk = reuseport_detach_sock(sk);
+		if (nsk)
+			inet_csk_reqsk_queue_migrate(sk, nsk);
+	}
+
 	if (ilb) {
 		inet_unhash2(hashinfo, sk);
 		ilb->count--;
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 04/11] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV.
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2020-12-01 14:44 ` [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-01 15:30   ` Eric Dumazet
  2020-12-01 14:44 ` [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

A TFO request socket is only freed after BOTH 3WHS has completed (or
aborted) and the child socket has been accepted (or its listener has been
closed). Hence, depending on the order, there can be two kinds of request
sockets in the accept queue.

  3WHS -> accept : TCP_ESTABLISHED
  accept -> 3WHS : TCP_SYN_RECV

Unlike TCP_ESTABLISHED socket, accept() does not free the request socket
for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove().
Also, it accesses request_sock.rsk_listener. So, in order to complete TFO
socket migration, we have to set the current listener to it at accept()
before reqsk_fastopen_remove().

Moreover, if TFO request caused RST before 3WHS has completed, it is held
in the listener's TFO queue to prevent DDoS attack. Thus, we also have to
migrate the requests in TFO queue.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 net/ipv4/inet_connection_sock.c | 35 ++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index b27241ea96bd..361efe55b1ad 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -500,6 +500,16 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
 	    tcp_rsk(req)->tfo_listener) {
 		spin_lock_bh(&queue->fastopenq.lock);
 		if (tcp_rsk(req)->tfo_listener) {
+			if (req->rsk_listener != sk) {
+				/* TFO request was migrated to another listener so
+				 * the new listener must be used in reqsk_fastopen_remove()
+				 * to hold requests which cause RST.
+				 */
+				sock_put(req->rsk_listener);
+				sock_hold(sk);
+				req->rsk_listener = sk;
+			}
+
 			/* We are still waiting for the final ACK from 3WHS
 			 * so can't free req now. Instead, we set req->sk to
 			 * NULL to signify that the child socket is taken
@@ -954,7 +964,6 @@ static void inet_child_forget(struct sock *sk, struct request_sock *req,
 
 	if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(req)->tfo_listener) {
 		BUG_ON(rcu_access_pointer(tcp_sk(child)->fastopen_rsk) != req);
-		BUG_ON(sk != req->rsk_listener);
 
 		/* Paranoid, to prevent race condition if
 		 * an inbound pkt destined for child is
@@ -995,6 +1004,7 @@ EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
 void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
 {
 	struct request_sock_queue *old_accept_queue, *new_accept_queue;
+	struct fastopen_queue *old_fastopenq, *new_fastopenq;
 
 	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
 	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
@@ -1019,6 +1029,29 @@ void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
 
 	spin_unlock(&new_accept_queue->rskq_lock);
 	spin_unlock(&old_accept_queue->rskq_lock);
+
+	old_fastopenq = &old_accept_queue->fastopenq;
+	new_fastopenq = &new_accept_queue->fastopenq;
+
+	spin_lock_bh(&old_fastopenq->lock);
+	spin_lock_bh(&new_fastopenq->lock);
+
+	new_fastopenq->qlen += old_fastopenq->qlen;
+	old_fastopenq->qlen = 0;
+
+	if (old_fastopenq->rskq_rst_head) {
+		if (new_fastopenq->rskq_rst_head)
+			old_fastopenq->rskq_rst_tail->dl_next = new_fastopenq->rskq_rst_head;
+		else
+			old_fastopenq->rskq_rst_tail = new_fastopenq->rskq_rst_tail;
+
+		new_fastopenq->rskq_rst_head = old_fastopenq->rskq_rst_head;
+		old_fastopenq->rskq_rst_head = NULL;
+		old_fastopenq->rskq_rst_tail = NULL;
+	}
+
+	spin_unlock_bh(&new_fastopenq->lock);
+	spin_unlock_bh(&old_fastopenq->lock);
 }
 EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
 
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2020-12-01 14:44 ` [PATCH v1 bpf-next 04/11] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-01 15:13   ` Eric Dumazet
  2020-12-10  0:07   ` Martin KaFai Lau
  2020-12-01 14:44 ` [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

This patch renames reuseport_select_sock() to __reuseport_select_sock() and
adds two wrapper function of it to pass the migration type defined in the
previous commit.

  reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
  reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST

As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
patch also changes the code to call reuseport_select_migrated_sock() even
if the listening socket is TCP_CLOSE. If we can pick out a listening socket
from the reuseport group, we rewrite request_sock.rsk_listener and resume
processing the request.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/net/inet_connection_sock.h | 12 +++++++++++
 include/net/request_sock.h         | 13 ++++++++++++
 include/net/sock_reuseport.h       |  8 +++----
 net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
 net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
 net/ipv4/tcp_ipv4.c                |  9 ++++++--
 net/ipv6/tcp_ipv6.c                |  9 ++++++--
 7 files changed, 81 insertions(+), 17 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 2ea2d743f8fc..1e0958f5eb21 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
 	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
 }
 
+static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
+						 struct sock *nsk,
+						 struct request_sock *req)
+{
+	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
+			     &inet_csk(nsk)->icsk_accept_queue,
+			     req);
+	sock_put(sk);
+	sock_hold(nsk);
+	req->rsk_listener = nsk;
+}
+
 static inline int inet_csk_reqsk_queue_len(const struct sock *sk)
 {
 	return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 29e41ff3ec93..d18ba0b857cc 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -226,6 +226,19 @@ static inline void reqsk_queue_added(struct request_sock_queue *queue)
 	atomic_inc(&queue->qlen);
 }
 
+static inline void reqsk_queue_migrated(struct request_sock_queue *old_accept_queue,
+					struct request_sock_queue *new_accept_queue,
+					const struct request_sock *req)
+{
+	atomic_dec(&old_accept_queue->qlen);
+	atomic_inc(&new_accept_queue->qlen);
+
+	if (req->num_timeout == 0) {
+		atomic_dec(&old_accept_queue->young);
+		atomic_inc(&new_accept_queue->young);
+	}
+}
+
 static inline int reqsk_queue_len(const struct request_sock_queue *queue)
 {
 	return atomic_read(&queue->qlen);
diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 09a1b1539d4c..a48259a974be 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -32,10 +32,10 @@ extern int reuseport_alloc(struct sock *sk, bool bind_inany);
 extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
 			      bool bind_inany);
 extern struct sock *reuseport_detach_sock(struct sock *sk);
-extern struct sock *reuseport_select_sock(struct sock *sk,
-					  u32 hash,
-					  struct sk_buff *skb,
-					  int hdr_len);
+extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash,
+					  struct sk_buff *skb, int hdr_len);
+extern struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
+						   struct sk_buff *skb);
 extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog);
 extern int reuseport_detach_prog(struct sock *sk);
 
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 60d7c1f28809..b4fe0829c9ab 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -202,7 +202,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 	}
 
 	reuse->socks[reuse->num_socks] = sk;
-	/* paired with smp_rmb() in reuseport_select_sock() */
+	/* paired with smp_rmb() in __reuseport_select_sock() */
 	smp_wmb();
 	reuse->num_socks++;
 	rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
@@ -313,12 +313,13 @@ static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks,
  *  @hdr_len: BPF filter expects skb data pointer at payload data.  If
  *    the skb does not yet point at the payload, this parameter represents
  *    how far the pointer needs to advance to reach the payload.
+ *  @migration: represents if it is selecting a listener for SYN or
+ *    migrating ESTABLISHED/SYN_RECV sockets or NEW_SYN_RECV socket.
  *  Returns a socket that should receive the packet (or NULL on error).
  */
-struct sock *reuseport_select_sock(struct sock *sk,
-				   u32 hash,
-				   struct sk_buff *skb,
-				   int hdr_len)
+struct sock *__reuseport_select_sock(struct sock *sk, u32 hash,
+				     struct sk_buff *skb, int hdr_len,
+				     u8 migration)
 {
 	struct sock_reuseport *reuse;
 	struct bpf_prog *prog;
@@ -332,13 +333,19 @@ struct sock *reuseport_select_sock(struct sock *sk,
 	if (!reuse)
 		goto out;
 
-	prog = rcu_dereference(reuse->prog);
 	socks = READ_ONCE(reuse->num_socks);
 	if (likely(socks)) {
 		/* paired with smp_wmb() in reuseport_add_sock() */
 		smp_rmb();
 
-		if (!prog || !skb)
+		prog = rcu_dereference(reuse->prog);
+		if (!prog)
+			goto select_by_hash;
+
+		if (migration)
+			goto out;
+
+		if (!skb)
 			goto select_by_hash;
 
 		if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
@@ -367,8 +374,21 @@ struct sock *reuseport_select_sock(struct sock *sk,
 	rcu_read_unlock();
 	return sk2;
 }
+
+struct sock *reuseport_select_sock(struct sock *sk, u32 hash,
+				   struct sk_buff *skb, int hdr_len)
+{
+	return __reuseport_select_sock(sk, hash, skb, hdr_len, BPF_SK_REUSEPORT_MIGRATE_NO);
+}
 EXPORT_SYMBOL(reuseport_select_sock);
 
+struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
+					    struct sk_buff *skb)
+{
+	return __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
+}
+EXPORT_SYMBOL(reuseport_select_migrated_sock);
+
 int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog)
 {
 	struct sock_reuseport *reuse;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 361efe55b1ad..e71653c6eae2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t)
 	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
 	int max_syn_ack_retries, qlen, expire = 0, resend = 0;
 
-	if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
-		goto drop;
+	if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
+		sk_listener = reuseport_select_migrated_sock(sk_listener,
+							     req_to_sk(req)->sk_hash, NULL);
+		if (!sk_listener) {
+			sk_listener = req->rsk_listener;
+			goto drop;
+		}
+		inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req);
+		icsk = inet_csk(sk_listener);
+		queue = &icsk->icsk_accept_queue;
+	}
 
 	max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries;
 	/* Normally all the openreqs are young and become mature
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index e4b31e70bd30..9a9aa27c6069 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1973,8 +1973,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
 			goto csum_error;
 		}
 		if (unlikely(sk->sk_state != TCP_LISTEN)) {
-			inet_csk_reqsk_queue_drop_and_put(sk, req);
-			goto lookup;
+			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
+			if (!nsk) {
+				inet_csk_reqsk_queue_drop_and_put(sk, req);
+				goto lookup;
+			}
+			inet_csk_reqsk_queue_migrated(sk, nsk, req);
+			sk = nsk;
 		}
 		/* We own a reference on the listener, increase it again
 		 * as we might lose it too soon.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 992cbf3eb9e3..ff11f3c0cb96 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1635,8 +1635,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 			goto csum_error;
 		}
 		if (unlikely(sk->sk_state != TCP_LISTEN)) {
-			inet_csk_reqsk_queue_drop_and_put(sk, req);
-			goto lookup;
+			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
+			if (!nsk) {
+				inet_csk_reqsk_queue_drop_and_put(sk, req);
+				goto lookup;
+			}
+			inet_csk_reqsk_queue_migrated(sk, nsk, req);
+			sk = nsk;
 		}
 		sock_hold(sk);
 		refcounted = true;
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2020-12-01 14:44 ` [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-02  2:04   ` Andrii Nakryiko
  2020-12-01 14:44 ` [PATCH v1 bpf-next 07/11] libbpf: Set expected_attach_type " Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
check if the attached eBPF program is capable of migrating sockets.

When the eBPF program is attached, the kernel runs it for socket migration
only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
The kernel will change the behaviour depending on the returned value:

  - SK_PASS with selected_sk, select it as a new listener
  - SK_PASS with selected_sk NULL, fall back to the random selection
  - SK_DROP, cancel the migration

Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/uapi/linux/bpf.h       | 2 ++
 kernel/bpf/syscall.c           | 8 ++++++++
 tools/include/uapi/linux/bpf.h | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 85278deff439..cfc207ae7782 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -241,6 +241,8 @@ enum bpf_attach_type {
 	BPF_XDP_CPUMAP,
 	BPF_SK_LOOKUP,
 	BPF_XDP,
+	BPF_SK_REUSEPORT_SELECT,
+	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f3fe9f53f93c..a0796a8de5ea 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		if (expected_attach_type == BPF_SK_LOOKUP)
 			return 0;
 		return -EINVAL;
+	case BPF_PROG_TYPE_SK_REUSEPORT:
+		switch (expected_attach_type) {
+		case BPF_SK_REUSEPORT_SELECT:
+		case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
+			return 0;
+		default:
+			return -EINVAL;
+		}
 	case BPF_PROG_TYPE_EXT:
 		if (expected_attach_type)
 			return -EINVAL;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 85278deff439..cfc207ae7782 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -241,6 +241,8 @@ enum bpf_attach_type {
 	BPF_XDP_CPUMAP,
 	BPF_SK_LOOKUP,
 	BPF_XDP,
+	BPF_SK_REUSEPORT_SELECT,
+	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 07/11] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2020-12-01 14:44 ` [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-01 14:44 ` [PATCH v1 bpf-next 08/11] bpf: Add migration to sk_reuseport_(kern|md) Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

This commit introduces a new section (sk_reuseport/migrate) and sets
expected_attach_type to two each section in BPF_PROG_TYPE_SK_REUSEPORT
program.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 tools/lib/bpf/libbpf.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 28baee7ba1ca..bbb3902a0e41 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8237,7 +8237,10 @@ static struct bpf_link *attach_iter(const struct bpf_sec_def *sec,
 
 static const struct bpf_sec_def section_defs[] = {
 	BPF_PROG_SEC("socket",			BPF_PROG_TYPE_SOCKET_FILTER),
-	BPF_PROG_SEC("sk_reuseport",		BPF_PROG_TYPE_SK_REUSEPORT),
+	BPF_EAPROG_SEC("sk_reuseport/migrate",	BPF_PROG_TYPE_SK_REUSEPORT,
+						BPF_SK_REUSEPORT_SELECT_OR_MIGRATE),
+	BPF_EAPROG_SEC("sk_reuseport",		BPF_PROG_TYPE_SK_REUSEPORT,
+						BPF_SK_REUSEPORT_SELECT),
 	SEC_DEF("kprobe/", KPROBE,
 		.attach_fn = attach_kprobe),
 	BPF_PROG_SEC("uprobe/",			BPF_PROG_TYPE_KPROBE),
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 08/11] bpf: Add migration to sk_reuseport_(kern|md).
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2020-12-01 14:44 ` [PATCH v1 bpf-next 07/11] libbpf: Set expected_attach_type " Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-01 14:44 ` [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

This patch adds u8 migration field to sk_reuseport_kern and sk_reuseport_md
to signal the eBPF program if the kernel calls it for selecting a listener
for SYN or migrating sockets in the accept queue or an immature socket
during 3WHS.

Note that this field is accessible only if the attached type is
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/linux/bpf.h            |  1 +
 include/linux/filter.h         |  4 ++--
 include/uapi/linux/bpf.h       |  1 +
 net/core/filter.c              | 15 ++++++++++++---
 net/core/sock_reuseport.c      |  2 +-
 tools/include/uapi/linux/bpf.h |  1 +
 6 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 581b2a2e78eb..244f823f1f84 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1897,6 +1897,7 @@ struct sk_reuseport_kern {
 	u32 hash;
 	u32 reuseport_id;
 	bool bind_inany;
+	u8 migration;
 };
 bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 				  struct bpf_insn_access_aux *info);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1b62397bd124..15d5bf13a905 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -967,12 +967,12 @@ void bpf_warn_invalid_xdp_action(u32 act);
 #ifdef CONFIG_INET
 struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 				  struct bpf_prog *prog, struct sk_buff *skb,
-				  u32 hash);
+				  u32 hash, u8 migration);
 #else
 static inline struct sock *
 bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 		     struct bpf_prog *prog, struct sk_buff *skb,
-		     u32 hash)
+		     u32 hash, u8 migration)
 {
 	return NULL;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index cfc207ae7782..efe342bf3dbc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4419,6 +4419,7 @@ struct sk_reuseport_md {
 	__u32 ip_protocol;	/* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
 	__u32 bind_inany;	/* Is sock bound to an INANY address? */
 	__u32 hash;		/* A hash of the packet 4 tuples */
+	__u8 migration;		/* Migration type */
 };
 
 #define BPF_TAG_SIZE	8
diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..0a0634787bb4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9853,7 +9853,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
 static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
 				    struct sock_reuseport *reuse,
 				    struct sock *sk, struct sk_buff *skb,
-				    u32 hash)
+				    u32 hash, u8 migration)
 {
 	reuse_kern->skb = skb;
 	reuse_kern->sk = sk;
@@ -9862,16 +9862,17 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
 	reuse_kern->hash = hash;
 	reuse_kern->reuseport_id = reuse->reuseport_id;
 	reuse_kern->bind_inany = reuse->bind_inany;
+	reuse_kern->migration = migration;
 }
 
 struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 				  struct bpf_prog *prog, struct sk_buff *skb,
-				  u32 hash)
+				  u32 hash, u8 migration)
 {
 	struct sk_reuseport_kern reuse_kern;
 	enum sk_action action;
 
-	bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash);
+	bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash, migration);
 	action = BPF_PROG_RUN(prog, &reuse_kern);
 
 	if (action == SK_PASS)
@@ -10010,6 +10011,10 @@ sk_reuseport_is_valid_access(int off, int size,
 	case offsetof(struct sk_reuseport_md, hash):
 		return size == size_default;
 
+	case bpf_ctx_range(struct sk_reuseport_md, migration):
+		return prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE &&
+			size == sizeof(__u8);
+
 	/* Fields that allow narrowing */
 	case bpf_ctx_range(struct sk_reuseport_md, eth_protocol):
 		if (size < sizeof_field(struct sk_buff, protocol))
@@ -10082,6 +10087,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct sk_reuseport_md, bind_inany):
 		SK_REUSEPORT_LOAD_FIELD(bind_inany);
 		break;
+
+	case offsetof(struct sk_reuseport_md, migration):
+		SK_REUSEPORT_LOAD_FIELD(migration);
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index b4fe0829c9ab..96d65b4c6974 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -349,7 +349,7 @@ struct sock *__reuseport_select_sock(struct sock *sk, u32 hash,
 			goto select_by_hash;
 
 		if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
-			sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash);
+			sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash, migration);
 		else
 			sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len);
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index cfc207ae7782..efe342bf3dbc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4419,6 +4419,7 @@ struct sk_reuseport_md {
 	__u32 ip_protocol;	/* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
 	__u32 bind_inany;	/* Is sock bound to an INANY address? */
 	__u32 hash;		/* A hash of the packet 4 tuples */
+	__u8 migration;		/* Migration type */
 };
 
 #define BPF_TAG_SIZE	8
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2020-12-01 14:44 ` [PATCH v1 bpf-next 08/11] bpf: Add migration to sk_reuseport_(kern|md) Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-04 19:58   ` Martin KaFai Lau
  2020-12-01 14:44 ` [PATCH v1 bpf-next 10/11] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
  2020-12-01 14:44 ` [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
  10 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

We will call sock_reuseport.prog for socket migration in the next commit,
so the eBPF program has to know which listener is closing in order to
select the new listener.

Currently, we can get a unique ID for each listener in the userspace by
calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.

This patch makes the sk pointer available in sk_reuseport_md so that we can
get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.

Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/uapi/linux/bpf.h       |  8 ++++++++
 net/core/filter.c              | 12 +++++++++++-
 tools/include/uapi/linux/bpf.h |  8 ++++++++
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index efe342bf3dbc..3e9b8bd42b4e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1650,6 +1650,13 @@ union bpf_attr {
  * 		A 8-byte long non-decreasing number on success, or 0 if the
  * 		socket field is missing inside *skb*.
  *
+ * u64 bpf_get_socket_cookie(struct bpf_sock *sk)
+ * 	Description
+ * 		Equivalent to bpf_get_socket_cookie() helper that accepts
+ * 		*skb*, but gets socket from **struct bpf_sock** context.
+ * 	Return
+ * 		A 8-byte long non-decreasing number.
+ *
  * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx)
  * 	Description
  * 		Equivalent to bpf_get_socket_cookie() helper that accepts
@@ -4420,6 +4427,7 @@ struct sk_reuseport_md {
 	__u32 bind_inany;	/* Is sock bound to an INANY address? */
 	__u32 hash;		/* A hash of the packet 4 tuples */
 	__u8 migration;		/* Migration type */
+	__bpf_md_ptr(struct bpf_sock *, sk); /* current listening socket */
 };
 
 #define BPF_TAG_SIZE	8
diff --git a/net/core/filter.c b/net/core/filter.c
index 0a0634787bb4..1059d31847ef 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4628,7 +4628,7 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_proto = {
 	.func		= bpf_get_socket_cookie_sock,
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
-	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg1_type	= ARG_PTR_TO_SOCKET,
 };
 
 BPF_CALL_1(bpf_get_socket_cookie_sock_ops, struct bpf_sock_ops_kern *, ctx)
@@ -9982,6 +9982,8 @@ sk_reuseport_func_proto(enum bpf_func_id func_id,
 		return &sk_reuseport_load_bytes_proto;
 	case BPF_FUNC_skb_load_bytes_relative:
 		return &sk_reuseport_load_bytes_relative_proto;
+	case BPF_FUNC_get_socket_cookie:
+		return &bpf_get_socket_cookie_sock_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -10015,6 +10017,10 @@ sk_reuseport_is_valid_access(int off, int size,
 		return prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE &&
 			size == sizeof(__u8);
 
+	case offsetof(struct sk_reuseport_md, sk):
+		info->reg_type = PTR_TO_SOCKET;
+		return size == sizeof(__u64);
+
 	/* Fields that allow narrowing */
 	case bpf_ctx_range(struct sk_reuseport_md, eth_protocol):
 		if (size < sizeof_field(struct sk_buff, protocol))
@@ -10091,6 +10097,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct sk_reuseport_md, migration):
 		SK_REUSEPORT_LOAD_FIELD(migration);
 		break;
+
+	case offsetof(struct sk_reuseport_md, sk):
+		SK_REUSEPORT_LOAD_FIELD(sk);
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index efe342bf3dbc..3e9b8bd42b4e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1650,6 +1650,13 @@ union bpf_attr {
  * 		A 8-byte long non-decreasing number on success, or 0 if the
  * 		socket field is missing inside *skb*.
  *
+ * u64 bpf_get_socket_cookie(struct bpf_sock *sk)
+ * 	Description
+ * 		Equivalent to bpf_get_socket_cookie() helper that accepts
+ * 		*skb*, but gets socket from **struct bpf_sock** context.
+ * 	Return
+ * 		A 8-byte long non-decreasing number.
+ *
  * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx)
  * 	Description
  * 		Equivalent to bpf_get_socket_cookie() helper that accepts
@@ -4420,6 +4427,7 @@ struct sk_reuseport_md {
 	__u32 bind_inany;	/* Is sock bound to an INANY address? */
 	__u32 hash;		/* A hash of the packet 4 tuples */
 	__u8 migration;		/* Migration type */
+	__bpf_md_ptr(struct bpf_sock *, sk); /* current listening socket */
 };
 
 #define BPF_TAG_SIZE	8
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 10/11] bpf: Call bpf_run_sk_reuseport() for socket migration.
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2020-12-01 14:44 ` [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-01 14:44 ` [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
  10 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

This patch supports socket migration by eBPF. If the attached type is
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, we can select a new listener by
BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
SK_DROP. This feature is useful when listeners have different settings at
the socket API level or when we want to free resources as soon as possible.

There are two noteworthy points. The first is that we select a listening
socket in reuseport_detach_sock() and __reuseport_select_sock(), but we do
not have struct skb at closing a listener or retransmitting a SYN+ACK.
However, some helper functions do not expect skb is NULL (e.g.
skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in
BPF_FUNC_skb_load_bytes_relative()). So, we allocate an empty skb
temporarily before running the eBPF program. The second is that we do not
have struct request_sock in unhash path, and the sk_hash of the listener is
always zero. Thus, we pass zero as hash to bpf_run_sk_reuseport().

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 net/core/filter.c          | 19 +++++++++++++++++++
 net/core/sock_reuseport.c  | 19 ++++++++++---------
 net/ipv4/inet_hashtables.c |  2 +-
 3 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 1059d31847ef..2f2fb77cdb72 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9871,10 +9871,29 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 {
 	struct sk_reuseport_kern reuse_kern;
 	enum sk_action action;
+	bool allocated = false;
+
+	if (migration) {
+		/* cancel migration for possibly incapable eBPF program */
+		if (prog->expected_attach_type != BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)
+			return ERR_PTR(-ENOTSUPP);
+
+		if (!skb) {
+			allocated = true;
+			skb = alloc_skb(0, GFP_ATOMIC);
+			if (!skb)
+				return ERR_PTR(-ENOMEM);
+		}
+	} else if (!skb) {
+		return NULL; /* fall back to select by hash */
+	}
 
 	bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash, migration);
 	action = BPF_PROG_RUN(prog, &reuse_kern);
 
+	if (allocated)
+		kfree_skb(skb);
+
 	if (action == SK_PASS)
 		return reuse_kern.selected_sk;
 	else
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 96d65b4c6974..6b475897b496 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -247,8 +247,15 @@ struct sock *reuseport_detach_sock(struct sock *sk)
 		prog = rcu_dereference(reuse->prog);
 
 		if (sk->sk_protocol == IPPROTO_TCP) {
-			if (reuse->num_socks && !prog)
-				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
+			if (reuse->num_socks) {
+				if (prog)
+					nsk = bpf_run_sk_reuseport(reuse, sk, prog, NULL, 0,
+								   BPF_SK_REUSEPORT_MIGRATE_QUEUE);
+
+				if (!nsk)
+					nsk = i == reuse->num_socks ?
+						reuse->socks[i - 1] : reuse->socks[i];
+			}
 
 			reuse->num_closed_socks++;
 			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
@@ -342,15 +349,9 @@ struct sock *__reuseport_select_sock(struct sock *sk, u32 hash,
 		if (!prog)
 			goto select_by_hash;
 
-		if (migration)
-			goto out;
-
-		if (!skb)
-			goto select_by_hash;
-
 		if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
 			sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash, migration);
-		else
+		else if (!skb)
 			sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len);
 
 select_by_hash:
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 545538a6bfac..59f58740c20d 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -699,7 +699,7 @@ void inet_unhash(struct sock *sk)
 
 	if (rcu_access_pointer(sk->sk_reuseport_cb)) {
 		nsk = reuseport_detach_sock(sk);
-		if (nsk)
+		if (!IS_ERR_OR_NULL(nsk))
 			inet_csk_reqsk_queue_migrate(sk, nsk);
 	}
 
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
  2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2020-12-01 14:44 ` [PATCH v1 bpf-next 10/11] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
@ 2020-12-01 14:44 ` Kuniyuki Iwashima
  2020-12-05  1:50   ` Martin KaFai Lau
  10 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-01 14:44 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, Kuniyuki Iwashima,
	osa-contribution-log, bpf, netdev, linux-kernel

This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 .../bpf/prog_tests/migrate_reuseport.c        | 164 ++++++++++++++++++
 .../bpf/progs/test_migrate_reuseport_kern.c   |  54 ++++++
 2 files changed, 218 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c

diff --git a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
new file mode 100644
index 000000000000..87c72d9ccadd
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
@@ -0,0 +1,164 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Check if we can migrate child sockets.
+ *
+ *   1. call listen() for 5 server sockets.
+ *   2. update a map to migrate all child socket
+ *        to the last server socket (migrate_map[cookie] = 4)
+ *   3. call connect() for 25 client sockets.
+ *   4. call close() for first 4 server sockets.
+ *   5. call accept() for the last server socket.
+ *
+ * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
+ */
+
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <linux/bpf.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define NUM_SOCKS 5
+#define LOCALHOST "127.0.0.1"
+#define err_exit(condition, message)			      \
+	do {						      \
+		if (condition) {			      \
+			perror("ERROR: " message " ");	      \
+			exit(1);			      \
+		}					      \
+	} while (0)
+
+__u64 server_fds[NUM_SOCKS];
+int prog_fd, reuseport_map_fd, migrate_map_fd;
+
+
+void setup_bpf(void)
+{
+	struct bpf_object *obj;
+	struct bpf_program *prog;
+	struct bpf_map *reuseport_map, *migrate_map;
+	int err;
+
+	obj = bpf_object__open("test_migrate_reuseport_kern.o");
+	err_exit(libbpf_get_error(obj), "opening BPF object file failed");
+
+	err = bpf_object__load(obj);
+	err_exit(err, "loading BPF object failed");
+
+	prog = bpf_program__next(NULL, obj);
+	err_exit(!prog, "loading BPF program failed");
+
+	reuseport_map = bpf_object__find_map_by_name(obj, "reuseport_map");
+	err_exit(!reuseport_map, "loading BPF reuseport_map failed");
+
+	migrate_map = bpf_object__find_map_by_name(obj, "migrate_map");
+	err_exit(!migrate_map, "loading BPF migrate_map failed");
+
+	prog_fd = bpf_program__fd(prog);
+	reuseport_map_fd = bpf_map__fd(reuseport_map);
+	migrate_map_fd = bpf_map__fd(migrate_map);
+}
+
+void test_listen(void)
+{
+	struct sockaddr_in addr;
+	socklen_t addr_len = sizeof(addr);
+	int i, err, optval = 1, migrated_to = NUM_SOCKS - 1;
+	__u64 value;
+
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(80);
+	inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr);
+
+	for (i = 0; i < NUM_SOCKS; i++) {
+		server_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+		err_exit(server_fds[i] == -1, "socket() for listener sockets failed");
+
+		err = setsockopt(server_fds[i], SOL_SOCKET, SO_REUSEPORT,
+				 &optval, sizeof(optval));
+		err_exit(err == -1, "setsockopt() for SO_REUSEPORT failed");
+
+		if (i == 0) {
+			err = setsockopt(server_fds[i], SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
+					 &prog_fd, sizeof(prog_fd));
+			err_exit(err == -1, "setsockopt() for SO_ATTACH_REUSEPORT_EBPF failed");
+		}
+
+		err = bind(server_fds[i], (struct sockaddr *)&addr, addr_len);
+		err_exit(err == -1, "bind() failed");
+
+		err = listen(server_fds[i], 32);
+		err_exit(err == -1, "listen() failed");
+
+		err = bpf_map_update_elem(reuseport_map_fd, &i, &server_fds[i], BPF_NOEXIST);
+		err_exit(err == -1, "updating BPF reuseport_map failed");
+
+		err = bpf_map_lookup_elem(reuseport_map_fd, &i, &value);
+		err_exit(err == -1, "looking up BPF reuseport_map failed");
+
+		printf("fd[%d] (cookie: %llu) -> fd[%d]\n", i, value, migrated_to);
+		err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, BPF_NOEXIST);
+		err_exit(err == -1, "updating BPF migrate_map failed");
+	}
+}
+
+void test_connect(void)
+{
+	struct sockaddr_in addr;
+	socklen_t addr_len = sizeof(addr);
+	int i, err, client_fd;
+
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(80);
+	inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr);
+
+	for (i = 0; i < NUM_SOCKS * 5; i++) {
+		client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+		err_exit(client_fd == -1, "socket() for listener sockets failed");
+
+		err = connect(client_fd, (struct sockaddr *)&addr, addr_len);
+		err_exit(err == -1, "connect() failed");
+
+		close(client_fd);
+	}
+}
+
+void test_close(void)
+{
+	int i;
+
+	for (i = 0; i < NUM_SOCKS - 1; i++)
+		close(server_fds[i]);
+}
+
+void test_accept(void)
+{
+	struct sockaddr_in addr;
+	socklen_t addr_len = sizeof(addr);
+	int cnt, client_fd;
+
+	fcntl(server_fds[NUM_SOCKS - 1], F_SETFL, O_NONBLOCK);
+
+	for (cnt = 0; cnt < NUM_SOCKS * 5; cnt++) {
+		client_fd = accept(server_fds[NUM_SOCKS - 1], (struct sockaddr *)&addr, &addr_len);
+		err_exit(client_fd == -1, "accept() failed");
+	}
+
+	printf("%d accepted, %d is expected\n", cnt, NUM_SOCKS * 5);
+}
+
+int main(void)
+{
+	setup_bpf();
+	test_listen();
+	test_connect();
+	test_close();
+	test_accept();
+	close(server_fds[NUM_SOCKS - 1]);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c b/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c
new file mode 100644
index 000000000000..28d007b3a7a7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Check if we can migrate child sockets.
+ *
+ *   1. If reuse_md->migration is 0 (SYN packet),
+ *        return SK_PASS without selecting a listener.
+ *   2. If reuse_md->migration is not 0 (socket migration),
+ *        select a listener (reuseport_map[migrate_map[cookie]])
+ *
+ * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
+ */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+#define NULL ((void *)0)
+
+int _version SEC("version") = 1;
+
+struct bpf_map_def SEC("maps") reuseport_map = {
+	.type = BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(__u64),
+	.max_entries = 256,
+};
+
+struct bpf_map_def SEC("maps") migrate_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(__u64),
+	.value_size = sizeof(int),
+	.max_entries = 256,
+};
+
+SEC("sk_reuseport/migrate")
+int select_by_skb_data(struct sk_reuseport_md *reuse_md)
+{
+	int *key, flags = 0;
+	__u64 cookie;
+
+	if (!reuse_md->migration)
+		return SK_PASS;
+
+	cookie = bpf_get_socket_cookie(reuse_md->sk);
+
+	key = bpf_map_lookup_elem(&migrate_map, &cookie);
+	if (key == NULL)
+		return SK_DROP;
+
+	bpf_sk_select_reuseport(reuse_md, &reuseport_map, key, flags);
+
+	return SK_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-12-01 14:44 ` [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
@ 2020-12-01 15:13   ` Eric Dumazet
  2020-12-03 14:12     ` Kuniyuki Iwashima
  2020-12-10  0:07   ` Martin KaFai Lau
  1 sibling, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2020-12-01 15:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski,
	Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, osa-contribution-log,
	bpf, netdev, linux-kernel



On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> adds two wrapper function of it to pass the migration type defined in the
> previous commit.
> 
>   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
>   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> 
> As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> patch also changes the code to call reuseport_select_migrated_sock() even
> if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> from the reuseport group, we rewrite request_sock.rsk_listener and resume
> processing the request.
> 
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  include/net/inet_connection_sock.h | 12 +++++++++++
>  include/net/request_sock.h         | 13 ++++++++++++
>  include/net/sock_reuseport.h       |  8 +++----
>  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
>  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
>  net/ipv4/tcp_ipv4.c                |  9 ++++++--
>  net/ipv6/tcp_ipv6.c                |  9 ++++++--
>  7 files changed, 81 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 2ea2d743f8fc..1e0958f5eb21 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
>  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
>  }
>  
> +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> +						 struct sock *nsk,
> +						 struct request_sock *req)
> +{
> +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> +			     &inet_csk(nsk)->icsk_accept_queue,
> +			     req);
> +	sock_put(sk);
> +	sock_hold(nsk);

This looks racy to me. nsk refcount might be zero at this point.

If you think it can _not_ be zero, please add a big comment here,
because this would mean something has been done before reaching this function,
and this sock_hold() would be not needed in the first place.

There is a good reason reqsk_alloc() is using refcount_inc_not_zero().

> +	req->rsk_listener = nsk;
> +}
> +

Honestly, this patch series looks quite complex, and finding a bug in the
very first function I am looking at is not really a good sign...




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-01 14:44 ` [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
@ 2020-12-01 15:25   ` Eric Dumazet
  2020-12-03 14:14     ` Kuniyuki Iwashima
  2020-12-05  1:42   ` Martin KaFai Lau
  2020-12-08  6:54   ` Martin KaFai Lau
  2 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2020-12-01 15:25 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski,
	Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, osa-contribution-log,
	bpf, netdev, linux-kernel



On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> This patch lets reuseport_detach_sock() return a pointer of struct sock,
> which is used only by inet_unhash(). If it is not NULL,
> inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> sockets from the closing listener to the selected one.
> 
> Listening sockets hold incoming connections as a linked list of struct
> request_sock in the accept queue, and each request has reference to a full
> socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> the requests from the closing listener's queue and relink them to the head
> of the new listener's queue. We do not process each request and its
> reference to the listener, so the migration completes in O(1) time
> complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> care in the next commit.
> 
> By default, the kernel selects a new listener randomly. In order to pick
> out a different socket every time, we select the last element of socks[] as
> the new listener. This behaviour is based on how the kernel moves sockets
> in socks[]. (See also [1])
> 
> Basically, in order to redistribute sockets evenly, we have to use an eBPF
> program called in the later commit, but as the side effect of such default
> selection, the kernel can redistribute old requests evenly to new listeners
> for a specific case where the application replaces listeners by
> generations.
> 
> For example, we call listen() for four sockets (A, B, C, D), and close the
> first two by turns. The sockets move in socks[] like below.
> 
>   socks[0] : A <-.      socks[0] : D          socks[0] : D
>   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
>   socks[2] : C   |      socks[2] : C --'
>   socks[3] : D --'
> 
> Then, if C and D have newer settings than A and B, and each socket has a
> request (a, b, c, d) in their accept queue, we can redistribute old
> requests evenly to new listeners.
> 
>   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
>   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
>   socks[2] : C (c)   |      socks[2] : C (c) --'
>   socks[3] : D (d) --'
> 
> Here, (A, D) or (B, C) can have different application settings, but they
> MUST have the same settings at the socket API level; otherwise, unexpected
> error may happen. For instance, if only the new listeners have
> TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> face inconsistency and cause an error.
> 
> Therefore, if there are different kinds of sockets, we must attach an eBPF
> program described in later commits.
> 
> Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  include/net/inet_connection_sock.h |  1 +
>  include/net/sock_reuseport.h       |  2 +-
>  net/core/sock_reuseport.c          | 10 +++++++++-
>  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
>  net/ipv4/inet_hashtables.c         |  9 +++++++--
>  5 files changed, 48 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 7338b3865a2a..2ea2d743f8fc 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
>  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>  				      struct request_sock *req,
>  				      struct sock *child);
> +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
>  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>  				   unsigned long timeout);
>  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> index 0e558ca7afbf..09a1b1539d4c 100644
> --- a/include/net/sock_reuseport.h
> +++ b/include/net/sock_reuseport.h
> @@ -31,7 +31,7 @@ struct sock_reuseport {
>  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
>  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
>  			      bool bind_inany);
> -extern void reuseport_detach_sock(struct sock *sk);
> +extern struct sock *reuseport_detach_sock(struct sock *sk);
>  extern struct sock *reuseport_select_sock(struct sock *sk,
>  					  u32 hash,
>  					  struct sk_buff *skb,
> diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> index fd133516ac0e..60d7c1f28809 100644
> --- a/net/core/sock_reuseport.c
> +++ b/net/core/sock_reuseport.c
> @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
>  }
>  EXPORT_SYMBOL(reuseport_add_sock);
>  
> -void reuseport_detach_sock(struct sock *sk)
> +struct sock *reuseport_detach_sock(struct sock *sk)
>  {
>  	struct sock_reuseport *reuse;
> +	struct bpf_prog *prog;
> +	struct sock *nsk = NULL;
>  	int i;
>  
>  	spin_lock_bh(&reuseport_lock);
> @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
>  
>  		reuse->num_socks--;
>  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> +		prog = rcu_dereference(reuse->prog);
>  
>  		if (sk->sk_protocol == IPPROTO_TCP) {
> +			if (reuse->num_socks && !prog)
> +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> +
>  			reuse->num_closed_socks++;
>  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
>  		} else {
> @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
>  		call_rcu(&reuse->rcu, reuseport_free_rcu);
>  out:
>  	spin_unlock_bh(&reuseport_lock);
> +
> +	return nsk;
>  }
>  EXPORT_SYMBOL(reuseport_detach_sock);
>  
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 1451aa9712b0..b27241ea96bd 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>  }
>  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
>  
> +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> +{
> +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> +
> +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> +
> +	spin_lock(&old_accept_queue->rskq_lock);
> +	spin_lock(&new_accept_queue->rskq_lock);

Are you sure lockdep is happy with this ?

I would guess it should complain, because :

lock(A);
lock(B);
...
unlock(B);
unlock(A);

will fail when the opposite action happens eventually

lock(B);
lock(A);
...
unlock(A);
unlock(B);


> +
> +	if (old_accept_queue->rskq_accept_head) {
> +		if (new_accept_queue->rskq_accept_head)
> +			old_accept_queue->rskq_accept_tail->dl_next =
> +				new_accept_queue->rskq_accept_head;
> +		else
> +			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
> +
> +		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
> +		old_accept_queue->rskq_accept_head = NULL;
> +		old_accept_queue->rskq_accept_tail = NULL;
> +
> +		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
> +		WRITE_ONCE(sk->sk_ack_backlog, 0);
> +	}
> +
> +	spin_unlock(&new_accept_queue->rskq_lock);
> +	spin_unlock(&old_accept_queue->rskq_lock);
> +}
> +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);

I fail to understand how the kernel can run fine right after this patch, before following patches are merged.

All request sockets in the socket accept queue MUST have their rsk_listener set to the listener,
this is how we designed things (each request socket has a reference taken on the listener)

We might even have some "BUG_ON(sk != req->rsk_listener);" in some places.

Since you splice list from old listener to the new one, without changing req->rsk_listener, bad things will happen.

I feel the order of your patches is not correct.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 04/11] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV.
  2020-12-01 14:44 ` [PATCH v1 bpf-next 04/11] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
@ 2020-12-01 15:30   ` Eric Dumazet
  0 siblings, 0 replies; 57+ messages in thread
From: Eric Dumazet @ 2020-12-01 15:30 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski,
	Eric Dumazet, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau
  Cc: Benjamin Herrenschmidt, Kuniyuki Iwashima, osa-contribution-log,
	bpf, netdev, linux-kernel



On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> A TFO request socket is only freed after BOTH 3WHS has completed (or
> aborted) and the child socket has been accepted (or its listener has been
> closed). Hence, depending on the order, there can be two kinds of request
> sockets in the accept queue.
> 
>   3WHS -> accept : TCP_ESTABLISHED
>   accept -> 3WHS : TCP_SYN_RECV
> 
> Unlike TCP_ESTABLISHED socket, accept() does not free the request socket
> for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove().
> Also, it accesses request_sock.rsk_listener. So, in order to complete TFO
> socket migration, we have to set the current listener to it at accept()
> before reqsk_fastopen_remove().
> 
> Moreover, if TFO request caused RST before 3WHS has completed, it is held
> in the listener's TFO queue to prevent DDoS attack. Thus, we also have to
> migrate the requests in TFO queue.
> 
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  net/ipv4/inet_connection_sock.c | 35 ++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index b27241ea96bd..361efe55b1ad 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -500,6 +500,16 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
>  	    tcp_rsk(req)->tfo_listener) {
>  		spin_lock_bh(&queue->fastopenq.lock);
>  		if (tcp_rsk(req)->tfo_listener) {
> +			if (req->rsk_listener != sk) {
> +				/* TFO request was migrated to another listener so
> +				 * the new listener must be used in reqsk_fastopen_remove()
> +				 * to hold requests which cause RST.
> +				 */
> +				sock_put(req->rsk_listener);
> +				sock_hold(sk);
> +				req->rsk_listener = sk;
> +			}
> +
>  			/* We are still waiting for the final ACK from 3WHS
>  			 * so can't free req now. Instead, we set req->sk to
>  			 * NULL to signify that the child socket is taken
> @@ -954,7 +964,6 @@ static void inet_child_forget(struct sock *sk, struct request_sock *req,
>  
>  	if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(req)->tfo_listener) {
>  		BUG_ON(rcu_access_pointer(tcp_sk(child)->fastopen_rsk) != req);
> -		BUG_ON(sk != req->rsk_listener);

>  
>  		/* Paranoid, to prevent race condition if
>  		 * an inbound pkt destined for child is
> @@ -995,6 +1004,7 @@ EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
>  void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
>  {
>  	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> +	struct fastopen_queue *old_fastopenq, *new_fastopenq;
>  
>  	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
>  	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> @@ -1019,6 +1029,29 @@ void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
>  
>  	spin_unlock(&new_accept_queue->rskq_lock);
>  	spin_unlock(&old_accept_queue->rskq_lock);
> +
> +	old_fastopenq = &old_accept_queue->fastopenq;
> +	new_fastopenq = &new_accept_queue->fastopenq;
> +
> +	spin_lock_bh(&old_fastopenq->lock);
> +	spin_lock_bh(&new_fastopenq->lock);


Same remark about lockdep being not happy with this (I guess)

> +
> +	new_fastopenq->qlen += old_fastopenq->qlen;
> +	old_fastopenq->qlen = 0;
> +
> +	if (old_fastopenq->rskq_rst_head) {
> +		if (new_fastopenq->rskq_rst_head)
> +			old_fastopenq->rskq_rst_tail->dl_next = new_fastopenq->rskq_rst_head;
> +		else
> +			old_fastopenq->rskq_rst_tail = new_fastopenq->rskq_rst_tail;
> +
> +		new_fastopenq->rskq_rst_head = old_fastopenq->rskq_rst_head;
> +		old_fastopenq->rskq_rst_head = NULL;
> +		old_fastopenq->rskq_rst_tail = NULL;
> +	}
> +
> +	spin_unlock_bh(&new_fastopenq->lock);
> +	spin_unlock_bh(&old_fastopenq->lock);
>  }
>  EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
>  
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-01 14:44 ` [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
@ 2020-12-02  2:04   ` Andrii Nakryiko
  2020-12-02 19:19     ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Andrii Nakryiko @ 2020-12-02  2:04 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Benjamin Herrenschmidt, Kuniyuki Iwashima, osa-contribution-log,
	bpf, Networking, open list

On Tue, Dec 1, 2020 at 6:49 AM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
>
> This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
> check if the attached eBPF program is capable of migrating sockets.
>
> When the eBPF program is attached, the kernel runs it for socket migration
> only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> The kernel will change the behaviour depending on the returned value:
>
>   - SK_PASS with selected_sk, select it as a new listener
>   - SK_PASS with selected_sk NULL, fall back to the random selection
>   - SK_DROP, cancel the migration
>
> Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
> Suggested-by: Martin KaFai Lau <kafai@fb.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  include/uapi/linux/bpf.h       | 2 ++
>  kernel/bpf/syscall.c           | 8 ++++++++
>  tools/include/uapi/linux/bpf.h | 2 ++
>  3 files changed, 12 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 85278deff439..cfc207ae7782 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -241,6 +241,8 @@ enum bpf_attach_type {
>         BPF_XDP_CPUMAP,
>         BPF_SK_LOOKUP,
>         BPF_XDP,
> +       BPF_SK_REUSEPORT_SELECT,
> +       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index f3fe9f53f93c..a0796a8de5ea 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
>                 if (expected_attach_type == BPF_SK_LOOKUP)
>                         return 0;
>                 return -EINVAL;
> +       case BPF_PROG_TYPE_SK_REUSEPORT:
> +               switch (expected_attach_type) {
> +               case BPF_SK_REUSEPORT_SELECT:
> +               case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
> +                       return 0;
> +               default:
> +                       return -EINVAL;
> +               }

this is a kernel regression, previously expected_attach_type wasn't
enforced, so user-space could have provided any number without an
error.

>         case BPF_PROG_TYPE_EXT:
>                 if (expected_attach_type)
>                         return -EINVAL;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 85278deff439..cfc207ae7782 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -241,6 +241,8 @@ enum bpf_attach_type {
>         BPF_XDP_CPUMAP,
>         BPF_SK_LOOKUP,
>         BPF_XDP,
> +       BPF_SK_REUSEPORT_SELECT,
> +       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> --
> 2.17.2 (Apple Git-113)
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-02  2:04   ` Andrii Nakryiko
@ 2020-12-02 19:19     ` Martin KaFai Lau
  2020-12-03  4:24       ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-02 19:19 UTC (permalink / raw)
  To: Andrii Nakryiko, Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, osa-contribution-log, bpf, Networking,
	open list

On Tue, Dec 01, 2020 at 06:04:50PM -0800, Andrii Nakryiko wrote:
> On Tue, Dec 1, 2020 at 6:49 AM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> >
> > This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
> > check if the attached eBPF program is capable of migrating sockets.
> >
> > When the eBPF program is attached, the kernel runs it for socket migration
> > only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> > The kernel will change the behaviour depending on the returned value:
> >
> >   - SK_PASS with selected_sk, select it as a new listener
> >   - SK_PASS with selected_sk NULL, fall back to the random selection
> >   - SK_DROP, cancel the migration
> >
> > Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
> > Suggested-by: Martin KaFai Lau <kafai@fb.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  include/uapi/linux/bpf.h       | 2 ++
> >  kernel/bpf/syscall.c           | 8 ++++++++
> >  tools/include/uapi/linux/bpf.h | 2 ++
> >  3 files changed, 12 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 85278deff439..cfc207ae7782 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> >         BPF_XDP_CPUMAP,
> >         BPF_SK_LOOKUP,
> >         BPF_XDP,
> > +       BPF_SK_REUSEPORT_SELECT,
> > +       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> >         __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index f3fe9f53f93c..a0796a8de5ea 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
> >                 if (expected_attach_type == BPF_SK_LOOKUP)
> >                         return 0;
> >                 return -EINVAL;
> > +       case BPF_PROG_TYPE_SK_REUSEPORT:
> > +               switch (expected_attach_type) {
> > +               case BPF_SK_REUSEPORT_SELECT:
> > +               case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
> > +                       return 0;
> > +               default:
> > +                       return -EINVAL;
> > +               }
> 
> this is a kernel regression, previously expected_attach_type wasn't
> enforced, so user-space could have provided any number without an
> error.
I also think this change alone will break things like when the usual
attr->expected_attach_type == 0 case.  At least changes is needed in
bpf_prog_load_fixup_attach_type() which is also handling a
similar situation for BPF_PROG_TYPE_CGROUP_SOCK.

I now think there is no need to expose new bpf_attach_type to the UAPI.
Since the prog->expected_attach_type is not used, it can be cleared at load time
and then only set to BPF_SK_REUSEPORT_SELECT_OR_MIGRATE (probably defined
internally at filter.[c|h]) in the is_valid_access() when "migration"
is accessed.  When "migration" is accessed, the bpf prog can handle
migration (and the original not-migration) case.

> 
> >         case BPF_PROG_TYPE_EXT:
> >                 if (expected_attach_type)
> >                         return -EINVAL;
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 85278deff439..cfc207ae7782 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> >         BPF_XDP_CPUMAP,
> >         BPF_SK_LOOKUP,
> >         BPF_XDP,
> > +       BPF_SK_REUSEPORT_SELECT,
> > +       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> >         __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > --
> > 2.17.2 (Apple Git-113)
> >

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-02 19:19     ` Martin KaFai Lau
@ 2020-12-03  4:24       ` Martin KaFai Lau
  2020-12-03 14:16         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-03  4:24 UTC (permalink / raw)
  To: Andrii Nakryiko, Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, osa-contribution-log, bpf, Networking,
	open list

On Wed, Dec 02, 2020 at 11:19:02AM -0800, Martin KaFai Lau wrote:
> On Tue, Dec 01, 2020 at 06:04:50PM -0800, Andrii Nakryiko wrote:
> > On Tue, Dec 1, 2020 at 6:49 AM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> > >
> > > This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
> > > check if the attached eBPF program is capable of migrating sockets.
> > >
> > > When the eBPF program is attached, the kernel runs it for socket migration
> > > only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> > > The kernel will change the behaviour depending on the returned value:
> > >
> > >   - SK_PASS with selected_sk, select it as a new listener
> > >   - SK_PASS with selected_sk NULL, fall back to the random selection
> > >   - SK_DROP, cancel the migration
> > >
> > > Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
> > > Suggested-by: Martin KaFai Lau <kafai@fb.com>
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > ---
> > >  include/uapi/linux/bpf.h       | 2 ++
> > >  kernel/bpf/syscall.c           | 8 ++++++++
> > >  tools/include/uapi/linux/bpf.h | 2 ++
> > >  3 files changed, 12 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 85278deff439..cfc207ae7782 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > >         BPF_XDP_CPUMAP,
> > >         BPF_SK_LOOKUP,
> > >         BPF_XDP,
> > > +       BPF_SK_REUSEPORT_SELECT,
> > > +       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > >         __MAX_BPF_ATTACH_TYPE
> > >  };
> > >
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index f3fe9f53f93c..a0796a8de5ea 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
> > >                 if (expected_attach_type == BPF_SK_LOOKUP)
> > >                         return 0;
> > >                 return -EINVAL;
> > > +       case BPF_PROG_TYPE_SK_REUSEPORT:
> > > +               switch (expected_attach_type) {
> > > +               case BPF_SK_REUSEPORT_SELECT:
> > > +               case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
> > > +                       return 0;
> > > +               default:
> > > +                       return -EINVAL;
> > > +               }
> > 
> > this is a kernel regression, previously expected_attach_type wasn't
> > enforced, so user-space could have provided any number without an
> > error.
> I also think this change alone will break things like when the usual
> attr->expected_attach_type == 0 case.  At least changes is needed in
> bpf_prog_load_fixup_attach_type() which is also handling a
> similar situation for BPF_PROG_TYPE_CGROUP_SOCK.
> 
> I now think there is no need to expose new bpf_attach_type to the UAPI.
> Since the prog->expected_attach_type is not used, it can be cleared at load time
> and then only set to BPF_SK_REUSEPORT_SELECT_OR_MIGRATE (probably defined
> internally at filter.[c|h]) in the is_valid_access() when "migration"
> is accessed.  When "migration" is accessed, the bpf prog can handle
> migration (and the original not-migration) case.
Scrap this internal only BPF_SK_REUSEPORT_SELECT_OR_MIGRATE idea.
I think there will be cases that bpf prog wants to do both
without accessing any field from sk_reuseport_md.

Lets go back to the discussion on using a similar
idea as BPF_PROG_TYPE_CGROUP_SOCK in bpf_prog_load_fixup_attach_type().
I am not aware there is loader setting a random number
in expected_attach_type, so the chance of breaking
is very low.  There was a similar discussion earlier [0].

[0]: https://lore.kernel.org/netdev/20200126045443.f47dzxdglazzchfm@ast-mbp/

> 
> > 
> > >         case BPF_PROG_TYPE_EXT:
> > >                 if (expected_attach_type)
> > >                         return -EINVAL;
> > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > index 85278deff439..cfc207ae7782 100644
> > > --- a/tools/include/uapi/linux/bpf.h
> > > +++ b/tools/include/uapi/linux/bpf.h
> > > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > >         BPF_XDP_CPUMAP,
> > >         BPF_SK_LOOKUP,
> > >         BPF_XDP,
> > > +       BPF_SK_REUSEPORT_SELECT,
> > > +       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > >         __MAX_BPF_ATTACH_TYPE
> > >  };
> > >
> > > --
> > > 2.17.2 (Apple Git-113)
> > >

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-12-01 15:13   ` Eric Dumazet
@ 2020-12-03 14:12     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-03 14:12 UTC (permalink / raw)
  To: eric.dumazet
  Cc: ast, benh, bpf, daniel, davem, edumazet, kafai, kuba, kuni1840,
	kuniyu, linux-kernel, netdev

From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Tue, 1 Dec 2020 16:13:39 +0100
> On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> > adds two wrapper function of it to pass the migration type defined in the
> > previous commit.
> > 
> >   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
> >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > 
> > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > patch also changes the code to call reuseport_select_migrated_sock() even
> > if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > processing the request.
> > 
> > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  include/net/inet_connection_sock.h | 12 +++++++++++
> >  include/net/request_sock.h         | 13 ++++++++++++
> >  include/net/sock_reuseport.h       |  8 +++----
> >  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
> >  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
> >  net/ipv4/tcp_ipv4.c                |  9 ++++++--
> >  net/ipv6/tcp_ipv6.c                |  9 ++++++--
> >  7 files changed, 81 insertions(+), 17 deletions(-)
> > 
> > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > index 2ea2d743f8fc..1e0958f5eb21 100644
> > --- a/include/net/inet_connection_sock.h
> > +++ b/include/net/inet_connection_sock.h
> > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
> >  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> >  }
> >  
> > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > +						 struct sock *nsk,
> > +						 struct request_sock *req)
> > +{
> > +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > +			     &inet_csk(nsk)->icsk_accept_queue,
> > +			     req);
> > +	sock_put(sk);
> > +	sock_hold(nsk);
> 
> This looks racy to me. nsk refcount might be zero at this point.
> 
> If you think it can _not_ be zero, please add a big comment here,
> because this would mean something has been done before reaching this function,
> and this sock_hold() would be not needed in the first place.
> 
> There is a good reason reqsk_alloc() is using refcount_inc_not_zero().

Exactly, I will fix this in the next spin like below.
Thank you.

---8<---
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 1e0958f5eb21..d8c3be31e987 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -280,7 +280,6 @@ static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
                             &inet_csk(nsk)->icsk_accept_queue,
                             req);
        sock_put(sk);
-       sock_hold(nsk);
        req->rsk_listener = nsk;
 }
 
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 6b475897b496..4d07bddcf678 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -386,7 +386,14 @@ EXPORT_SYMBOL(reuseport_select_sock);
 struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
                                            struct sk_buff *skb)
 {
-       return __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
+       struct sock *nsk;
+
+       nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
+       if (IS_ERR_OR_NULL(nsk) ||
+           unlikely(!refcount_inc_not_zero(&nsk->sk_refcnt)))
+               return NULL;
+
+       return nsk;
 }
 EXPORT_SYMBOL(reuseport_select_migrated_sock);
 
---8<---


> > +	req->rsk_listener = nsk;
> > +}
> > +
> 
> Honestly, this patch series looks quite complex, and finding a bug in the
> very first function I am looking at is not really a good sign...

I also think this issue is quite complex, but it might be easier to fix
than it was disscussed in 2015 [1] thanks to your some refactoring.

https://lore.kernel.org/netdev/1443313848-751-1-git-send-email-tolga.ceylan@gmail.com/

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-01 15:25   ` Eric Dumazet
@ 2020-12-03 14:14     ` Kuniyuki Iwashima
  2020-12-03 14:31       ` Eric Dumazet
  2020-12-07 20:33       ` Martin KaFai Lau
  0 siblings, 2 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-03 14:14 UTC (permalink / raw)
  To: eric.dumazet
  Cc: ast, benh, bpf, daniel, davem, edumazet, kafai, kuba, kuni1840,
	kuniyu, linux-kernel, netdev

From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Tue, 1 Dec 2020 16:25:51 +0100
> On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > which is used only by inet_unhash(). If it is not NULL,
> > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > sockets from the closing listener to the selected one.
> > 
> > Listening sockets hold incoming connections as a linked list of struct
> > request_sock in the accept queue, and each request has reference to a full
> > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > the requests from the closing listener's queue and relink them to the head
> > of the new listener's queue. We do not process each request and its
> > reference to the listener, so the migration completes in O(1) time
> > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > care in the next commit.
> > 
> > By default, the kernel selects a new listener randomly. In order to pick
> > out a different socket every time, we select the last element of socks[] as
> > the new listener. This behaviour is based on how the kernel moves sockets
> > in socks[]. (See also [1])
> > 
> > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > program called in the later commit, but as the side effect of such default
> > selection, the kernel can redistribute old requests evenly to new listeners
> > for a specific case where the application replaces listeners by
> > generations.
> > 
> > For example, we call listen() for four sockets (A, B, C, D), and close the
> > first two by turns. The sockets move in socks[] like below.
> > 
> >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> >   socks[2] : C   |      socks[2] : C --'
> >   socks[3] : D --'
> > 
> > Then, if C and D have newer settings than A and B, and each socket has a
> > request (a, b, c, d) in their accept queue, we can redistribute old
> > requests evenly to new listeners.
> > 
> >   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
> >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> >   socks[2] : C (c)   |      socks[2] : C (c) --'
> >   socks[3] : D (d) --'
> > 
> > Here, (A, D) or (B, C) can have different application settings, but they
> > MUST have the same settings at the socket API level; otherwise, unexpected
> > error may happen. For instance, if only the new listeners have
> > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > face inconsistency and cause an error.
> > 
> > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > program described in later commits.
> > 
> > Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  include/net/inet_connection_sock.h |  1 +
> >  include/net/sock_reuseport.h       |  2 +-
> >  net/core/sock_reuseport.c          | 10 +++++++++-
> >  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
> >  net/ipv4/inet_hashtables.c         |  9 +++++++--
> >  5 files changed, 48 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > index 7338b3865a2a..2ea2d743f8fc 100644
> > --- a/include/net/inet_connection_sock.h
> > +++ b/include/net/inet_connection_sock.h
> > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
> >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> >  				      struct request_sock *req,
> >  				      struct sock *child);
> > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> >  				   unsigned long timeout);
> >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > index 0e558ca7afbf..09a1b1539d4c 100644
> > --- a/include/net/sock_reuseport.h
> > +++ b/include/net/sock_reuseport.h
> > @@ -31,7 +31,7 @@ struct sock_reuseport {
> >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> >  			      bool bind_inany);
> > -extern void reuseport_detach_sock(struct sock *sk);
> > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> >  extern struct sock *reuseport_select_sock(struct sock *sk,
> >  					  u32 hash,
> >  					  struct sk_buff *skb,
> > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > index fd133516ac0e..60d7c1f28809 100644
> > --- a/net/core/sock_reuseport.c
> > +++ b/net/core/sock_reuseport.c
> > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> >  }
> >  EXPORT_SYMBOL(reuseport_add_sock);
> >  
> > -void reuseport_detach_sock(struct sock *sk)
> > +struct sock *reuseport_detach_sock(struct sock *sk)
> >  {
> >  	struct sock_reuseport *reuse;
> > +	struct bpf_prog *prog;
> > +	struct sock *nsk = NULL;
> >  	int i;
> >  
> >  	spin_lock_bh(&reuseport_lock);
> > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> >  
> >  		reuse->num_socks--;
> >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > +		prog = rcu_dereference(reuse->prog);
> >  
> >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > +			if (reuse->num_socks && !prog)
> > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > +
> >  			reuse->num_closed_socks++;
> >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> >  		} else {
> > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> >  out:
> >  	spin_unlock_bh(&reuseport_lock);
> > +
> > +	return nsk;
> >  }
> >  EXPORT_SYMBOL(reuseport_detach_sock);
> >  
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index 1451aa9712b0..b27241ea96bd 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> >  }
> >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> >  
> > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > +{
> > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > +
> > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > +
> > +	spin_lock(&old_accept_queue->rskq_lock);
> > +	spin_lock(&new_accept_queue->rskq_lock);
> 
> Are you sure lockdep is happy with this ?
> 
> I would guess it should complain, because :
> 
> lock(A);
> lock(B);
> ...
> unlock(B);
> unlock(A);
> 
> will fail when the opposite action happens eventually
> 
> lock(B);
> lock(A);
> ...
> unlock(A);
> unlock(B);

I enabled lockdep and did not see warnings of lockdep.

Also, the inversion deadlock does not happen in this case.
In reuseport_detach_sock(), sk is moved backward in socks[] and poped out
from the eBPF map, so the old listener will not be selected as the new
listener.


> > +
> > +	if (old_accept_queue->rskq_accept_head) {
> > +		if (new_accept_queue->rskq_accept_head)
> > +			old_accept_queue->rskq_accept_tail->dl_next =
> > +				new_accept_queue->rskq_accept_head;
> > +		else
> > +			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
> > +
> > +		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
> > +		old_accept_queue->rskq_accept_head = NULL;
> > +		old_accept_queue->rskq_accept_tail = NULL;
> > +
> > +		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
> > +		WRITE_ONCE(sk->sk_ack_backlog, 0);
> > +	}
> > +
> > +	spin_unlock(&new_accept_queue->rskq_lock);
> > +	spin_unlock(&old_accept_queue->rskq_lock);
> > +}
> > +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
> 
> I fail to understand how the kernel can run fine right after this patch, before following patches are merged.

I will squash the two or reorganize them into definition part and migration
part.


> All request sockets in the socket accept queue MUST have their rsk_listener set to the listener,
> this is how we designed things (each request socket has a reference taken on the listener)
> 
> We might even have some "BUG_ON(sk != req->rsk_listener);" in some places.
> 
> Since you splice list from old listener to the new one, without changing req->rsk_listener, bad things will happen.
> 
> I feel the order of your patches is not correct.

I understand this series is against the design.
But once the requests sockets are added in the queue, they are accessed
from the accept queue, and then we have the correct listener and can
rewirte rsk_listener. Otherwise, their full socket are accessed instead.

Also, as far as I know, such BUG_ON was only in inet_child_forget().

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-03  4:24       ` Martin KaFai Lau
@ 2020-12-03 14:16         ` Kuniyuki Iwashima
  2020-12-04  5:56           ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-03 14:16 UTC (permalink / raw)
  To: kafai
  Cc: andrii.nakryiko, ast, benh, bpf, daniel, davem, edumazet, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Wed, 2 Dec 2020 20:24:02 -0800
> On Wed, Dec 02, 2020 at 11:19:02AM -0800, Martin KaFai Lau wrote:
> > On Tue, Dec 01, 2020 at 06:04:50PM -0800, Andrii Nakryiko wrote:
> > > On Tue, Dec 1, 2020 at 6:49 AM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> > > >
> > > > This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
> > > > check if the attached eBPF program is capable of migrating sockets.
> > > >
> > > > When the eBPF program is attached, the kernel runs it for socket migration
> > > > only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> > > > The kernel will change the behaviour depending on the returned value:
> > > >
> > > >   - SK_PASS with selected_sk, select it as a new listener
> > > >   - SK_PASS with selected_sk NULL, fall back to the random selection
> > > >   - SK_DROP, cancel the migration
> > > >
> > > > Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
> > > > Suggested-by: Martin KaFai Lau <kafai@fb.com>
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > ---
> > > >  include/uapi/linux/bpf.h       | 2 ++
> > > >  kernel/bpf/syscall.c           | 8 ++++++++
> > > >  tools/include/uapi/linux/bpf.h | 2 ++
> > > >  3 files changed, 12 insertions(+)
> > > >
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 85278deff439..cfc207ae7782 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > > >         BPF_XDP_CPUMAP,
> > > >         BPF_SK_LOOKUP,
> > > >         BPF_XDP,
> > > > +       BPF_SK_REUSEPORT_SELECT,
> > > > +       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > > >         __MAX_BPF_ATTACH_TYPE
> > > >  };
> > > >
> > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > index f3fe9f53f93c..a0796a8de5ea 100644
> > > > --- a/kernel/bpf/syscall.c
> > > > +++ b/kernel/bpf/syscall.c
> > > > @@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
> > > >                 if (expected_attach_type == BPF_SK_LOOKUP)
> > > >                         return 0;
> > > >                 return -EINVAL;
> > > > +       case BPF_PROG_TYPE_SK_REUSEPORT:
> > > > +               switch (expected_attach_type) {
> > > > +               case BPF_SK_REUSEPORT_SELECT:
> > > > +               case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
> > > > +                       return 0;
> > > > +               default:
> > > > +                       return -EINVAL;
> > > > +               }
> > > 
> > > this is a kernel regression, previously expected_attach_type wasn't
> > > enforced, so user-space could have provided any number without an
> > > error.
> > I also think this change alone will break things like when the usual
> > attr->expected_attach_type == 0 case.  At least changes is needed in
> > bpf_prog_load_fixup_attach_type() which is also handling a
> > similar situation for BPF_PROG_TYPE_CGROUP_SOCK.
> > 
> > I now think there is no need to expose new bpf_attach_type to the UAPI.
> > Since the prog->expected_attach_type is not used, it can be cleared at load time
> > and then only set to BPF_SK_REUSEPORT_SELECT_OR_MIGRATE (probably defined
> > internally at filter.[c|h]) in the is_valid_access() when "migration"
> > is accessed.  When "migration" is accessed, the bpf prog can handle
> > migration (and the original not-migration) case.
> Scrap this internal only BPF_SK_REUSEPORT_SELECT_OR_MIGRATE idea.
> I think there will be cases that bpf prog wants to do both
> without accessing any field from sk_reuseport_md.
> 
> Lets go back to the discussion on using a similar
> idea as BPF_PROG_TYPE_CGROUP_SOCK in bpf_prog_load_fixup_attach_type().
> I am not aware there is loader setting a random number
> in expected_attach_type, so the chance of breaking
> is very low.  There was a similar discussion earlier [0].
> 
> [0]: https://lore.kernel.org/netdev/20200126045443.f47dzxdglazzchfm@ast-mbp/

Thank you for the idea and reference.

I will remove the change in bpf_prog_load_check_attach() and set the
default value (BPF_SK_REUSEPORT_SELECT) in bpf_prog_load_fixup_attach_type()
for backward compatibility if expected_attach_type is 0.


> > > >         case BPF_PROG_TYPE_EXT:
> > > >                 if (expected_attach_type)
> > > >                         return -EINVAL;
> > > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > > index 85278deff439..cfc207ae7782 100644
> > > > --- a/tools/include/uapi/linux/bpf.h
> > > > +++ b/tools/include/uapi/linux/bpf.h
> > > > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > > >         BPF_XDP_CPUMAP,
> > > >         BPF_SK_LOOKUP,
> > > >         BPF_XDP,
> > > > +       BPF_SK_REUSEPORT_SELECT,
> > > > +       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > > >         __MAX_BPF_ATTACH_TYPE
> > > >  };
> > > >
> > > > --
> > > > 2.17.2 (Apple Git-113)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-03 14:14     ` Kuniyuki Iwashima
@ 2020-12-03 14:31       ` Eric Dumazet
  2020-12-03 15:41         ` Kuniyuki Iwashima
  2020-12-07 20:33       ` Martin KaFai Lau
  1 sibling, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2020-12-03 14:31 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Eric Dumazet, Alexei Starovoitov, Benjamin Herrenschmidt, bpf,
	Daniel Borkmann, David Miller, Martin KaFai Lau, Jakub Kicinski,
	Kuniyuki Iwashima, LKML, netdev

On Thu, Dec 3, 2020 at 3:14 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
>
> From:   Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > which is used only by inet_unhash(). If it is not NULL,
> > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > sockets from the closing listener to the selected one.
> > >
> > > Listening sockets hold incoming connections as a linked list of struct
> > > request_sock in the accept queue, and each request has reference to a full
> > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > > the requests from the closing listener's queue and relink them to the head
> > > of the new listener's queue. We do not process each request and its
> > > reference to the listener, so the migration completes in O(1) time
> > > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > > care in the next commit.
> > >
> > > By default, the kernel selects a new listener randomly. In order to pick
> > > out a different socket every time, we select the last element of socks[] as
> > > the new listener. This behaviour is based on how the kernel moves sockets
> > > in socks[]. (See also [1])
> > >
> > > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > > program called in the later commit, but as the side effect of such default
> > > selection, the kernel can redistribute old requests evenly to new listeners
> > > for a specific case where the application replaces listeners by
> > > generations.
> > >
> > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > first two by turns. The sockets move in socks[] like below.
> > >
> > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > >   socks[2] : C   |      socks[2] : C --'
> > >   socks[3] : D --'
> > >
> > > Then, if C and D have newer settings than A and B, and each socket has a
> > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > requests evenly to new listeners.
> > >
> > >   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
> > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> > >   socks[2] : C (c)   |      socks[2] : C (c) --'
> > >   socks[3] : D (d) --'
> > >
> > > Here, (A, D) or (B, C) can have different application settings, but they
> > > MUST have the same settings at the socket API level; otherwise, unexpected
> > > error may happen. For instance, if only the new listeners have
> > > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > > face inconsistency and cause an error.
> > >
> > > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > > program described in later commits.
> > >
> > > Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > ---
> > >  include/net/inet_connection_sock.h |  1 +
> > >  include/net/sock_reuseport.h       |  2 +-
> > >  net/core/sock_reuseport.c          | 10 +++++++++-
> > >  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
> > >  net/ipv4/inet_hashtables.c         |  9 +++++++--
> > >  5 files changed, 48 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > index 7338b3865a2a..2ea2d743f8fc 100644
> > > --- a/include/net/inet_connection_sock.h
> > > +++ b/include/net/inet_connection_sock.h
> > > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
> > >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > >                                   struct request_sock *req,
> > >                                   struct sock *child);
> > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> > >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > >                                unsigned long timeout);
> > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > > index 0e558ca7afbf..09a1b1539d4c 100644
> > > --- a/include/net/sock_reuseport.h
> > > +++ b/include/net/sock_reuseport.h
> > > @@ -31,7 +31,7 @@ struct sock_reuseport {
> > >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> > >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> > >                           bool bind_inany);
> > > -extern void reuseport_detach_sock(struct sock *sk);
> > > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> > >  extern struct sock *reuseport_select_sock(struct sock *sk,
> > >                                       u32 hash,
> > >                                       struct sk_buff *skb,
> > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > index fd133516ac0e..60d7c1f28809 100644
> > > --- a/net/core/sock_reuseport.c
> > > +++ b/net/core/sock_reuseport.c
> > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > >  }
> > >  EXPORT_SYMBOL(reuseport_add_sock);
> > >
> > > -void reuseport_detach_sock(struct sock *sk)
> > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > >  {
> > >     struct sock_reuseport *reuse;
> > > +   struct bpf_prog *prog;
> > > +   struct sock *nsk = NULL;
> > >     int i;
> > >
> > >     spin_lock_bh(&reuseport_lock);
> > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > >
> > >             reuse->num_socks--;
> > >             reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > +           prog = rcu_dereference(reuse->prog);
> > >
> > >             if (sk->sk_protocol == IPPROTO_TCP) {
> > > +                   if (reuse->num_socks && !prog)
> > > +                           nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > +
> > >                     reuse->num_closed_socks++;
> > >                     reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > >             } else {
> > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > >             call_rcu(&reuse->rcu, reuseport_free_rcu);
> > >  out:
> > >     spin_unlock_bh(&reuseport_lock);
> > > +
> > > +   return nsk;
> > >  }
> > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > >
> > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > index 1451aa9712b0..b27241ea96bd 100644
> > > --- a/net/ipv4/inet_connection_sock.c
> > > +++ b/net/ipv4/inet_connection_sock.c
> > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > >  }
> > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > >
> > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > +{
> > > +   struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > +
> > > +   old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > +   new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > +
> > > +   spin_lock(&old_accept_queue->rskq_lock);
> > > +   spin_lock(&new_accept_queue->rskq_lock);
> >
> > Are you sure lockdep is happy with this ?
> >
> > I would guess it should complain, because :
> >
> > lock(A);
> > lock(B);
> > ...
> > unlock(B);
> > unlock(A);
> >
> > will fail when the opposite action happens eventually
> >
> > lock(B);
> > lock(A);
> > ...
> > unlock(A);
> > unlock(B);
>
> I enabled lockdep and did not see warnings of lockdep.
>
> Also, the inversion deadlock does not happen in this case.
> In reuseport_detach_sock(), sk is moved backward in socks[] and poped out
> from the eBPF map, so the old listener will not be selected as the new
> listener.

Until the socket is closed, reallocated and used again. LOCKDEP has no
idea about soreuseport logic.

If you run your tests long enough, lockdep should complain at some point.

git grep -n double_lock

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-03 14:31       ` Eric Dumazet
@ 2020-12-03 15:41         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-03 15:41 UTC (permalink / raw)
  To: edumazet
  Cc: ast, benh, bpf, daniel, davem, eric.dumazet, kafai, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Eric Dumazet <edumazet@google.com>
Date:   Thu, 3 Dec 2020 15:31:53 +0100
> On Thu, Dec 3, 2020 at 3:14 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> >
> > From:   Eric Dumazet <eric.dumazet@gmail.com>
> > Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > which is used only by inet_unhash(). If it is not NULL,
> > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > sockets from the closing listener to the selected one.
> > > >
> > > > Listening sockets hold incoming connections as a linked list of struct
> > > > request_sock in the accept queue, and each request has reference to a full
> > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > > > the requests from the closing listener's queue and relink them to the head
> > > > of the new listener's queue. We do not process each request and its
> > > > reference to the listener, so the migration completes in O(1) time
> > > > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > > > care in the next commit.
> > > >
> > > > By default, the kernel selects a new listener randomly. In order to pick
> > > > out a different socket every time, we select the last element of socks[] as
> > > > the new listener. This behaviour is based on how the kernel moves sockets
> > > > in socks[]. (See also [1])
> > > >
> > > > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > > > program called in the later commit, but as the side effect of such default
> > > > selection, the kernel can redistribute old requests evenly to new listeners
> > > > for a specific case where the application replaces listeners by
> > > > generations.
> > > >
> > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > first two by turns. The sockets move in socks[] like below.
> > > >
> > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > >   socks[2] : C   |      socks[2] : C --'
> > > >   socks[3] : D --'
> > > >
> > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > requests evenly to new listeners.
> > > >
> > > >   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
> > > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> > > >   socks[2] : C (c)   |      socks[2] : C (c) --'
> > > >   socks[3] : D (d) --'
> > > >
> > > > Here, (A, D) or (B, C) can have different application settings, but they
> > > > MUST have the same settings at the socket API level; otherwise, unexpected
> > > > error may happen. For instance, if only the new listeners have
> > > > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > > > face inconsistency and cause an error.
> > > >
> > > > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > > > program described in later commits.
> > > >
> > > > Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > ---
> > > >  include/net/inet_connection_sock.h |  1 +
> > > >  include/net/sock_reuseport.h       |  2 +-
> > > >  net/core/sock_reuseport.c          | 10 +++++++++-
> > > >  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
> > > >  net/ipv4/inet_hashtables.c         |  9 +++++++--
> > > >  5 files changed, 48 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > index 7338b3865a2a..2ea2d743f8fc 100644
> > > > --- a/include/net/inet_connection_sock.h
> > > > +++ b/include/net/inet_connection_sock.h
> > > > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
> > > >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > >                                   struct request_sock *req,
> > > >                                   struct sock *child);
> > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> > > >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > > >                                unsigned long timeout);
> > > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > > > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > > > index 0e558ca7afbf..09a1b1539d4c 100644
> > > > --- a/include/net/sock_reuseport.h
> > > > +++ b/include/net/sock_reuseport.h
> > > > @@ -31,7 +31,7 @@ struct sock_reuseport {
> > > >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> > > >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> > > >                           bool bind_inany);
> > > > -extern void reuseport_detach_sock(struct sock *sk);
> > > > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> > > >  extern struct sock *reuseport_select_sock(struct sock *sk,
> > > >                                       u32 hash,
> > > >                                       struct sk_buff *skb,
> > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > index fd133516ac0e..60d7c1f28809 100644
> > > > --- a/net/core/sock_reuseport.c
> > > > +++ b/net/core/sock_reuseport.c
> > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > >  }
> > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > >
> > > > -void reuseport_detach_sock(struct sock *sk)
> > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > >  {
> > > >     struct sock_reuseport *reuse;
> > > > +   struct bpf_prog *prog;
> > > > +   struct sock *nsk = NULL;
> > > >     int i;
> > > >
> > > >     spin_lock_bh(&reuseport_lock);
> > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > >
> > > >             reuse->num_socks--;
> > > >             reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > +           prog = rcu_dereference(reuse->prog);
> > > >
> > > >             if (sk->sk_protocol == IPPROTO_TCP) {
> > > > +                   if (reuse->num_socks && !prog)
> > > > +                           nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > > +
> > > >                     reuse->num_closed_socks++;
> > > >                     reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > > >             } else {
> > > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > > >             call_rcu(&reuse->rcu, reuseport_free_rcu);
> > > >  out:
> > > >     spin_unlock_bh(&reuseport_lock);
> > > > +
> > > > +   return nsk;
> > > >  }
> > > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > > >
> > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > --- a/net/ipv4/inet_connection_sock.c
> > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > >  }
> > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > >
> > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > +{
> > > > +   struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > +
> > > > +   old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > +   new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > +
> > > > +   spin_lock(&old_accept_queue->rskq_lock);
> > > > +   spin_lock(&new_accept_queue->rskq_lock);
> > >
> > > Are you sure lockdep is happy with this ?
> > >
> > > I would guess it should complain, because :
> > >
> > > lock(A);
> > > lock(B);
> > > ...
> > > unlock(B);
> > > unlock(A);
> > >
> > > will fail when the opposite action happens eventually
> > >
> > > lock(B);
> > > lock(A);
> > > ...
> > > unlock(A);
> > > unlock(B);
> >
> > I enabled lockdep and did not see warnings of lockdep.
> >
> > Also, the inversion deadlock does not happen in this case.
> > In reuseport_detach_sock(), sk is moved backward in socks[] and poped out
> > from the eBPF map, so the old listener will not be selected as the new
> > listener.
> 
> Until the socket is closed, reallocated and used again. LOCKDEP has no
> idea about soreuseport logic.
> 
> If you run your tests long enough, lockdep should complain at some point.
> 
> git grep -n double_lock

Thank you, I will change the code like double_lock().
And I will try to continue testing lockdep without this change for
curiosity!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-03 14:16         ` Kuniyuki Iwashima
@ 2020-12-04  5:56           ` Martin KaFai Lau
  2020-12-06  4:32             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-04  5:56 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii.nakryiko, ast, benh, bpf, daniel, davem, edumazet, kuba,
	kuni1840, linux-kernel, netdev

On Thu, Dec 03, 2020 at 11:16:08PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Wed, 2 Dec 2020 20:24:02 -0800
> > On Wed, Dec 02, 2020 at 11:19:02AM -0800, Martin KaFai Lau wrote:
> > > On Tue, Dec 01, 2020 at 06:04:50PM -0800, Andrii Nakryiko wrote:
> > > > On Tue, Dec 1, 2020 at 6:49 AM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> > > > >
> > > > > This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
> > > > > check if the attached eBPF program is capable of migrating sockets.
> > > > >
> > > > > When the eBPF program is attached, the kernel runs it for socket migration
> > > > > only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> > > > > The kernel will change the behaviour depending on the returned value:
> > > > >
> > > > >   - SK_PASS with selected_sk, select it as a new listener
> > > > >   - SK_PASS with selected_sk NULL, fall back to the random selection
> > > > >   - SK_DROP, cancel the migration
> > > > >
> > > > > Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
> > > > > Suggested-by: Martin KaFai Lau <kafai@fb.com>
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > > ---
> > > > >  include/uapi/linux/bpf.h       | 2 ++
> > > > >  kernel/bpf/syscall.c           | 8 ++++++++
> > > > >  tools/include/uapi/linux/bpf.h | 2 ++
> > > > >  3 files changed, 12 insertions(+)
> > > > >
> > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > index 85278deff439..cfc207ae7782 100644
> > > > > --- a/include/uapi/linux/bpf.h
> > > > > +++ b/include/uapi/linux/bpf.h
> > > > > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > > > >         BPF_XDP_CPUMAP,
> > > > >         BPF_SK_LOOKUP,
> > > > >         BPF_XDP,
> > > > > +       BPF_SK_REUSEPORT_SELECT,
> > > > > +       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > > > >         __MAX_BPF_ATTACH_TYPE
> > > > >  };
> > > > >
> > > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > > index f3fe9f53f93c..a0796a8de5ea 100644
> > > > > --- a/kernel/bpf/syscall.c
> > > > > +++ b/kernel/bpf/syscall.c
> > > > > @@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
> > > > >                 if (expected_attach_type == BPF_SK_LOOKUP)
> > > > >                         return 0;
> > > > >                 return -EINVAL;
> > > > > +       case BPF_PROG_TYPE_SK_REUSEPORT:
> > > > > +               switch (expected_attach_type) {
> > > > > +               case BPF_SK_REUSEPORT_SELECT:
> > > > > +               case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
> > > > > +                       return 0;
> > > > > +               default:
> > > > > +                       return -EINVAL;
> > > > > +               }
> > > > 
> > > > this is a kernel regression, previously expected_attach_type wasn't
> > > > enforced, so user-space could have provided any number without an
> > > > error.
> > > I also think this change alone will break things like when the usual
> > > attr->expected_attach_type == 0 case.  At least changes is needed in
> > > bpf_prog_load_fixup_attach_type() which is also handling a
> > > similar situation for BPF_PROG_TYPE_CGROUP_SOCK.
> > > 
> > > I now think there is no need to expose new bpf_attach_type to the UAPI.
> > > Since the prog->expected_attach_type is not used, it can be cleared at load time
> > > and then only set to BPF_SK_REUSEPORT_SELECT_OR_MIGRATE (probably defined
> > > internally at filter.[c|h]) in the is_valid_access() when "migration"
> > > is accessed.  When "migration" is accessed, the bpf prog can handle
> > > migration (and the original not-migration) case.
> > Scrap this internal only BPF_SK_REUSEPORT_SELECT_OR_MIGRATE idea.
> > I think there will be cases that bpf prog wants to do both
> > without accessing any field from sk_reuseport_md.
> > 
> > Lets go back to the discussion on using a similar
> > idea as BPF_PROG_TYPE_CGROUP_SOCK in bpf_prog_load_fixup_attach_type().
> > I am not aware there is loader setting a random number
> > in expected_attach_type, so the chance of breaking
> > is very low.  There was a similar discussion earlier [0].
> > 
> > [0]: https://lore.kernel.org/netdev/20200126045443.f47dzxdglazzchfm@ast-mbp/
> 
> Thank you for the idea and reference.
> 
> I will remove the change in bpf_prog_load_check_attach() and set the
> default value (BPF_SK_REUSEPORT_SELECT) in bpf_prog_load_fixup_attach_type()
> for backward compatibility if expected_attach_type is 0.
check_attach_type() can be kept.  You can refer to
commit aac3fc320d94 for a similar situation.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-01 14:44 ` [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
@ 2020-12-04 19:58   ` Martin KaFai Lau
  2020-12-06  4:36     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-04 19:58 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, osa-contribution-log, bpf, netdev,
	linux-kernel

On Tue, Dec 01, 2020 at 11:44:16PM +0900, Kuniyuki Iwashima wrote:
> We will call sock_reuseport.prog for socket migration in the next commit,
> so the eBPF program has to know which listener is closing in order to
> select the new listener.
> 
> Currently, we can get a unique ID for each listener in the userspace by
> calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.
> 
> This patch makes the sk pointer available in sk_reuseport_md so that we can
> get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.
> 
> Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/
> Suggested-by: Martin KaFai Lau <kafai@fb.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  include/uapi/linux/bpf.h       |  8 ++++++++
>  net/core/filter.c              | 12 +++++++++++-
>  tools/include/uapi/linux/bpf.h |  8 ++++++++
>  3 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index efe342bf3dbc..3e9b8bd42b4e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1650,6 +1650,13 @@ union bpf_attr {
>   * 		A 8-byte long non-decreasing number on success, or 0 if the
>   * 		socket field is missing inside *skb*.
>   *
> + * u64 bpf_get_socket_cookie(struct bpf_sock *sk)
> + * 	Description
> + * 		Equivalent to bpf_get_socket_cookie() helper that accepts
> + * 		*skb*, but gets socket from **struct bpf_sock** context.
> + * 	Return
> + * 		A 8-byte long non-decreasing number.
> + *
>   * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx)
>   * 	Description
>   * 		Equivalent to bpf_get_socket_cookie() helper that accepts
> @@ -4420,6 +4427,7 @@ struct sk_reuseport_md {
>  	__u32 bind_inany;	/* Is sock bound to an INANY address? */
>  	__u32 hash;		/* A hash of the packet 4 tuples */
>  	__u8 migration;		/* Migration type */
> +	__bpf_md_ptr(struct bpf_sock *, sk); /* current listening socket */
>  };
>  
>  #define BPF_TAG_SIZE	8
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 0a0634787bb4..1059d31847ef 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4628,7 +4628,7 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_proto = {
>  	.func		= bpf_get_socket_cookie_sock,
>  	.gpl_only	= false,
>  	.ret_type	= RET_INTEGER,
> -	.arg1_type	= ARG_PTR_TO_CTX,
> +	.arg1_type	= ARG_PTR_TO_SOCKET,
This will break existing bpf prog (BPF_PROG_TYPE_CGROUP_SOCK)
using this proto.  A new proto is needed and there is
an on-going patch doing this [0].

[0]: https://lore.kernel.org/bpf/20201203213330.1657666-1-revest@google.com/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group.
  2020-12-01 14:44 ` [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
@ 2020-12-05  1:31   ` Martin KaFai Lau
  2020-12-06  4:38     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-05  1:31 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, osa-contribution-log, bpf, netdev,
	linux-kernel

On Tue, Dec 01, 2020 at 11:44:08PM +0900, Kuniyuki Iwashima wrote:
> This patch is a preparation patch to migrate incoming connections in the
> later commits and adds a field (num_closed_socks) to the struct
> sock_reuseport to keep TCP_CLOSE sockets in the reuseport group.
> 
> When we close a listening socket, to migrate its connections to another
> listener in the same reuseport group, we have to handle two kinds of child
> sockets. One is that a listening socket has a reference to, and the other
> is not.
> 
> The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
> accept queue of their listening socket. So, we can pop them out and push
> them into another listener's queue at close() or shutdown() syscalls. On
> the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
> three-way handshake and not in the accept queue. Thus, we cannot access
> such sockets at close() or shutdown() syscalls. Accordingly, we have to
> migrate immature sockets after their listening socket has been closed.
> 
> Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
> sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
> that time, if we could select a new listener from the same reuseport group,
> no connection would be aborted. However, it is impossible because
> reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
> the reuseport group from closed sockets.
> 
> This patch allows TCP_CLOSE sockets to remain in the reuseport group and to
> have access to it while any child socket references to them. The point is
> that reuseport_detach_sock() is called twice from inet_unhash() and
> sk_destruct(). At first, it moves the socket backwards in socks[] and
> increments num_closed_socks. Later, when all migrated connections are
> accepted, it removes the socket from socks[], decrements num_closed_socks,
> and sets NULL to sk_reuseport_cb.
> 
> By this change, closed sockets can keep sk_reuseport_cb until all child
> requests have been freed or accepted. Consequently calling listen() after
> shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or
> inet_csk_bind_conflict() which expect that such sockets should not have the
> reuseport group. Therefore, this patch also loosens such validation rules
> so that the socket can listen again if it has the same reuseport group with
> other listening sockets.
> 
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  include/net/sock_reuseport.h    |  5 ++-
>  net/core/sock_reuseport.c       | 79 +++++++++++++++++++++++++++------
>  net/ipv4/inet_connection_sock.c |  7 ++-
>  3 files changed, 74 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> index 505f1e18e9bf..0e558ca7afbf 100644
> --- a/include/net/sock_reuseport.h
> +++ b/include/net/sock_reuseport.h
> @@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock;
>  struct sock_reuseport {
>  	struct rcu_head		rcu;
>  
> -	u16			max_socks;	/* length of socks */
> -	u16			num_socks;	/* elements in socks */
> +	u16			max_socks;		/* length of socks */
> +	u16			num_socks;		/* elements in socks */
> +	u16			num_closed_socks;	/* closed elements in socks */
>  	/* The last synq overflow event timestamp of this
>  	 * reuse->socks[] group.
>  	 */
> diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> index bbdd3c7b6cb5..fd133516ac0e 100644
> --- a/net/core/sock_reuseport.c
> +++ b/net/core/sock_reuseport.c
> @@ -98,16 +98,21 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse)
>  		return NULL;
>  
>  	more_reuse->num_socks = reuse->num_socks;
> +	more_reuse->num_closed_socks = reuse->num_closed_socks;
>  	more_reuse->prog = reuse->prog;
>  	more_reuse->reuseport_id = reuse->reuseport_id;
>  	more_reuse->bind_inany = reuse->bind_inany;
>  	more_reuse->has_conns = reuse->has_conns;
> +	more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
>  
>  	memcpy(more_reuse->socks, reuse->socks,
>  	       reuse->num_socks * sizeof(struct sock *));
> -	more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
> +	memcpy(more_reuse->socks +
> +	       (more_reuse->max_socks - more_reuse->num_closed_socks),
> +	       reuse->socks + reuse->num_socks,
> +	       reuse->num_closed_socks * sizeof(struct sock *));
>  
> -	for (i = 0; i < reuse->num_socks; ++i)
> +	for (i = 0; i < reuse->max_socks; ++i)
>  		rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
>  				   more_reuse);
>  
> @@ -129,6 +134,25 @@ static void reuseport_free_rcu(struct rcu_head *head)
>  	kfree(reuse);
>  }
>  
> +static int reuseport_sock_index(struct sock_reuseport *reuse, struct sock *sk,
> +				bool closed)
> +{
> +	int left, right;
> +
> +	if (!closed) {
> +		left = 0;
> +		right = reuse->num_socks;
> +	} else {
> +		left = reuse->max_socks - reuse->num_closed_socks;
> +		right = reuse->max_socks;
> +	}
> +
> +	for (; left < right; left++)
> +		if (reuse->socks[left] == sk)
> +			return left;
> +	return -1;
> +}
> +
>  /**
>   *  reuseport_add_sock - Add a socket to the reuseport group of another.
>   *  @sk:  New socket to add to the group.
> @@ -153,12 +177,23 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
>  					  lockdep_is_held(&reuseport_lock));
>  	old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
>  					     lockdep_is_held(&reuseport_lock));
> -	if (old_reuse && old_reuse->num_socks != 1) {
> +
> +	if (old_reuse == reuse) {
> +		int i = reuseport_sock_index(reuse, sk, true);
> +
> +		if (i == -1) {
When will this happen?

I found the new logic in the closed sk shuffling within socks[] quite
complicated to read.  I can see why the closed sk wants to keep its
sk->sk_reuseport_cb.  However, does it need to stay
in socks[]?


> +			spin_unlock_bh(&reuseport_lock);
> +			return -EBUSY;
> +		}
> +
> +		reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks];
> +		reuse->num_closed_socks--;
> +	} else if (old_reuse && old_reuse->num_socks != 1) {
>  		spin_unlock_bh(&reuseport_lock);
>  		return -EBUSY;
>  	}
>  
> -	if (reuse->num_socks == reuse->max_socks) {
> +	if (reuse->num_socks + reuse->num_closed_socks == reuse->max_socks) {
>  		reuse = reuseport_grow(reuse);
>  		if (!reuse) {
>  			spin_unlock_bh(&reuseport_lock);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-01 14:44 ` [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
  2020-12-01 15:25   ` Eric Dumazet
@ 2020-12-05  1:42   ` Martin KaFai Lau
  2020-12-06  4:41     ` Kuniyuki Iwashima
       [not found]     ` <20201205160307.91179-1-kuniyu@amazon.co.jp>
  2020-12-08  6:54   ` Martin KaFai Lau
  2 siblings, 2 replies; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-05  1:42 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, osa-contribution-log, bpf, netdev,
	linux-kernel

On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
[ ... ]
> diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> index fd133516ac0e..60d7c1f28809 100644
> --- a/net/core/sock_reuseport.c
> +++ b/net/core/sock_reuseport.c
> @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
>  }
>  EXPORT_SYMBOL(reuseport_add_sock);
>  
> -void reuseport_detach_sock(struct sock *sk)
> +struct sock *reuseport_detach_sock(struct sock *sk)
>  {
>  	struct sock_reuseport *reuse;
> +	struct bpf_prog *prog;
> +	struct sock *nsk = NULL;
>  	int i;
>  
>  	spin_lock_bh(&reuseport_lock);
> @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
>  
>  		reuse->num_socks--;
>  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> +		prog = rcu_dereference(reuse->prog);
Is it under rcu_read_lock() here?

>  
>  		if (sk->sk_protocol == IPPROTO_TCP) {
> +			if (reuse->num_socks && !prog)
> +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> +
>  			reuse->num_closed_socks++;
>  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
>  		} else {
> @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
>  		call_rcu(&reuse->rcu, reuseport_free_rcu);
>  out:
>  	spin_unlock_bh(&reuseport_lock);
> +
> +	return nsk;
>  }
>  EXPORT_SYMBOL(reuseport_detach_sock);
>  
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 1451aa9712b0..b27241ea96bd 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>  }
>  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
>  
> +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> +{
> +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> +
> +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> +
> +	spin_lock(&old_accept_queue->rskq_lock);
> +	spin_lock(&new_accept_queue->rskq_lock);
I am also not very thrilled on this double spin_lock.
Can this be done in (or like) inet_csk_listen_stop() instead?

> +
> +	if (old_accept_queue->rskq_accept_head) {
> +		if (new_accept_queue->rskq_accept_head)
> +			old_accept_queue->rskq_accept_tail->dl_next =
> +				new_accept_queue->rskq_accept_head;
> +		else
> +			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
> +
> +		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
> +		old_accept_queue->rskq_accept_head = NULL;
> +		old_accept_queue->rskq_accept_tail = NULL;
> +
> +		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
> +		WRITE_ONCE(sk->sk_ack_backlog, 0);
> +	}
> +
> +	spin_unlock(&new_accept_queue->rskq_lock);
> +	spin_unlock(&old_accept_queue->rskq_lock);
> +}
> +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
> +
>  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
>  					 struct request_sock *req, bool own_req)
>  {
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 45fb450b4522..545538a6bfac 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -681,6 +681,7 @@ void inet_unhash(struct sock *sk)
>  {
>  	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
>  	struct inet_listen_hashbucket *ilb = NULL;
> +	struct sock *nsk;
>  	spinlock_t *lock;
>  
>  	if (sk_unhashed(sk))
> @@ -696,8 +697,12 @@ void inet_unhash(struct sock *sk)
>  	if (sk_unhashed(sk))
>  		goto unlock;
>  
> -	if (rcu_access_pointer(sk->sk_reuseport_cb))
> -		reuseport_detach_sock(sk);
> +	if (rcu_access_pointer(sk->sk_reuseport_cb)) {
> +		nsk = reuseport_detach_sock(sk);
> +		if (nsk)
> +			inet_csk_reqsk_queue_migrate(sk, nsk);
> +	}
> +
>  	if (ilb) {
>  		inet_unhash2(hashinfo, sk);
>  		ilb->count--;
> -- 
> 2.17.2 (Apple Git-113)
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
  2020-12-01 14:44 ` [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
@ 2020-12-05  1:50   ` Martin KaFai Lau
  2020-12-06  4:43     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-05  1:50 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, osa-contribution-log, bpf, netdev,
	linux-kernel

On Tue, Dec 01, 2020 at 11:44:18PM +0900, Kuniyuki Iwashima wrote:
> This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> 
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  .../bpf/prog_tests/migrate_reuseport.c        | 164 ++++++++++++++++++
>  .../bpf/progs/test_migrate_reuseport_kern.c   |  54 ++++++
>  2 files changed, 218 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
> new file mode 100644
> index 000000000000..87c72d9ccadd
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
> @@ -0,0 +1,164 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Check if we can migrate child sockets.
> + *
> + *   1. call listen() for 5 server sockets.
> + *   2. update a map to migrate all child socket
> + *        to the last server socket (migrate_map[cookie] = 4)
> + *   3. call connect() for 25 client sockets.
> + *   4. call close() for first 4 server sockets.
> + *   5. call accept() for the last server socket.
> + *
> + * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> + */
> +
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <fcntl.h>
> +#include <netinet/in.h>
> +#include <arpa/inet.h>
> +#include <linux/bpf.h>
> +#include <sys/socket.h>
> +#include <sys/types.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#define NUM_SOCKS 5
> +#define LOCALHOST "127.0.0.1"
> +#define err_exit(condition, message)			      \
> +	do {						      \
> +		if (condition) {			      \
> +			perror("ERROR: " message " ");	      \
> +			exit(1);			      \
> +		}					      \
> +	} while (0)
> +
> +__u64 server_fds[NUM_SOCKS];
> +int prog_fd, reuseport_map_fd, migrate_map_fd;
> +
> +
> +void setup_bpf(void)
> +{
> +	struct bpf_object *obj;
> +	struct bpf_program *prog;
> +	struct bpf_map *reuseport_map, *migrate_map;
> +	int err;
> +
> +	obj = bpf_object__open("test_migrate_reuseport_kern.o");
> +	err_exit(libbpf_get_error(obj), "opening BPF object file failed");
> +
> +	err = bpf_object__load(obj);
> +	err_exit(err, "loading BPF object failed");
> +
> +	prog = bpf_program__next(NULL, obj);
> +	err_exit(!prog, "loading BPF program failed");
> +
> +	reuseport_map = bpf_object__find_map_by_name(obj, "reuseport_map");
> +	err_exit(!reuseport_map, "loading BPF reuseport_map failed");
> +
> +	migrate_map = bpf_object__find_map_by_name(obj, "migrate_map");
> +	err_exit(!migrate_map, "loading BPF migrate_map failed");
> +
> +	prog_fd = bpf_program__fd(prog);
> +	reuseport_map_fd = bpf_map__fd(reuseport_map);
> +	migrate_map_fd = bpf_map__fd(migrate_map);
> +}
> +
> +void test_listen(void)
> +{
> +	struct sockaddr_in addr;
> +	socklen_t addr_len = sizeof(addr);
> +	int i, err, optval = 1, migrated_to = NUM_SOCKS - 1;
> +	__u64 value;
> +
> +	addr.sin_family = AF_INET;
> +	addr.sin_port = htons(80);
> +	inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr);
> +
> +	for (i = 0; i < NUM_SOCKS; i++) {
> +		server_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
> +		err_exit(server_fds[i] == -1, "socket() for listener sockets failed");
> +
> +		err = setsockopt(server_fds[i], SOL_SOCKET, SO_REUSEPORT,
> +				 &optval, sizeof(optval));
> +		err_exit(err == -1, "setsockopt() for SO_REUSEPORT failed");
> +
> +		if (i == 0) {
> +			err = setsockopt(server_fds[i], SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
> +					 &prog_fd, sizeof(prog_fd));
> +			err_exit(err == -1, "setsockopt() for SO_ATTACH_REUSEPORT_EBPF failed");
> +		}
> +
> +		err = bind(server_fds[i], (struct sockaddr *)&addr, addr_len);
> +		err_exit(err == -1, "bind() failed");
> +
> +		err = listen(server_fds[i], 32);
> +		err_exit(err == -1, "listen() failed");
> +
> +		err = bpf_map_update_elem(reuseport_map_fd, &i, &server_fds[i], BPF_NOEXIST);
> +		err_exit(err == -1, "updating BPF reuseport_map failed");
> +
> +		err = bpf_map_lookup_elem(reuseport_map_fd, &i, &value);
> +		err_exit(err == -1, "looking up BPF reuseport_map failed");
> +
> +		printf("fd[%d] (cookie: %llu) -> fd[%d]\n", i, value, migrated_to);
> +		err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, BPF_NOEXIST);
> +		err_exit(err == -1, "updating BPF migrate_map failed");
> +	}
> +}
> +
> +void test_connect(void)
> +{
> +	struct sockaddr_in addr;
> +	socklen_t addr_len = sizeof(addr);
> +	int i, err, client_fd;
> +
> +	addr.sin_family = AF_INET;
> +	addr.sin_port = htons(80);
> +	inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr);
> +
> +	for (i = 0; i < NUM_SOCKS * 5; i++) {
> +		client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
> +		err_exit(client_fd == -1, "socket() for listener sockets failed");
> +
> +		err = connect(client_fd, (struct sockaddr *)&addr, addr_len);
> +		err_exit(err == -1, "connect() failed");
> +
> +		close(client_fd);
> +	}
> +}
> +
> +void test_close(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < NUM_SOCKS - 1; i++)
> +		close(server_fds[i]);
> +}
> +
> +void test_accept(void)
> +{
> +	struct sockaddr_in addr;
> +	socklen_t addr_len = sizeof(addr);
> +	int cnt, client_fd;
> +
> +	fcntl(server_fds[NUM_SOCKS - 1], F_SETFL, O_NONBLOCK);
> +
> +	for (cnt = 0; cnt < NUM_SOCKS * 5; cnt++) {
> +		client_fd = accept(server_fds[NUM_SOCKS - 1], (struct sockaddr *)&addr, &addr_len);
> +		err_exit(client_fd == -1, "accept() failed");
> +	}
> +
> +	printf("%d accepted, %d is expected\n", cnt, NUM_SOCKS * 5);
> +}
> +
> +int main(void)
I am pretty sure "make -C tools/testing/selftests/bpf"
will not compile here because of double main() with
the test_progs.c.

Please take a look at how other tests are written in
tools/testing/selftests/bpf/prog_tests/. e.g.
the test function in tcp_hdr_options.c is
test_tcp_hdr_options().

Also, instead of bpf_object__open(), please use skeleton
like most of the tests do.


> +{
> +	setup_bpf();
> +	test_listen();
> +	test_connect();
> +	test_close();
> +	test_accept();
> +	close(server_fds[NUM_SOCKS - 1]);
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-04  5:56           ` Martin KaFai Lau
@ 2020-12-06  4:32             ` Kuniyuki Iwashima
  0 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-06  4:32 UTC (permalink / raw)
  To: kuniyu; +Cc: kuni1840, netdev, bpf, linux-kernel

I'm sending this mail just for logging because I failed to send mails only
to LKML, netdev, and bpf yesterday.


From:   Martin KaFai Lau <kafai@fb.com>
Date:   Thu, 3 Dec 2020 21:56:53 -0800
> On Thu, Dec 03, 2020 at 11:16:08PM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Wed, 2 Dec 2020 20:24:02 -0800
> > > On Wed, Dec 02, 2020 at 11:19:02AM -0800, Martin KaFai Lau wrote:
> > > > On Tue, Dec 01, 2020 at 06:04:50PM -0800, Andrii Nakryiko wrote:
> > > > > On Tue, Dec 1, 2020 at 6:49 AM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> > > > > >
> > > > > > This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
> > > > > > check if the attached eBPF program is capable of migrating sockets.
> > > > > >
> > > > > > When the eBPF program is attached, the kernel runs it for socket migration
> > > > > > only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> > > > > > The kernel will change the behaviour depending on the returned value:
> > > > > >
> > > > > >   - SK_PASS with selected_sk, select it as a new listener
> > > > > >   - SK_PASS with selected_sk NULL, fall back to the random selection
> > > > > >   - SK_DROP, cancel the migration
> > > > > >
> > > > > > Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
> > > > > > Suggested-by: Martin KaFai Lau <kafai@fb.com>
> > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > > > ---
> > > > > >  include/uapi/linux/bpf.h       | 2 ++
> > > > > >  kernel/bpf/syscall.c           | 8 ++++++++
> > > > > >  tools/include/uapi/linux/bpf.h | 2 ++
> > > > > >  3 files changed, 12 insertions(+)
> > > > > >
> > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > > index 85278deff439..cfc207ae7782 100644
> > > > > > --- a/include/uapi/linux/bpf.h
> > > > > > +++ b/include/uapi/linux/bpf.h
> > > > > > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > > > > >         BPF_XDP_CPUMAP,
> > > > > >         BPF_SK_LOOKUP,
> > > > > >         BPF_XDP,
> > > > > > +       BPF_SK_REUSEPORT_SELECT,
> > > > > > +       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > > > > >         __MAX_BPF_ATTACH_TYPE
> > > > > >  };
> > > > > >
> > > > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > > > index f3fe9f53f93c..a0796a8de5ea 100644
> > > > > > --- a/kernel/bpf/syscall.c
> > > > > > +++ b/kernel/bpf/syscall.c
> > > > > > @@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
> > > > > >                 if (expected_attach_type == BPF_SK_LOOKUP)
> > > > > >                         return 0;
> > > > > >                 return -EINVAL;
> > > > > > +       case BPF_PROG_TYPE_SK_REUSEPORT:
> > > > > > +               switch (expected_attach_type) {
> > > > > > +               case BPF_SK_REUSEPORT_SELECT:
> > > > > > +               case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
> > > > > > +                       return 0;
> > > > > > +               default:
> > > > > > +                       return -EINVAL;
> > > > > > +               }
> > > > > 
> > > > > this is a kernel regression, previously expected_attach_type wasn't
> > > > > enforced, so user-space could have provided any number without an
> > > > > error.
> > > > I also think this change alone will break things like when the usual
> > > > attr->expected_attach_type == 0 case.  At least changes is needed in
> > > > bpf_prog_load_fixup_attach_type() which is also handling a
> > > > similar situation for BPF_PROG_TYPE_CGROUP_SOCK.
> > > > 
> > > > I now think there is no need to expose new bpf_attach_type to the UAPI.
> > > > Since the prog->expected_attach_type is not used, it can be cleared at load time
> > > > and then only set to BPF_SK_REUSEPORT_SELECT_OR_MIGRATE (probably defined
> > > > internally at filter.[c|h]) in the is_valid_access() when "migration"
> > > > is accessed.  When "migration" is accessed, the bpf prog can handle
> > > > migration (and the original not-migration) case.
> > > Scrap this internal only BPF_SK_REUSEPORT_SELECT_OR_MIGRATE idea.
> > > I think there will be cases that bpf prog wants to do both
> > > without accessing any field from sk_reuseport_md.
> > > 
> > > Lets go back to the discussion on using a similar
> > > idea as BPF_PROG_TYPE_CGROUP_SOCK in bpf_prog_load_fixup_attach_type().
> > > I am not aware there is loader setting a random number
> > > in expected_attach_type, so the chance of breaking
> > > is very low.  There was a similar discussion earlier [0].
> > > 
> > > [0]: https://lore.kernel.org/netdev/20200126045443.f47dzxdglazzchfm@ast-mbp/
> > 
> > Thank you for the idea and reference.
> > 
> > I will remove the change in bpf_prog_load_check_attach() and set the
> > default value (BPF_SK_REUSEPORT_SELECT) in bpf_prog_load_fixup_attach_type()
> > for backward compatibility if expected_attach_type is 0.
> check_attach_type() can be kept.  You can refer to
> commit aac3fc320d94 for a similar situation.

I confirmed bpf_prog_load_fixup_attach_type() is called just before
bpf_prog_load_check_attach(), so I will add the fixup code to this patch.
Thank you.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT.
  2020-12-04 19:58   ` Martin KaFai Lau
@ 2020-12-06  4:36     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-06  4:36 UTC (permalink / raw)
  To: kuniyu; +Cc: kuni1840, netdev, bpf, linux-kernel

I'm sending this mail just for logging because I failed to send mails only 
to LKML, netdev, and bpf yesterday.


From:   Martin KaFai Lau <kafai@fb.com>
Date:   Fri, 4 Dec 2020 11:58:07 -0800
> On Tue, Dec 01, 2020 at 11:44:16PM +0900, Kuniyuki Iwashima wrote:
> > We will call sock_reuseport.prog for socket migration in the next commit,
> > so the eBPF program has to know which listener is closing in order to
> > select the new listener.
> > 
> > Currently, we can get a unique ID for each listener in the userspace by
> > calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.
> > 
> > This patch makes the sk pointer available in sk_reuseport_md so that we can
> > get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.
> > 
> > Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/
> > Suggested-by: Martin KaFai Lau <kafai@fb.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  include/uapi/linux/bpf.h       |  8 ++++++++
> >  net/core/filter.c              | 12 +++++++++++-
> >  tools/include/uapi/linux/bpf.h |  8 ++++++++
> >  3 files changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index efe342bf3dbc..3e9b8bd42b4e 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -1650,6 +1650,13 @@ union bpf_attr {
> >   * 		A 8-byte long non-decreasing number on success, or 0 if the
> >   * 		socket field is missing inside *skb*.
> >   *
> > + * u64 bpf_get_socket_cookie(struct bpf_sock *sk)
> > + * 	Description
> > + * 		Equivalent to bpf_get_socket_cookie() helper that accepts
> > + * 		*skb*, but gets socket from **struct bpf_sock** context.
> > + * 	Return
> > + * 		A 8-byte long non-decreasing number.
> > + *
> >   * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx)
> >   * 	Description
> >   * 		Equivalent to bpf_get_socket_cookie() helper that accepts
> > @@ -4420,6 +4427,7 @@ struct sk_reuseport_md {
> >  	__u32 bind_inany;	/* Is sock bound to an INANY address? */
> >  	__u32 hash;		/* A hash of the packet 4 tuples */
> >  	__u8 migration;		/* Migration type */
> > +	__bpf_md_ptr(struct bpf_sock *, sk); /* current listening socket */
> >  };
> >  
> >  #define BPF_TAG_SIZE	8
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 0a0634787bb4..1059d31847ef 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4628,7 +4628,7 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_proto = {
> >  	.func		= bpf_get_socket_cookie_sock,
> >  	.gpl_only	= false,
> >  	.ret_type	= RET_INTEGER,
> > -	.arg1_type	= ARG_PTR_TO_CTX,
> > +	.arg1_type	= ARG_PTR_TO_SOCKET,
> This will break existing bpf prog (BPF_PROG_TYPE_CGROUP_SOCK)
> using this proto.  A new proto is needed and there is
> an on-going patch doing this [0].
> 
> [0]: https://lore.kernel.org/bpf/20201203213330.1657666-1-revest@google.com/

Thank you for notifying me of this patch!
I will define another proto, but may drop the part if the above patch is
already merged then.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group.
  2020-12-05  1:31   ` Martin KaFai Lau
@ 2020-12-06  4:38     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-06  4:38 UTC (permalink / raw)
  To: kuniyu; +Cc: kuni1840, netdev, bpf, linux-kernel

I'm sending this mail just for logging because I failed to send mails only 
to LKML, netdev, and bpf yesterday.


From:   Martin KaFai Lau <kafai@fb.com>
Date:   Fri, 4 Dec 2020 17:31:03 -0800
> On Tue, Dec 01, 2020 at 11:44:08PM +0900, Kuniyuki Iwashima wrote:
> > This patch is a preparation patch to migrate incoming connections in the
> > later commits and adds a field (num_closed_socks) to the struct
> > sock_reuseport to keep TCP_CLOSE sockets in the reuseport group.
> > 
> > When we close a listening socket, to migrate its connections to another
> > listener in the same reuseport group, we have to handle two kinds of child
> > sockets. One is that a listening socket has a reference to, and the other
> > is not.
> > 
> > The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
> > accept queue of their listening socket. So, we can pop them out and push
> > them into another listener's queue at close() or shutdown() syscalls. On
> > the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
> > three-way handshake and not in the accept queue. Thus, we cannot access
> > such sockets at close() or shutdown() syscalls. Accordingly, we have to
> > migrate immature sockets after their listening socket has been closed.
> > 
> > Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
> > sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
> > that time, if we could select a new listener from the same reuseport group,
> > no connection would be aborted. However, it is impossible because
> > reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
> > the reuseport group from closed sockets.
> > 
> > This patch allows TCP_CLOSE sockets to remain in the reuseport group and to
> > have access to it while any child socket references to them. The point is
> > that reuseport_detach_sock() is called twice from inet_unhash() and
> > sk_destruct(). At first, it moves the socket backwards in socks[] and
> > increments num_closed_socks. Later, when all migrated connections are
> > accepted, it removes the socket from socks[], decrements num_closed_socks,
> > and sets NULL to sk_reuseport_cb.
> > 
> > By this change, closed sockets can keep sk_reuseport_cb until all child
> > requests have been freed or accepted. Consequently calling listen() after
> > shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or
> > inet_csk_bind_conflict() which expect that such sockets should not have the
> > reuseport group. Therefore, this patch also loosens such validation rules
> > so that the socket can listen again if it has the same reuseport group with
> > other listening sockets.
> > 
> > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  include/net/sock_reuseport.h    |  5 ++-
> >  net/core/sock_reuseport.c       | 79 +++++++++++++++++++++++++++------
> >  net/ipv4/inet_connection_sock.c |  7 ++-
> >  3 files changed, 74 insertions(+), 17 deletions(-)
> > 
> > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > index 505f1e18e9bf..0e558ca7afbf 100644
> > --- a/include/net/sock_reuseport.h
> > +++ b/include/net/sock_reuseport.h
> > @@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock;
> >  struct sock_reuseport {
> >  	struct rcu_head		rcu;
> >  
> > -	u16			max_socks;	/* length of socks */
> > -	u16			num_socks;	/* elements in socks */
> > +	u16			max_socks;		/* length of socks */
> > +	u16			num_socks;		/* elements in socks */
> > +	u16			num_closed_socks;	/* closed elements in socks */
> >  	/* The last synq overflow event timestamp of this
> >  	 * reuse->socks[] group.
> >  	 */
> > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > index bbdd3c7b6cb5..fd133516ac0e 100644
> > --- a/net/core/sock_reuseport.c
> > +++ b/net/core/sock_reuseport.c
> > @@ -98,16 +98,21 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse)
> >  		return NULL;
> >  
> >  	more_reuse->num_socks = reuse->num_socks;
> > +	more_reuse->num_closed_socks = reuse->num_closed_socks;
> >  	more_reuse->prog = reuse->prog;
> >  	more_reuse->reuseport_id = reuse->reuseport_id;
> >  	more_reuse->bind_inany = reuse->bind_inany;
> >  	more_reuse->has_conns = reuse->has_conns;
> > +	more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
> >  
> >  	memcpy(more_reuse->socks, reuse->socks,
> >  	       reuse->num_socks * sizeof(struct sock *));
> > -	more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
> > +	memcpy(more_reuse->socks +
> > +	       (more_reuse->max_socks - more_reuse->num_closed_socks),
> > +	       reuse->socks + reuse->num_socks,
> > +	       reuse->num_closed_socks * sizeof(struct sock *));
> >  
> > -	for (i = 0; i < reuse->num_socks; ++i)
> > +	for (i = 0; i < reuse->max_socks; ++i)
> >  		rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
> >  				   more_reuse);
> >  
> > @@ -129,6 +134,25 @@ static void reuseport_free_rcu(struct rcu_head *head)
> >  	kfree(reuse);
> >  }
> >  
> > +static int reuseport_sock_index(struct sock_reuseport *reuse, struct sock *sk,
> > +				bool closed)
> > +{
> > +	int left, right;
> > +
> > +	if (!closed) {
> > +		left = 0;
> > +		right = reuse->num_socks;
> > +	} else {
> > +		left = reuse->max_socks - reuse->num_closed_socks;
> > +		right = reuse->max_socks;
> > +	}
> > +
> > +	for (; left < right; left++)
> > +		if (reuse->socks[left] == sk)
> > +			return left;
> > +	return -1;
> > +}
> > +
> >  /**
> >   *  reuseport_add_sock - Add a socket to the reuseport group of another.
> >   *  @sk:  New socket to add to the group.
> > @@ -153,12 +177,23 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> >  					  lockdep_is_held(&reuseport_lock));
> >  	old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
> >  					     lockdep_is_held(&reuseport_lock));
> > -	if (old_reuse && old_reuse->num_socks != 1) {
> > +
> > +	if (old_reuse == reuse) {
> > +		int i = reuseport_sock_index(reuse, sk, true);
> > +
> > +		if (i == -1) {
> When will this happen?

I understood the original code did nothing if the sk was not found in
socks[], so I rewrote it this way, but I also think `i` will never be -1.

If I rewrite, it will be like:

---8<---
for (; left < right; left++)
    if (reuse->socks[left] == sk)
        break;
return left;
---8<---


> I found the new logic in the closed sk shuffling within socks[] quite
> complicated to read.  I can see why the closed sk wants to keep its
> sk->sk_reuseport_cb.  However, does it need to stay
> in socks[]?

Currently, I do not use closed sockets in socks[], so the only thing I need
to do seems to be to count num_closed_socks to free struct sock_reuseport.
I will change the code only to keep sk_reuseport_cb and count
num_closed_socks.

(As a side note, I wrote the code while thinking of stack and heap to share
the same array, but I also feel a bit difficult to read.)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-05  1:42   ` Martin KaFai Lau
@ 2020-12-06  4:41     ` Kuniyuki Iwashima
       [not found]     ` <20201205160307.91179-1-kuniyu@amazon.co.jp>
  1 sibling, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-06  4:41 UTC (permalink / raw)
  To: kuniyu; +Cc: kuni1840, netdev, bpf, linux-kernel

I'm sending this mail just for logging because I failed to send mails only 
to LKML, netdev, and bpf yesterday.


From:   Martin KaFai Lau <kafai@fb.com>
Date:   Fri, 4 Dec 2020 17:42:41 -0800
> On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
> [ ... ]
> > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > index fd133516ac0e..60d7c1f28809 100644
> > --- a/net/core/sock_reuseport.c
> > +++ b/net/core/sock_reuseport.c
> > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> >  }
> >  EXPORT_SYMBOL(reuseport_add_sock);
> >  
> > -void reuseport_detach_sock(struct sock *sk)
> > +struct sock *reuseport_detach_sock(struct sock *sk)
> >  {
> >  	struct sock_reuseport *reuse;
> > +	struct bpf_prog *prog;
> > +	struct sock *nsk = NULL;
> >  	int i;
> >  
> >  	spin_lock_bh(&reuseport_lock);
> > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> >  
> >  		reuse->num_socks--;
> >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > +		prog = rcu_dereference(reuse->prog);
> Is it under rcu_read_lock() here?

reuseport_lock is locked in this function, and we do not modify the prog,
but is rcu_dereference_protected() preferable?

---8<---
prog = rcu_dereference_protected(reuse->prog,
				 lockdep_is_held(&reuseport_lock));
---8<---


> >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > +			if (reuse->num_socks && !prog)
> > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > +
> >  			reuse->num_closed_socks++;
> >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> >  		} else {
> > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> >  out:
> >  	spin_unlock_bh(&reuseport_lock);
> > +
> > +	return nsk;
> >  }
> >  EXPORT_SYMBOL(reuseport_detach_sock);
> >  
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index 1451aa9712b0..b27241ea96bd 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> >  }
> >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> >  
> > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > +{
> > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > +
> > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > +
> > +	spin_lock(&old_accept_queue->rskq_lock);
> > +	spin_lock(&new_accept_queue->rskq_lock);
> I am also not very thrilled on this double spin_lock.
> Can this be done in (or like) inet_csk_listen_stop() instead?

It will be possible to migrate sockets in inet_csk_listen_stop(), but I
think it is better to do it just after reuseport_detach_sock() becuase we
can select a different listener (almost) every time at a lower cost by
selecting the moved socket and pass it to inet_csk_reqsk_queue_migrate()
easily.

sk_hash of the listener is 0, so we would have to generate a random number
in inet_csk_listen_stop().

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
  2020-12-05  1:50   ` Martin KaFai Lau
@ 2020-12-06  4:43     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-06  4:43 UTC (permalink / raw)
  To: kuniyu; +Cc: kuni1840, netdev, bpf, linux-kernel

I'm sending this mail just for logging because I failed to send mails only 
to LKML, netdev, and bpf yesterday.


From:   Martin KaFai Lau <kafai@fb.com>
Date:   Fri, 4 Dec 2020 17:50:00 -0800
> On Tue, Dec 01, 2020 at 11:44:18PM +0900, Kuniyuki Iwashima wrote:
> > This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> > 
> > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  .../bpf/prog_tests/migrate_reuseport.c        | 164 ++++++++++++++++++
> >  .../bpf/progs/test_migrate_reuseport_kern.c   |  54 ++++++
> >  2 files changed, 218 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c
> > 
> > diff --git a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
> > new file mode 100644
> > index 000000000000..87c72d9ccadd
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
> > @@ -0,0 +1,164 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Check if we can migrate child sockets.
> > + *
> > + *   1. call listen() for 5 server sockets.
> > + *   2. update a map to migrate all child socket
> > + *        to the last server socket (migrate_map[cookie] = 4)
> > + *   3. call connect() for 25 client sockets.
> > + *   4. call close() for first 4 server sockets.
> > + *   5. call accept() for the last server socket.
> > + *
> > + * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > + */
> > +
> > +#include <stdlib.h>
> > +#include <unistd.h>
> > +#include <fcntl.h>
> > +#include <netinet/in.h>
> > +#include <arpa/inet.h>
> > +#include <linux/bpf.h>
> > +#include <sys/socket.h>
> > +#include <sys/types.h>
> > +#include <bpf/bpf.h>
> > +#include <bpf/libbpf.h>
> > +
> > +#define NUM_SOCKS 5
> > +#define LOCALHOST "127.0.0.1"
> > +#define err_exit(condition, message)			      \
> > +	do {						      \
> > +		if (condition) {			      \
> > +			perror("ERROR: " message " ");	      \
> > +			exit(1);			      \
> > +		}					      \
> > +	} while (0)
> > +
> > +__u64 server_fds[NUM_SOCKS];
> > +int prog_fd, reuseport_map_fd, migrate_map_fd;
> > +
> > +
> > +void setup_bpf(void)
> > +{
> > +	struct bpf_object *obj;
> > +	struct bpf_program *prog;
> > +	struct bpf_map *reuseport_map, *migrate_map;
> > +	int err;
> > +
> > +	obj = bpf_object__open("test_migrate_reuseport_kern.o");
> > +	err_exit(libbpf_get_error(obj), "opening BPF object file failed");
> > +
> > +	err = bpf_object__load(obj);
> > +	err_exit(err, "loading BPF object failed");
> > +
> > +	prog = bpf_program__next(NULL, obj);
> > +	err_exit(!prog, "loading BPF program failed");
> > +
> > +	reuseport_map = bpf_object__find_map_by_name(obj, "reuseport_map");
> > +	err_exit(!reuseport_map, "loading BPF reuseport_map failed");
> > +
> > +	migrate_map = bpf_object__find_map_by_name(obj, "migrate_map");
> > +	err_exit(!migrate_map, "loading BPF migrate_map failed");
> > +
> > +	prog_fd = bpf_program__fd(prog);
> > +	reuseport_map_fd = bpf_map__fd(reuseport_map);
> > +	migrate_map_fd = bpf_map__fd(migrate_map);
> > +}
> > +
> > +void test_listen(void)
> > +{
> > +	struct sockaddr_in addr;
> > +	socklen_t addr_len = sizeof(addr);
> > +	int i, err, optval = 1, migrated_to = NUM_SOCKS - 1;
> > +	__u64 value;
> > +
> > +	addr.sin_family = AF_INET;
> > +	addr.sin_port = htons(80);
> > +	inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr);
> > +
> > +	for (i = 0; i < NUM_SOCKS; i++) {
> > +		server_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
> > +		err_exit(server_fds[i] == -1, "socket() for listener sockets failed");
> > +
> > +		err = setsockopt(server_fds[i], SOL_SOCKET, SO_REUSEPORT,
> > +				 &optval, sizeof(optval));
> > +		err_exit(err == -1, "setsockopt() for SO_REUSEPORT failed");
> > +
> > +		if (i == 0) {
> > +			err = setsockopt(server_fds[i], SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
> > +					 &prog_fd, sizeof(prog_fd));
> > +			err_exit(err == -1, "setsockopt() for SO_ATTACH_REUSEPORT_EBPF failed");
> > +		}
> > +
> > +		err = bind(server_fds[i], (struct sockaddr *)&addr, addr_len);
> > +		err_exit(err == -1, "bind() failed");
> > +
> > +		err = listen(server_fds[i], 32);
> > +		err_exit(err == -1, "listen() failed");
> > +
> > +		err = bpf_map_update_elem(reuseport_map_fd, &i, &server_fds[i], BPF_NOEXIST);
> > +		err_exit(err == -1, "updating BPF reuseport_map failed");
> > +
> > +		err = bpf_map_lookup_elem(reuseport_map_fd, &i, &value);
> > +		err_exit(err == -1, "looking up BPF reuseport_map failed");
> > +
> > +		printf("fd[%d] (cookie: %llu) -> fd[%d]\n", i, value, migrated_to);
> > +		err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, BPF_NOEXIST);
> > +		err_exit(err == -1, "updating BPF migrate_map failed");
> > +	}
> > +}
> > +
> > +void test_connect(void)
> > +{
> > +	struct sockaddr_in addr;
> > +	socklen_t addr_len = sizeof(addr);
> > +	int i, err, client_fd;
> > +
> > +	addr.sin_family = AF_INET;
> > +	addr.sin_port = htons(80);
> > +	inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr);
> > +
> > +	for (i = 0; i < NUM_SOCKS * 5; i++) {
> > +		client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
> > +		err_exit(client_fd == -1, "socket() for listener sockets failed");
> > +
> > +		err = connect(client_fd, (struct sockaddr *)&addr, addr_len);
> > +		err_exit(err == -1, "connect() failed");
> > +
> > +		close(client_fd);
> > +	}
> > +}
> > +
> > +void test_close(void)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < NUM_SOCKS - 1; i++)
> > +		close(server_fds[i]);
> > +}
> > +
> > +void test_accept(void)
> > +{
> > +	struct sockaddr_in addr;
> > +	socklen_t addr_len = sizeof(addr);
> > +	int cnt, client_fd;
> > +
> > +	fcntl(server_fds[NUM_SOCKS - 1], F_SETFL, O_NONBLOCK);
> > +
> > +	for (cnt = 0; cnt < NUM_SOCKS * 5; cnt++) {
> > +		client_fd = accept(server_fds[NUM_SOCKS - 1], (struct sockaddr *)&addr, &addr_len);
> > +		err_exit(client_fd == -1, "accept() failed");
> > +	}
> > +
> > +	printf("%d accepted, %d is expected\n", cnt, NUM_SOCKS * 5);
> > +}
> > +
> > +int main(void)
> I am pretty sure "make -C tools/testing/selftests/bpf"
> will not compile here because of double main() with
> the test_progs.c.
> 
> Please take a look at how other tests are written in
> tools/testing/selftests/bpf/prog_tests/. e.g.
> the test function in tcp_hdr_options.c is
> test_tcp_hdr_options().
> 
> Also, instead of bpf_object__open(), please use skeleton
> like most of the tests do.

I'm sorry... I will check other tests and rewrite this patch along them.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
       [not found]     ` <20201205160307.91179-1-kuniyu@amazon.co.jp>
@ 2020-12-07 20:14       ` Martin KaFai Lau
  2020-12-08  6:27         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-07 20:14 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840,
	linux-kernel, netdev

On Sun, Dec 06, 2020 at 01:03:07AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Fri, 4 Dec 2020 17:42:41 -0800
> > On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
> > [ ... ]
> > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > index fd133516ac0e..60d7c1f28809 100644
> > > --- a/net/core/sock_reuseport.c
> > > +++ b/net/core/sock_reuseport.c
> > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > >  }
> > >  EXPORT_SYMBOL(reuseport_add_sock);
> > >  
> > > -void reuseport_detach_sock(struct sock *sk)
> > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > >  {
> > >  	struct sock_reuseport *reuse;
> > > +	struct bpf_prog *prog;
> > > +	struct sock *nsk = NULL;
> > >  	int i;
> > >  
> > >  	spin_lock_bh(&reuseport_lock);
> > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > >  
> > >  		reuse->num_socks--;
> > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > +		prog = rcu_dereference(reuse->prog);
> > Is it under rcu_read_lock() here?
> 
> reuseport_lock is locked in this function, and we do not modify the prog,
> but is rcu_dereference_protected() preferable?
> 
> ---8<---
> prog = rcu_dereference_protected(reuse->prog,
> 				 lockdep_is_held(&reuseport_lock));
> ---8<---
It is not only reuse->prog.  Other things also require rcu_read_lock(),
e.g. please take a look at __htab_map_lookup_elem().

The TCP_LISTEN sk (selected by bpf to be the target of the migration)
is also protected by rcu.

I am surprised there is no WARNING in the test.
Do you have the needed DEBUG_LOCK* config enabled?

> > >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > > +			if (reuse->num_socks && !prog)
> > > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > +
> > >  			reuse->num_closed_socks++;
> > >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > >  		} else {
> > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> > >  out:
> > >  	spin_unlock_bh(&reuseport_lock);
> > > +
> > > +	return nsk;
> > >  }
> > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > >  
> > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > index 1451aa9712b0..b27241ea96bd 100644
> > > --- a/net/ipv4/inet_connection_sock.c
> > > +++ b/net/ipv4/inet_connection_sock.c
> > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > >  }
> > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > >  
> > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > +{
> > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > +
> > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > +
> > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > +	spin_lock(&new_accept_queue->rskq_lock);
> > I am also not very thrilled on this double spin_lock.
> > Can this be done in (or like) inet_csk_listen_stop() instead?
> 
> It will be possible to migrate sockets in inet_csk_listen_stop(), but I
> think it is better to do it just after reuseport_detach_sock() becuase we
> can select a different listener (almost) every time at a lower cost by
> selecting the moved socket and pass it to inet_csk_reqsk_queue_migrate()
> easily.
I don't see the "lower cost" point.  Please elaborate.

> 
> sk_hash of the listener is 0, so we would have to generate a random number
> in inet_csk_listen_stop().
If I read it correctly, it is also passing 0 as the sk_hash to
bpf_run_sk_reuseport() from reuseport_detach_sock().

Also, how is the sk_hash expected to be used?  I don't see
it in the test.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-03 14:14     ` Kuniyuki Iwashima
  2020-12-03 14:31       ` Eric Dumazet
@ 2020-12-07 20:33       ` Martin KaFai Lau
  2020-12-08  6:31         ` Kuniyuki Iwashima
  1 sibling, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-07 20:33 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: eric.dumazet, ast, benh, bpf, daniel, davem, edumazet, kuba,
	kuni1840, linux-kernel, netdev

On Thu, Dec 03, 2020 at 11:14:24PM +0900, Kuniyuki Iwashima wrote:
> From:   Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > which is used only by inet_unhash(). If it is not NULL,
> > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > sockets from the closing listener to the selected one.
> > > 
> > > Listening sockets hold incoming connections as a linked list of struct
> > > request_sock in the accept queue, and each request has reference to a full
> > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > > the requests from the closing listener's queue and relink them to the head
> > > of the new listener's queue. We do not process each request and its
> > > reference to the listener, so the migration completes in O(1) time
> > > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > > care in the next commit.
> > > 
> > > By default, the kernel selects a new listener randomly. In order to pick
> > > out a different socket every time, we select the last element of socks[] as
> > > the new listener. This behaviour is based on how the kernel moves sockets
> > > in socks[]. (See also [1])
> > > 
> > > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > > program called in the later commit, but as the side effect of such default
> > > selection, the kernel can redistribute old requests evenly to new listeners
> > > for a specific case where the application replaces listeners by
> > > generations.
> > > 
> > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > first two by turns. The sockets move in socks[] like below.
> > > 
> > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > >   socks[2] : C   |      socks[2] : C --'
> > >   socks[3] : D --'
> > > 
> > > Then, if C and D have newer settings than A and B, and each socket has a
> > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > requests evenly to new listeners.
> > > 
> > >   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
> > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> > >   socks[2] : C (c)   |      socks[2] : C (c) --'
> > >   socks[3] : D (d) --'
> > > 
> > > Here, (A, D) or (B, C) can have different application settings, but they
> > > MUST have the same settings at the socket API level; otherwise, unexpected
> > > error may happen. For instance, if only the new listeners have
> > > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > > face inconsistency and cause an error.
> > > 
> > > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > > program described in later commits.
> > > 
> > > Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > ---
> > >  include/net/inet_connection_sock.h |  1 +
> > >  include/net/sock_reuseport.h       |  2 +-
> > >  net/core/sock_reuseport.c          | 10 +++++++++-
> > >  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
> > >  net/ipv4/inet_hashtables.c         |  9 +++++++--
> > >  5 files changed, 48 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > index 7338b3865a2a..2ea2d743f8fc 100644
> > > --- a/include/net/inet_connection_sock.h
> > > +++ b/include/net/inet_connection_sock.h
> > > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
> > >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > >  				      struct request_sock *req,
> > >  				      struct sock *child);
> > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> > >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > >  				   unsigned long timeout);
> > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > > index 0e558ca7afbf..09a1b1539d4c 100644
> > > --- a/include/net/sock_reuseport.h
> > > +++ b/include/net/sock_reuseport.h
> > > @@ -31,7 +31,7 @@ struct sock_reuseport {
> > >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> > >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> > >  			      bool bind_inany);
> > > -extern void reuseport_detach_sock(struct sock *sk);
> > > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> > >  extern struct sock *reuseport_select_sock(struct sock *sk,
> > >  					  u32 hash,
> > >  					  struct sk_buff *skb,
> > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > index fd133516ac0e..60d7c1f28809 100644
> > > --- a/net/core/sock_reuseport.c
> > > +++ b/net/core/sock_reuseport.c
> > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > >  }
> > >  EXPORT_SYMBOL(reuseport_add_sock);
> > >  
> > > -void reuseport_detach_sock(struct sock *sk)
> > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > >  {
> > >  	struct sock_reuseport *reuse;
> > > +	struct bpf_prog *prog;
> > > +	struct sock *nsk = NULL;
> > >  	int i;
> > >  
> > >  	spin_lock_bh(&reuseport_lock);
> > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > >  
> > >  		reuse->num_socks--;
> > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > +		prog = rcu_dereference(reuse->prog);
> > >  
> > >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > > +			if (reuse->num_socks && !prog)
> > > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > +
> > >  			reuse->num_closed_socks++;
> > >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > >  		} else {
> > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> > >  out:
> > >  	spin_unlock_bh(&reuseport_lock);
> > > +
> > > +	return nsk;
> > >  }
> > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > >  
> > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > index 1451aa9712b0..b27241ea96bd 100644
> > > --- a/net/ipv4/inet_connection_sock.c
> > > +++ b/net/ipv4/inet_connection_sock.c
> > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > >  }
> > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > >  
> > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > +{
> > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > +
> > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > +
> > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > +	spin_lock(&new_accept_queue->rskq_lock);
> > 
> > Are you sure lockdep is happy with this ?
> > 
> > I would guess it should complain, because :
> > 
> > lock(A);
> > lock(B);
> > ...
> > unlock(B);
> > unlock(A);
> > 
> > will fail when the opposite action happens eventually
> > 
> > lock(B);
> > lock(A);
> > ...
> > unlock(A);
> > unlock(B);
> 
> I enabled lockdep and did not see warnings of lockdep.
> 
> Also, the inversion deadlock does not happen in this case.
> In reuseport_detach_sock(), sk is moved backward in socks[] and poped out
> from the eBPF map, so the old listener will not be selected as the new
> listener.
> 
> 
> > > +
> > > +	if (old_accept_queue->rskq_accept_head) {
> > > +		if (new_accept_queue->rskq_accept_head)
> > > +			old_accept_queue->rskq_accept_tail->dl_next =
> > > +				new_accept_queue->rskq_accept_head;
> > > +		else
> > > +			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
> > > +
> > > +		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
> > > +		old_accept_queue->rskq_accept_head = NULL;
> > > +		old_accept_queue->rskq_accept_tail = NULL;
> > > +
> > > +		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
> > > +		WRITE_ONCE(sk->sk_ack_backlog, 0);
> > > +	}
> > > +
> > > +	spin_unlock(&new_accept_queue->rskq_lock);
> > > +	spin_unlock(&old_accept_queue->rskq_lock);
> > > +}
> > > +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
> > 
> > I fail to understand how the kernel can run fine right after this patch, before following patches are merged.
> 
> I will squash the two or reorganize them into definition part and migration
> part.
> 
> 
> > All request sockets in the socket accept queue MUST have their rsk_listener set to the listener,
> > this is how we designed things (each request socket has a reference taken on the listener)
> > 
> > We might even have some "BUG_ON(sk != req->rsk_listener);" in some places.
> > 
> > Since you splice list from old listener to the new one, without changing req->rsk_listener, bad things will happen.
I also have similar concern on the inconsistency in req->rsk_listener.

The fix-up in req->rsk_listener for the TFO req in patch 4
makes it clear that req->rsk_listener should be updated during
the migration instead of asking a much later code path
to accommodate this inconsistent req->rsk_listener pointer.

The current inet_csk_listen_stop() is already iterating
the icsk_accept_queue and fastopenq.  The extra cost
in updating rsk_listener may be just noise?

> > 
> > I feel the order of your patches is not correct.
> 
> I understand this series is against the design.
> But once the requests sockets are added in the queue, they are accessed
> from the accept queue, and then we have the correct listener and can
> rewirte rsk_listener. Otherwise, their full socket are accessed instead.
> 
> Also, as far as I know, such BUG_ON was only in inet_child_forget().

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-07 20:14       ` Martin KaFai Lau
@ 2020-12-08  6:27         ` Kuniyuki Iwashima
  2020-12-08  8:13           ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-08  6:27 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Mon, 7 Dec 2020 12:14:38 -0800
> On Sun, Dec 06, 2020 at 01:03:07AM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Fri, 4 Dec 2020 17:42:41 -0800
> > > On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
> > > [ ... ]
> > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > index fd133516ac0e..60d7c1f28809 100644
> > > > --- a/net/core/sock_reuseport.c
> > > > +++ b/net/core/sock_reuseport.c
> > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > >  }
> > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > >  
> > > > -void reuseport_detach_sock(struct sock *sk)
> > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > >  {
> > > >  	struct sock_reuseport *reuse;
> > > > +	struct bpf_prog *prog;
> > > > +	struct sock *nsk = NULL;
> > > >  	int i;
> > > >  
> > > >  	spin_lock_bh(&reuseport_lock);
> > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > >  
> > > >  		reuse->num_socks--;
> > > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > +		prog = rcu_dereference(reuse->prog);
> > > Is it under rcu_read_lock() here?
> > 
> > reuseport_lock is locked in this function, and we do not modify the prog,
> > but is rcu_dereference_protected() preferable?
> > 
> > ---8<---
> > prog = rcu_dereference_protected(reuse->prog,
> > 				 lockdep_is_held(&reuseport_lock));
> > ---8<---
> It is not only reuse->prog.  Other things also require rcu_read_lock(),
> e.g. please take a look at __htab_map_lookup_elem().
> 
> The TCP_LISTEN sk (selected by bpf to be the target of the migration)
> is also protected by rcu.

Thank you, I will use rcu_read_lock() and rcu_dereference() in v3 patchset.


> I am surprised there is no WARNING in the test.
> Do you have the needed DEBUG_LOCK* config enabled?

Yes, DEBUG_LOCK* was 'y', but rcu_dereference() without rcu_read_lock()
does not show warnings...


> > > >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > > > +			if (reuse->num_socks && !prog)
> > > > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > > +
> > > >  			reuse->num_closed_socks++;
> > > >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > > >  		} else {
> > > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > > >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> > > >  out:
> > > >  	spin_unlock_bh(&reuseport_lock);
> > > > +
> > > > +	return nsk;
> > > >  }
> > > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > > >  
> > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > --- a/net/ipv4/inet_connection_sock.c
> > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > >  }
> > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > >  
> > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > +{
> > > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > +
> > > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > +
> > > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > > +	spin_lock(&new_accept_queue->rskq_lock);
> > > I am also not very thrilled on this double spin_lock.
> > > Can this be done in (or like) inet_csk_listen_stop() instead?
> > 
> > It will be possible to migrate sockets in inet_csk_listen_stop(), but I
> > think it is better to do it just after reuseport_detach_sock() becuase we
> > can select a different listener (almost) every time at a lower cost by
> > selecting the moved socket and pass it to inet_csk_reqsk_queue_migrate()
> > easily.
> I don't see the "lower cost" point.  Please elaborate.

In reuseport_select_sock(), we pass sk_hash of the request socket to
reciprocal_scale() and generate a random index for socks[] to select
a different listener every time.
On the other hand, we do not have request sockets in unhash path and
sk_hash of the listener is always 0, so we have to generate a random number
in another way. In reuseport_detach_sock(), we can use the index of the
moved socket, but we do not have it in inet_csk_listen_stop(), so we have
to generate a random number in inet_csk_listen_stop().
I think it is at lower cost to use the index of the moved socket.


> > sk_hash of the listener is 0, so we would have to generate a random number
> > in inet_csk_listen_stop().
> If I read it correctly, it is also passing 0 as the sk_hash to
> bpf_run_sk_reuseport() from reuseport_detach_sock().
> 
> Also, how is the sk_hash expected to be used?  I don't see
> it in the test.

I expected it should not be used in unhash path.
We do not have the request socket in unhash path and cannot pass a proper
sk_hash to bpf_run_sk_reuseport(). So, if u8 migration is
BPF_SK_REUSEPORT_MIGRATE_QUEUE, we cannot use sk_hash.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-07 20:33       ` Martin KaFai Lau
@ 2020-12-08  6:31         ` Kuniyuki Iwashima
  2020-12-08  7:34           ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-08  6:31 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, eric.dumazet, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Mon, 7 Dec 2020 12:33:15 -0800
> On Thu, Dec 03, 2020 at 11:14:24PM +0900, Kuniyuki Iwashima wrote:
> > From:   Eric Dumazet <eric.dumazet@gmail.com>
> > Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > which is used only by inet_unhash(). If it is not NULL,
> > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > sockets from the closing listener to the selected one.
> > > > 
> > > > Listening sockets hold incoming connections as a linked list of struct
> > > > request_sock in the accept queue, and each request has reference to a full
> > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > > > the requests from the closing listener's queue and relink them to the head
> > > > of the new listener's queue. We do not process each request and its
> > > > reference to the listener, so the migration completes in O(1) time
> > > > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > > > care in the next commit.
> > > > 
> > > > By default, the kernel selects a new listener randomly. In order to pick
> > > > out a different socket every time, we select the last element of socks[] as
> > > > the new listener. This behaviour is based on how the kernel moves sockets
> > > > in socks[]. (See also [1])
> > > > 
> > > > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > > > program called in the later commit, but as the side effect of such default
> > > > selection, the kernel can redistribute old requests evenly to new listeners
> > > > for a specific case where the application replaces listeners by
> > > > generations.
> > > > 
> > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > first two by turns. The sockets move in socks[] like below.
> > > > 
> > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > >   socks[2] : C   |      socks[2] : C --'
> > > >   socks[3] : D --'
> > > > 
> > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > requests evenly to new listeners.
> > > > 
> > > >   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
> > > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> > > >   socks[2] : C (c)   |      socks[2] : C (c) --'
> > > >   socks[3] : D (d) --'
> > > > 
> > > > Here, (A, D) or (B, C) can have different application settings, but they
> > > > MUST have the same settings at the socket API level; otherwise, unexpected
> > > > error may happen. For instance, if only the new listeners have
> > > > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > > > face inconsistency and cause an error.
> > > > 
> > > > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > > > program described in later commits.
> > > > 
> > > > Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > ---
> > > >  include/net/inet_connection_sock.h |  1 +
> > > >  include/net/sock_reuseport.h       |  2 +-
> > > >  net/core/sock_reuseport.c          | 10 +++++++++-
> > > >  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
> > > >  net/ipv4/inet_hashtables.c         |  9 +++++++--
> > > >  5 files changed, 48 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > index 7338b3865a2a..2ea2d743f8fc 100644
> > > > --- a/include/net/inet_connection_sock.h
> > > > +++ b/include/net/inet_connection_sock.h
> > > > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
> > > >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > >  				      struct request_sock *req,
> > > >  				      struct sock *child);
> > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> > > >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > > >  				   unsigned long timeout);
> > > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > > > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > > > index 0e558ca7afbf..09a1b1539d4c 100644
> > > > --- a/include/net/sock_reuseport.h
> > > > +++ b/include/net/sock_reuseport.h
> > > > @@ -31,7 +31,7 @@ struct sock_reuseport {
> > > >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> > > >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> > > >  			      bool bind_inany);
> > > > -extern void reuseport_detach_sock(struct sock *sk);
> > > > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> > > >  extern struct sock *reuseport_select_sock(struct sock *sk,
> > > >  					  u32 hash,
> > > >  					  struct sk_buff *skb,
> > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > index fd133516ac0e..60d7c1f28809 100644
> > > > --- a/net/core/sock_reuseport.c
> > > > +++ b/net/core/sock_reuseport.c
> > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > >  }
> > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > >  
> > > > -void reuseport_detach_sock(struct sock *sk)
> > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > >  {
> > > >  	struct sock_reuseport *reuse;
> > > > +	struct bpf_prog *prog;
> > > > +	struct sock *nsk = NULL;
> > > >  	int i;
> > > >  
> > > >  	spin_lock_bh(&reuseport_lock);
> > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > >  
> > > >  		reuse->num_socks--;
> > > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > +		prog = rcu_dereference(reuse->prog);
> > > >  
> > > >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > > > +			if (reuse->num_socks && !prog)
> > > > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > > +
> > > >  			reuse->num_closed_socks++;
> > > >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > > >  		} else {
> > > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > > >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> > > >  out:
> > > >  	spin_unlock_bh(&reuseport_lock);
> > > > +
> > > > +	return nsk;
> > > >  }
> > > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > > >  
> > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > --- a/net/ipv4/inet_connection_sock.c
> > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > >  }
> > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > >  
> > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > +{
> > > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > +
> > > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > +
> > > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > > +	spin_lock(&new_accept_queue->rskq_lock);
> > > 
> > > Are you sure lockdep is happy with this ?
> > > 
> > > I would guess it should complain, because :
> > > 
> > > lock(A);
> > > lock(B);
> > > ...
> > > unlock(B);
> > > unlock(A);
> > > 
> > > will fail when the opposite action happens eventually
> > > 
> > > lock(B);
> > > lock(A);
> > > ...
> > > unlock(A);
> > > unlock(B);
> > 
> > I enabled lockdep and did not see warnings of lockdep.
> > 
> > Also, the inversion deadlock does not happen in this case.
> > In reuseport_detach_sock(), sk is moved backward in socks[] and poped out
> > from the eBPF map, so the old listener will not be selected as the new
> > listener.
> > 
> > 
> > > > +
> > > > +	if (old_accept_queue->rskq_accept_head) {
> > > > +		if (new_accept_queue->rskq_accept_head)
> > > > +			old_accept_queue->rskq_accept_tail->dl_next =
> > > > +				new_accept_queue->rskq_accept_head;
> > > > +		else
> > > > +			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
> > > > +
> > > > +		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
> > > > +		old_accept_queue->rskq_accept_head = NULL;
> > > > +		old_accept_queue->rskq_accept_tail = NULL;
> > > > +
> > > > +		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
> > > > +		WRITE_ONCE(sk->sk_ack_backlog, 0);
> > > > +	}
> > > > +
> > > > +	spin_unlock(&new_accept_queue->rskq_lock);
> > > > +	spin_unlock(&old_accept_queue->rskq_lock);
> > > > +}
> > > > +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
> > > 
> > > I fail to understand how the kernel can run fine right after this patch, before following patches are merged.
> > 
> > I will squash the two or reorganize them into definition part and migration
> > part.
> > 
> > 
> > > All request sockets in the socket accept queue MUST have their rsk_listener set to the listener,
> > > this is how we designed things (each request socket has a reference taken on the listener)
> > > 
> > > We might even have some "BUG_ON(sk != req->rsk_listener);" in some places.
> > > 
> > > Since you splice list from old listener to the new one, without changing req->rsk_listener, bad things will happen.
> I also have similar concern on the inconsistency in req->rsk_listener.
> 
> The fix-up in req->rsk_listener for the TFO req in patch 4
> makes it clear that req->rsk_listener should be updated during
> the migration instead of asking a much later code path
> to accommodate this inconsistent req->rsk_listener pointer.

When I started this patchset, I read this thread and misunderstood that I
had to migrate sockets in O(1) for scalability. So, I selected the fix-up
approach and checked rsk_listener is not used except for TFO.

---8<---
Whole point of BPF was to avoid iterate through all sockets [1],
and let user space use whatever selection logic it needs.

[1] This was okay with up to 16 sockets. But with 128 it does not scale.
---&<---
https://lore.kernel.org/netdev/1458837191.12033.4.camel@edumazet-glaptop3.roam.corp.google.com/


However, I've read it again, and this was about iterating over listeners
to select a new listener, not about iterating over requests...
In this patchset, we can select a listener in O(1) and it is enough.


> The current inet_csk_listen_stop() is already iterating
> the icsk_accept_queue and fastopenq.  The extra cost
> in updating rsk_listener may be just noise?

Exactly.
If we end up iterating requests, it is better to migrate than close. I will
update each rsk_listener in inet_csk_reqsk_queue_migrate() in v3 patchset.
Thank you!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-01 14:44 ` [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
  2020-12-01 15:25   ` Eric Dumazet
  2020-12-05  1:42   ` Martin KaFai Lau
@ 2020-12-08  6:54   ` Martin KaFai Lau
  2020-12-08  7:42     ` Kuniyuki Iwashima
  2 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-08  6:54 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, osa-contribution-log, bpf, netdev,
	linux-kernel

On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:

> @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
>  
>  		reuse->num_socks--;
>  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> +		prog = rcu_dereference(reuse->prog);
>  
>  		if (sk->sk_protocol == IPPROTO_TCP) {
> +			if (reuse->num_socks && !prog)
> +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
I asked in the earlier thread if the primary use case is to only
use the bpf prog to pick.  That thread did not come to
a solid answer but did conclude that the sysctl should not
control the behavior of the BPF_SK_REUSEPORT_SELECT_OR_MIGRATE prog.

From this change here, it seems it is still desired to only depend
on the kernel to random pick even when no bpf prog is attached.
If that is the case, a sysctl to guard here for not changing
the current behavior makes sense.
It should still only control the non-bpf-pick behavior:
when the sysctl is on, the kernel will still do a random pick
when there is no bpf prog attached to the reuseport group.
Thoughts?

> +
>  			reuse->num_closed_socks++;
>  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
>  		} else {
> @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
>  		call_rcu(&reuse->rcu, reuseport_free_rcu);
>  out:
>  	spin_unlock_bh(&reuseport_lock);
> +
> +	return nsk;
>  }
>  EXPORT_SYMBOL(reuseport_detach_sock);


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-08  6:31         ` Kuniyuki Iwashima
@ 2020-12-08  7:34           ` Martin KaFai Lau
  2020-12-08  8:17             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-08  7:34 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, eric.dumazet, kuba,
	kuni1840, linux-kernel, netdev

On Tue, Dec 08, 2020 at 03:31:34PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Mon, 7 Dec 2020 12:33:15 -0800
> > On Thu, Dec 03, 2020 at 11:14:24PM +0900, Kuniyuki Iwashima wrote:
> > > From:   Eric Dumazet <eric.dumazet@gmail.com>
> > > Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > > > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > sockets from the closing listener to the selected one.
> > > > > 
> > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > > > > the requests from the closing listener's queue and relink them to the head
> > > > > of the new listener's queue. We do not process each request and its
> > > > > reference to the listener, so the migration completes in O(1) time
> > > > > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > > > > care in the next commit.
> > > > > 
> > > > > By default, the kernel selects a new listener randomly. In order to pick
> > > > > out a different socket every time, we select the last element of socks[] as
> > > > > the new listener. This behaviour is based on how the kernel moves sockets
> > > > > in socks[]. (See also [1])
> > > > > 
> > > > > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > > > > program called in the later commit, but as the side effect of such default
> > > > > selection, the kernel can redistribute old requests evenly to new listeners
> > > > > for a specific case where the application replaces listeners by
> > > > > generations.
> > > > > 
> > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > first two by turns. The sockets move in socks[] like below.
> > > > > 
> > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > >   socks[2] : C   |      socks[2] : C --'
> > > > >   socks[3] : D --'
> > > > > 
> > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > requests evenly to new listeners.
> > > > > 
> > > > >   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
> > > > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> > > > >   socks[2] : C (c)   |      socks[2] : C (c) --'
> > > > >   socks[3] : D (d) --'
> > > > > 
> > > > > Here, (A, D) or (B, C) can have different application settings, but they
> > > > > MUST have the same settings at the socket API level; otherwise, unexpected
> > > > > error may happen. For instance, if only the new listeners have
> > > > > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > > > > face inconsistency and cause an error.
> > > > > 
> > > > > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > > > > program described in later commits.
> > > > > 
> > > > > Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > > ---
> > > > >  include/net/inet_connection_sock.h |  1 +
> > > > >  include/net/sock_reuseport.h       |  2 +-
> > > > >  net/core/sock_reuseport.c          | 10 +++++++++-
> > > > >  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
> > > > >  net/ipv4/inet_hashtables.c         |  9 +++++++--
> > > > >  5 files changed, 48 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > > index 7338b3865a2a..2ea2d743f8fc 100644
> > > > > --- a/include/net/inet_connection_sock.h
> > > > > +++ b/include/net/inet_connection_sock.h
> > > > > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
> > > > >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > >  				      struct request_sock *req,
> > > > >  				      struct sock *child);
> > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> > > > >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > > > >  				   unsigned long timeout);
> > > > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > > > > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > > > > index 0e558ca7afbf..09a1b1539d4c 100644
> > > > > --- a/include/net/sock_reuseport.h
> > > > > +++ b/include/net/sock_reuseport.h
> > > > > @@ -31,7 +31,7 @@ struct sock_reuseport {
> > > > >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> > > > >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> > > > >  			      bool bind_inany);
> > > > > -extern void reuseport_detach_sock(struct sock *sk);
> > > > > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> > > > >  extern struct sock *reuseport_select_sock(struct sock *sk,
> > > > >  					  u32 hash,
> > > > >  					  struct sk_buff *skb,
> > > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > > index fd133516ac0e..60d7c1f28809 100644
> > > > > --- a/net/core/sock_reuseport.c
> > > > > +++ b/net/core/sock_reuseport.c
> > > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > > >  }
> > > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > > >  
> > > > > -void reuseport_detach_sock(struct sock *sk)
> > > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > > >  {
> > > > >  	struct sock_reuseport *reuse;
> > > > > +	struct bpf_prog *prog;
> > > > > +	struct sock *nsk = NULL;
> > > > >  	int i;
> > > > >  
> > > > >  	spin_lock_bh(&reuseport_lock);
> > > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > > >  
> > > > >  		reuse->num_socks--;
> > > > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > > +		prog = rcu_dereference(reuse->prog);
> > > > >  
> > > > >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > > > > +			if (reuse->num_socks && !prog)
> > > > > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > > > +
> > > > >  			reuse->num_closed_socks++;
> > > > >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > > > >  		} else {
> > > > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > > > >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> > > > >  out:
> > > > >  	spin_unlock_bh(&reuseport_lock);
> > > > > +
> > > > > +	return nsk;
> > > > >  }
> > > > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > > > >  
> > > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > > --- a/net/ipv4/inet_connection_sock.c
> > > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > >  }
> > > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > > >  
> > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > > +{
> > > > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > > +
> > > > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > > +
> > > > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > > > +	spin_lock(&new_accept_queue->rskq_lock);
> > > > 
> > > > Are you sure lockdep is happy with this ?
> > > > 
> > > > I would guess it should complain, because :
> > > > 
> > > > lock(A);
> > > > lock(B);
> > > > ...
> > > > unlock(B);
> > > > unlock(A);
> > > > 
> > > > will fail when the opposite action happens eventually
> > > > 
> > > > lock(B);
> > > > lock(A);
> > > > ...
> > > > unlock(A);
> > > > unlock(B);
> > > 
> > > I enabled lockdep and did not see warnings of lockdep.
> > > 
> > > Also, the inversion deadlock does not happen in this case.
> > > In reuseport_detach_sock(), sk is moved backward in socks[] and poped out
> > > from the eBPF map, so the old listener will not be selected as the new
> > > listener.
> > > 
> > > 
> > > > > +
> > > > > +	if (old_accept_queue->rskq_accept_head) {
> > > > > +		if (new_accept_queue->rskq_accept_head)
> > > > > +			old_accept_queue->rskq_accept_tail->dl_next =
> > > > > +				new_accept_queue->rskq_accept_head;
> > > > > +		else
> > > > > +			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
> > > > > +
> > > > > +		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
> > > > > +		old_accept_queue->rskq_accept_head = NULL;
> > > > > +		old_accept_queue->rskq_accept_tail = NULL;
> > > > > +
> > > > > +		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
> > > > > +		WRITE_ONCE(sk->sk_ack_backlog, 0);
> > > > > +	}
> > > > > +
> > > > > +	spin_unlock(&new_accept_queue->rskq_lock);
> > > > > +	spin_unlock(&old_accept_queue->rskq_lock);
> > > > > +}
> > > > > +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
> > > > 
> > > > I fail to understand how the kernel can run fine right after this patch, before following patches are merged.
> > > 
> > > I will squash the two or reorganize them into definition part and migration
> > > part.
> > > 
> > > 
> > > > All request sockets in the socket accept queue MUST have their rsk_listener set to the listener,
> > > > this is how we designed things (each request socket has a reference taken on the listener)
> > > > 
> > > > We might even have some "BUG_ON(sk != req->rsk_listener);" in some places.
> > > > 
> > > > Since you splice list from old listener to the new one, without changing req->rsk_listener, bad things will happen.
> > I also have similar concern on the inconsistency in req->rsk_listener.
> > 
> > The fix-up in req->rsk_listener for the TFO req in patch 4
> > makes it clear that req->rsk_listener should be updated during
> > the migration instead of asking a much later code path
> > to accommodate this inconsistent req->rsk_listener pointer.
> 
> When I started this patchset, I read this thread and misunderstood that I
> had to migrate sockets in O(1) for scalability. So, I selected the fix-up
> approach and checked rsk_listener is not used except for TFO.
> 
> ---8<---
> Whole point of BPF was to avoid iterate through all sockets [1],
> and let user space use whatever selection logic it needs.
> 
> [1] This was okay with up to 16 sockets. But with 128 it does not scale.
> ---&<---
> https://lore.kernel.org/netdev/1458837191.12033.4.camel@edumazet-glaptop3.roam.corp.google.com/
> 
> 
> However, I've read it again, and this was about iterating over listeners
> to select a new listener, not about iterating over requests...
> In this patchset, we can select a listener in O(1) and it is enough.
> 
> 
> > The current inet_csk_listen_stop() is already iterating
> > the icsk_accept_queue and fastopenq.  The extra cost
> > in updating rsk_listener may be just noise?
> 
> Exactly.
> If we end up iterating requests, it is better to migrate than close. I will
> update each rsk_listener in inet_csk_reqsk_queue_migrate() in v3 patchset.
To be clear, I meant to do migration in inet_csk_listen_stop() instead
of doing it in the new inet_csk_reqsk_queue_migrate() which reqires a
double lock and then need to re-bring in the whole spin_lock_bh_nested
patch in the patch 3 of v2.

e.g. in the first while loop in inet_csk_listen_stop(),
if there is a target to migrate to,  it can do
something similar to inet_csk_reqsk_queue_add(target_sk, ...)
instead of doing the current inet_child_forget().
It probably needs something different from
inet_csk_reqsk_queue_add(), e.g. also update rsk_listener,
but the idea should be similar.

Since the rsk_listener has to be updated one by one, there is
really no point to do the list splicing which requires
the double lock.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-08  6:54   ` Martin KaFai Lau
@ 2020-12-08  7:42     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-08  7:42 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev, osa-contribution-log

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Mon, 7 Dec 2020 22:54:18 -0800
> On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
> 
> > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> >  
> >  		reuse->num_socks--;
> >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > +		prog = rcu_dereference(reuse->prog);
> >  
> >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > +			if (reuse->num_socks && !prog)
> > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> I asked in the earlier thread if the primary use case is to only
> use the bpf prog to pick.  That thread did not come to
> a solid answer but did conclude that the sysctl should not
> control the behavior of the BPF_SK_REUSEPORT_SELECT_OR_MIGRATE prog.
> 
> From this change here, it seems it is still desired to only depend
> on the kernel to random pick even when no bpf prog is attached.

I wrote this way only to split patches into tcp and bpf parts.
So, in the 10th patch, eBPF prog is run if the type is
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
https://lore.kernel.org/netdev/20201201144418.35045-11-kuniyu@amazon.co.jp/

But, it makes a breakage, so I will move
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE validation into 10th patch so that the
type is only available after 10th patch.

---8<---
	case BPF_PROG_TYPE_SK_REUSEPORT:
		switch (expected_attach_type) {
		case BPF_SK_REUSEPORT_SELECT:
		case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE: <- move to 10th.
			return 0;
		default:
			return -EINVAL;
		}
---8<---


> If that is the case, a sysctl to guard here for not changing
> the current behavior makes sense.
> It should still only control the non-bpf-pick behavior:
> when the sysctl is on, the kernel will still do a random pick
> when there is no bpf prog attached to the reuseport group.
> Thoughts?

If different applications listen on the same port without eBPF prog, I
think sysctl is necessary. But honestly, I am not sure there is really such
a case and sysctl is necessary.

If patcheset with sysctl is more acceptable, I will add it back in the next
spin.


> > +
> >  			reuse->num_closed_socks++;
> >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> >  		} else {
> > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> >  out:
> >  	spin_unlock_bh(&reuseport_lock);
> > +
> > +	return nsk;
> >  }
> >  EXPORT_SYMBOL(reuseport_detach_sock);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-08  6:27         ` Kuniyuki Iwashima
@ 2020-12-08  8:13           ` Martin KaFai Lau
  2020-12-08  9:02             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-08  8:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840,
	linux-kernel, netdev

On Tue, Dec 08, 2020 at 03:27:14PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Mon, 7 Dec 2020 12:14:38 -0800
> > On Sun, Dec 06, 2020 at 01:03:07AM +0900, Kuniyuki Iwashima wrote:
> > > From:   Martin KaFai Lau <kafai@fb.com>
> > > Date:   Fri, 4 Dec 2020 17:42:41 -0800
> > > > On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
> > > > [ ... ]
> > > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > > index fd133516ac0e..60d7c1f28809 100644
> > > > > --- a/net/core/sock_reuseport.c
> > > > > +++ b/net/core/sock_reuseport.c
> > > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > > >  }
> > > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > > >  
> > > > > -void reuseport_detach_sock(struct sock *sk)
> > > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > > >  {
> > > > >  	struct sock_reuseport *reuse;
> > > > > +	struct bpf_prog *prog;
> > > > > +	struct sock *nsk = NULL;
> > > > >  	int i;
> > > > >  
> > > > >  	spin_lock_bh(&reuseport_lock);
> > > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > > >  
> > > > >  		reuse->num_socks--;
> > > > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > > +		prog = rcu_dereference(reuse->prog);
> > > > Is it under rcu_read_lock() here?
> > > 
> > > reuseport_lock is locked in this function, and we do not modify the prog,
> > > but is rcu_dereference_protected() preferable?
> > > 
> > > ---8<---
> > > prog = rcu_dereference_protected(reuse->prog,
> > > 				 lockdep_is_held(&reuseport_lock));
> > > ---8<---
> > It is not only reuse->prog.  Other things also require rcu_read_lock(),
> > e.g. please take a look at __htab_map_lookup_elem().
> > 
> > The TCP_LISTEN sk (selected by bpf to be the target of the migration)
> > is also protected by rcu.
> 
> Thank you, I will use rcu_read_lock() and rcu_dereference() in v3 patchset.
> 
> 
> > I am surprised there is no WARNING in the test.
> > Do you have the needed DEBUG_LOCK* config enabled?
> 
> Yes, DEBUG_LOCK* was 'y', but rcu_dereference() without rcu_read_lock()
> does not show warnings...
I would at least expect the "WARN_ON_ONCE(!rcu_read_lock_held() ...)"
from __htab_map_lookup_elem() should fire in your test
example in the last patch.

It is better to check the config before sending v3.

[ ... ]

> > > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > > --- a/net/ipv4/inet_connection_sock.c
> > > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > >  }
> > > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > > >  
> > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > > +{
> > > > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > > +
> > > > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > > +
> > > > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > > > +	spin_lock(&new_accept_queue->rskq_lock);
> > > > I am also not very thrilled on this double spin_lock.
> > > > Can this be done in (or like) inet_csk_listen_stop() instead?
> > > 
> > > It will be possible to migrate sockets in inet_csk_listen_stop(), but I
> > > think it is better to do it just after reuseport_detach_sock() becuase we
> > > can select a different listener (almost) every time at a lower cost by
> > > selecting the moved socket and pass it to inet_csk_reqsk_queue_migrate()
> > > easily.
> > I don't see the "lower cost" point.  Please elaborate.
> 
> In reuseport_select_sock(), we pass sk_hash of the request socket to
> reciprocal_scale() and generate a random index for socks[] to select
> a different listener every time.
> On the other hand, we do not have request sockets in unhash path and
> sk_hash of the listener is always 0, so we have to generate a random number
> in another way. In reuseport_detach_sock(), we can use the index of the
> moved socket, but we do not have it in inet_csk_listen_stop(), so we have
> to generate a random number in inet_csk_listen_stop().
> I think it is at lower cost to use the index of the moved socket.
Generate a random number is not a big deal for the migration code path.

Also, I really still failed to see a particular way that the kernel
pick will help in the migration case.  The kernel has no clue
on how to select the right process to migrate to without
a proper policy signal from the user.  They are all as bad as
a random pick.  I am not sure this migration feature is
even useful if there is no bpf prog attached to define the policy.
That said, if it is still desired to do a random pick by kernel when
there is no bpf prog, it probably makes sense to guard it in a sysctl as
suggested in another reply.  To keep it simple, I would also keep this
kernel-pick consistent instead of request socket is doing something
different from the unhash path.

> 
> 
> > > sk_hash of the listener is 0, so we would have to generate a random number
> > > in inet_csk_listen_stop().
> > If I read it correctly, it is also passing 0 as the sk_hash to
> > bpf_run_sk_reuseport() from reuseport_detach_sock().
> > 
> > Also, how is the sk_hash expected to be used?  I don't see
> > it in the test.
> 
> I expected it should not be used in unhash path.
> We do not have the request socket in unhash path and cannot pass a proper
> sk_hash to bpf_run_sk_reuseport(). So, if u8 migration is
> BPF_SK_REUSEPORT_MIGRATE_QUEUE, we cannot use sk_hash.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-08  7:34           ` Martin KaFai Lau
@ 2020-12-08  8:17             ` Kuniyuki Iwashima
  2020-12-09  3:09               ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-08  8:17 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, eric.dumazet, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Mon, 7 Dec 2020 23:34:41 -0800
> On Tue, Dec 08, 2020 at 03:31:34PM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Mon, 7 Dec 2020 12:33:15 -0800
> > > On Thu, Dec 03, 2020 at 11:14:24PM +0900, Kuniyuki Iwashima wrote:
> > > > From:   Eric Dumazet <eric.dumazet@gmail.com>
> > > > Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > > > > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > > sockets from the closing listener to the selected one.
> > > > > > 
> > > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > > > > > the requests from the closing listener's queue and relink them to the head
> > > > > > of the new listener's queue. We do not process each request and its
> > > > > > reference to the listener, so the migration completes in O(1) time
> > > > > > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > > > > > care in the next commit.
> > > > > > 
> > > > > > By default, the kernel selects a new listener randomly. In order to pick
> > > > > > out a different socket every time, we select the last element of socks[] as
> > > > > > the new listener. This behaviour is based on how the kernel moves sockets
> > > > > > in socks[]. (See also [1])
> > > > > > 
> > > > > > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > > > > > program called in the later commit, but as the side effect of such default
> > > > > > selection, the kernel can redistribute old requests evenly to new listeners
> > > > > > for a specific case where the application replaces listeners by
> > > > > > generations.
> > > > > > 
> > > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > > first two by turns. The sockets move in socks[] like below.
> > > > > > 
> > > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > > >   socks[2] : C   |      socks[2] : C --'
> > > > > >   socks[3] : D --'
> > > > > > 
> > > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > > requests evenly to new listeners.
> > > > > > 
> > > > > >   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
> > > > > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> > > > > >   socks[2] : C (c)   |      socks[2] : C (c) --'
> > > > > >   socks[3] : D (d) --'
> > > > > > 
> > > > > > Here, (A, D) or (B, C) can have different application settings, but they
> > > > > > MUST have the same settings at the socket API level; otherwise, unexpected
> > > > > > error may happen. For instance, if only the new listeners have
> > > > > > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > > > > > face inconsistency and cause an error.
> > > > > > 
> > > > > > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > > > > > program described in later commits.
> > > > > > 
> > > > > > Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> > > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > > > ---
> > > > > >  include/net/inet_connection_sock.h |  1 +
> > > > > >  include/net/sock_reuseport.h       |  2 +-
> > > > > >  net/core/sock_reuseport.c          | 10 +++++++++-
> > > > > >  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
> > > > > >  net/ipv4/inet_hashtables.c         |  9 +++++++--
> > > > > >  5 files changed, 48 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > > > index 7338b3865a2a..2ea2d743f8fc 100644
> > > > > > --- a/include/net/inet_connection_sock.h
> > > > > > +++ b/include/net/inet_connection_sock.h
> > > > > > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
> > > > > >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > > >  				      struct request_sock *req,
> > > > > >  				      struct sock *child);
> > > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> > > > > >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > > > > >  				   unsigned long timeout);
> > > > > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > > > > > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > > > > > index 0e558ca7afbf..09a1b1539d4c 100644
> > > > > > --- a/include/net/sock_reuseport.h
> > > > > > +++ b/include/net/sock_reuseport.h
> > > > > > @@ -31,7 +31,7 @@ struct sock_reuseport {
> > > > > >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> > > > > >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> > > > > >  			      bool bind_inany);
> > > > > > -extern void reuseport_detach_sock(struct sock *sk);
> > > > > > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> > > > > >  extern struct sock *reuseport_select_sock(struct sock *sk,
> > > > > >  					  u32 hash,
> > > > > >  					  struct sk_buff *skb,
> > > > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > > > index fd133516ac0e..60d7c1f28809 100644
> > > > > > --- a/net/core/sock_reuseport.c
> > > > > > +++ b/net/core/sock_reuseport.c
> > > > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > > > >  
> > > > > > -void reuseport_detach_sock(struct sock *sk)
> > > > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > > > >  {
> > > > > >  	struct sock_reuseport *reuse;
> > > > > > +	struct bpf_prog *prog;
> > > > > > +	struct sock *nsk = NULL;
> > > > > >  	int i;
> > > > > >  
> > > > > >  	spin_lock_bh(&reuseport_lock);
> > > > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > > > >  
> > > > > >  		reuse->num_socks--;
> > > > > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > > > +		prog = rcu_dereference(reuse->prog);
> > > > > >  
> > > > > >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > > > > > +			if (reuse->num_socks && !prog)
> > > > > > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > > > > +
> > > > > >  			reuse->num_closed_socks++;
> > > > > >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > > > > >  		} else {
> > > > > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > > > > >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> > > > > >  out:
> > > > > >  	spin_unlock_bh(&reuseport_lock);
> > > > > > +
> > > > > > +	return nsk;
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > > > > >  
> > > > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > > > --- a/net/ipv4/inet_connection_sock.c
> > > > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > > > >  
> > > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > > > +{
> > > > > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > > > +
> > > > > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > > > +
> > > > > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > > > > +	spin_lock(&new_accept_queue->rskq_lock);
> > > > > 
> > > > > Are you sure lockdep is happy with this ?
> > > > > 
> > > > > I would guess it should complain, because :
> > > > > 
> > > > > lock(A);
> > > > > lock(B);
> > > > > ...
> > > > > unlock(B);
> > > > > unlock(A);
> > > > > 
> > > > > will fail when the opposite action happens eventually
> > > > > 
> > > > > lock(B);
> > > > > lock(A);
> > > > > ...
> > > > > unlock(A);
> > > > > unlock(B);
> > > > 
> > > > I enabled lockdep and did not see warnings of lockdep.
> > > > 
> > > > Also, the inversion deadlock does not happen in this case.
> > > > In reuseport_detach_sock(), sk is moved backward in socks[] and poped out
> > > > from the eBPF map, so the old listener will not be selected as the new
> > > > listener.
> > > > 
> > > > 
> > > > > > +
> > > > > > +	if (old_accept_queue->rskq_accept_head) {
> > > > > > +		if (new_accept_queue->rskq_accept_head)
> > > > > > +			old_accept_queue->rskq_accept_tail->dl_next =
> > > > > > +				new_accept_queue->rskq_accept_head;
> > > > > > +		else
> > > > > > +			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
> > > > > > +
> > > > > > +		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
> > > > > > +		old_accept_queue->rskq_accept_head = NULL;
> > > > > > +		old_accept_queue->rskq_accept_tail = NULL;
> > > > > > +
> > > > > > +		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
> > > > > > +		WRITE_ONCE(sk->sk_ack_backlog, 0);
> > > > > > +	}
> > > > > > +
> > > > > > +	spin_unlock(&new_accept_queue->rskq_lock);
> > > > > > +	spin_unlock(&old_accept_queue->rskq_lock);
> > > > > > +}
> > > > > > +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
> > > > > 
> > > > > I fail to understand how the kernel can run fine right after this patch, before following patches are merged.
> > > > 
> > > > I will squash the two or reorganize them into definition part and migration
> > > > part.
> > > > 
> > > > 
> > > > > All request sockets in the socket accept queue MUST have their rsk_listener set to the listener,
> > > > > this is how we designed things (each request socket has a reference taken on the listener)
> > > > > 
> > > > > We might even have some "BUG_ON(sk != req->rsk_listener);" in some places.
> > > > > 
> > > > > Since you splice list from old listener to the new one, without changing req->rsk_listener, bad things will happen.
> > > I also have similar concern on the inconsistency in req->rsk_listener.
> > > 
> > > The fix-up in req->rsk_listener for the TFO req in patch 4
> > > makes it clear that req->rsk_listener should be updated during
> > > the migration instead of asking a much later code path
> > > to accommodate this inconsistent req->rsk_listener pointer.
> > 
> > When I started this patchset, I read this thread and misunderstood that I
> > had to migrate sockets in O(1) for scalability. So, I selected the fix-up
> > approach and checked rsk_listener is not used except for TFO.
> > 
> > ---8<---
> > Whole point of BPF was to avoid iterate through all sockets [1],
> > and let user space use whatever selection logic it needs.
> > 
> > [1] This was okay with up to 16 sockets. But with 128 it does not scale.
> > ---&<---
> > https://lore.kernel.org/netdev/1458837191.12033.4.camel@edumazet-glaptop3.roam.corp.google.com/
> > 
> > 
> > However, I've read it again, and this was about iterating over listeners
> > to select a new listener, not about iterating over requests...
> > In this patchset, we can select a listener in O(1) and it is enough.
> > 
> > 
> > > The current inet_csk_listen_stop() is already iterating
> > > the icsk_accept_queue and fastopenq.  The extra cost
> > > in updating rsk_listener may be just noise?
> > 
> > Exactly.
> > If we end up iterating requests, it is better to migrate than close. I will
> > update each rsk_listener in inet_csk_reqsk_queue_migrate() in v3 patchset.
> To be clear, I meant to do migration in inet_csk_listen_stop() instead
> of doing it in the new inet_csk_reqsk_queue_migrate() which reqires a
> double lock and then need to re-bring in the whole spin_lock_bh_nested
> patch in the patch 3 of v2.
> 
> e.g. in the first while loop in inet_csk_listen_stop(),
> if there is a target to migrate to,  it can do
> something similar to inet_csk_reqsk_queue_add(target_sk, ...)
> instead of doing the current inet_child_forget().
> It probably needs something different from
> inet_csk_reqsk_queue_add(), e.g. also update rsk_listener,
> but the idea should be similar.
> 
> Since the rsk_listener has to be updated one by one, there is
> really no point to do the list splicing which requires
> the double lock.

I think it is a bit complex to pass the new listener from
reuseport_detach_sock() to inet_csk_listen_stop().

__tcp_close/tcp_disconnect/tcp_abort
 |-tcp_set_state
 |  |-unhash
 |     |-reuseport_detach_sock (return nsk)
 |-inet_csk_listen_stop

If we splice requests like this, we do not need double lock?

  1. lock the accept queue of the old listener
  2. unlink all requests and decrement refcount
  3. unlock
  4. update all requests with new listener
  5. lock the accept queue of the new listener
  6. splice requests and increment refcount
  7. unlock

Also, I think splicing is better to keep the order of requests. Adding one
by one reverses it.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-08  8:13           ` Martin KaFai Lau
@ 2020-12-08  9:02             ` Kuniyuki Iwashima
  0 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-08  9:02 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Tue, 8 Dec 2020 00:13:28 -0800
> On Tue, Dec 08, 2020 at 03:27:14PM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Mon, 7 Dec 2020 12:14:38 -0800
> > > On Sun, Dec 06, 2020 at 01:03:07AM +0900, Kuniyuki Iwashima wrote:
> > > > From:   Martin KaFai Lau <kafai@fb.com>
> > > > Date:   Fri, 4 Dec 2020 17:42:41 -0800
> > > > > On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
> > > > > [ ... ]
> > > > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > > > index fd133516ac0e..60d7c1f28809 100644
> > > > > > --- a/net/core/sock_reuseport.c
> > > > > > +++ b/net/core/sock_reuseport.c
> > > > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > > > >  
> > > > > > -void reuseport_detach_sock(struct sock *sk)
> > > > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > > > >  {
> > > > > >  	struct sock_reuseport *reuse;
> > > > > > +	struct bpf_prog *prog;
> > > > > > +	struct sock *nsk = NULL;
> > > > > >  	int i;
> > > > > >  
> > > > > >  	spin_lock_bh(&reuseport_lock);
> > > > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > > > >  
> > > > > >  		reuse->num_socks--;
> > > > > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > > > +		prog = rcu_dereference(reuse->prog);
> > > > > Is it under rcu_read_lock() here?
> > > > 
> > > > reuseport_lock is locked in this function, and we do not modify the prog,
> > > > but is rcu_dereference_protected() preferable?
> > > > 
> > > > ---8<---
> > > > prog = rcu_dereference_protected(reuse->prog,
> > > > 				 lockdep_is_held(&reuseport_lock));
> > > > ---8<---
> > > It is not only reuse->prog.  Other things also require rcu_read_lock(),
> > > e.g. please take a look at __htab_map_lookup_elem().
> > > 
> > > The TCP_LISTEN sk (selected by bpf to be the target of the migration)
> > > is also protected by rcu.
> > 
> > Thank you, I will use rcu_read_lock() and rcu_dereference() in v3 patchset.
> > 
> > 
> > > I am surprised there is no WARNING in the test.
> > > Do you have the needed DEBUG_LOCK* config enabled?
> > 
> > Yes, DEBUG_LOCK* was 'y', but rcu_dereference() without rcu_read_lock()
> > does not show warnings...
> I would at least expect the "WARN_ON_ONCE(!rcu_read_lock_held() ...)"
> from __htab_map_lookup_elem() should fire in your test
> example in the last patch.
> 
> It is better to check the config before sending v3.

It seems ok, but I will check it again.

---8<---
[ec2-user@ip-10-0-0-124 bpf-next]$ cat .config | grep DEBUG_LOCK
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
---8<---


> > > > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > > > --- a/net/ipv4/inet_connection_sock.c
> > > > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > > > >  
> > > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > > > +{
> > > > > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > > > +
> > > > > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > > > +
> > > > > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > > > > +	spin_lock(&new_accept_queue->rskq_lock);
> > > > > I am also not very thrilled on this double spin_lock.
> > > > > Can this be done in (or like) inet_csk_listen_stop() instead?
> > > > 
> > > > It will be possible to migrate sockets in inet_csk_listen_stop(), but I
> > > > think it is better to do it just after reuseport_detach_sock() becuase we
> > > > can select a different listener (almost) every time at a lower cost by
> > > > selecting the moved socket and pass it to inet_csk_reqsk_queue_migrate()
> > > > easily.
> > > I don't see the "lower cost" point.  Please elaborate.
> > 
> > In reuseport_select_sock(), we pass sk_hash of the request socket to
> > reciprocal_scale() and generate a random index for socks[] to select
> > a different listener every time.
> > On the other hand, we do not have request sockets in unhash path and
> > sk_hash of the listener is always 0, so we have to generate a random number
> > in another way. In reuseport_detach_sock(), we can use the index of the
> > moved socket, but we do not have it in inet_csk_listen_stop(), so we have
> > to generate a random number in inet_csk_listen_stop().
> > I think it is at lower cost to use the index of the moved socket.
> Generate a random number is not a big deal for the migration code path.
> 
> Also, I really still failed to see a particular way that the kernel
> pick will help in the migration case.  The kernel has no clue
> on how to select the right process to migrate to without
> a proper policy signal from the user.  They are all as bad as
> a random pick.  I am not sure this migration feature is
> even useful if there is no bpf prog attached to define the policy.

I think most applications start new listeners before closing listeners, in
this case, selecting the moved socket as the new listener works well.


> That said, if it is still desired to do a random pick by kernel when
> there is no bpf prog, it probably makes sense to guard it in a sysctl as
> suggested in another reply.  To keep it simple, I would also keep this
> kernel-pick consistent instead of request socket is doing something
> different from the unhash path.

Then, is this way better to keep kernel-pick consistent?

  1. call reuseport_select_migrated_sock() without sk_hash from any path
  2. generate a random number in reuseport_select_migrated_sock()
  3. pass it to __reuseport_select_sock() only for select-by-hash
  (4. pass 0 as sk_hash to bpf_run_sk_reuseport not to use it)
  5. do migration per queue in inet_csk_listen_stop() or per request in
     receive path.

I understand it is beautiful to keep consistensy, but also think
the kernel-pick with heuristic performs better than random-pick.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-08  8:17             ` Kuniyuki Iwashima
@ 2020-12-09  3:09               ` Martin KaFai Lau
  2020-12-09  8:05                 ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-09  3:09 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, eric.dumazet, kuba,
	kuni1840, linux-kernel, netdev

On Tue, Dec 08, 2020 at 05:17:48PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Mon, 7 Dec 2020 23:34:41 -0800
> > On Tue, Dec 08, 2020 at 03:31:34PM +0900, Kuniyuki Iwashima wrote:
> > > From:   Martin KaFai Lau <kafai@fb.com>
> > > Date:   Mon, 7 Dec 2020 12:33:15 -0800
> > > > On Thu, Dec 03, 2020 at 11:14:24PM +0900, Kuniyuki Iwashima wrote:
> > > > > From:   Eric Dumazet <eric.dumazet@gmail.com>
> > > > > Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > > > > > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > > > sockets from the closing listener to the selected one.
> > > > > > > 
> > > > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > > > > > > the requests from the closing listener's queue and relink them to the head
> > > > > > > of the new listener's queue. We do not process each request and its
> > > > > > > reference to the listener, so the migration completes in O(1) time
> > > > > > > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > > > > > > care in the next commit.
> > > > > > > 
> > > > > > > By default, the kernel selects a new listener randomly. In order to pick
> > > > > > > out a different socket every time, we select the last element of socks[] as
> > > > > > > the new listener. This behaviour is based on how the kernel moves sockets
> > > > > > > in socks[]. (See also [1])
> > > > > > > 
> > > > > > > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > > > > > > program called in the later commit, but as the side effect of such default
> > > > > > > selection, the kernel can redistribute old requests evenly to new listeners
> > > > > > > for a specific case where the application replaces listeners by
> > > > > > > generations.
> > > > > > > 
> > > > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > > > first two by turns. The sockets move in socks[] like below.
> > > > > > > 
> > > > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > > > >   socks[2] : C   |      socks[2] : C --'
> > > > > > >   socks[3] : D --'
> > > > > > > 
> > > > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > > > requests evenly to new listeners.
> > > > > > > 
> > > > > > >   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
> > > > > > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> > > > > > >   socks[2] : C (c)   |      socks[2] : C (c) --'
> > > > > > >   socks[3] : D (d) --'
> > > > > > > 
> > > > > > > Here, (A, D) or (B, C) can have different application settings, but they
> > > > > > > MUST have the same settings at the socket API level; otherwise, unexpected
> > > > > > > error may happen. For instance, if only the new listeners have
> > > > > > > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > > > > > > face inconsistency and cause an error.
> > > > > > > 
> > > > > > > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > > > > > > program described in later commits.
> > > > > > > 
> > > > > > > Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> > > > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > > > > ---
> > > > > > >  include/net/inet_connection_sock.h |  1 +
> > > > > > >  include/net/sock_reuseport.h       |  2 +-
> > > > > > >  net/core/sock_reuseport.c          | 10 +++++++++-
> > > > > > >  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
> > > > > > >  net/ipv4/inet_hashtables.c         |  9 +++++++--
> > > > > > >  5 files changed, 48 insertions(+), 4 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > > > > index 7338b3865a2a..2ea2d743f8fc 100644
> > > > > > > --- a/include/net/inet_connection_sock.h
> > > > > > > +++ b/include/net/inet_connection_sock.h
> > > > > > > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
> > > > > > >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > > > >  				      struct request_sock *req,
> > > > > > >  				      struct sock *child);
> > > > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> > > > > > >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > > > > > >  				   unsigned long timeout);
> > > > > > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > > > > > > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > > > > > > index 0e558ca7afbf..09a1b1539d4c 100644
> > > > > > > --- a/include/net/sock_reuseport.h
> > > > > > > +++ b/include/net/sock_reuseport.h
> > > > > > > @@ -31,7 +31,7 @@ struct sock_reuseport {
> > > > > > >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> > > > > > >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> > > > > > >  			      bool bind_inany);
> > > > > > > -extern void reuseport_detach_sock(struct sock *sk);
> > > > > > > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> > > > > > >  extern struct sock *reuseport_select_sock(struct sock *sk,
> > > > > > >  					  u32 hash,
> > > > > > >  					  struct sk_buff *skb,
> > > > > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > > > > index fd133516ac0e..60d7c1f28809 100644
> > > > > > > --- a/net/core/sock_reuseport.c
> > > > > > > +++ b/net/core/sock_reuseport.c
> > > > > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > > > > >  
> > > > > > > -void reuseport_detach_sock(struct sock *sk)
> > > > > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > > > > >  {
> > > > > > >  	struct sock_reuseport *reuse;
> > > > > > > +	struct bpf_prog *prog;
> > > > > > > +	struct sock *nsk = NULL;
> > > > > > >  	int i;
> > > > > > >  
> > > > > > >  	spin_lock_bh(&reuseport_lock);
> > > > > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > > > > >  
> > > > > > >  		reuse->num_socks--;
> > > > > > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > > > > +		prog = rcu_dereference(reuse->prog);
> > > > > > >  
> > > > > > >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > > > > > > +			if (reuse->num_socks && !prog)
> > > > > > > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > > > > > +
> > > > > > >  			reuse->num_closed_socks++;
> > > > > > >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > > > > > >  		} else {
> > > > > > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > > > > > >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> > > > > > >  out:
> > > > > > >  	spin_unlock_bh(&reuseport_lock);
> > > > > > > +
> > > > > > > +	return nsk;
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > > > > > >  
> > > > > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > > > > --- a/net/ipv4/inet_connection_sock.c
> > > > > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > > > > >  
> > > > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > > > > +{
> > > > > > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > > > > +
> > > > > > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > > > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > > > > +
> > > > > > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > > > > > +	spin_lock(&new_accept_queue->rskq_lock);
> > > > > > 
> > > > > > Are you sure lockdep is happy with this ?
> > > > > > 
> > > > > > I would guess it should complain, because :
> > > > > > 
> > > > > > lock(A);
> > > > > > lock(B);
> > > > > > ...
> > > > > > unlock(B);
> > > > > > unlock(A);
> > > > > > 
> > > > > > will fail when the opposite action happens eventually
> > > > > > 
> > > > > > lock(B);
> > > > > > lock(A);
> > > > > > ...
> > > > > > unlock(A);
> > > > > > unlock(B);
> > > > > 
> > > > > I enabled lockdep and did not see warnings of lockdep.
> > > > > 
> > > > > Also, the inversion deadlock does not happen in this case.
> > > > > In reuseport_detach_sock(), sk is moved backward in socks[] and poped out
> > > > > from the eBPF map, so the old listener will not be selected as the new
> > > > > listener.
> > > > > 
> > > > > 
> > > > > > > +
> > > > > > > +	if (old_accept_queue->rskq_accept_head) {
> > > > > > > +		if (new_accept_queue->rskq_accept_head)
> > > > > > > +			old_accept_queue->rskq_accept_tail->dl_next =
> > > > > > > +				new_accept_queue->rskq_accept_head;
> > > > > > > +		else
> > > > > > > +			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
> > > > > > > +
> > > > > > > +		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
> > > > > > > +		old_accept_queue->rskq_accept_head = NULL;
> > > > > > > +		old_accept_queue->rskq_accept_tail = NULL;
> > > > > > > +
> > > > > > > +		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
> > > > > > > +		WRITE_ONCE(sk->sk_ack_backlog, 0);
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	spin_unlock(&new_accept_queue->rskq_lock);
> > > > > > > +	spin_unlock(&old_accept_queue->rskq_lock);
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
> > > > > > 
> > > > > > I fail to understand how the kernel can run fine right after this patch, before following patches are merged.
> > > > > 
> > > > > I will squash the two or reorganize them into definition part and migration
> > > > > part.
> > > > > 
> > > > > 
> > > > > > All request sockets in the socket accept queue MUST have their rsk_listener set to the listener,
> > > > > > this is how we designed things (each request socket has a reference taken on the listener)
> > > > > > 
> > > > > > We might even have some "BUG_ON(sk != req->rsk_listener);" in some places.
> > > > > > 
> > > > > > Since you splice list from old listener to the new one, without changing req->rsk_listener, bad things will happen.
> > > > I also have similar concern on the inconsistency in req->rsk_listener.
> > > > 
> > > > The fix-up in req->rsk_listener for the TFO req in patch 4
> > > > makes it clear that req->rsk_listener should be updated during
> > > > the migration instead of asking a much later code path
> > > > to accommodate this inconsistent req->rsk_listener pointer.
> > > 
> > > When I started this patchset, I read this thread and misunderstood that I
> > > had to migrate sockets in O(1) for scalability. So, I selected the fix-up
> > > approach and checked rsk_listener is not used except for TFO.
> > > 
> > > ---8<---
> > > Whole point of BPF was to avoid iterate through all sockets [1],
> > > and let user space use whatever selection logic it needs.
> > > 
> > > [1] This was okay with up to 16 sockets. But with 128 it does not scale.
> > > ---&<---
> > > https://lore.kernel.org/netdev/1458837191.12033.4.camel@edumazet-glaptop3.roam.corp.google.com/
> > > 
> > > 
> > > However, I've read it again, and this was about iterating over listeners
> > > to select a new listener, not about iterating over requests...
> > > In this patchset, we can select a listener in O(1) and it is enough.
> > > 
> > > 
> > > > The current inet_csk_listen_stop() is already iterating
> > > > the icsk_accept_queue and fastopenq.  The extra cost
> > > > in updating rsk_listener may be just noise?
> > > 
> > > Exactly.
> > > If we end up iterating requests, it is better to migrate than close. I will
> > > update each rsk_listener in inet_csk_reqsk_queue_migrate() in v3 patchset.
> > To be clear, I meant to do migration in inet_csk_listen_stop() instead
> > of doing it in the new inet_csk_reqsk_queue_migrate() which reqires a
> > double lock and then need to re-bring in the whole spin_lock_bh_nested
> > patch in the patch 3 of v2.
> > 
> > e.g. in the first while loop in inet_csk_listen_stop(),
> > if there is a target to migrate to,  it can do
> > something similar to inet_csk_reqsk_queue_add(target_sk, ...)
> > instead of doing the current inet_child_forget().
> > It probably needs something different from
> > inet_csk_reqsk_queue_add(), e.g. also update rsk_listener,
> > but the idea should be similar.
> > 
> > Since the rsk_listener has to be updated one by one, there is
> > really no point to do the list splicing which requires
> > the double lock.
> 
> I think it is a bit complex to pass the new listener from
> reuseport_detach_sock() to inet_csk_listen_stop().
> 
> __tcp_close/tcp_disconnect/tcp_abort
>  |-tcp_set_state
>  |  |-unhash
>  |     |-reuseport_detach_sock (return nsk)
>  |-inet_csk_listen_stop
Picking the new listener does not have to be done in
reuseport_detach_sock().

IIUC, it is done there only because it prefers to pick
the last sk from socks[] when bpf prog is not attached.
This seems to get into the way of exploring other potential
implementation options.

Merging the discussion on the last socks[] pick from another thread:
>
> I think most applications start new listeners before closing listeners, in
> this case, selecting the moved socket as the new listener works well.
>
>
> > That said, if it is still desired to do a random pick by kernel when
> > there is no bpf prog, it probably makes sense to guard it in a sysctl as
> > suggested in another reply.  To keep it simple, I would also keep this
> > kernel-pick consistent instead of request socket is doing something
> > different from the unhash path.
>
> Then, is this way better to keep kernel-pick consistent?
>
>   1. call reuseport_select_migrated_sock() without sk_hash from any path
>   2. generate a random number in reuseport_select_migrated_sock()
>   3. pass it to __reuseport_select_sock() only for select-by-hash
>   (4. pass 0 as sk_hash to bpf_run_sk_reuseport not to use it)
>   5. do migration per queue in inet_csk_listen_stop() or per request in
>      receive path.
>
> I understand it is beautiful to keep consistensy, but also think
> the kernel-pick with heuristic performs better than random-pick.
I think discussing the best kernel pick without explicit user input
is going to be a dead end. There is always a case that
makes this heuristic (or guess) fail.  e.g. what if multiple
sk(s) being closed are always the last one in the socks[]?
all their child sk(s) will then be piled up at one listen sk
because the last socks[] is always picked?

Lets assume the last socks[] is indeed the best for all cases.  Then why
the in-progress req don't pick it this way?  I feel the implementation
is doing what is convenient at that point.  And that is fine, I think
for kernel-pick, it should just go for simplicity and stay with
the random(/hash) pick instead of pretending the kernel knows the
application must operate in a certain way.  It is fine
that the pick was wrong, the kernel will eventually move the
childs/reqs to the survived listen sk.
[ I still think the kernel should not even pick if
  there is no bpf prog to instruct how to pick
  but I am fine as long as there is a sysctl to
  guard this. ]

I would rather focus on ensuring the bpf prog getting what it
needs to make the migration pick.  A few things
I would like to discuss and explore:

> 
> If we splice requests like this, we do not need double lock?
> 
>   1. lock the accept queue of the old listener
>   2. unlink all requests and decrement refcount
>   3. unlock
>   4. update all requests with new listener
I guess updating rsk_listener can be done without acquiring
the lock in (5) below is because it is done under the
listening_hash's bucket lock (and also the global reuseport_lock) so
that the new listener will stay in TCP_LISTEN state?

I am not sure iterating the queue under these
locks is a very good thing to do though.  The queue may not be
very long in usual setup but still let see
if that can be avoided.

Do you think the iteration can be done without holding
bucket lock and the global reuseport_lock?  inet_csk_reqsk_queue_add()
is taking the rskq_lock and then check for TCP_LISTEN.  May be
something similar can be done also?

While doing BPF_SK_REUSEPORT_MIGRATE_REQUEST,
the bpf prog can pick per req and have the sk_hash.
However, while doing BPF_SK_REUSEPORT_MIGRATE_QUEUE,
the bpf prog currently does not have a chance to
pick individually for each req/child on the queue.
Since it is iterating the queue anyway, does it make
sense to also call the bpf to pick for each req/child
in the queue?  It then can pass sk_hash (from child->sk_hash?)
to the bpf prog also instead of current 0.  The cost of calling
bpf prog is not really that much / signficant at the
migration code path.  If the queue is somehow
unusally long, there is already an existing
cond_resched() in inet_csk_listen_stop().

Then, instead of adding sk_reuseport_md->migration,
it can then add sk_reuseport_md->migrate_sk.
"migrate_sk = req" for in-progress req and "migrate_sk = child"
for iterating acceptq.  The bpf_prog can then tell what sk (req or child)
it is migrating by reading migrate_sk->state.  It can then also
learn the 4 tuples src/dst ip/port while skb is missing.
The sk_reuseport_md->sk can still point to the closed sk
such that the bpf prog can learn the cookie.

I suspect a few things between BPF_SK_REUSEPORT_MIGRATE_REQUEST
and BPF_SK_REUSEPORT_MIGRATE_QUEUE can be folded together
by doing the above.  It also gives a more consistent
interface for the bpf prog, no more MIGRATE_QUEUE vs MIGRATE_REQUEST.

>   5. lock the accept queue of the new listener
>   6. splice requests and increment refcount
>   7. unlock
> 
> Also, I think splicing is better to keep the order of requests. Adding one
> by one reverses it.
It can keep the order but I think it is orthogonal here.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-09  3:09               ` Martin KaFai Lau
@ 2020-12-09  8:05                 ` Kuniyuki Iwashima
  2020-12-09 16:57                   ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-09  8:05 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, eric.dumazet, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Tue, 8 Dec 2020 19:09:03 -0800
> On Tue, Dec 08, 2020 at 05:17:48PM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Mon, 7 Dec 2020 23:34:41 -0800
> > > On Tue, Dec 08, 2020 at 03:31:34PM +0900, Kuniyuki Iwashima wrote:
> > > > From:   Martin KaFai Lau <kafai@fb.com>
> > > > Date:   Mon, 7 Dec 2020 12:33:15 -0800
> > > > > On Thu, Dec 03, 2020 at 11:14:24PM +0900, Kuniyuki Iwashima wrote:
> > > > > > From:   Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > > > > > > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > > > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > > > > sockets from the closing listener to the selected one.
> > > > > > > > 
> > > > > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > > > > > > > the requests from the closing listener's queue and relink them to the head
> > > > > > > > of the new listener's queue. We do not process each request and its
> > > > > > > > reference to the listener, so the migration completes in O(1) time
> > > > > > > > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > > > > > > > care in the next commit.
> > > > > > > > 
> > > > > > > > By default, the kernel selects a new listener randomly. In order to pick
> > > > > > > > out a different socket every time, we select the last element of socks[] as
> > > > > > > > the new listener. This behaviour is based on how the kernel moves sockets
> > > > > > > > in socks[]. (See also [1])
> > > > > > > > 
> > > > > > > > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > > > > > > > program called in the later commit, but as the side effect of such default
> > > > > > > > selection, the kernel can redistribute old requests evenly to new listeners
> > > > > > > > for a specific case where the application replaces listeners by
> > > > > > > > generations.
> > > > > > > > 
> > > > > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > > > > first two by turns. The sockets move in socks[] like below.
> > > > > > > > 
> > > > > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > > > > >   socks[2] : C   |      socks[2] : C --'
> > > > > > > >   socks[3] : D --'
> > > > > > > > 
> > > > > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > > > > requests evenly to new listeners.
> > > > > > > > 
> > > > > > > >   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
> > > > > > > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> > > > > > > >   socks[2] : C (c)   |      socks[2] : C (c) --'
> > > > > > > >   socks[3] : D (d) --'
> > > > > > > > 
> > > > > > > > Here, (A, D) or (B, C) can have different application settings, but they
> > > > > > > > MUST have the same settings at the socket API level; otherwise, unexpected
> > > > > > > > error may happen. For instance, if only the new listeners have
> > > > > > > > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > > > > > > > face inconsistency and cause an error.
> > > > > > > > 
> > > > > > > > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > > > > > > > program described in later commits.
> > > > > > > > 
> > > > > > > > Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> > > > > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > > > > > ---
> > > > > > > >  include/net/inet_connection_sock.h |  1 +
> > > > > > > >  include/net/sock_reuseport.h       |  2 +-
> > > > > > > >  net/core/sock_reuseport.c          | 10 +++++++++-
> > > > > > > >  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
> > > > > > > >  net/ipv4/inet_hashtables.c         |  9 +++++++--
> > > > > > > >  5 files changed, 48 insertions(+), 4 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > > > > > index 7338b3865a2a..2ea2d743f8fc 100644
> > > > > > > > --- a/include/net/inet_connection_sock.h
> > > > > > > > +++ b/include/net/inet_connection_sock.h
> > > > > > > > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
> > > > > > > >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > > > > >  				      struct request_sock *req,
> > > > > > > >  				      struct sock *child);
> > > > > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> > > > > > > >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > > > > > > >  				   unsigned long timeout);
> > > > > > > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > > > > > > > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > > > > > > > index 0e558ca7afbf..09a1b1539d4c 100644
> > > > > > > > --- a/include/net/sock_reuseport.h
> > > > > > > > +++ b/include/net/sock_reuseport.h
> > > > > > > > @@ -31,7 +31,7 @@ struct sock_reuseport {
> > > > > > > >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> > > > > > > >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> > > > > > > >  			      bool bind_inany);
> > > > > > > > -extern void reuseport_detach_sock(struct sock *sk);
> > > > > > > > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> > > > > > > >  extern struct sock *reuseport_select_sock(struct sock *sk,
> > > > > > > >  					  u32 hash,
> > > > > > > >  					  struct sk_buff *skb,
> > > > > > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > > > > > index fd133516ac0e..60d7c1f28809 100644
> > > > > > > > --- a/net/core/sock_reuseport.c
> > > > > > > > +++ b/net/core/sock_reuseport.c
> > > > > > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > > > > > >  
> > > > > > > > -void reuseport_detach_sock(struct sock *sk)
> > > > > > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > > > > > >  {
> > > > > > > >  	struct sock_reuseport *reuse;
> > > > > > > > +	struct bpf_prog *prog;
> > > > > > > > +	struct sock *nsk = NULL;
> > > > > > > >  	int i;
> > > > > > > >  
> > > > > > > >  	spin_lock_bh(&reuseport_lock);
> > > > > > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > > > > > >  
> > > > > > > >  		reuse->num_socks--;
> > > > > > > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > > > > > +		prog = rcu_dereference(reuse->prog);
> > > > > > > >  
> > > > > > > >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > > > > > > > +			if (reuse->num_socks && !prog)
> > > > > > > > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > > > > > > +
> > > > > > > >  			reuse->num_closed_socks++;
> > > > > > > >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > > > > > > >  		} else {
> > > > > > > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > > > > > > >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> > > > > > > >  out:
> > > > > > > >  	spin_unlock_bh(&reuseport_lock);
> > > > > > > > +
> > > > > > > > +	return nsk;
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > > > > > > >  
> > > > > > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > > > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > > > > > --- a/net/ipv4/inet_connection_sock.c
> > > > > > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > > > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > > > > > >  
> > > > > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > > > > > +{
> > > > > > > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > > > > > +
> > > > > > > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > > > > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > > > > > +
> > > > > > > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > > > > > > +	spin_lock(&new_accept_queue->rskq_lock);
> > > > > > > 
> > > > > > > Are you sure lockdep is happy with this ?
> > > > > > > 
> > > > > > > I would guess it should complain, because :
> > > > > > > 
> > > > > > > lock(A);
> > > > > > > lock(B);
> > > > > > > ...
> > > > > > > unlock(B);
> > > > > > > unlock(A);
> > > > > > > 
> > > > > > > will fail when the opposite action happens eventually
> > > > > > > 
> > > > > > > lock(B);
> > > > > > > lock(A);
> > > > > > > ...
> > > > > > > unlock(A);
> > > > > > > unlock(B);
> > > > > > 
> > > > > > I enabled lockdep and did not see warnings of lockdep.
> > > > > > 
> > > > > > Also, the inversion deadlock does not happen in this case.
> > > > > > In reuseport_detach_sock(), sk is moved backward in socks[] and poped out
> > > > > > from the eBPF map, so the old listener will not be selected as the new
> > > > > > listener.
> > > > > > 
> > > > > > 
> > > > > > > > +
> > > > > > > > +	if (old_accept_queue->rskq_accept_head) {
> > > > > > > > +		if (new_accept_queue->rskq_accept_head)
> > > > > > > > +			old_accept_queue->rskq_accept_tail->dl_next =
> > > > > > > > +				new_accept_queue->rskq_accept_head;
> > > > > > > > +		else
> > > > > > > > +			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
> > > > > > > > +
> > > > > > > > +		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
> > > > > > > > +		old_accept_queue->rskq_accept_head = NULL;
> > > > > > > > +		old_accept_queue->rskq_accept_tail = NULL;
> > > > > > > > +
> > > > > > > > +		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
> > > > > > > > +		WRITE_ONCE(sk->sk_ack_backlog, 0);
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	spin_unlock(&new_accept_queue->rskq_lock);
> > > > > > > > +	spin_unlock(&old_accept_queue->rskq_lock);
> > > > > > > > +}
> > > > > > > > +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
> > > > > > > 
> > > > > > > I fail to understand how the kernel can run fine right after this patch, before following patches are merged.
> > > > > > 
> > > > > > I will squash the two or reorganize them into definition part and migration
> > > > > > part.
> > > > > > 
> > > > > > 
> > > > > > > All request sockets in the socket accept queue MUST have their rsk_listener set to the listener,
> > > > > > > this is how we designed things (each request socket has a reference taken on the listener)
> > > > > > > 
> > > > > > > We might even have some "BUG_ON(sk != req->rsk_listener);" in some places.
> > > > > > > 
> > > > > > > Since you splice list from old listener to the new one, without changing req->rsk_listener, bad things will happen.
> > > > > I also have similar concern on the inconsistency in req->rsk_listener.
> > > > > 
> > > > > The fix-up in req->rsk_listener for the TFO req in patch 4
> > > > > makes it clear that req->rsk_listener should be updated during
> > > > > the migration instead of asking a much later code path
> > > > > to accommodate this inconsistent req->rsk_listener pointer.
> > > > 
> > > > When I started this patchset, I read this thread and misunderstood that I
> > > > had to migrate sockets in O(1) for scalability. So, I selected the fix-up
> > > > approach and checked rsk_listener is not used except for TFO.
> > > > 
> > > > ---8<---
> > > > Whole point of BPF was to avoid iterate through all sockets [1],
> > > > and let user space use whatever selection logic it needs.
> > > > 
> > > > [1] This was okay with up to 16 sockets. But with 128 it does not scale.
> > > > ---&<---
> > > > https://lore.kernel.org/netdev/1458837191.12033.4.camel@edumazet-glaptop3.roam.corp.google.com/
> > > > 
> > > > 
> > > > However, I've read it again, and this was about iterating over listeners
> > > > to select a new listener, not about iterating over requests...
> > > > In this patchset, we can select a listener in O(1) and it is enough.
> > > > 
> > > > 
> > > > > The current inet_csk_listen_stop() is already iterating
> > > > > the icsk_accept_queue and fastopenq.  The extra cost
> > > > > in updating rsk_listener may be just noise?
> > > > 
> > > > Exactly.
> > > > If we end up iterating requests, it is better to migrate than close. I will
> > > > update each rsk_listener in inet_csk_reqsk_queue_migrate() in v3 patchset.
> > > To be clear, I meant to do migration in inet_csk_listen_stop() instead
> > > of doing it in the new inet_csk_reqsk_queue_migrate() which reqires a
> > > double lock and then need to re-bring in the whole spin_lock_bh_nested
> > > patch in the patch 3 of v2.
> > > 
> > > e.g. in the first while loop in inet_csk_listen_stop(),
> > > if there is a target to migrate to,  it can do
> > > something similar to inet_csk_reqsk_queue_add(target_sk, ...)
> > > instead of doing the current inet_child_forget().
> > > It probably needs something different from
> > > inet_csk_reqsk_queue_add(), e.g. also update rsk_listener,
> > > but the idea should be similar.
> > > 
> > > Since the rsk_listener has to be updated one by one, there is
> > > really no point to do the list splicing which requires
> > > the double lock.
> > 
> > I think it is a bit complex to pass the new listener from
> > reuseport_detach_sock() to inet_csk_listen_stop().
> > 
> > __tcp_close/tcp_disconnect/tcp_abort
> >  |-tcp_set_state
> >  |  |-unhash
> >  |     |-reuseport_detach_sock (return nsk)
> >  |-inet_csk_listen_stop
> Picking the new listener does not have to be done in
> reuseport_detach_sock().
> 
> IIUC, it is done there only because it prefers to pick
> the last sk from socks[] when bpf prog is not attached.
> This seems to get into the way of exploring other potential
> implementation options.

Yes.
This is just idea, but we can reserve the last index of socks[] to hold the
last 'moved' socket in reuseport_detach_sock() and use it in
inet_csk_listen_stop().


> Merging the discussion on the last socks[] pick from another thread:
> >
> > I think most applications start new listeners before closing listeners, in
> > this case, selecting the moved socket as the new listener works well.
> >
> >
> > > That said, if it is still desired to do a random pick by kernel when
> > > there is no bpf prog, it probably makes sense to guard it in a sysctl as
> > > suggested in another reply.  To keep it simple, I would also keep this
> > > kernel-pick consistent instead of request socket is doing something
> > > different from the unhash path.
> >
> > Then, is this way better to keep kernel-pick consistent?
> >
> >   1. call reuseport_select_migrated_sock() without sk_hash from any path
> >   2. generate a random number in reuseport_select_migrated_sock()
> >   3. pass it to __reuseport_select_sock() only for select-by-hash
> >   (4. pass 0 as sk_hash to bpf_run_sk_reuseport not to use it)
> >   5. do migration per queue in inet_csk_listen_stop() or per request in
> >      receive path.
> >
> > I understand it is beautiful to keep consistensy, but also think
> > the kernel-pick with heuristic performs better than random-pick.
> I think discussing the best kernel pick without explicit user input
> is going to be a dead end. There is always a case that
> makes this heuristic (or guess) fail.  e.g. what if multiple
> sk(s) being closed are always the last one in the socks[]?
> all their child sk(s) will then be piled up at one listen sk
> because the last socks[] is always picked?

There can be such a case, but it means the newly listened sockets are
closed earlier than old ones.


> Lets assume the last socks[] is indeed the best for all cases.  Then why
> the in-progress req don't pick it this way?  I feel the implementation
> is doing what is convenient at that point.  And that is fine, I think

In this patchset, I originally assumed four things:

  migration should be done
    (i)   from old to new
    (ii)  to redistribute requests evenly as possible
    (iii) to keep the order of requests in the queue
          (resulting in splicing queues)
    (iv)  in O(1) for scalability
          (resulting in fix-up rsk_listener approach)

I selected the last socket in unhash path to satisfy above four because the
last socket changes at every close() syscall if application closes from
older socket.

But in receiving ACK or retransmitting SYN+ACK, we cannot get the last
'moved' socket. Even if we reserve the last 'moved' socket in the last
index by the idea above, we cannot sure the last socket is changed after
close() for each req->listener. For example, we have listeners A, B, C, and
D, and then call close(A) and close(B), and receive the final ACKs for A
and B, then both of them are assigned to C. In this case, A for D and B for
C is desired. So, selecting the last socket in socks[] for incoming
requests cannnot realize (ii).

This is why I selected the last moved socket in unhash path and a random
listener in receive path.


> for kernel-pick, it should just go for simplicity and stay with
> the random(/hash) pick instead of pretending the kernel knows the
> application must operate in a certain way.  It is fine
> that the pick was wrong, the kernel will eventually move the
> childs/reqs to the survived listen sk.

Exactly. Also the heuristic way is not fair for every application.

After reading below idea (migrated_sk), I think random-pick is better
at simplicity and passing each sk.


> [ I still think the kernel should not even pick if
>   there is no bpf prog to instruct how to pick
>   but I am fine as long as there is a sysctl to
>   guard this. ]

Unless different applications listen on the same port, random-pick can save
connections which would be aborted. So, I would add a sysctl to do
migration when no eBPF prog is attached.


> I would rather focus on ensuring the bpf prog getting what it
> needs to make the migration pick.  A few things
> I would like to discuss and explore:
> > If we splice requests like this, we do not need double lock?
> > 
> >   1. lock the accept queue of the old listener
> >   2. unlink all requests and decrement refcount
> >   3. unlock
> >   4. update all requests with new listener
> I guess updating rsk_listener can be done without acquiring
> the lock in (5) below is because it is done under the
> listening_hash's bucket lock (and also the global reuseport_lock) so
> that the new listener will stay in TCP_LISTEN state?

If we do migration in inet_unhash(), the lock is held, but it is not held
in inet_csk_listen_stop().


> I am not sure iterating the queue under these
> locks is a very good thing to do though.  The queue may not be
> very long in usual setup but still let see
> if that can be avoided.

I agree, lock should not be held long.


> Do you think the iteration can be done without holding
> bucket lock and the global reuseport_lock?  inet_csk_reqsk_queue_add()
> is taking the rskq_lock and then check for TCP_LISTEN.  May be
> something similar can be done also?

I think either one is necessary at least, so if the sk_state of selected
listener is TCP_CLOSE (this is mostly by random-pick of kernel), then we
have to fall back to call inet_child_forget().


> While doing BPF_SK_REUSEPORT_MIGRATE_REQUEST,
> the bpf prog can pick per req and have the sk_hash.
> However, while doing BPF_SK_REUSEPORT_MIGRATE_QUEUE,
> the bpf prog currently does not have a chance to
> pick individually for each req/child on the queue.
> Since it is iterating the queue anyway, does it make
> sense to also call the bpf to pick for each req/child
> in the queue?  It then can pass sk_hash (from child->sk_hash?)
> to the bpf prog also instead of current 0.  The cost of calling
> bpf prog is not really that much / signficant at the
> migration code path.  If the queue is somehow
> unusally long, there is already an existing
> cond_resched() in inet_csk_listen_stop().
> 
> Then, instead of adding sk_reuseport_md->migration,
> it can then add sk_reuseport_md->migrate_sk.
> "migrate_sk = req" for in-progress req and "migrate_sk = child"
> for iterating acceptq.  The bpf_prog can then tell what sk (req or child)
> it is migrating by reading migrate_sk->state.  It can then also
> learn the 4 tuples src/dst ip/port while skb is missing.
> The sk_reuseport_md->sk can still point to the closed sk
> such that the bpf prog can learn the cookie.
> 
> I suspect a few things between BPF_SK_REUSEPORT_MIGRATE_REQUEST
> and BPF_SK_REUSEPORT_MIGRATE_QUEUE can be folded together
> by doing the above.  It also gives a more consistent
> interface for the bpf prog, no more MIGRATE_QUEUE vs MIGRATE_REQUEST.

I think this is really nice idea. Also, I tried to implement random-pick
one by one in inet_csk_listen_stop() yesterday, I found a concern about how
to handle requests in TFO queue.

The request can be already accepted, so passing it to eBPF prog is
confusing? But, redistributing randomly can affect all listeners
unnecessary. How should we handle such requests?


> >   5. lock the accept queue of the new listener
> >   6. splice requests and increment refcount
> >   7. unlock
> > 
> > Also, I think splicing is better to keep the order of requests. Adding one
> > by one reverses it.
> It can keep the order but I think it is orthogonal here.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-09  8:05                 ` Kuniyuki Iwashima
@ 2020-12-09 16:57                   ` Kuniyuki Iwashima
  2020-12-10  1:53                     ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-09 16:57 UTC (permalink / raw)
  To: kafai
  Cc: kuniyu, ast, benh, bpf, daniel, davem, edumazet, eric.dumazet,
	kuba, kuni1840, linux-kernel, netdev

From:   Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed, 9 Dec 2020 17:05:09 +0900
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Tue, 8 Dec 2020 19:09:03 -0800
> > On Tue, Dec 08, 2020 at 05:17:48PM +0900, Kuniyuki Iwashima wrote:
> > > From:   Martin KaFai Lau <kafai@fb.com>
> > > Date:   Mon, 7 Dec 2020 23:34:41 -0800
> > > > On Tue, Dec 08, 2020 at 03:31:34PM +0900, Kuniyuki Iwashima wrote:
> > > > > From:   Martin KaFai Lau <kafai@fb.com>
> > > > > Date:   Mon, 7 Dec 2020 12:33:15 -0800
> > > > > > On Thu, Dec 03, 2020 at 11:14:24PM +0900, Kuniyuki Iwashima wrote:
> > > > > > > From:   Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > > Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > > > > > > > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > > > > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > > > > > sockets from the closing listener to the selected one.
> > > > > > > > > 
> > > > > > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > > > > > > > > the requests from the closing listener's queue and relink them to the head
> > > > > > > > > of the new listener's queue. We do not process each request and its
> > > > > > > > > reference to the listener, so the migration completes in O(1) time
> > > > > > > > > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > > > > > > > > care in the next commit.
> > > > > > > > > 
> > > > > > > > > By default, the kernel selects a new listener randomly. In order to pick
> > > > > > > > > out a different socket every time, we select the last element of socks[] as
> > > > > > > > > the new listener. This behaviour is based on how the kernel moves sockets
> > > > > > > > > in socks[]. (See also [1])
> > > > > > > > > 
> > > > > > > > > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > > > > > > > > program called in the later commit, but as the side effect of such default
> > > > > > > > > selection, the kernel can redistribute old requests evenly to new listeners
> > > > > > > > > for a specific case where the application replaces listeners by
> > > > > > > > > generations.
> > > > > > > > > 
> > > > > > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > > > > > first two by turns. The sockets move in socks[] like below.
> > > > > > > > > 
> > > > > > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > > > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > > > > > >   socks[2] : C   |      socks[2] : C --'
> > > > > > > > >   socks[3] : D --'
> > > > > > > > > 
> > > > > > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > > > > > requests evenly to new listeners.
> > > > > > > > > 
> > > > > > > > >   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
> > > > > > > > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> > > > > > > > >   socks[2] : C (c)   |      socks[2] : C (c) --'
> > > > > > > > >   socks[3] : D (d) --'
> > > > > > > > > 
> > > > > > > > > Here, (A, D) or (B, C) can have different application settings, but they
> > > > > > > > > MUST have the same settings at the socket API level; otherwise, unexpected
> > > > > > > > > error may happen. For instance, if only the new listeners have
> > > > > > > > > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > > > > > > > > face inconsistency and cause an error.
> > > > > > > > > 
> > > > > > > > > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > > > > > > > > program described in later commits.
> > > > > > > > > 
> > > > > > > > > Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
> > > > > > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > > > > > > ---
> > > > > > > > >  include/net/inet_connection_sock.h |  1 +
> > > > > > > > >  include/net/sock_reuseport.h       |  2 +-
> > > > > > > > >  net/core/sock_reuseport.c          | 10 +++++++++-
> > > > > > > > >  net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
> > > > > > > > >  net/ipv4/inet_hashtables.c         |  9 +++++++--
> > > > > > > > >  5 files changed, 48 insertions(+), 4 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > > > > > > index 7338b3865a2a..2ea2d743f8fc 100644
> > > > > > > > > --- a/include/net/inet_connection_sock.h
> > > > > > > > > +++ b/include/net/inet_connection_sock.h
> > > > > > > > > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
> > > > > > > > >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > > > > > >  				      struct request_sock *req,
> > > > > > > > >  				      struct sock *child);
> > > > > > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> > > > > > > > >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > > > > > > > >  				   unsigned long timeout);
> > > > > > > > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > > > > > > > > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > > > > > > > > index 0e558ca7afbf..09a1b1539d4c 100644
> > > > > > > > > --- a/include/net/sock_reuseport.h
> > > > > > > > > +++ b/include/net/sock_reuseport.h
> > > > > > > > > @@ -31,7 +31,7 @@ struct sock_reuseport {
> > > > > > > > >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> > > > > > > > >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> > > > > > > > >  			      bool bind_inany);
> > > > > > > > > -extern void reuseport_detach_sock(struct sock *sk);
> > > > > > > > > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> > > > > > > > >  extern struct sock *reuseport_select_sock(struct sock *sk,
> > > > > > > > >  					  u32 hash,
> > > > > > > > >  					  struct sk_buff *skb,
> > > > > > > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > > > > > > index fd133516ac0e..60d7c1f28809 100644
> > > > > > > > > --- a/net/core/sock_reuseport.c
> > > > > > > > > +++ b/net/core/sock_reuseport.c
> > > > > > > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > > > > > > >  }
> > > > > > > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > > > > > > >  
> > > > > > > > > -void reuseport_detach_sock(struct sock *sk)
> > > > > > > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > > > > > > >  {
> > > > > > > > >  	struct sock_reuseport *reuse;
> > > > > > > > > +	struct bpf_prog *prog;
> > > > > > > > > +	struct sock *nsk = NULL;
> > > > > > > > >  	int i;
> > > > > > > > >  
> > > > > > > > >  	spin_lock_bh(&reuseport_lock);
> > > > > > > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > > > > > > >  
> > > > > > > > >  		reuse->num_socks--;
> > > > > > > > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > > > > > > +		prog = rcu_dereference(reuse->prog);
> > > > > > > > >  
> > > > > > > > >  		if (sk->sk_protocol == IPPROTO_TCP) {
> > > > > > > > > +			if (reuse->num_socks && !prog)
> > > > > > > > > +				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
> > > > > > > > > +
> > > > > > > > >  			reuse->num_closed_socks++;
> > > > > > > > >  			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
> > > > > > > > >  		} else {
> > > > > > > > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > > > > > > > >  		call_rcu(&reuse->rcu, reuseport_free_rcu);
> > > > > > > > >  out:
> > > > > > > > >  	spin_unlock_bh(&reuseport_lock);
> > > > > > > > > +
> > > > > > > > > +	return nsk;
> > > > > > > > >  }
> > > > > > > > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > > > > > > > >  
> > > > > > > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > > > > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > > > > > > --- a/net/ipv4/inet_connection_sock.c
> > > > > > > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > > > > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > > > > > >  }
> > > > > > > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > > > > > > >  
> > > > > > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > > > > > > +{
> > > > > > > > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > > > > > > +
> > > > > > > > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > > > > > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > > > > > > +
> > > > > > > > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > > > > > > > +	spin_lock(&new_accept_queue->rskq_lock);
> > > > > > > > 
> > > > > > > > Are you sure lockdep is happy with this ?
> > > > > > > > 
> > > > > > > > I would guess it should complain, because :
> > > > > > > > 
> > > > > > > > lock(A);
> > > > > > > > lock(B);
> > > > > > > > ...
> > > > > > > > unlock(B);
> > > > > > > > unlock(A);
> > > > > > > > 
> > > > > > > > will fail when the opposite action happens eventually
> > > > > > > > 
> > > > > > > > lock(B);
> > > > > > > > lock(A);
> > > > > > > > ...
> > > > > > > > unlock(A);
> > > > > > > > unlock(B);
> > > > > > > 
> > > > > > > I enabled lockdep and did not see warnings of lockdep.
> > > > > > > 
> > > > > > > Also, the inversion deadlock does not happen in this case.
> > > > > > > In reuseport_detach_sock(), sk is moved backward in socks[] and poped out
> > > > > > > from the eBPF map, so the old listener will not be selected as the new
> > > > > > > listener.
> > > > > > > 
> > > > > > > 
> > > > > > > > > +
> > > > > > > > > +	if (old_accept_queue->rskq_accept_head) {
> > > > > > > > > +		if (new_accept_queue->rskq_accept_head)
> > > > > > > > > +			old_accept_queue->rskq_accept_tail->dl_next =
> > > > > > > > > +				new_accept_queue->rskq_accept_head;
> > > > > > > > > +		else
> > > > > > > > > +			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
> > > > > > > > > +
> > > > > > > > > +		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
> > > > > > > > > +		old_accept_queue->rskq_accept_head = NULL;
> > > > > > > > > +		old_accept_queue->rskq_accept_tail = NULL;
> > > > > > > > > +
> > > > > > > > > +		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
> > > > > > > > > +		WRITE_ONCE(sk->sk_ack_backlog, 0);
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	spin_unlock(&new_accept_queue->rskq_lock);
> > > > > > > > > +	spin_unlock(&old_accept_queue->rskq_lock);
> > > > > > > > > +}
> > > > > > > > > +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
> > > > > > > > 
> > > > > > > > I fail to understand how the kernel can run fine right after this patch, before following patches are merged.
> > > > > > > 
> > > > > > > I will squash the two or reorganize them into definition part and migration
> > > > > > > part.
> > > > > > > 
> > > > > > > 
> > > > > > > > All request sockets in the socket accept queue MUST have their rsk_listener set to the listener,
> > > > > > > > this is how we designed things (each request socket has a reference taken on the listener)
> > > > > > > > 
> > > > > > > > We might even have some "BUG_ON(sk != req->rsk_listener);" in some places.
> > > > > > > > 
> > > > > > > > Since you splice list from old listener to the new one, without changing req->rsk_listener, bad things will happen.
> > > > > > I also have similar concern on the inconsistency in req->rsk_listener.
> > > > > > 
> > > > > > The fix-up in req->rsk_listener for the TFO req in patch 4
> > > > > > makes it clear that req->rsk_listener should be updated during
> > > > > > the migration instead of asking a much later code path
> > > > > > to accommodate this inconsistent req->rsk_listener pointer.
> > > > > 
> > > > > When I started this patchset, I read this thread and misunderstood that I
> > > > > had to migrate sockets in O(1) for scalability. So, I selected the fix-up
> > > > > approach and checked rsk_listener is not used except for TFO.
> > > > > 
> > > > > ---8<---
> > > > > Whole point of BPF was to avoid iterate through all sockets [1],
> > > > > and let user space use whatever selection logic it needs.
> > > > > 
> > > > > [1] This was okay with up to 16 sockets. But with 128 it does not scale.
> > > > > ---&<---
> > > > > https://lore.kernel.org/netdev/1458837191.12033.4.camel@edumazet-glaptop3.roam.corp.google.com/
> > > > > 
> > > > > 
> > > > > However, I've read it again, and this was about iterating over listeners
> > > > > to select a new listener, not about iterating over requests...
> > > > > In this patchset, we can select a listener in O(1) and it is enough.
> > > > > 
> > > > > 
> > > > > > The current inet_csk_listen_stop() is already iterating
> > > > > > the icsk_accept_queue and fastopenq.  The extra cost
> > > > > > in updating rsk_listener may be just noise?
> > > > > 
> > > > > Exactly.
> > > > > If we end up iterating requests, it is better to migrate than close. I will
> > > > > update each rsk_listener in inet_csk_reqsk_queue_migrate() in v3 patchset.
> > > > To be clear, I meant to do migration in inet_csk_listen_stop() instead
> > > > of doing it in the new inet_csk_reqsk_queue_migrate() which reqires a
> > > > double lock and then need to re-bring in the whole spin_lock_bh_nested
> > > > patch in the patch 3 of v2.
> > > > 
> > > > e.g. in the first while loop in inet_csk_listen_stop(),
> > > > if there is a target to migrate to,  it can do
> > > > something similar to inet_csk_reqsk_queue_add(target_sk, ...)
> > > > instead of doing the current inet_child_forget().
> > > > It probably needs something different from
> > > > inet_csk_reqsk_queue_add(), e.g. also update rsk_listener,
> > > > but the idea should be similar.
> > > > 
> > > > Since the rsk_listener has to be updated one by one, there is
> > > > really no point to do the list splicing which requires
> > > > the double lock.
> > > 
> > > I think it is a bit complex to pass the new listener from
> > > reuseport_detach_sock() to inet_csk_listen_stop().
> > > 
> > > __tcp_close/tcp_disconnect/tcp_abort
> > >  |-tcp_set_state
> > >  |  |-unhash
> > >  |     |-reuseport_detach_sock (return nsk)
> > >  |-inet_csk_listen_stop
> > Picking the new listener does not have to be done in
> > reuseport_detach_sock().
> > 
> > IIUC, it is done there only because it prefers to pick
> > the last sk from socks[] when bpf prog is not attached.
> > This seems to get into the way of exploring other potential
> > implementation options.
> 
> Yes.
> This is just idea, but we can reserve the last index of socks[] to hold the
> last 'moved' socket in reuseport_detach_sock() and use it in
> inet_csk_listen_stop().
> 
> 
> > Merging the discussion on the last socks[] pick from another thread:
> > >
> > > I think most applications start new listeners before closing listeners, in
> > > this case, selecting the moved socket as the new listener works well.
> > >
> > >
> > > > That said, if it is still desired to do a random pick by kernel when
> > > > there is no bpf prog, it probably makes sense to guard it in a sysctl as
> > > > suggested in another reply.  To keep it simple, I would also keep this
> > > > kernel-pick consistent instead of request socket is doing something
> > > > different from the unhash path.
> > >
> > > Then, is this way better to keep kernel-pick consistent?
> > >
> > >   1. call reuseport_select_migrated_sock() without sk_hash from any path
> > >   2. generate a random number in reuseport_select_migrated_sock()
> > >   3. pass it to __reuseport_select_sock() only for select-by-hash
> > >   (4. pass 0 as sk_hash to bpf_run_sk_reuseport not to use it)
> > >   5. do migration per queue in inet_csk_listen_stop() or per request in
> > >      receive path.
> > >
> > > I understand it is beautiful to keep consistensy, but also think
> > > the kernel-pick with heuristic performs better than random-pick.
> > I think discussing the best kernel pick without explicit user input
> > is going to be a dead end. There is always a case that
> > makes this heuristic (or guess) fail.  e.g. what if multiple
> > sk(s) being closed are always the last one in the socks[]?
> > all their child sk(s) will then be piled up at one listen sk
> > because the last socks[] is always picked?
> 
> There can be such a case, but it means the newly listened sockets are
> closed earlier than old ones.
> 
> 
> > Lets assume the last socks[] is indeed the best for all cases.  Then why
> > the in-progress req don't pick it this way?  I feel the implementation
> > is doing what is convenient at that point.  And that is fine, I think
> 
> In this patchset, I originally assumed four things:
> 
>   migration should be done
>     (i)   from old to new
>     (ii)  to redistribute requests evenly as possible
>     (iii) to keep the order of requests in the queue
>           (resulting in splicing queues)
>     (iv)  in O(1) for scalability
>           (resulting in fix-up rsk_listener approach)
> 
> I selected the last socket in unhash path to satisfy above four because the
> last socket changes at every close() syscall if application closes from
> older socket.
> 
> But in receiving ACK or retransmitting SYN+ACK, we cannot get the last
> 'moved' socket. Even if we reserve the last 'moved' socket in the last
> index by the idea above, we cannot sure the last socket is changed after
> close() for each req->listener. For example, we have listeners A, B, C, and
> D, and then call close(A) and close(B), and receive the final ACKs for A
> and B, then both of them are assigned to C. In this case, A for D and B for
> C is desired. So, selecting the last socket in socks[] for incoming
> requests cannnot realize (ii).
> 
> This is why I selected the last moved socket in unhash path and a random
> listener in receive path.
> 
> 
> > for kernel-pick, it should just go for simplicity and stay with
> > the random(/hash) pick instead of pretending the kernel knows the
> > application must operate in a certain way.  It is fine
> > that the pick was wrong, the kernel will eventually move the
> > childs/reqs to the survived listen sk.
> 
> Exactly. Also the heuristic way is not fair for every application.
> 
> After reading below idea (migrated_sk), I think random-pick is better
> at simplicity and passing each sk.
> 
> 
> > [ I still think the kernel should not even pick if
> >   there is no bpf prog to instruct how to pick
> >   but I am fine as long as there is a sysctl to
> >   guard this. ]
> 
> Unless different applications listen on the same port, random-pick can save
> connections which would be aborted. So, I would add a sysctl to do
> migration when no eBPF prog is attached.
> 
> 
> > I would rather focus on ensuring the bpf prog getting what it
> > needs to make the migration pick.  A few things
> > I would like to discuss and explore:
> > > If we splice requests like this, we do not need double lock?
> > > 
> > >   1. lock the accept queue of the old listener
> > >   2. unlink all requests and decrement refcount
> > >   3. unlock
> > >   4. update all requests with new listener
> > I guess updating rsk_listener can be done without acquiring
> > the lock in (5) below is because it is done under the
> > listening_hash's bucket lock (and also the global reuseport_lock) so
> > that the new listener will stay in TCP_LISTEN state?
> 
> If we do migration in inet_unhash(), the lock is held, but it is not held
> in inet_csk_listen_stop().
> 
> 
> > I am not sure iterating the queue under these
> > locks is a very good thing to do though.  The queue may not be
> > very long in usual setup but still let see
> > if that can be avoided.
> 
> I agree, lock should not be held long.
> 
> 
> > Do you think the iteration can be done without holding
> > bucket lock and the global reuseport_lock?  inet_csk_reqsk_queue_add()
> > is taking the rskq_lock and then check for TCP_LISTEN.  May be
> > something similar can be done also?
> 
> I think either one is necessary at least, so if the sk_state of selected
> listener is TCP_CLOSE (this is mostly by random-pick of kernel), then we
> have to fall back to call inet_child_forget().
> 
> 
> > While doing BPF_SK_REUSEPORT_MIGRATE_REQUEST,
> > the bpf prog can pick per req and have the sk_hash.
> > However, while doing BPF_SK_REUSEPORT_MIGRATE_QUEUE,
> > the bpf prog currently does not have a chance to
> > pick individually for each req/child on the queue.
> > Since it is iterating the queue anyway, does it make
> > sense to also call the bpf to pick for each req/child
> > in the queue?  It then can pass sk_hash (from child->sk_hash?)
> > to the bpf prog also instead of current 0.  The cost of calling
> > bpf prog is not really that much / signficant at the
> > migration code path.  If the queue is somehow
> > unusally long, there is already an existing
> > cond_resched() in inet_csk_listen_stop().
> > 
> > Then, instead of adding sk_reuseport_md->migration,
> > it can then add sk_reuseport_md->migrate_sk.
> > "migrate_sk = req" for in-progress req and "migrate_sk = child"
> > for iterating acceptq.  The bpf_prog can then tell what sk (req or child)
> > it is migrating by reading migrate_sk->state.  It can then also
> > learn the 4 tuples src/dst ip/port while skb is missing.
> > The sk_reuseport_md->sk can still point to the closed sk
> > such that the bpf prog can learn the cookie.
> > 
> > I suspect a few things between BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > and BPF_SK_REUSEPORT_MIGRATE_QUEUE can be folded together
> > by doing the above.  It also gives a more consistent
> > interface for the bpf prog, no more MIGRATE_QUEUE vs MIGRATE_REQUEST.
> 
> I think this is really nice idea. Also, I tried to implement random-pick
> one by one in inet_csk_listen_stop() yesterday, I found a concern about how
> to handle requests in TFO queue.
> 
> The request can be already accepted, so passing it to eBPF prog is
> confusing? But, redistributing randomly can affect all listeners
> unnecessary. How should we handle such requests?

I've implemented one-by-one migration only for the accept queue for now.
In addition to the concern about TFO queue, I want to discuss which should
we pass NULL or request_sock to eBPF program as migrate_sk when selecting a
listener for SYN ?

---8<---
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index a82fd4c912be..d0ddd3cb988b 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -1001,6 +1001,29 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
 }
 EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
 
+static bool inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk, struct request_sock *req)
+{
+       struct request_sock_queue *queue = &inet_csk(nsk)->icsk_accept_queue;
+       bool migrated = false;
+
+       spin_lock(&queue->rskq_lock);
+       if (likely(nsk->sk_state == TCP_LISTEN)) {
+               migrated = true;
+
+               req->dl_next = NULL;
+               if (queue->rskq_accept_head == NULL)
+                       WRITE_ONCE(queue->rskq_accept_head, req);
+               else
+                       queue->rskq_accept_tail->dl_next = req;
+               queue->rskq_accept_tail = req;
+               sk_acceptq_added(nsk);
+               inet_csk_reqsk_queue_migrated(sk, nsk, req);
+       }
+       spin_unlock(&queue->rskq_lock);
+
+       return migrated;
+}
+
 struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
                                         struct request_sock *req, bool own_req)
 {
@@ -1023,9 +1046,11 @@ EXPORT_SYMBOL(inet_csk_complete_hashdance);
  */
 void inet_csk_listen_stop(struct sock *sk)
 {
+       struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb);
        struct inet_connection_sock *icsk = inet_csk(sk);
        struct request_sock_queue *queue = &icsk->icsk_accept_queue;
        struct request_sock *next, *req;
+       struct sock *nsk;
 
        /* Following specs, it would be better either to send FIN
         * (and enter FIN-WAIT-1, it is normal close)
@@ -1043,8 +1068,19 @@ void inet_csk_listen_stop(struct sock *sk)
                WARN_ON(sock_owned_by_user(child));
                sock_hold(child);
 
+               if (reuseport_cb) {
+                       nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, NULL);
+                       if (nsk) {
+                               if (inet_csk_reqsk_queue_migrate(sk, nsk, req))
+                                       goto unlock_sock;
+                               else
+                                       sock_put(nsk);
+                       }
+               }
+
                inet_child_forget(sk, req, child);
                reqsk_put(req);
+unlock_sock:
                bh_unlock_sock(child);
                local_bh_enable();
                sock_put(child);
---8<---


> > >   5. lock the accept queue of the new listener
> > >   6. splice requests and increment refcount
> > >   7. unlock
> > > 
> > > Also, I think splicing is better to keep the order of requests. Adding one
> > > by one reverses it.
> > It can keep the order but I think it is orthogonal here.

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-12-01 14:44 ` [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
  2020-12-01 15:13   ` Eric Dumazet
@ 2020-12-10  0:07   ` Martin KaFai Lau
  2020-12-10  5:15     ` Kuniyuki Iwashima
  1 sibling, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-10  0:07 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Eric Dumazet,
	Alexei Starovoitov, Daniel Borkmann, Benjamin Herrenschmidt,
	Kuniyuki Iwashima, osa-contribution-log, bpf, netdev,
	linux-kernel

On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> adds two wrapper function of it to pass the migration type defined in the
> previous commit.
> 
>   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
>   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> 
> As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> patch also changes the code to call reuseport_select_migrated_sock() even
> if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> from the reuseport group, we rewrite request_sock.rsk_listener and resume
> processing the request.
> 
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  include/net/inet_connection_sock.h | 12 +++++++++++
>  include/net/request_sock.h         | 13 ++++++++++++
>  include/net/sock_reuseport.h       |  8 +++----
>  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
>  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
>  net/ipv4/tcp_ipv4.c                |  9 ++++++--
>  net/ipv6/tcp_ipv6.c                |  9 ++++++--
>  7 files changed, 81 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 2ea2d743f8fc..1e0958f5eb21 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
>  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
>  }
>  
> +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> +						 struct sock *nsk,
> +						 struct request_sock *req)
> +{
> +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> +			     &inet_csk(nsk)->icsk_accept_queue,
> +			     req);
> +	sock_put(sk);
not sure if it is safe to do here.
IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
to req->rsk_listener such that sock_hold(req->rsk_listener) is
safe because its sk_refcnt is not zero.

> +	sock_hold(nsk);
> +	req->rsk_listener = nsk;
> +}
> +

[ ... ]

> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 361efe55b1ad..e71653c6eae2 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t)
>  	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
>  	int max_syn_ack_retries, qlen, expire = 0, resend = 0;
>  
> -	if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
> -		goto drop;
> +	if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
> +		sk_listener = reuseport_select_migrated_sock(sk_listener,
> +							     req_to_sk(req)->sk_hash, NULL);
> +		if (!sk_listener) {
> +			sk_listener = req->rsk_listener;
> +			goto drop;
> +		}
> +		inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req);
> +		icsk = inet_csk(sk_listener);
> +		queue = &icsk->icsk_accept_queue;
> +	}
>  
>  	max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries;
>  	/* Normally all the openreqs are young and become mature
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index e4b31e70bd30..9a9aa27c6069 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1973,8 +1973,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
>  			goto csum_error;
>  		}
>  		if (unlikely(sk->sk_state != TCP_LISTEN)) {
> -			inet_csk_reqsk_queue_drop_and_put(sk, req);
> -			goto lookup;
> +			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
> +			if (!nsk) {
> +				inet_csk_reqsk_queue_drop_and_put(sk, req);
> +				goto lookup;
> +			}
> +			inet_csk_reqsk_queue_migrated(sk, nsk, req);
> +			sk = nsk;
>  		}
>  		/* We own a reference on the listener, increase it again
>  		 * as we might lose it too soon.
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 992cbf3eb9e3..ff11f3c0cb96 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1635,8 +1635,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
>  			goto csum_error;
>  		}
>  		if (unlikely(sk->sk_state != TCP_LISTEN)) {
> -			inet_csk_reqsk_queue_drop_and_put(sk, req);
> -			goto lookup;
> +			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
> +			if (!nsk) {
> +				inet_csk_reqsk_queue_drop_and_put(sk, req);
> +				goto lookup;
> +			}
> +			inet_csk_reqsk_queue_migrated(sk, nsk, req);
> +			sk = nsk;
>  		}
>  		sock_hold(sk);
For example, this sock_hold(sk).  sk here is req->rsk_listener.

>  		refcounted = true;
> -- 
> 2.17.2 (Apple Git-113)
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-09 16:57                   ` Kuniyuki Iwashima
@ 2020-12-10  1:53                     ` Martin KaFai Lau
  2020-12-10  5:58                       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-10  1:53 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, eric.dumazet, kuba,
	kuni1840, linux-kernel, netdev

On Thu, Dec 10, 2020 at 01:57:19AM +0900, Kuniyuki Iwashima wrote:
[ ... ]

> > > > I think it is a bit complex to pass the new listener from
> > > > reuseport_detach_sock() to inet_csk_listen_stop().
> > > > 
> > > > __tcp_close/tcp_disconnect/tcp_abort
> > > >  |-tcp_set_state
> > > >  |  |-unhash
> > > >  |     |-reuseport_detach_sock (return nsk)
> > > >  |-inet_csk_listen_stop
> > > Picking the new listener does not have to be done in
> > > reuseport_detach_sock().
> > > 
> > > IIUC, it is done there only because it prefers to pick
> > > the last sk from socks[] when bpf prog is not attached.
> > > This seems to get into the way of exploring other potential
> > > implementation options.
> > 
> > Yes.
> > This is just idea, but we can reserve the last index of socks[] to hold the
> > last 'moved' socket in reuseport_detach_sock() and use it in
> > inet_csk_listen_stop().
> > 
> > 
> > > Merging the discussion on the last socks[] pick from another thread:
> > > >
> > > > I think most applications start new listeners before closing listeners, in
> > > > this case, selecting the moved socket as the new listener works well.
> > > >
> > > >
> > > > > That said, if it is still desired to do a random pick by kernel when
> > > > > there is no bpf prog, it probably makes sense to guard it in a sysctl as
> > > > > suggested in another reply.  To keep it simple, I would also keep this
> > > > > kernel-pick consistent instead of request socket is doing something
> > > > > different from the unhash path.
> > > >
> > > > Then, is this way better to keep kernel-pick consistent?
> > > >
> > > >   1. call reuseport_select_migrated_sock() without sk_hash from any path
> > > >   2. generate a random number in reuseport_select_migrated_sock()
> > > >   3. pass it to __reuseport_select_sock() only for select-by-hash
> > > >   (4. pass 0 as sk_hash to bpf_run_sk_reuseport not to use it)
> > > >   5. do migration per queue in inet_csk_listen_stop() or per request in
> > > >      receive path.
> > > >
> > > > I understand it is beautiful to keep consistensy, but also think
> > > > the kernel-pick with heuristic performs better than random-pick.
> > > I think discussing the best kernel pick without explicit user input
> > > is going to be a dead end. There is always a case that
> > > makes this heuristic (or guess) fail.  e.g. what if multiple
> > > sk(s) being closed are always the last one in the socks[]?
> > > all their child sk(s) will then be piled up at one listen sk
> > > because the last socks[] is always picked?
> > 
> > There can be such a case, but it means the newly listened sockets are
> > closed earlier than old ones.
> > 
> > 
> > > Lets assume the last socks[] is indeed the best for all cases.  Then why
> > > the in-progress req don't pick it this way?  I feel the implementation
> > > is doing what is convenient at that point.  And that is fine, I think
> > 
> > In this patchset, I originally assumed four things:
> > 
> >   migration should be done
> >     (i)   from old to new
> >     (ii)  to redistribute requests evenly as possible
> >     (iii) to keep the order of requests in the queue
> >           (resulting in splicing queues)
> >     (iv)  in O(1) for scalability
> >           (resulting in fix-up rsk_listener approach)
> > 
> > I selected the last socket in unhash path to satisfy above four because the
> > last socket changes at every close() syscall if application closes from
> > older socket.
> > 
> > But in receiving ACK or retransmitting SYN+ACK, we cannot get the last
> > 'moved' socket. Even if we reserve the last 'moved' socket in the last
> > index by the idea above, we cannot sure the last socket is changed after
> > close() for each req->listener. For example, we have listeners A, B, C, and
> > D, and then call close(A) and close(B), and receive the final ACKs for A
> > and B, then both of them are assigned to C. In this case, A for D and B for
> > C is desired. So, selecting the last socket in socks[] for incoming
> > requests cannnot realize (ii).
> > 
> > This is why I selected the last moved socket in unhash path and a random
> > listener in receive path.
> > 
> > 
> > > for kernel-pick, it should just go for simplicity and stay with
> > > the random(/hash) pick instead of pretending the kernel knows the
> > > application must operate in a certain way.  It is fine
> > > that the pick was wrong, the kernel will eventually move the
> > > childs/reqs to the survived listen sk.
> > 
> > Exactly. Also the heuristic way is not fair for every application.
> > 
> > After reading below idea (migrated_sk), I think random-pick is better
> > at simplicity and passing each sk.
> > 
> > 
> > > [ I still think the kernel should not even pick if
> > >   there is no bpf prog to instruct how to pick
> > >   but I am fine as long as there is a sysctl to
> > >   guard this. ]
> > 
> > Unless different applications listen on the same port, random-pick can save
> > connections which would be aborted. So, I would add a sysctl to do
> > migration when no eBPF prog is attached.
> > 
> > 
> > > I would rather focus on ensuring the bpf prog getting what it
> > > needs to make the migration pick.  A few things
> > > I would like to discuss and explore:
> > > > If we splice requests like this, we do not need double lock?
> > > > 
> > > >   1. lock the accept queue of the old listener
> > > >   2. unlink all requests and decrement refcount
> > > >   3. unlock
> > > >   4. update all requests with new listener
> > > I guess updating rsk_listener can be done without acquiring
> > > the lock in (5) below is because it is done under the
> > > listening_hash's bucket lock (and also the global reuseport_lock) so
> > > that the new listener will stay in TCP_LISTEN state?
> > 
> > If we do migration in inet_unhash(), the lock is held, but it is not held
> > in inet_csk_listen_stop().
> > 
> > 
> > > I am not sure iterating the queue under these
> > > locks is a very good thing to do though.  The queue may not be
> > > very long in usual setup but still let see
> > > if that can be avoided.
> > 
> > I agree, lock should not be held long.
> > 
> > 
> > > Do you think the iteration can be done without holding
> > > bucket lock and the global reuseport_lock?  inet_csk_reqsk_queue_add()
> > > is taking the rskq_lock and then check for TCP_LISTEN.  May be
> > > something similar can be done also?
> > 
> > I think either one is necessary at least, so if the sk_state of selected
> > listener is TCP_CLOSE (this is mostly by random-pick of kernel), then we
> > have to fall back to call inet_child_forget().
> > 
> > 
> > > While doing BPF_SK_REUSEPORT_MIGRATE_REQUEST,
> > > the bpf prog can pick per req and have the sk_hash.
> > > However, while doing BPF_SK_REUSEPORT_MIGRATE_QUEUE,
> > > the bpf prog currently does not have a chance to
> > > pick individually for each req/child on the queue.
> > > Since it is iterating the queue anyway, does it make
> > > sense to also call the bpf to pick for each req/child
> > > in the queue?  It then can pass sk_hash (from child->sk_hash?)
> > > to the bpf prog also instead of current 0.  The cost of calling
> > > bpf prog is not really that much / signficant at the
> > > migration code path.  If the queue is somehow
> > > unusally long, there is already an existing
> > > cond_resched() in inet_csk_listen_stop().
> > > 
> > > Then, instead of adding sk_reuseport_md->migration,
> > > it can then add sk_reuseport_md->migrate_sk.
> > > "migrate_sk = req" for in-progress req and "migrate_sk = child"
> > > for iterating acceptq.  The bpf_prog can then tell what sk (req or child)
> > > it is migrating by reading migrate_sk->state.  It can then also
> > > learn the 4 tuples src/dst ip/port while skb is missing.
> > > The sk_reuseport_md->sk can still point to the closed sk
> > > such that the bpf prog can learn the cookie.
> > > 
> > > I suspect a few things between BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > and BPF_SK_REUSEPORT_MIGRATE_QUEUE can be folded together
> > > by doing the above.  It also gives a more consistent
> > > interface for the bpf prog, no more MIGRATE_QUEUE vs MIGRATE_REQUEST.
> > 
> > I think this is really nice idea. Also, I tried to implement random-pick
> > one by one in inet_csk_listen_stop() yesterday, I found a concern about how
> > to handle requests in TFO queue.
> > 
> > The request can be already accepted, so passing it to eBPF prog is
> > confusing? But, redistributing randomly can affect all listeners
> > unnecessary. How should we handle such requests?
> 
> I've implemented one-by-one migration only for the accept queue for now.
> In addition to the concern about TFO queue,
You meant this queue:  queue->fastopenq.rskq_rst_head?
Can "req" be passed?
I did not look up the lock/race in details for that though.

> I want to discuss which should
> we pass NULL or request_sock to eBPF program as migrate_sk when selecting a
> listener for SYN ?
hmmm... not sure I understand your question.

You meant the existing lookup listener case from inet_lhash2_lookup()?
There is nothing to migrate at that point, so NULL makes sense to me.
migrate_sk's type should be PTR_TO_SOCK_COMMON_OR_NULL.

> 
> ---8<---
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index a82fd4c912be..d0ddd3cb988b 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -1001,6 +1001,29 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>  }
>  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
>  
> +static bool inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk, struct request_sock *req)
> +{
> +       struct request_sock_queue *queue = &inet_csk(nsk)->icsk_accept_queue;
> +       bool migrated = false;
> +
> +       spin_lock(&queue->rskq_lock);
> +       if (likely(nsk->sk_state == TCP_LISTEN)) {
> +               migrated = true;
> +
> +               req->dl_next = NULL;
> +               if (queue->rskq_accept_head == NULL)
> +                       WRITE_ONCE(queue->rskq_accept_head, req);
> +               else
> +                       queue->rskq_accept_tail->dl_next = req;
> +               queue->rskq_accept_tail = req;
> +               sk_acceptq_added(nsk);
> +               inet_csk_reqsk_queue_migrated(sk, nsk, req);
need to first resolve the question raised in patch 5 regarding
to the update on req->rsk_listener though.

> +       }
> +       spin_unlock(&queue->rskq_lock);
> +
> +       return migrated;
> +}
> +
>  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
>                                          struct request_sock *req, bool own_req)
>  {
> @@ -1023,9 +1046,11 @@ EXPORT_SYMBOL(inet_csk_complete_hashdance);
>   */
>  void inet_csk_listen_stop(struct sock *sk)
>  {
> +       struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb);
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         struct request_sock_queue *queue = &icsk->icsk_accept_queue;
>         struct request_sock *next, *req;
> +       struct sock *nsk;
>  
>         /* Following specs, it would be better either to send FIN
>          * (and enter FIN-WAIT-1, it is normal close)
> @@ -1043,8 +1068,19 @@ void inet_csk_listen_stop(struct sock *sk)
>                 WARN_ON(sock_owned_by_user(child));
>                 sock_hold(child);
>  
> +               if (reuseport_cb) {
> +                       nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, NULL);
> +                       if (nsk) {
> +                               if (inet_csk_reqsk_queue_migrate(sk, nsk, req))
> +                                       goto unlock_sock;
> +                               else
> +                                       sock_put(nsk);
> +                       }
> +               }
> +
>                 inet_child_forget(sk, req, child);
>                 reqsk_put(req);
> +unlock_sock:
>                 bh_unlock_sock(child);
>                 local_bh_enable();
>                 sock_put(child);
> ---8<---
> 
> 
> > > >   5. lock the accept queue of the new listener
> > > >   6. splice requests and increment refcount
> > > >   7. unlock
> > > > 
> > > > Also, I think splicing is better to keep the order of requests. Adding one
> > > > by one reverses it.
> > > It can keep the order but I think it is orthogonal here.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-12-10  0:07   ` Martin KaFai Lau
@ 2020-12-10  5:15     ` Kuniyuki Iwashima
  2020-12-10 18:49       ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-10  5:15 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Wed, 9 Dec 2020 16:07:07 -0800
> On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> > This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> > adds two wrapper function of it to pass the migration type defined in the
> > previous commit.
> > 
> >   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
> >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > 
> > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > patch also changes the code to call reuseport_select_migrated_sock() even
> > if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > processing the request.
> > 
> > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  include/net/inet_connection_sock.h | 12 +++++++++++
> >  include/net/request_sock.h         | 13 ++++++++++++
> >  include/net/sock_reuseport.h       |  8 +++----
> >  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
> >  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
> >  net/ipv4/tcp_ipv4.c                |  9 ++++++--
> >  net/ipv6/tcp_ipv6.c                |  9 ++++++--
> >  7 files changed, 81 insertions(+), 17 deletions(-)
> > 
> > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > index 2ea2d743f8fc..1e0958f5eb21 100644
> > --- a/include/net/inet_connection_sock.h
> > +++ b/include/net/inet_connection_sock.h
> > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
> >  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> >  }
> >  
> > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > +						 struct sock *nsk,
> > +						 struct request_sock *req)
> > +{
> > +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > +			     &inet_csk(nsk)->icsk_accept_queue,
> > +			     req);
> > +	sock_put(sk);
> not sure if it is safe to do here.
> IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
> to req->rsk_listener such that sock_hold(req->rsk_listener) is
> safe because its sk_refcnt is not zero.

I think it is safe to call sock_put() for the old listener here.

Without this patchset, at receiving the final ACK or retransmitting
SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done
by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put(). And
then, we do `goto lookup;` and overwrite the sk.

In the v2 patchset, refcount_inc_not_zero() is done for the new listener in
reuseport_select_migrated_sock(), so we have to call sock_put() for the old
listener instead to free it properly.

---8<---
+struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
+					    struct sk_buff *skb)
+{
+	struct sock *nsk;
+
+	nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
+	if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
+		return nsk;
+
+	return NULL;
+}
+EXPORT_SYMBOL(reuseport_select_migrated_sock);
---8<---
https://lore.kernel.org/netdev/20201207132456.65472-8-kuniyu@amazon.co.jp/


> > +	sock_hold(nsk);
> > +	req->rsk_listener = nsk;
> > +}
> > +
> 
> [ ... ]
> 
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index 361efe55b1ad..e71653c6eae2 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t)
> >  	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
> >  	int max_syn_ack_retries, qlen, expire = 0, resend = 0;
> >  
> > -	if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
> > -		goto drop;
> > +	if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
> > +		sk_listener = reuseport_select_migrated_sock(sk_listener,
> > +							     req_to_sk(req)->sk_hash, NULL);
> > +		if (!sk_listener) {
> > +			sk_listener = req->rsk_listener;
> > +			goto drop;
> > +		}
> > +		inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req);
> > +		icsk = inet_csk(sk_listener);
> > +		queue = &icsk->icsk_accept_queue;
> > +	}
> >  
> >  	max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries;
> >  	/* Normally all the openreqs are young and become mature
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index e4b31e70bd30..9a9aa27c6069 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1973,8 +1973,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >  			goto csum_error;
> >  		}
> >  		if (unlikely(sk->sk_state != TCP_LISTEN)) {
> > -			inet_csk_reqsk_queue_drop_and_put(sk, req);
> > -			goto lookup;
> > +			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
> > +			if (!nsk) {
> > +				inet_csk_reqsk_queue_drop_and_put(sk, req);
> > +				goto lookup;
> > +			}
> > +			inet_csk_reqsk_queue_migrated(sk, nsk, req);
> > +			sk = nsk;
> >  		}
> >  		/* We own a reference on the listener, increase it again
> >  		 * as we might lose it too soon.
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 992cbf3eb9e3..ff11f3c0cb96 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -1635,8 +1635,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
> >  			goto csum_error;
> >  		}
> >  		if (unlikely(sk->sk_state != TCP_LISTEN)) {
> > -			inet_csk_reqsk_queue_drop_and_put(sk, req);
> > -			goto lookup;
> > +			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
> > +			if (!nsk) {
> > +				inet_csk_reqsk_queue_drop_and_put(sk, req);
> > +				goto lookup;
> > +			}
> > +			inet_csk_reqsk_queue_migrated(sk, nsk, req);
> > +			sk = nsk;
> >  		}
> >  		sock_hold(sk);
> For example, this sock_hold(sk).  sk here is req->rsk_listener.

After migration, this is for the new listener and it is safe because
refcount_inc_not_zero() for the new listener is called in
reuseport_select_migerate_sock().


> >  		refcounted = true;
> > -- 
> > 2.17.2 (Apple Git-113)


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-10  1:53                     ` Martin KaFai Lau
@ 2020-12-10  5:58                       ` Kuniyuki Iwashima
  2020-12-10 19:33                         ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-10  5:58 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, eric.dumazet, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Wed, 9 Dec 2020 17:53:19 -0800
> On Thu, Dec 10, 2020 at 01:57:19AM +0900, Kuniyuki Iwashima wrOAote:
> [ ... ]
> 
> > > > > I think it is a bit complex to pass the new listener from
> > > > > reuseport_detach_sock() to inet_csk_listen_stop().
> > > > > 
> > > > > __tcp_close/tcp_disconnect/tcp_abort
> > > > >  |-tcp_set_state
> > > > >  |  |-unhash
> > > > >  |     |-reuseport_detach_sock (return nsk)
> > > > >  |-inet_csk_listen_stop
> > > > Picking the new listener does not have to be done in
> > > > reuseport_detach_sock().
> > > > 
> > > > IIUC, it is done there only because it prefers to pick
> > > > the last sk from socks[] when bpf prog is not attached.
> > > > This seems to get into the way of exploring other potential
> > > > implementation options.
> > > 
> > > Yes.
> > > This is just idea, but we can reserve the last index of socks[] to hold the
> > > last 'moved' socket in reuseport_detach_sock() and use it in
> > > inet_csk_listen_stop().
> > > 
> > > 
> > > > Merging the discussion on the last socks[] pick from another thread:
> > > > >
> > > > > I think most applications start new listeners before closing listeners, in
> > > > > this case, selecting the moved socket as the new listener works well.
> > > > >
> > > > >
> > > > > > That said, if it is still desired to do a random pick by kernel when
> > > > > > there is no bpf prog, it probably makes sense to guard it in a sysctl as
> > > > > > suggested in another reply.  To keep it simple, I would also keep this
> > > > > > kernel-pick consistent instead of request socket is doing something
> > > > > > different from the unhash path.
> > > > >
> > > > > Then, is this way better to keep kernel-pick consistent?
> > > > >
> > > > >   1. call reuseport_select_migrated_sock() without sk_hash from any path
> > > > >   2. generate a random number in reuseport_select_migrated_sock()
> > > > >   3. pass it to __reuseport_select_sock() only for select-by-hash
> > > > >   (4. pass 0 as sk_hash to bpf_run_sk_reuseport not to use it)
> > > > >   5. do migration per queue in inet_csk_listen_stop() or per request in
> > > > >      receive path.
> > > > >
> > > > > I understand it is beautiful to keep consistensy, but also think
> > > > > the kernel-pick with heuristic performs better than random-pick.
> > > > I think discussing the best kernel pick without explicit user input
> > > > is going to be a dead end. There is always a case that
> > > > makes this heuristic (or guess) fail.  e.g. what if multiple
> > > > sk(s) being closed are always the last one in the socks[]?
> > > > all their child sk(s) will then be piled up at one listen sk
> > > > because the last socks[] is always picked?
> > > 
> > > There can be such a case, but it means the newly listened sockets are
> > > closed earlier than old ones.
> > > 
> > > 
> > > > Lets assume the last socks[] is indeed the best for all cases.  Then why
> > > > the in-progress req don't pick it this way?  I feel the implementation
> > > > is doing what is convenient at that point.  And that is fine, I think
> > > 
> > > In this patchset, I originally assumed four things:
> > > 
> > >   migration should be done
> > >     (i)   from old to new
> > >     (ii)  to redistribute requests evenly as possible
> > >     (iii) to keep the order of requests in the queue
> > >           (resulting in splicing queues)
> > >     (iv)  in O(1) for scalability
> > >           (resulting in fix-up rsk_listener approach)
> > > 
> > > I selected the last socket in unhash path to satisfy above four because the
> > > last socket changes at every close() syscall if application closes from
> > > older socket.
> > > 
> > > But in receiving ACK or retransmitting SYN+ACK, we cannot get the last
> > > 'moved' socket. Even if we reserve the last 'moved' socket in the last
> > > index by the idea above, we cannot sure the last socket is changed after
> > > close() for each req->listener. For example, we have listeners A, B, C, and
> > > D, and then call close(A) and close(B), and receive the final ACKs for A
> > > and B, then both of them are assigned to C. In this case, A for D and B for
> > > C is desired. So, selecting the last socket in socks[] for incoming
> > > requests cannnot realize (ii).
> > > 
> > > This is why I selected the last moved socket in unhash path and a random
> > > listener in receive path.
> > > 
> > > 
> > > > for kernel-pick, it should just go for simplicity and stay with
> > > > the random(/hash) pick instead of pretending the kernel knows the
> > > > application must operate in a certain way.  It is fine
> > > > that the pick was wrong, the kernel will eventually move the
> > > > childs/reqs to the survived listen sk.
> > > 
> > > Exactly. Also the heuristic way is not fair for every application.
> > > 
> > > After reading below idea (migrated_sk), I think random-pick is better
> > > at simplicity and passing each sk.
> > > 
> > > 
> > > > [ I still think the kernel should not even pick if
> > > >   there is no bpf prog to instruct how to pick
> > > >   but I am fine as long as there is a sysctl to
> > > >   guard this. ]
> > > 
> > > Unless different applications listen on the same port, random-pick can save
> > > connections which would be aborted. So, I would add a sysctl to do
> > > migration when no eBPF prog is attached.
> > > 
> > > 
> > > > I would rather focus on ensuring the bpf prog getting what it
> > > > needs to make the migration pick.  A few things
> > > > I would like to discuss and explore:
> > > > > If we splice requests like this, we do not need double lock?
> > > > > 
> > > > >   1. lock the accept queue of the old listener
> > > > >   2. unlink all requests and decrement refcount
> > > > >   3. unlock
> > > > >   4. update all requests with new listener
> > > > I guess updating rsk_listener can be done without acquiring
> > > > the lock in (5) below is because it is done under the
> > > > listening_hash's bucket lock (and also the global reuseport_lock) so
> > > > that the new listener will stay in TCP_LISTEN state?
> > > 
> > > If we do migration in inet_unhash(), the lock is held, but it is not held
> > > in inet_csk_listen_stop().
> > > 
> > > 
> > > > I am not sure iterating the queue under these
> > > > locks is a very good thing to do though.  The queue may not be
> > > > very long in usual setup but still let see
> > > > if that can be avoided.
> > > 
> > > I agree, lock should not be held long.
> > > 
> > > 
> > > > Do you think the iteration can be done without holding
> > > > bucket lock and the global reuseport_lock?  inet_csk_reqsk_queue_add()
> > > > is taking the rskq_lock and then check for TCP_LISTEN.  May be
> > > > something similar can be done also?
> > > 
> > > I think either one is necessary at least, so if the sk_state of selected
> > > listener is TCP_CLOSE (this is mostly by random-pick of kernel), then we
> > > have to fall back to call inet_child_forget().
> > > 
> > > 
> > > > While doing BPF_SK_REUSEPORT_MIGRATE_REQUEST,
> > > > the bpf prog can pick per req and have the sk_hash.
> > > > However, while doing BPF_SK_REUSEPORT_MIGRATE_QUEUE,
> > > > the bpf prog currently does not have a chance to
> > > > pick individually for each req/child on the queue.
> > > > Since it is iterating the queue anyway, does it make
> > > > sense to also call the bpf to pick for each req/child
> > > > in the queue?  It then can pass sk_hash (from child->sk_hash?)
> > > > to the bpf prog also instead of current 0.  The cost of calling
> > > > bpf prog is not really that much / signficant at the
> > > > migration code path.  If the queue is somehow
> > > > unusally long, there is already an existing
> > > > cond_resched() in inet_csk_listen_stop().
> > > > 
> > > > Then, instead of adding sk_reuseport_md->migration,
> > > > it can then add sk_reuseport_md->migrate_sk.
> > > > "migrate_sk = req" for in-progress req and "migrate_sk = child"
> > > > for iterating acceptq.  The bpf_prog can then tell what sk (req or child)
> > > > it is migrating by reading migrate_sk->state.  It can then also
> > > > learn the 4 tuples src/dst ip/port while skb is missing.
> > > > The sk_reuseport_md->sk can still point to the closed sk
> > > > such that the bpf prog can learn the cookie.
> > > > 
> > > > I suspect a few things between BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > > and BPF_SK_REUSEPORT_MIGRATE_QUEUE can be folded together
> > > > by doing the above.  It also gives a more consistent
> > > > interface for the bpf prog, no more MIGRATE_QUEUE vs MIGRATE_REQUEST.
> > > 
> > > I think this is really nice idea. Also, I tried to implement random-pick
> > > one by one in inet_csk_listen_stop() yesterday, I found a concern about how
> > > to handle requests in TFO queue.
> > > 
> > > The request can be already accepted, so passing it to eBPF prog is
> > > confusing? But, redistributing randomly can affect all listeners
> > > unnecessary. How should we handle such requests?
> > 
> > I've implemented one-by-one migration only for the accept queue for now.
> > In addition to the concern about TFO queue,
> You meant this queue:  queue->fastopenq.rskq_rst_head?

Yes.


> Can "req" be passed?
> I did not look up the lock/race in details for that though.

I think if we rewrite freeing TFO requests part like one of accept queue
using reqsk_queue_remove(), we can also migrate them.

In this patchset, selecting a listener for accept queue, the TFO queue of
the same listener is also migrated to another listener in order to prevent
TFO spoofing attack.

If the request in the accept queue is migrated one by one, I am wondering
which should the request in TFO queue be migrated to prevent attack or
freed.

I think user need not know about keeping such requests in kernel to prevent
attacks, so passing them to eBPF prog is confusing. But, redistributing
them randomly without user's intention can make some irrelevant listeners
unnecessarily drop new TFO requests, so this is also bad. Moreover, freeing
such requests seems not so good in the point of security.


> > I want to discuss which should
> > we pass NULL or request_sock to eBPF program as migrate_sk when selecting a
> > listener for SYN ?
> hmmm... not sure I understand your question.
> 
> You meant the existing lookup listener case from inet_lhash2_lookup()?

Yes.


> There is nothing to migrate at that point, so NULL makes sense to me.
> migrate_sk's type should be PTR_TO_SOCK_COMMON_OR_NULL.

Thank you, I will set PTR_TO_SOCK_COMMON_OR_NULL and pass NULL in
inet_lhash2_lookup().


> > ---8<---
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index a82fd4c912be..d0ddd3cb988b 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -1001,6 +1001,29 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> >  }
> >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> >  
> > +static bool inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk, struct request_sock *req)
> > +{
> > +       struct request_sock_queue *queue = &inet_csk(nsk)->icsk_accept_queue;
> > +       bool migrated = false;
> > +
> > +       spin_lock(&queue->rskq_lock);
> > +       if (likely(nsk->sk_state == TCP_LISTEN)) {
> > +               migrated = true;
> > +
> > +               req->dl_next = NULL;
> > +               if (queue->rskq_accept_head == NULL)
> > +                       WRITE_ONCE(queue->rskq_accept_head, req);
> > +               else
> > +                       queue->rskq_accept_tail->dl_next = req;
> > +               queue->rskq_accept_tail = req;
> > +               sk_acceptq_added(nsk);
> > +               inet_csk_reqsk_queue_migrated(sk, nsk, req);
> need to first resolve the question raised in patch 5 regarding
> to the update on req->rsk_listener though.

In the unhash path, it is also safe to call sock_put() for the old listner.

In inet_csk_listen_stop(), the sk_refcnt of the listener >= 1. If the
listener does not have immature requests, sk_refcnt is 1 and freed in
__tcp_close().

  sock_hold(sk) in __tcp_close()
  sock_put(sk) in inet_csk_destroy_sock()
  sock_put(sk) in __tcp_clsoe()


> > +       }
> > +       spin_unlock(&queue->rskq_lock);
> > +
> > +       return migrated;
> > +}
> > +
> >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> >                                          struct request_sock *req, bool own_req)
> >  {
> > @@ -1023,9 +1046,11 @@ EXPORT_SYMBOL(inet_csk_complete_hashdance);
> >   */
> >  void inet_csk_listen_stop(struct sock *sk)
> >  {
> > +       struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb);
> >         struct inet_connection_sock *icsk = inet_csk(sk);
> >         struct request_sock_queue *queue = &icsk->icsk_accept_queue;
> >         struct request_sock *next, *req;
> > +       struct sock *nsk;
> >  
> >         /* Following specs, it would be better either to send FIN
> >          * (and enter FIN-WAIT-1, it is normal close)
> > @@ -1043,8 +1068,19 @@ void inet_csk_listen_stop(struct sock *sk)
> >                 WARN_ON(sock_owned_by_user(child));
> >                 sock_hold(child);
> >  
> > +               if (reuseport_cb) {
> > +                       nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, NULL);
> > +                       if (nsk) {
> > +                               if (inet_csk_reqsk_queue_migrate(sk, nsk, req))
> > +                                       goto unlock_sock;
> > +                               else
> > +                                       sock_put(nsk);
> > +                       }
> > +               }
> > +
> >                 inet_child_forget(sk, req, child);
> >                 reqsk_put(req);
> > +unlock_sock:
> >                 bh_unlock_sock(child);
> >                 local_bh_enable();
> >                 sock_put(child);
> > ---8<---
> > 
> > 
> > > > >   5. lock the accept queue of the new listener
> > > > >   6. splice requests and increment refcount
> > > > >   7. unlock
> > > > > 
> > > > > Also, I think splicing is better to keep the order of requests. Adding one
> > > > > by one reverses it.
> > > > It can keep the order but I think it is orthogonal here.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-12-10  5:15     ` Kuniyuki Iwashima
@ 2020-12-10 18:49       ` Martin KaFai Lau
  2020-12-14 17:03         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-10 18:49 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840,
	linux-kernel, netdev

On Thu, Dec 10, 2020 at 02:15:38PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Wed, 9 Dec 2020 16:07:07 -0800
> > On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> > > This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> > > adds two wrapper function of it to pass the migration type defined in the
> > > previous commit.
> > > 
> > >   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
> > >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > 
> > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > > patch also changes the code to call reuseport_select_migrated_sock() even
> > > if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> > > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > > processing the request.
> > > 
> > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > ---
> > >  include/net/inet_connection_sock.h | 12 +++++++++++
> > >  include/net/request_sock.h         | 13 ++++++++++++
> > >  include/net/sock_reuseport.h       |  8 +++----
> > >  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
> > >  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
> > >  net/ipv4/tcp_ipv4.c                |  9 ++++++--
> > >  net/ipv6/tcp_ipv6.c                |  9 ++++++--
> > >  7 files changed, 81 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > index 2ea2d743f8fc..1e0958f5eb21 100644
> > > --- a/include/net/inet_connection_sock.h
> > > +++ b/include/net/inet_connection_sock.h
> > > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
> > >  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> > >  }
> > >  
> > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > > +						 struct sock *nsk,
> > > +						 struct request_sock *req)
> > > +{
> > > +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > > +			     &inet_csk(nsk)->icsk_accept_queue,
> > > +			     req);
> > > +	sock_put(sk);
> > not sure if it is safe to do here.
> > IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
> > to req->rsk_listener such that sock_hold(req->rsk_listener) is
> > safe because its sk_refcnt is not zero.
> 
> I think it is safe to call sock_put() for the old listener here.
> 
> Without this patchset, at receiving the final ACK or retransmitting
> SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done
> by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put().
Note that in your example (final ACK), sock_put(req->rsk_listener) is
_only_ called when reqsk_put() can get refcount_dec_and_test(&req->rsk_refcnt)
to reach zero.

Here in this patch, it sock_put(req->rsk_listener) without req->rsk_refcnt
reaching zero.

Let says there are two cores holding two refcnt to req (one cnt for each core)
by looking up the req from ehash.  One of the core do this migrate and
sock_put(req->rsk_listener).  Another core does sock_hold(req->rsk_listener).

	Core1					Core2
						sock_put(req->rsk_listener)

	sock_hold(req->rsk_listener)

> And then, we do `goto lookup;` and overwrite the sk.
> 
> In the v2 patchset, refcount_inc_not_zero() is done for the new listener in
> reuseport_select_migrated_sock(), so we have to call sock_put() for the old
> listener instead to free it properly.
> 
> ---8<---
> +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
> +					    struct sk_buff *skb)
> +{
> +	struct sock *nsk;
> +
> +	nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
> +	if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
There is another potential issue here.  The TCP_LISTEN nsk is protected
by rcu.  refcount_inc_not_zero(&nsk->sk_refcnt) cannot be done if it
is not under rcu_read_lock().

The receive path may be ok as it is in rcu.  You may need to check for
others.

> +		return nsk;
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(reuseport_select_migrated_sock);
> ---8<---
> https://lore.kernel.org/netdev/20201207132456.65472-8-kuniyu@amazon.co.jp/
> 
> 
> > > +	sock_hold(nsk);
> > > +	req->rsk_listener = nsk;
It looks like there is another race here.  What
if multiple cores try to update req->rsk_listener?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-10  5:58                       ` Kuniyuki Iwashima
@ 2020-12-10 19:33                         ` Martin KaFai Lau
  2020-12-14 17:16                           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-10 19:33 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, eric.dumazet, kuba,
	kuni1840, linux-kernel, netdev

On Thu, Dec 10, 2020 at 02:58:10PM +0900, Kuniyuki Iwashima wrote:

[ ... ]

> > > I've implemented one-by-one migration only for the accept queue for now.
> > > In addition to the concern about TFO queue,
> > You meant this queue:  queue->fastopenq.rskq_rst_head?
> 
> Yes.
> 
> 
> > Can "req" be passed?
> > I did not look up the lock/race in details for that though.
> 
> I think if we rewrite freeing TFO requests part like one of accept queue
> using reqsk_queue_remove(), we can also migrate them.
> 
> In this patchset, selecting a listener for accept queue, the TFO queue of
> the same listener is also migrated to another listener in order to prevent
> TFO spoofing attack.
> 
> If the request in the accept queue is migrated one by one, I am wondering
> which should the request in TFO queue be migrated to prevent attack or
> freed.
> 
> I think user need not know about keeping such requests in kernel to prevent
> attacks, so passing them to eBPF prog is confusing. But, redistributing
> them randomly without user's intention can make some irrelevant listeners
> unnecessarily drop new TFO requests, so this is also bad. Moreover, freeing
> such requests seems not so good in the point of security.
The current behavior (during process restart) is also not carrying this
security queue.  Not carrying them in this patch will make it
less secure than the current behavior during process restart?
Do you need it now or it is something that can be considered for later
without changing uapi bpf.h?

> > > ---8<---
> > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > index a82fd4c912be..d0ddd3cb988b 100644
> > > --- a/net/ipv4/inet_connection_sock.c
> > > +++ b/net/ipv4/inet_connection_sock.c
> > > @@ -1001,6 +1001,29 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > >  }
> > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > >  
> > > +static bool inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk, struct request_sock *req)
> > > +{
> > > +       struct request_sock_queue *queue = &inet_csk(nsk)->icsk_accept_queue;
> > > +       bool migrated = false;
> > > +
> > > +       spin_lock(&queue->rskq_lock);
> > > +       if (likely(nsk->sk_state == TCP_LISTEN)) {
> > > +               migrated = true;
> > > +
> > > +               req->dl_next = NULL;
> > > +               if (queue->rskq_accept_head == NULL)
> > > +                       WRITE_ONCE(queue->rskq_accept_head, req);
> > > +               else
> > > +                       queue->rskq_accept_tail->dl_next = req;
> > > +               queue->rskq_accept_tail = req;
> > > +               sk_acceptq_added(nsk);
> > > +               inet_csk_reqsk_queue_migrated(sk, nsk, req);
> > need to first resolve the question raised in patch 5 regarding
> > to the update on req->rsk_listener though.
> 
> In the unhash path, it is also safe to call sock_put() for the old listner.
> 
> In inet_csk_listen_stop(), the sk_refcnt of the listener >= 1. If the
> listener does not have immature requests, sk_refcnt is 1 and freed in
> __tcp_close().
> 
>   sock_hold(sk) in __tcp_close()
>   sock_put(sk) in inet_csk_destroy_sock()
>   sock_put(sk) in __tcp_clsoe()
I don't see how it is different here than in patch 5.
I could be missing something.

Lets contd the discussion on the other thread (patch 5) first.

> 
> 
> > > +       }
> > > +       spin_unlock(&queue->rskq_lock);
> > > +
> > > +       return migrated;
> > > +}
> > > +
> > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> > >                                          struct request_sock *req, bool own_req)
> > >  {
> > > @@ -1023,9 +1046,11 @@ EXPORT_SYMBOL(inet_csk_complete_hashdance);
> > >   */
> > >  void inet_csk_listen_stop(struct sock *sk)
> > >  {
> > > +       struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb);
> > >         struct inet_connection_sock *icsk = inet_csk(sk);
> > >         struct request_sock_queue *queue = &icsk->icsk_accept_queue;
> > >         struct request_sock *next, *req;
> > > +       struct sock *nsk;
> > >  
> > >         /* Following specs, it would be better either to send FIN
> > >          * (and enter FIN-WAIT-1, it is normal close)
> > > @@ -1043,8 +1068,19 @@ void inet_csk_listen_stop(struct sock *sk)
> > >                 WARN_ON(sock_owned_by_user(child));
> > >                 sock_hold(child);
> > >  
> > > +               if (reuseport_cb) {
> > > +                       nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, NULL);
> > > +                       if (nsk) {
> > > +                               if (inet_csk_reqsk_queue_migrate(sk, nsk, req))
> > > +                                       goto unlock_sock;
> > > +                               else
> > > +                                       sock_put(nsk);
> > > +                       }
> > > +               }
> > > +
> > >                 inet_child_forget(sk, req, child);
> > >                 reqsk_put(req);
> > > +unlock_sock:
> > >                 bh_unlock_sock(child);
> > >                 local_bh_enable();
> > >                 sock_put(child);
> > > ---8<---
> > > 
> > > 
> > > > > >   5. lock the accept queue of the new listener
> > > > > >   6. splice requests and increment refcount
> > > > > >   7. unlock
> > > > > > 
> > > > > > Also, I think splicing is better to keep the order of requests. Adding one
> > > > > > by one reverses it.
> > > > > It can keep the order but I think it is orthogonal here.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-12-10 18:49       ` Martin KaFai Lau
@ 2020-12-14 17:03         ` Kuniyuki Iwashima
  2020-12-15  2:58           ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-14 17:03 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Thu, 10 Dec 2020 10:49:15 -0800
> On Thu, Dec 10, 2020 at 02:15:38PM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Wed, 9 Dec 2020 16:07:07 -0800
> > > On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> > > > This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> > > > adds two wrapper function of it to pass the migration type defined in the
> > > > previous commit.
> > > > 
> > > >   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
> > > >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > > 
> > > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > > > patch also changes the code to call reuseport_select_migrated_sock() even
> > > > if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> > > > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > > > processing the request.
> > > > 
> > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > ---
> > > >  include/net/inet_connection_sock.h | 12 +++++++++++
> > > >  include/net/request_sock.h         | 13 ++++++++++++
> > > >  include/net/sock_reuseport.h       |  8 +++----
> > > >  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
> > > >  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
> > > >  net/ipv4/tcp_ipv4.c                |  9 ++++++--
> > > >  net/ipv6/tcp_ipv6.c                |  9 ++++++--
> > > >  7 files changed, 81 insertions(+), 17 deletions(-)
> > > > 
> > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > index 2ea2d743f8fc..1e0958f5eb21 100644
> > > > --- a/include/net/inet_connection_sock.h
> > > > +++ b/include/net/inet_connection_sock.h
> > > > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
> > > >  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> > > >  }
> > > >  
> > > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > > > +						 struct sock *nsk,
> > > > +						 struct request_sock *req)
> > > > +{
> > > > +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > > > +			     &inet_csk(nsk)->icsk_accept_queue,
> > > > +			     req);
> > > > +	sock_put(sk);
> > > not sure if it is safe to do here.
> > > IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
> > > to req->rsk_listener such that sock_hold(req->rsk_listener) is
> > > safe because its sk_refcnt is not zero.
> > 
> > I think it is safe to call sock_put() for the old listener here.
> > 
> > Without this patchset, at receiving the final ACK or retransmitting
> > SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done
> > by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put().
> Note that in your example (final ACK), sock_put(req->rsk_listener) is
> _only_ called when reqsk_put() can get refcount_dec_and_test(&req->rsk_refcnt)
> to reach zero.
> 
> Here in this patch, it sock_put(req->rsk_listener) without req->rsk_refcnt
> reaching zero.
> 
> Let says there are two cores holding two refcnt to req (one cnt for each core)
> by looking up the req from ehash.  One of the core do this migrate and
> sock_put(req->rsk_listener).  Another core does sock_hold(req->rsk_listener).
> 
> 	Core1					Core2
> 						sock_put(req->rsk_listener)
> 
> 	sock_hold(req->rsk_listener)

I'm sorry for the late reply.

I missed this situation that different Cores get into NEW_SYN_RECV path,
but this does exist.
https://lore.kernel.org/netdev/1517977874.3715.153.camel@gmail.com/#t
https://lore.kernel.org/netdev/1518531252.3715.178.camel@gmail.com/


If close() is called for the listener and the request has the last refcount
for it, sock_put() by Core2 frees it, so Core1 cannot proceed with freed
listener. So, it is good to call refcount_inc_not_zero() instead of
sock_hold(). If refcount_inc_not_zero() fails, it means that the listener
is closed and the req->rsk_listener is changed in another place. Then, we
can continue processing the request by rewriting sk with rsk_listener and
calling sock_hold() for it.

Also, the migration by Core2 can be done after sock_hold() by Core1. Then
if Core1 win the race by removing the request from ehash,
in inet_csk_reqsk_queue_add(), instead of sk, req->rsk_listener should be
used as the proper listener to add the req into its queue. But if the
rsk_listener is also TCP_CLOSE, we have to call inet_child_forget().

Moreover, we have to check the listener is freed in the beginning of
reqsk_timer_handler() by refcount_inc_not_zero().


> > And then, we do `goto lookup;` and overwrite the sk.
> > 
> > In the v2 patchset, refcount_inc_not_zero() is done for the new listener in
> > reuseport_select_migrated_sock(), so we have to call sock_put() for the old
> > listener instead to free it properly.
> > 
> > ---8<---
> > +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
> > +					    struct sk_buff *skb)
> > +{
> > +	struct sock *nsk;
> > +
> > +	nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
> > +	if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
> There is another potential issue here.  The TCP_LISTEN nsk is protected
> by rcu.  refcount_inc_not_zero(&nsk->sk_refcnt) cannot be done if it
> is not under rcu_read_lock().
> 
> The receive path may be ok as it is in rcu.  You may need to check for
> others.

IIUC, is this mean nsk can be NULL after grace period of RCU? If so, I will
move rcu_read_lock/unlock() from __reuseport_select_sock() to
reuseport_select_sock() and reuseport_select_migrated_sock().


> > +		return nsk;
> > +
> > +	return NULL;
> > +}
> > +EXPORT_SYMBOL(reuseport_select_migrated_sock);
> > ---8<---
> > https://lore.kernel.org/netdev/20201207132456.65472-8-kuniyu@amazon.co.jp/
> > 
> > 
> > > > +	sock_hold(nsk);
> > > > +	req->rsk_listener = nsk;
> It looks like there is another race here.  What
> if multiple cores try to update req->rsk_listener?

I think we have to add a lock in struct request_sock, acquire it, check
if the rsk_listener is changed or not, and then do migration. Also, if the
listener has been changed, we have to tell the caller to use it as the new
listener.

---8<---
       spin_lock(&lock)
       if (sk != req->rsk_listener) {
               nsk = req->rsk_listener;
               goto out;
       }

       // do migration
out:
       spin_unlock(&lock)
       return nsk;
---8<---


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  2020-12-10 19:33                         ` Martin KaFai Lau
@ 2020-12-14 17:16                           ` Kuniyuki Iwashima
  0 siblings, 0 replies; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-14 17:16 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, eric.dumazet, kuba,
	kuni1840, kuniyu, linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Thu, 10 Dec 2020 11:33:40 -0800
> On Thu, Dec 10, 2020 at 02:58:10PM +0900, Kuniyuki Iwashima wrote:
> 
> [ ... ]
> 
> > > > I've implemented one-by-one migration only for the accept queue for now.
> > > > In addition to the concern about TFO queue,
> > > You meant this queue:  queue->fastopenq.rskq_rst_head?
> > 
> > Yes.
> > 
> > 
> > > Can "req" be passed?
> > > I did not look up the lock/race in details for that though.
> > 
> > I think if we rewrite freeing TFO requests part like one of accept queue
> > using reqsk_queue_remove(), we can also migrate them.
> > 
> > In this patchset, selecting a listener for accept queue, the TFO queue of
> > the same listener is also migrated to another listener in order to prevent
> > TFO spoofing attack.
> > 
> > If the request in the accept queue is migrated one by one, I am wondering
> > which should the request in TFO queue be migrated to prevent attack or
> > freed.
> > 
> > I think user need not know about keeping such requests in kernel to prevent
> > attacks, so passing them to eBPF prog is confusing. But, redistributing
> > them randomly without user's intention can make some irrelevant listeners
> > unnecessarily drop new TFO requests, so this is also bad. Moreover, freeing
> > such requests seems not so good in the point of security.
> The current behavior (during process restart) is also not carrying this
> security queue.  Not carrying them in this patch will make it
> less secure than the current behavior during process restart?

No, I thought I could make it more secure.


> Do you need it now or it is something that can be considered for later
> without changing uapi bpf.h?

No, I do not need it for any other reason, so I will simply free the
requests in TFO queue.
Thank you.


> > > > ---8<---
> > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > index a82fd4c912be..d0ddd3cb988b 100644
> > > > --- a/net/ipv4/inet_connection_sock.c
> > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > @@ -1001,6 +1001,29 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > >  }
> > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > >  
> > > > +static bool inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk, struct request_sock *req)
> > > > +{
> > > > +       struct request_sock_queue *queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > +       bool migrated = false;
> > > > +
> > > > +       spin_lock(&queue->rskq_lock);
> > > > +       if (likely(nsk->sk_state == TCP_LISTEN)) {
> > > > +               migrated = true;
> > > > +
> > > > +               req->dl_next = NULL;
> > > > +               if (queue->rskq_accept_head == NULL)
> > > > +                       WRITE_ONCE(queue->rskq_accept_head, req);
> > > > +               else
> > > > +                       queue->rskq_accept_tail->dl_next = req;
> > > > +               queue->rskq_accept_tail = req;
> > > > +               sk_acceptq_added(nsk);
> > > > +               inet_csk_reqsk_queue_migrated(sk, nsk, req);
> > > need to first resolve the question raised in patch 5 regarding
> > > to the update on req->rsk_listener though.
> > 
> > In the unhash path, it is also safe to call sock_put() for the old listner.
> > 
> > In inet_csk_listen_stop(), the sk_refcnt of the listener >= 1. If the
> > listener does not have immature requests, sk_refcnt is 1 and freed in
> > __tcp_close().
> > 
> >   sock_hold(sk) in __tcp_close()
> >   sock_put(sk) in inet_csk_destroy_sock()
> >   sock_put(sk) in __tcp_clsoe()
> I don't see how it is different here than in patch 5.
> I could be missing something.
> 
> Lets contd the discussion on the other thread (patch 5) first.

The listening socket has two kinds of refcounts for itself(1) and
requests(n). I think the listener has its own refcount at least in
inet_csk_listen_stop(), so sock_put() here never free the listener.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-12-14 17:03         ` Kuniyuki Iwashima
@ 2020-12-15  2:58           ` Martin KaFai Lau
  2020-12-16 16:41             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-15  2:58 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840,
	linux-kernel, netdev

On Tue, Dec 15, 2020 at 02:03:13AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Thu, 10 Dec 2020 10:49:15 -0800
> > On Thu, Dec 10, 2020 at 02:15:38PM +0900, Kuniyuki Iwashima wrote:
> > > From:   Martin KaFai Lau <kafai@fb.com>
> > > Date:   Wed, 9 Dec 2020 16:07:07 -0800
> > > > On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> > > > > This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> > > > > adds two wrapper function of it to pass the migration type defined in the
> > > > > previous commit.
> > > > > 
> > > > >   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
> > > > >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > > > 
> > > > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > > > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > > > > patch also changes the code to call reuseport_select_migrated_sock() even
> > > > > if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> > > > > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > > > > processing the request.
> > > > > 
> > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > > ---
> > > > >  include/net/inet_connection_sock.h | 12 +++++++++++
> > > > >  include/net/request_sock.h         | 13 ++++++++++++
> > > > >  include/net/sock_reuseport.h       |  8 +++----
> > > > >  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
> > > > >  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
> > > > >  net/ipv4/tcp_ipv4.c                |  9 ++++++--
> > > > >  net/ipv6/tcp_ipv6.c                |  9 ++++++--
> > > > >  7 files changed, 81 insertions(+), 17 deletions(-)
> > > > > 
> > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > > index 2ea2d743f8fc..1e0958f5eb21 100644
> > > > > --- a/include/net/inet_connection_sock.h
> > > > > +++ b/include/net/inet_connection_sock.h
> > > > > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
> > > > >  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> > > > >  }
> > > > >  
> > > > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > > > > +						 struct sock *nsk,
> > > > > +						 struct request_sock *req)
> > > > > +{
> > > > > +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > > > > +			     &inet_csk(nsk)->icsk_accept_queue,
> > > > > +			     req);
> > > > > +	sock_put(sk);
> > > > not sure if it is safe to do here.
> > > > IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
> > > > to req->rsk_listener such that sock_hold(req->rsk_listener) is
> > > > safe because its sk_refcnt is not zero.
> > > 
> > > I think it is safe to call sock_put() for the old listener here.
> > > 
> > > Without this patchset, at receiving the final ACK or retransmitting
> > > SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done
> > > by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put().
> > Note that in your example (final ACK), sock_put(req->rsk_listener) is
> > _only_ called when reqsk_put() can get refcount_dec_and_test(&req->rsk_refcnt)
> > to reach zero.
> > 
> > Here in this patch, it sock_put(req->rsk_listener) without req->rsk_refcnt
> > reaching zero.
> > 
> > Let says there are two cores holding two refcnt to req (one cnt for each core)
> > by looking up the req from ehash.  One of the core do this migrate and
> > sock_put(req->rsk_listener).  Another core does sock_hold(req->rsk_listener).
> > 
> > 	Core1					Core2
> > 						sock_put(req->rsk_listener)
> > 
> > 	sock_hold(req->rsk_listener)
> 
> I'm sorry for the late reply.
> 
> I missed this situation that different Cores get into NEW_SYN_RECV path,
> but this does exist.
> https://lore.kernel.org/netdev/1517977874.3715.153.camel@gmail.com/#t
> https://lore.kernel.org/netdev/1518531252.3715.178.camel@gmail.com/
> 
> 
> If close() is called for the listener and the request has the last refcount
> for it, sock_put() by Core2 frees it, so Core1 cannot proceed with freed
> listener. So, it is good to call refcount_inc_not_zero() instead of
> sock_hold(). If refcount_inc_not_zero() fails, it means that the listener
_inc_not_zero() usually means it requires rcu_read_lock().
That may have rippling effect on other req->rsk_listener readers.

There may also be places assuming that the req->rsk_listener will never
change once it is assigned.  not sure.  have not looked closely yet.

It probably needs some more thoughts here to get a simpler solution.

> is closed and the req->rsk_listener is changed in another place. Then, we
> can continue processing the request by rewriting sk with rsk_listener and
> calling sock_hold() for it.
> 
> Also, the migration by Core2 can be done after sock_hold() by Core1. Then
> if Core1 win the race by removing the request from ehash,
> in inet_csk_reqsk_queue_add(), instead of sk, req->rsk_listener should be
> used as the proper listener to add the req into its queue. But if the
> rsk_listener is also TCP_CLOSE, we have to call inet_child_forget().
> 
> Moreover, we have to check the listener is freed in the beginning of
> reqsk_timer_handler() by refcount_inc_not_zero().
> 
> 
> > > And then, we do `goto lookup;` and overwrite the sk.
> > > 
> > > In the v2 patchset, refcount_inc_not_zero() is done for the new listener in
> > > reuseport_select_migrated_sock(), so we have to call sock_put() for the old
> > > listener instead to free it properly.
> > > 
> > > ---8<---
> > > +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
> > > +					    struct sk_buff *skb)
> > > +{
> > > +	struct sock *nsk;
> > > +
> > > +	nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
> > > +	if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
> > There is another potential issue here.  The TCP_LISTEN nsk is protected
> > by rcu.  refcount_inc_not_zero(&nsk->sk_refcnt) cannot be done if it
> > is not under rcu_read_lock().
> > 
> > The receive path may be ok as it is in rcu.  You may need to check for
> > others.
> 
> IIUC, is this mean nsk can be NULL after grace period of RCU? If so, I will
worse than NULL.  an invalid pointer.
 
> move rcu_read_lock/unlock() from __reuseport_select_sock() to
> reuseport_select_sock() and reuseport_select_migrated_sock().
ok.

> 
> 
> > > +		return nsk;
> > > +
> > > +	return NULL;
> > > +}
> > > +EXPORT_SYMBOL(reuseport_select_migrated_sock);
> > > ---8<---
> > > https://lore.kernel.org/netdev/20201207132456.65472-8-kuniyu@amazon.co.jp/
> > > 
> > > 
> > > > > +	sock_hold(nsk);
> > > > > +	req->rsk_listener = nsk;
> > It looks like there is another race here.  What
> > if multiple cores try to update req->rsk_listener?
> 
> I think we have to add a lock in struct request_sock, acquire it, check
> if the rsk_listener is changed or not, and then do migration. Also, if the
> listener has been changed, we have to tell the caller to use it as the new
> listener.
> 
> ---8<---
>        spin_lock(&lock)
>        if (sk != req->rsk_listener) {
>                nsk = req->rsk_listener;
>                goto out;
>        }
> 
>        // do migration
> out:
>        spin_unlock(&lock)
>        return nsk;
> ---8<---
cmpxchg may help here.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-12-15  2:58           ` Martin KaFai Lau
@ 2020-12-16 16:41             ` Kuniyuki Iwashima
  2020-12-16 22:24               ` Martin KaFai Lau
  0 siblings, 1 reply; 57+ messages in thread
From: Kuniyuki Iwashima @ 2020-12-16 16:41 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Mon, 14 Dec 2020 18:58:37 -0800
> On Tue, Dec 15, 2020 at 02:03:13AM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Thu, 10 Dec 2020 10:49:15 -0800
> > > On Thu, Dec 10, 2020 at 02:15:38PM +0900, Kuniyuki Iwashima wrote:
> > > > From:   Martin KaFai Lau <kafai@fb.com>
> > > > Date:   Wed, 9 Dec 2020 16:07:07 -0800
> > > > > On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> > > > > > This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> > > > > > adds two wrapper function of it to pass the migration type defined in the
> > > > > > previous commit.
> > > > > > 
> > > > > >   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
> > > > > >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > > > > 
> > > > > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > > > > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > > > > > patch also changes the code to call reuseport_select_migrated_sock() even
> > > > > > if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> > > > > > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > > > > > processing the request.
> > > > > > 
> > > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > > > ---
> > > > > >  include/net/inet_connection_sock.h | 12 +++++++++++
> > > > > >  include/net/request_sock.h         | 13 ++++++++++++
> > > > > >  include/net/sock_reuseport.h       |  8 +++----
> > > > > >  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
> > > > > >  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
> > > > > >  net/ipv4/tcp_ipv4.c                |  9 ++++++--
> > > > > >  net/ipv6/tcp_ipv6.c                |  9 ++++++--
> > > > > >  7 files changed, 81 insertions(+), 17 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > > > index 2ea2d743f8fc..1e0958f5eb21 100644
> > > > > > --- a/include/net/inet_connection_sock.h
> > > > > > +++ b/include/net/inet_connection_sock.h
> > > > > > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
> > > > > >  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> > > > > >  }
> > > > > >  
> > > > > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > > > > > +						 struct sock *nsk,
> > > > > > +						 struct request_sock *req)
> > > > > > +{
> > > > > > +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > > > > > +			     &inet_csk(nsk)->icsk_accept_queue,
> > > > > > +			     req);
> > > > > > +	sock_put(sk);
> > > > > not sure if it is safe to do here.
> > > > > IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
> > > > > to req->rsk_listener such that sock_hold(req->rsk_listener) is
> > > > > safe because its sk_refcnt is not zero.
> > > > 
> > > > I think it is safe to call sock_put() for the old listener here.
> > > > 
> > > > Without this patchset, at receiving the final ACK or retransmitting
> > > > SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done
> > > > by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put().
> > > Note that in your example (final ACK), sock_put(req->rsk_listener) is
> > > _only_ called when reqsk_put() can get refcount_dec_and_test(&req->rsk_refcnt)
> > > to reach zero.
> > > 
> > > Here in this patch, it sock_put(req->rsk_listener) without req->rsk_refcnt
> > > reaching zero.
> > > 
> > > Let says there are two cores holding two refcnt to req (one cnt for each core)
> > > by looking up the req from ehash.  One of the core do this migrate and
> > > sock_put(req->rsk_listener).  Another core does sock_hold(req->rsk_listener).
> > > 
> > > 	Core1					Core2
> > > 						sock_put(req->rsk_listener)
> > > 
> > > 	sock_hold(req->rsk_listener)
> > 
> > I'm sorry for the late reply.
> > 
> > I missed this situation that different Cores get into NEW_SYN_RECV path,
> > but this does exist.
> > https://lore.kernel.org/netdev/1517977874.3715.153.camel@gmail.com/#t
> > https://lore.kernel.org/netdev/1518531252.3715.178.camel@gmail.com/
> > 
> > 
> > If close() is called for the listener and the request has the last refcount
> > for it, sock_put() by Core2 frees it, so Core1 cannot proceed with freed
> > listener. So, it is good to call refcount_inc_not_zero() instead of
> > sock_hold(). If refcount_inc_not_zero() fails, it means that the listener
> _inc_not_zero() usually means it requires rcu_read_lock().
> That may have rippling effect on other req->rsk_listener readers.
> 
> There may also be places assuming that the req->rsk_listener will never
> change once it is assigned.  not sure.  have not looked closely yet.

I have checked this again. There are no functions that expect explicitly
req->rsk_listener never change except for BUG_ON in inet_child_forget().
No BUG_ON/WARN_ON does not mean they does not assume listener never
change, but such functions still work properly if rsk_listener is changed.


> It probably needs some more thoughts here to get a simpler solution.

Is it fine to move sock_hold() before assigning rsk_listener and defer
sock_put() to the end of tcp_v[46]_rcv() ?

Also, we have to rewrite rsk_listener first and then call sock_put() in
reqsk_timer_handler() so that rsk_listener always has refcount more than 1.

---8<---
	struct sock *nsk, *osk;
	bool migrated = false;
	...
	sock_hold(req->rsk_listener);  // (i)
	sk = req->rsk_listener;
	...
	if (sk->sk_state == TCP_CLOSE) {
		osk = sk;
		// do migration without sock_put()
		sock_hold(nsk);  // (ii) (as with (i))
		sk = nsk;
		migrated = true;
	}
	...
	if (migrated) {
		sock_put(sk);  // pair with (ii)
		sock_put(osk); // decrement old listener's refcount
		sk = osk;
	}
	sock_put(sk);  // pair with (i)
---8<---


> > is closed and the req->rsk_listener is changed in another place. Then, we
> > can continue processing the request by rewriting sk with rsk_listener and
> > calling sock_hold() for it.
> > 
> > Also, the migration by Core2 can be done after sock_hold() by Core1. Then
> > if Core1 win the race by removing the request from ehash,
> > in inet_csk_reqsk_queue_add(), instead of sk, req->rsk_listener should be
> > used as the proper listener to add the req into its queue. But if the
> > rsk_listener is also TCP_CLOSE, we have to call inet_child_forget().
> > 
> > Moreover, we have to check the listener is freed in the beginning of
> > reqsk_timer_handler() by refcount_inc_not_zero().
> > 
> > 
> > > > And then, we do `goto lookup;` and overwrite the sk.
> > > > 
> > > > In the v2 patchset, refcount_inc_not_zero() is done for the new listener in
> > > > reuseport_select_migrated_sock(), so we have to call sock_put() for the old
> > > > listener instead to free it properly.
> > > > 
> > > > ---8<---
> > > > +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
> > > > +					    struct sk_buff *skb)
> > > > +{
> > > > +	struct sock *nsk;
> > > > +
> > > > +	nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
> > > > +	if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
> > > There is another potential issue here.  The TCP_LISTEN nsk is protected
> > > by rcu.  refcount_inc_not_zero(&nsk->sk_refcnt) cannot be done if it
> > > is not under rcu_read_lock().
> > > 
> > > The receive path may be ok as it is in rcu.  You may need to check for
> > > others.
> > 
> > IIUC, is this mean nsk can be NULL after grace period of RCU? If so, I will
> worse than NULL.  an invalid pointer.
>  
> > move rcu_read_lock/unlock() from __reuseport_select_sock() to
> > reuseport_select_sock() and reuseport_select_migrated_sock().
> ok.
> 
> > 
> > 
> > > > +		return nsk;
> > > > +
> > > > +	return NULL;
> > > > +}
> > > > +EXPORT_SYMBOL(reuseport_select_migrated_sock);
> > > > ---8<---
> > > > https://lore.kernel.org/netdev/20201207132456.65472-8-kuniyu@amazon.co.jp/
> > > > 
> > > > 
> > > > > > +	sock_hold(nsk);
> > > > > > +	req->rsk_listener = nsk;
> > > It looks like there is another race here.  What
> > > if multiple cores try to update req->rsk_listener?
> > 
> > I think we have to add a lock in struct request_sock, acquire it, check
> > if the rsk_listener is changed or not, and then do migration. Also, if the
> > listener has been changed, we have to tell the caller to use it as the new
> > listener.
> > 
> > ---8<---
> >        spin_lock(&lock)
> >        if (sk != req->rsk_listener) {
> >                nsk = req->rsk_listener;
> >                goto out;
> >        }
> > 
> >        // do migration
> > out:
> >        spin_unlock(&lock)
> >        return nsk;
> > ---8<---
> cmpxchg may help here.

Thank you, I will use cmpxchg() to rewrite rsk_listener atomically and
check if req->rsk_listener is updated.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
  2020-12-16 16:41             ` Kuniyuki Iwashima
@ 2020-12-16 22:24               ` Martin KaFai Lau
  0 siblings, 0 replies; 57+ messages in thread
From: Martin KaFai Lau @ 2020-12-16 22:24 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840,
	linux-kernel, netdev

On Thu, Dec 17, 2020 at 01:41:58AM +0900, Kuniyuki Iwashima wrote:
[ ... ]

> > There may also be places assuming that the req->rsk_listener will never
> > change once it is assigned.  not sure.  have not looked closely yet.
> 
> I have checked this again. There are no functions that expect explicitly
> req->rsk_listener never change except for BUG_ON in inet_child_forget().
> No BUG_ON/WARN_ON does not mean they does not assume listener never
> change, but such functions still work properly if rsk_listener is changed.
The migration not only changes the ptr value of req->rsk_listener, it also
means req is moved to another listener. (e.g. by updating the qlen of
the old sk and new sk)

Lets reuse the example about two cores at the TCP_NEW_SYN_RECV path
racing to finish up the 3WHS.

One core is already at inet_csk_complete_hashdance() doing
"reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req))".
What happen if another core migrates the req to another listener?
Would the "reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req))"
doing thing on the accept_queue that this req no longer belongs to?

Also, from a quick look at reqsk_timer_handler() on how
queue->young and req->num_timeout are updated, I am not sure
the reqsk_queue_migrated() will work also:

+static inline void reqsk_queue_migrated(struct request_sock_queue *old_accept_queue,
+					struct request_sock_queue *new_accept_queue,
+					const struct request_sock *req)
+{
+	atomic_dec(&old_accept_queue->qlen);
+	atomic_inc(&new_accept_queue->qlen);
+
+	if (req->num_timeout == 0) {
What if reqsk_timer_handler() is running in parallel
and updating req->num_timeout?

+		atomic_dec(&old_accept_queue->young);
+		atomic_inc(&new_accept_queue->young);
+	}
+}


It feels like some of the "own_req" related logic may be useful here.
not sure.  could be something worth to think about.

> 
> 
> > It probably needs some more thoughts here to get a simpler solution.
> 
> Is it fine to move sock_hold() before assigning rsk_listener and defer
> sock_put() to the end of tcp_v[46]_rcv() ?
I don't see how this ordering helps, considering the migration can happen
any time at another core.

> 
> Also, we have to rewrite rsk_listener first and then call sock_put() in
> reqsk_timer_handler() so that rsk_listener always has refcount more than 1.
> 
> ---8<---
> 	struct sock *nsk, *osk;
> 	bool migrated = false;
> 	...
> 	sock_hold(req->rsk_listener);  // (i)
> 	sk = req->rsk_listener;
> 	...
> 	if (sk->sk_state == TCP_CLOSE) {
> 		osk = sk;
> 		// do migration without sock_put()
> 		sock_hold(nsk);  // (ii) (as with (i))
> 		sk = nsk;
> 		migrated = true;
> 	}
> 	...
> 	if (migrated) {
> 		sock_put(sk);  // pair with (ii)
> 		sock_put(osk); // decrement old listener's refcount
> 		sk = osk;
> 	}
> 	sock_put(sk);  // pair with (i)
> ---8<---

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2020-12-16 22:25 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2020-12-05  1:31   ` Martin KaFai Lau
2020-12-06  4:38     ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 02/11] bpf: Define migration types for SO_REUSEPORT Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2020-12-01 15:25   ` Eric Dumazet
2020-12-03 14:14     ` Kuniyuki Iwashima
2020-12-03 14:31       ` Eric Dumazet
2020-12-03 15:41         ` Kuniyuki Iwashima
2020-12-07 20:33       ` Martin KaFai Lau
2020-12-08  6:31         ` Kuniyuki Iwashima
2020-12-08  7:34           ` Martin KaFai Lau
2020-12-08  8:17             ` Kuniyuki Iwashima
2020-12-09  3:09               ` Martin KaFai Lau
2020-12-09  8:05                 ` Kuniyuki Iwashima
2020-12-09 16:57                   ` Kuniyuki Iwashima
2020-12-10  1:53                     ` Martin KaFai Lau
2020-12-10  5:58                       ` Kuniyuki Iwashima
2020-12-10 19:33                         ` Martin KaFai Lau
2020-12-14 17:16                           ` Kuniyuki Iwashima
2020-12-05  1:42   ` Martin KaFai Lau
2020-12-06  4:41     ` Kuniyuki Iwashima
     [not found]     ` <20201205160307.91179-1-kuniyu@amazon.co.jp>
2020-12-07 20:14       ` Martin KaFai Lau
2020-12-08  6:27         ` Kuniyuki Iwashima
2020-12-08  8:13           ` Martin KaFai Lau
2020-12-08  9:02             ` Kuniyuki Iwashima
2020-12-08  6:54   ` Martin KaFai Lau
2020-12-08  7:42     ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 04/11] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
2020-12-01 15:30   ` Eric Dumazet
2020-12-01 14:44 ` [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
2020-12-01 15:13   ` Eric Dumazet
2020-12-03 14:12     ` Kuniyuki Iwashima
2020-12-10  0:07   ` Martin KaFai Lau
2020-12-10  5:15     ` Kuniyuki Iwashima
2020-12-10 18:49       ` Martin KaFai Lau
2020-12-14 17:03         ` Kuniyuki Iwashima
2020-12-15  2:58           ` Martin KaFai Lau
2020-12-16 16:41             ` Kuniyuki Iwashima
2020-12-16 22:24               ` Martin KaFai Lau
2020-12-01 14:44 ` [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2020-12-02  2:04   ` Andrii Nakryiko
2020-12-02 19:19     ` Martin KaFai Lau
2020-12-03  4:24       ` Martin KaFai Lau
2020-12-03 14:16         ` Kuniyuki Iwashima
2020-12-04  5:56           ` Martin KaFai Lau
2020-12-06  4:32             ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 07/11] libbpf: Set expected_attach_type " Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 08/11] bpf: Add migration to sk_reuseport_(kern|md) Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2020-12-04 19:58   ` Martin KaFai Lau
2020-12-06  4:36     ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 10/11] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
2020-12-05  1:50   ` Martin KaFai Lau
2020-12-06  4:43     ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).