All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
@ 2022-06-02 16:50 Joanne Koong
  2022-06-02 16:51 ` [PATCH net-next v2 1/2] net: " Joanne Koong
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Joanne Koong @ 2022-06-02 16:50 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, kafai, kuba, davem, pabeni, Joanne Koong

As syzbot noted [1], there is an inconsistency in the bhash2 table in the
case where a socket's rcv saddr changes after it is binded. (For more
details, please see the commit message of the first patch)

This patchset fixes that and adds a test that triggers the case where the
sk's rcv saddr changes. The subsequent listen() call should succeed.

[1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/

--
v1 -> v2:
v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
* Mark __inet_bhash2_update_saddr as static

Joanne Koong (2):
  net: Update bhash2 when socket's rcv saddr changes
  selftests/net: Add sk_bind_sendto_listen test

 include/net/inet_hashtables.h                 |  6 +-
 include/net/ipv6.h                            |  2 +-
 net/dccp/ipv4.c                               | 10 ++-
 net/dccp/ipv6.c                               |  4 +-
 net/ipv4/af_inet.c                            |  7 +-
 net/ipv4/inet_hashtables.c                    | 70 ++++++++++++++--
 net/ipv4/tcp_ipv4.c                           |  8 +-
 net/ipv6/inet6_hashtables.c                   |  4 +-
 net/ipv6/tcp_ipv6.c                           |  4 +-
 tools/testing/selftests/net/.gitignore        |  1 +
 tools/testing/selftests/net/Makefile          |  1 +
 .../selftests/net/sk_bind_sendto_listen.c     | 82 +++++++++++++++++++
 12 files changed, 181 insertions(+), 18 deletions(-)
 create mode 100644 tools/testing/selftests/net/sk_bind_sendto_listen.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 1/2] net: Update bhash2 when socket's rcv saddr changes
  2022-06-02 16:50 [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes Joanne Koong
@ 2022-06-02 16:51 ` Joanne Koong
  2022-06-07  8:33   ` Paolo Abeni
  2022-06-02 16:51 ` [PATCH net-next v2 2/2] selftests/net: Add sk_bind_sendto_listen test Joanne Koong
  2022-06-03 18:54 ` [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes Mat Martineau
  2 siblings, 1 reply; 18+ messages in thread
From: Joanne Koong @ 2022-06-02 16:51 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, kafai, kuba, davem, pabeni, Joanne Koong,
	syzbot+015d756bbd1f8b5c8f09

Commit d5a42de8bdbe ("net: Add a second bind table hashed by port and
address") added a second bind table, bhash2, that hashes by a socket's port
and rcv address.

However, there are two cases where the socket's rcv saddr can change
after it has been binded:

1) The case where there is a bind() call on "::" (IPADDR_ANY) and then
a connect() call. The kernel will assign the socket an address when it
handles the connect()

2) In inet_sk_reselect_saddr(), which is called when rerouting fails
when rebuilding the sk header (invoked by inet_sk_rebuild_header)

In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and adding a new entry reflecting the updated
address.

Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com
Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 include/net/inet_hashtables.h |  6 ++-
 include/net/ipv6.h            |  2 +-
 net/dccp/ipv4.c               | 10 +++--
 net/dccp/ipv6.c               |  4 +-
 net/ipv4/af_inet.c            |  7 +++-
 net/ipv4/inet_hashtables.c    | 70 ++++++++++++++++++++++++++++++++---
 net/ipv4/tcp_ipv4.c           |  8 +++-
 net/ipv6/inet6_hashtables.c   |  4 +-
 net/ipv6/tcp_ipv6.c           |  4 +-
 9 files changed, 97 insertions(+), 18 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index a0887b70967b..2c331ce6ca73 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -448,11 +448,13 @@ static inline void sk_rcv_saddr_set(struct sock *sk, __be32 addr)
 }
 
 int __inet_hash_connect(struct inet_timewait_death_row *death_row,
-			struct sock *sk, u64 port_offset,
+			struct sock *sk, u64 port_offset, bool prev_inaddr_any,
 			int (*check_established)(struct inet_timewait_death_row *,
 						 struct sock *, __u16,
 						 struct inet_timewait_sock **));
 
 int inet_hash_connect(struct inet_timewait_death_row *death_row,
-		      struct sock *sk);
+		      struct sock *sk, bool prev_inaddr_any);
+
+int inet_bhash2_update_saddr(struct sock *sk);
 #endif /* _INET_HASHTABLES_H */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 5b38bf1a586b..6a50aca56d50 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1187,7 +1187,7 @@ int inet6_compat_ioctl(struct socket *sock, unsigned int cmd,
 		unsigned long arg);
 
 int inet6_hash_connect(struct inet_timewait_death_row *death_row,
-			      struct sock *sk);
+		       struct sock *sk, bool prev_inaddr_any);
 int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
 int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 		  int flags);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index da6e3b20cd75..37a8bc3ee49e 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -47,12 +47,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
 	struct inet_sock *inet = inet_sk(sk);
 	struct dccp_sock *dp = dccp_sk(sk);
+	struct ip_options_rcu *inet_opt;
 	__be16 orig_sport, orig_dport;
+	bool prev_inaddr_any = false;
 	__be32 daddr, nexthop;
 	struct flowi4 *fl4;
 	struct rtable *rt;
 	int err;
-	struct ip_options_rcu *inet_opt;
 
 	dp->dccps_role = DCCP_ROLE_CLIENT;
 
@@ -89,8 +90,11 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (inet_opt == NULL || !inet_opt->opt.srr)
 		daddr = fl4->daddr;
 
-	if (inet->inet_saddr == 0)
+	if (inet->inet_saddr == 0) {
 		inet->inet_saddr = fl4->saddr;
+		prev_inaddr_any = true;
+	}
+
 	sk_rcv_saddr_set(sk, inet->inet_saddr);
 	inet->inet_dport = usin->sin_port;
 	sk_daddr_set(sk, daddr);
@@ -105,7 +109,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	 * complete initialization after this.
 	 */
 	dccp_set_state(sk, DCCP_REQUESTING);
-	err = inet_hash_connect(&dccp_death_row, sk);
+	err = inet_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
 	if (err != 0)
 		goto failure;
 
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index fd44638ec16b..03013522acab 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -824,6 +824,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct dccp_sock *dp = dccp_sk(sk);
 	struct in6_addr *saddr = NULL, *final_p, final;
+	bool prev_inaddr_any = false;
 	struct ipv6_txoptions *opt;
 	struct flowi6 fl6;
 	struct dst_entry *dst;
@@ -936,6 +937,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	if (saddr == NULL) {
 		saddr = &fl6.saddr;
 		sk->sk_v6_rcv_saddr = *saddr;
+		prev_inaddr_any = true;
 	}
 
 	/* set the source address */
@@ -951,7 +953,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	inet->inet_dport = usin->sin6_port;
 
 	dccp_set_state(sk, DCCP_REQUESTING);
-	err = inet6_hash_connect(&dccp_death_row, sk);
+	err = inet6_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
 	if (err)
 		goto late_failure;
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 93da9f783bec..ad627a99ff9d 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1221,10 +1221,11 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 	struct inet_sock *inet = inet_sk(sk);
 	__be32 old_saddr = inet->inet_saddr;
 	__be32 daddr = inet->inet_daddr;
+	struct ip_options_rcu *inet_opt;
 	struct flowi4 *fl4;
 	struct rtable *rt;
 	__be32 new_saddr;
-	struct ip_options_rcu *inet_opt;
+	int err;
 
 	inet_opt = rcu_dereference_protected(inet->inet_opt,
 					     lockdep_sock_is_held(sk));
@@ -1253,6 +1254,10 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 
 	inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
 
+	err = inet_bhash2_update_saddr(sk);
+	if (err)
+		return err;
+
 	/*
 	 * XXX The only one ugly spot where we need to
 	 * XXX really change the sockets identity after
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index e8de5e699b3f..592b70663a3b 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -826,6 +826,55 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
 	return bhash2;
 }
 
+/* the lock for the socket's corresponding bhash entry must be held */
+static int __inet_bhash2_update_saddr(struct sock *sk,
+				      struct inet_hashinfo *hinfo,
+				      struct net *net, int port, int l3mdev)
+{
+	struct inet_bind2_hashbucket *head2;
+	struct inet_bind2_bucket *tb2;
+
+	tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk,
+				     &head2);
+	if (!tb2) {
+		tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep,
+					       net, head2, port, l3mdev, sk);
+		if (!tb2)
+			return -ENOMEM;
+	}
+
+	/* Remove the socket's old entry from bhash2 */
+	__sk_del_bind2_node(sk);
+
+	sk_add_bind2_node(sk, &tb2->owners);
+	inet_csk(sk)->icsk_bind2_hash = tb2;
+
+	return 0;
+}
+
+/* This should be called if/when a socket's rcv saddr changes after it has
+ * been binded.
+ */
+int inet_bhash2_update_saddr(struct sock *sk)
+{
+	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	int l3mdev = inet_sk_bound_l3mdev(sk);
+	struct inet_bind_hashbucket *head;
+	int port = inet_sk(sk)->inet_num;
+	struct net *net = sock_net(sk);
+	int err;
+
+	head = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)];
+
+	spin_lock_bh(&head->lock);
+
+	err = __inet_bhash2_update_saddr(sk, hinfo, net, port, l3mdev);
+
+	spin_unlock_bh(&head->lock);
+
+	return err;
+}
+
 /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
  * Note that we use 32bit integers (vs RFC 'short integers')
  * because 2^16 is not a multiple of num_ephemeral and this
@@ -840,7 +889,7 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
 static u32 *table_perturb;
 
 int __inet_hash_connect(struct inet_timewait_death_row *death_row,
-		struct sock *sk, u64 port_offset,
+		struct sock *sk, u64 port_offset, bool prev_inaddr_any,
 		int (*check_established)(struct inet_timewait_death_row *,
 			struct sock *, __u16, struct inet_timewait_sock **))
 {
@@ -858,11 +907,24 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	int l3mdev;
 	u32 index;
 
+	l3mdev = inet_sk_bound_l3mdev(sk);
+
 	if (port) {
 		head = &hinfo->bhash[inet_bhashfn(net, port,
 						  hinfo->bhash_size)];
 		tb = inet_csk(sk)->icsk_bind_hash;
+
 		spin_lock_bh(&head->lock);
+
+		if (prev_inaddr_any) {
+			ret = __inet_bhash2_update_saddr(sk, hinfo, net, port,
+							 l3mdev);
+			if (ret) {
+				spin_unlock_bh(&head->lock);
+				return ret;
+			}
+		}
+
 		if (sk_head(&tb->owners) == sk && !sk->sk_bind_node.next) {
 			inet_ehash_nolisten(sk, NULL, NULL);
 			spin_unlock_bh(&head->lock);
@@ -875,8 +937,6 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		return ret;
 	}
 
-	l3mdev = inet_sk_bound_l3mdev(sk);
-
 	inet_get_local_port_range(net, &low, &high);
 	high++; /* [32768, 60999] -> [32768, 61000[ */
 	remaining = high - low;
@@ -987,13 +1047,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
  * Bind a port for a connect operation and hash it.
  */
 int inet_hash_connect(struct inet_timewait_death_row *death_row,
-		      struct sock *sk)
+		      struct sock *sk, bool prev_inaddr_any)
 {
 	u64 port_offset = 0;
 
 	if (!inet_sk(sk)->inet_num)
 		port_offset = inet_sk_port_offset(sk);
-	return __inet_hash_connect(death_row, sk, port_offset,
+	return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
 				   __inet_check_established);
 }
 EXPORT_SYMBOL_GPL(inet_hash_connect);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index dac2650f3863..adf8d750933d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -203,6 +203,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct inet_sock *inet = inet_sk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	__be16 orig_sport, orig_dport;
+	bool prev_inaddr_any = false;
 	__be32 daddr, nexthop;
 	struct flowi4 *fl4;
 	struct rtable *rt;
@@ -246,8 +247,11 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (!inet_opt || !inet_opt->opt.srr)
 		daddr = fl4->daddr;
 
-	if (!inet->inet_saddr)
+	if (!inet->inet_saddr) {
 		inet->inet_saddr = fl4->saddr;
+		prev_inaddr_any = true;
+	}
+
 	sk_rcv_saddr_set(sk, inet->inet_saddr);
 
 	if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
@@ -273,7 +277,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	 * complete initialization after this.
 	 */
 	tcp_set_state(sk, TCP_SYN_SENT);
-	err = inet_hash_connect(tcp_death_row, sk);
+	err = inet_hash_connect(tcp_death_row, sk, prev_inaddr_any);
 	if (err)
 		goto failure;
 
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 7d53d62783b1..c87c5933f3be 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -317,13 +317,13 @@ static u64 inet6_sk_port_offset(const struct sock *sk)
 }
 
 int inet6_hash_connect(struct inet_timewait_death_row *death_row,
-		       struct sock *sk)
+		       struct sock *sk, bool prev_inaddr_any)
 {
 	u64 port_offset = 0;
 
 	if (!inet_sk(sk)->inet_num)
 		port_offset = inet6_sk_port_offset(sk);
-	return __inet_hash_connect(death_row, sk, port_offset,
+	return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
 				   __inet6_check_established);
 }
 EXPORT_SYMBOL_GPL(inet6_hash_connect);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f37dd4aa91c6..81e3312c2a97 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -152,6 +152,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	struct ipv6_pinfo *np = tcp_inet6_sk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct in6_addr *saddr = NULL, *final_p, final;
+	bool prev_inaddr_any = false;
 	struct ipv6_txoptions *opt;
 	struct flowi6 fl6;
 	struct dst_entry *dst;
@@ -289,6 +290,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	if (!saddr) {
 		saddr = &fl6.saddr;
 		sk->sk_v6_rcv_saddr = *saddr;
+		prev_inaddr_any = true;
 	}
 
 	/* set the source address */
@@ -309,7 +311,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 
 	tcp_set_state(sk, TCP_SYN_SENT);
 	tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
-	err = inet6_hash_connect(tcp_death_row, sk);
+	err = inet6_hash_connect(tcp_death_row, sk, prev_inaddr_any);
 	if (err)
 		goto late_failure;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 2/2] selftests/net: Add sk_bind_sendto_listen test
  2022-06-02 16:50 [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes Joanne Koong
  2022-06-02 16:51 ` [PATCH net-next v2 1/2] net: " Joanne Koong
@ 2022-06-02 16:51 ` Joanne Koong
  2022-06-03 18:54 ` [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes Mat Martineau
  2 siblings, 0 replies; 18+ messages in thread
From: Joanne Koong @ 2022-06-02 16:51 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, kafai, kuba, davem, pabeni, Joanne Koong

This patch adds a new test called sk_bind_sendto_listen.

This test exercises the path where a socket's rcv saddr changes after it
has been added to the binding tables, and then a listen() on the socket
is invoked. The listen() should succeed.

This test is copied over from one of syzbot's tests:
https://syzkaller.appspot.com/x/repro.c?x=1673a38df00000

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 tools/testing/selftests/net/.gitignore        |  1 +
 tools/testing/selftests/net/Makefile          |  1 +
 .../selftests/net/sk_bind_sendto_listen.c     | 82 +++++++++++++++++++
 3 files changed, 84 insertions(+)
 create mode 100644 tools/testing/selftests/net/sk_bind_sendto_listen.c

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index b984f8c8d523..69f1a2aafde4 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -38,3 +38,4 @@ ioam6_parser
 toeplitz
 cmsg_sender
 bind_bhash_test
+sk_bind_sendto_listen
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 464df13831f2..45f4f57bf1f4 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -60,6 +60,7 @@ TEST_GEN_FILES += cmsg_sender
 TEST_GEN_FILES += stress_reuseport_listen
 TEST_PROGS += test_vxlan_vnifiltering.sh
 TEST_GEN_FILES += bind_bhash_test
+TEST_GEN_FILES += sk_bind_sendto_listen
 
 TEST_FILES := settings
 
diff --git a/tools/testing/selftests/net/sk_bind_sendto_listen.c b/tools/testing/selftests/net/sk_bind_sendto_listen.c
new file mode 100644
index 000000000000..3e09c5631997
--- /dev/null
+++ b/tools/testing/selftests/net/sk_bind_sendto_listen.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <arpa/inet.h>
+#include <error.h>
+#include <errno.h>
+#include <unistd.h>
+
+int main(void)
+{
+	int fd1, fd2, one = 1;
+	struct sockaddr_in6 bind_addr = {
+		.sin6_family = AF_INET6,
+		.sin6_port = htons(20000),
+		.sin6_flowinfo = htonl(0),
+		.sin6_addr = {},
+		.sin6_scope_id = 0,
+	};
+
+	inet_pton(AF_INET6, "::", &bind_addr.sin6_addr);
+
+	fd1 = socket(AF_INET6, SOCK_STREAM, IPPROTO_IP);
+	if (fd1 < 0) {
+		error(1, errno, "socket fd1");
+		return -1;
+	}
+
+	if (setsockopt(fd1, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one))) {
+		error(1, errno, "setsockopt(SO_REUSEADDR) fd1");
+		goto out_err1;
+	}
+
+	if (bind(fd1, (struct sockaddr *)&bind_addr, sizeof(bind_addr))) {
+		error(1, errno, "bind fd1");
+		goto out_err1;
+	}
+
+	if (sendto(fd1, NULL, 0, MSG_FASTOPEN, (struct sockaddr *)&bind_addr,
+		   sizeof(bind_addr))) {
+		error(1, errno, "sendto fd1");
+		goto out_err1;
+	}
+
+	fd2 = socket(AF_INET6, SOCK_STREAM, IPPROTO_IP);
+	if (fd2 < 0) {
+		error(1, errno, "socket fd2");
+		goto out_err1;
+	}
+
+	if (setsockopt(fd2, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one))) {
+		error(1, errno, "setsockopt(SO_REUSEADDR) fd2");
+		goto out_err2;
+	}
+
+	if (bind(fd2, (struct sockaddr *)&bind_addr, sizeof(bind_addr))) {
+		error(1, errno, "bind fd2");
+		goto out_err2;
+	}
+
+	if (sendto(fd2, NULL, 0, MSG_FASTOPEN, (struct sockaddr *)&bind_addr,
+		   sizeof(bind_addr)) != -1) {
+		error(1, errno, "sendto fd2");
+		goto out_err2;
+	}
+
+	if (listen(fd2, 0)) {
+		error(1, errno, "listen");
+		goto out_err2;
+	}
+
+	close(fd2);
+	close(fd1);
+	return 0;
+
+out_err2:
+	close(fd2);
+
+out_err1:
+	close(fd1);
+	return -1;
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
  2022-06-02 16:50 [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes Joanne Koong
  2022-06-02 16:51 ` [PATCH net-next v2 1/2] net: " Joanne Koong
  2022-06-02 16:51 ` [PATCH net-next v2 2/2] selftests/net: Add sk_bind_sendto_listen test Joanne Koong
@ 2022-06-03 18:54 ` Mat Martineau
  2022-06-04  0:38   ` Joanne Koong
  2 siblings, 1 reply; 18+ messages in thread
From: Mat Martineau @ 2022-06-03 18:54 UTC (permalink / raw)
  To: Joanne Koong; +Cc: netdev, edumazet, kafai, kuba, davem, pabeni

On Thu, 2 Jun 2022, Joanne Koong wrote:

> As syzbot noted [1], there is an inconsistency in the bhash2 table in the
> case where a socket's rcv saddr changes after it is binded. (For more
> details, please see the commit message of the first patch)
>
> This patchset fixes that and adds a test that triggers the case where the
> sk's rcv saddr changes. The subsequent listen() call should succeed.
>
> [1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
>
> --
> v1 -> v2:
> v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
> * Mark __inet_bhash2_update_saddr as static
>
> Joanne Koong (2):
>  net: Update bhash2 when socket's rcv saddr changes
>  selftests/net: Add sk_bind_sendto_listen test
>

Hi Joanne -

I've been running my own syzkaller instance with v1 of this fix for a 
couple of days. Before this patch, syzkaller would trigger the 
inet_csk_get_port warning a couple of times per hour. After this patch it 
took two days to show the warning:

------------[ cut here ]------------

WARNING: CPU: 0 PID: 9430 at net/ipv4/inet_connection_sock.c:525 
inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
Modules linked in:
CPU: 0 PID: 9430 Comm: syz-executor.5 Not tainted 5.18.0-05016-g433fde5b4119 #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 2b 0f db fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 18 0f db fd <0f> 0b e9 70 f9 ff ff e8 0c 0f db fd 0f 0b e9 28 f9 ff ff e8 00 0f
RSP: 0018:ffffc90000b5fbc0 EFLAGS: 00010212
RAX: 00000000000000e7 RBX: ffff88803c410040 RCX: ffffc9000e419000
RDX: 0000000000040000 RSI: ffffffff836f47e8 RDI: ffff88803c4106e0
RBP: ffff88810b773840 R08: 0000000000000001 R09: 0000000000000001
R10: fffffbfff0f64303 R11: 0000000000000001 R12: 0000000000000000
R13: ffff88810605e2f0 R14: ffffffff88606040 R15: 000000000000c1ff
FS:  00007fada4d03640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b32e24000 CR3: 00000001016de001 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
  <TASK>
  inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
  inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
  mptcp_listen+0x205/0x440 net/mptcp/protocol.c:3564
  __sys_listen+0x189/0x260 net/socket.c:1810
  __do_sys_listen net/socket.c:1819 [inline]
  __se_sys_listen net/socket.c:1817 [inline]
  __x64_sys_listen+0x54/0x80 net/socket.c:1817
  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
  do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fada558f92d
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fada4d03028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
RAX: ffffffffffffffda RBX: 00007fada56aff60 RCX: 00007fada558f92d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007fada56000a0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fada56aff60 R15: 00007fada4ce3000
  </TASK>
irq event stamp: 2078
hardirqs last  enabled at (2092): [<ffffffff812f1be3>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
hardirqs last disabled at (2103): [<ffffffff812f1bc8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
softirqs last  enabled at (1498): [<ffffffff83807dd4>] lock_sock include/net/sock.h:1691 [inline]
softirqs last  enabled at (1498): [<ffffffff83807dd4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
softirqs last disabled at (1500): [<ffffffff836f425c>] spin_lock_bh include/linux/spinlock.h:354 [inline]
softirqs last disabled at (1500): [<ffffffff836f425c>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
---[ end trace 0000000000000000 ]---


In the full log file it does look like syskaller is doing something 
strange since it's calling bind, connect, and listen on the same socket:

r4 = socket$inet_mptcp(0x2, 0x1, 0x106)
bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e23, @empty}, 0x10)
connect$inet(r4, &(0x7f0000000040)={0x2, 0x0, @local}, 0x10)
listen(r4, 0x0)

...but it is a fuzz tester after all.

I've uploaded the full syzkaller logs to this GitHub issue:

https://github.com/multipath-tcp/mptcp_net-next/issues/279


Not sure yet if this is MPTCP-related. I don't think MPTCP 
changes anything with the subflow TCP socket bind hashes.


--
Mat Martineau
Intel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
  2022-06-03 18:54 ` [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes Mat Martineau
@ 2022-06-04  0:38   ` Joanne Koong
  2022-06-07  0:02     ` Joanne Koong
  0 siblings, 1 reply; 18+ messages in thread
From: Joanne Koong @ 2022-06-04  0:38 UTC (permalink / raw)
  To: Mat Martineau
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, Paolo Abeni

On Fri, Jun 3, 2022 at 11:55 AM Mat Martineau
<mathew.j.martineau@linux.intel.com> wrote:
>
> On Thu, 2 Jun 2022, Joanne Koong wrote:
>
> > As syzbot noted [1], there is an inconsistency in the bhash2 table in the
> > case where a socket's rcv saddr changes after it is binded. (For more
> > details, please see the commit message of the first patch)
> >
> > This patchset fixes that and adds a test that triggers the case where the
> > sk's rcv saddr changes. The subsequent listen() call should succeed.
> >
> > [1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
> >
> > --
> > v1 -> v2:
> > v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
> > * Mark __inet_bhash2_update_saddr as static
> >
> > Joanne Koong (2):
> >  net: Update bhash2 when socket's rcv saddr changes
> >  selftests/net: Add sk_bind_sendto_listen test
> >
>
> Hi Joanne -
>
> I've been running my own syzkaller instance with v1 of this fix for a
> couple of days. Before this patch, syzkaller would trigger the
> inet_csk_get_port warning a couple of times per hour. After this patch it
> took two days to show the warning:
>
> ------------[ cut here ]------------
>
> WARNING: CPU: 0 PID: 9430 at net/ipv4/inet_connection_sock.c:525
> inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
> Modules linked in:
> CPU: 0 PID: 9430 Comm: syz-executor.5 Not tainted 5.18.0-05016-g433fde5b4119 #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
> Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 2b 0f db fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 18 0f db fd <0f> 0b e9 70 f9 ff ff e8 0c 0f db fd 0f 0b e9 28 f9 ff ff e8 00 0f
> RSP: 0018:ffffc90000b5fbc0 EFLAGS: 00010212
> RAX: 00000000000000e7 RBX: ffff88803c410040 RCX: ffffc9000e419000
> RDX: 0000000000040000 RSI: ffffffff836f47e8 RDI: ffff88803c4106e0
> RBP: ffff88810b773840 R08: 0000000000000001 R09: 0000000000000001
> R10: fffffbfff0f64303 R11: 0000000000000001 R12: 0000000000000000
> R13: ffff88810605e2f0 R14: ffffffff88606040 R15: 000000000000c1ff
> FS:  00007fada4d03640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b32e24000 CR3: 00000001016de001 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
>   <TASK>
>   inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
>   inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
>   mptcp_listen+0x205/0x440 net/mptcp/protocol.c:3564
>   __sys_listen+0x189/0x260 net/socket.c:1810
>   __do_sys_listen net/socket.c:1819 [inline]
>   __se_sys_listen net/socket.c:1817 [inline]
>   __x64_sys_listen+0x54/0x80 net/socket.c:1817
>   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>   do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x7fada558f92d
> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007fada4d03028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
> RAX: ffffffffffffffda RBX: 00007fada56aff60 RCX: 00007fada558f92d
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
> RBP: 00007fada56000a0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 000000000000000b R14: 00007fada56aff60 R15: 00007fada4ce3000
>   </TASK>
> irq event stamp: 2078
> hardirqs last  enabled at (2092): [<ffffffff812f1be3>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
> hardirqs last disabled at (2103): [<ffffffff812f1bc8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
> softirqs last  enabled at (1498): [<ffffffff83807dd4>] lock_sock include/net/sock.h:1691 [inline]
> softirqs last  enabled at (1498): [<ffffffff83807dd4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
> softirqs last disabled at (1500): [<ffffffff836f425c>] spin_lock_bh include/linux/spinlock.h:354 [inline]
> softirqs last disabled at (1500): [<ffffffff836f425c>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
> ---[ end trace 0000000000000000 ]---
>
>
> In the full log file it does look like syskaller is doing something
> strange since it's calling bind, connect, and listen on the same socket:
>
> r4 = socket$inet_mptcp(0x2, 0x1, 0x106)
> bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e23, @empty}, 0x10)
> connect$inet(r4, &(0x7f0000000040)={0x2, 0x0, @local}, 0x10)
> listen(r4, 0x0)
>
> ...but it is a fuzz tester after all.
>
> I've uploaded the full syzkaller logs to this GitHub issue:
>
> https://github.com/multipath-tcp/mptcp_net-next/issues/279
>
>
> Not sure yet if this is MPTCP-related. I don't think MPTCP
> changes anything with the subflow TCP socket bind hashes.
>
Hi Mat,

Thanks for bringing this up and for uploading the logs. I will look into this.
>
> --
> Mat Martineau
> Intel
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
  2022-06-04  0:38   ` Joanne Koong
@ 2022-06-07  0:02     ` Joanne Koong
  2022-06-07 17:31       ` Mat Martineau
  0 siblings, 1 reply; 18+ messages in thread
From: Joanne Koong @ 2022-06-07  0:02 UTC (permalink / raw)
  To: Mat Martineau
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, Paolo Abeni

On Fri, Jun 3, 2022 at 5:38 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Jun 3, 2022 at 11:55 AM Mat Martineau
> <mathew.j.martineau@linux.intel.com> wrote:
> >
> > On Thu, 2 Jun 2022, Joanne Koong wrote:
> >
> > > As syzbot noted [1], there is an inconsistency in the bhash2 table in the
> > > case where a socket's rcv saddr changes after it is binded. (For more
> > > details, please see the commit message of the first patch)
> > >
> > > This patchset fixes that and adds a test that triggers the case where the
> > > sk's rcv saddr changes. The subsequent listen() call should succeed.
> > >
> > > [1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
> > >
> > > --
> > > v1 -> v2:
> > > v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
> > > * Mark __inet_bhash2_update_saddr as static
> > >
> > > Joanne Koong (2):
> > >  net: Update bhash2 when socket's rcv saddr changes
> > >  selftests/net: Add sk_bind_sendto_listen test
> > >
> >
> > Hi Joanne -
> >
> > I've been running my own syzkaller instance with v1 of this fix for a
> > couple of days. Before this patch, syzkaller would trigger the
> > inet_csk_get_port warning a couple of times per hour. After this patch it
> > took two days to show the warning:
> >
> > ------------[ cut here ]------------
> >
> > WARNING: CPU: 0 PID: 9430 at net/ipv4/inet_connection_sock.c:525
> > inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
> > Modules linked in:
> > CPU: 0 PID: 9430 Comm: syz-executor.5 Not tainted 5.18.0-05016-g433fde5b4119 #3
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
> > Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 2b 0f db fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 18 0f db fd <0f> 0b e9 70 f9 ff ff e8 0c 0f db fd 0f 0b e9 28 f9 ff ff e8 00 0f
> > RSP: 0018:ffffc90000b5fbc0 EFLAGS: 00010212
> > RAX: 00000000000000e7 RBX: ffff88803c410040 RCX: ffffc9000e419000
> > RDX: 0000000000040000 RSI: ffffffff836f47e8 RDI: ffff88803c4106e0
> > RBP: ffff88810b773840 R08: 0000000000000001 R09: 0000000000000001
> > R10: fffffbfff0f64303 R11: 0000000000000001 R12: 0000000000000000
> > R13: ffff88810605e2f0 R14: ffffffff88606040 R15: 000000000000c1ff
> > FS:  00007fada4d03640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000001b32e24000 CR3: 00000001016de001 CR4: 0000000000770ef0
> > PKRU: 55555554
> > Call Trace:
> >   <TASK>
> >   inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
> >   inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
> >   mptcp_listen+0x205/0x440 net/mptcp/protocol.c:3564
> >   __sys_listen+0x189/0x260 net/socket.c:1810
> >   __do_sys_listen net/socket.c:1819 [inline]
> >   __se_sys_listen net/socket.c:1817 [inline]
> >   __x64_sys_listen+0x54/0x80 net/socket.c:1817
> >   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >   do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
> >   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > RIP: 0033:0x7fada558f92d
> > Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007fada4d03028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
> > RAX: ffffffffffffffda RBX: 00007fada56aff60 RCX: 00007fada558f92d
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
> > RBP: 00007fada56000a0 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > R13: 000000000000000b R14: 00007fada56aff60 R15: 00007fada4ce3000
> >   </TASK>
> > irq event stamp: 2078
> > hardirqs last  enabled at (2092): [<ffffffff812f1be3>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
> > hardirqs last disabled at (2103): [<ffffffff812f1bc8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
> > softirqs last  enabled at (1498): [<ffffffff83807dd4>] lock_sock include/net/sock.h:1691 [inline]
> > softirqs last  enabled at (1498): [<ffffffff83807dd4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
> > softirqs last disabled at (1500): [<ffffffff836f425c>] spin_lock_bh include/linux/spinlock.h:354 [inline]
> > softirqs last disabled at (1500): [<ffffffff836f425c>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
> > ---[ end trace 0000000000000000 ]---
> >
> >
> > In the full log file it does look like syskaller is doing something
> > strange since it's calling bind, connect, and listen on the same socket:
> >
> > r4 = socket$inet_mptcp(0x2, 0x1, 0x106)
> > bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e23, @empty}, 0x10)
> > connect$inet(r4, &(0x7f0000000040)={0x2, 0x0, @local}, 0x10)
> > listen(r4, 0x0)
> >
> > ...but it is a fuzz tester after all.
> >
> > I've uploaded the full syzkaller logs to this GitHub issue:
> >
> > https://github.com/multipath-tcp/mptcp_net-next/issues/279
> >
> >
> > Not sure yet if this is MPTCP-related. I don't think MPTCP
> > changes anything with the subflow TCP socket bind hashes.
> >
> Hi Mat,
>
> Thanks for bringing this up and for uploading the logs. I will look into this.
> >
Hi Mat,

I am still trying to configure my local environment for mptcp to repro
+ test the fix to verify that it is correct. I think the fix is to add
"inet_bhash2_update_saddr(msk);" to the end of the
"mptcp_copy_inaddrs" function in net/mptcp/protocol.c.  Would you be
able to run an instance on your local syzkaller environment with this
line added to see if that fixes the warning?

Thanks,
Joanne
> > --
> > Mat Martineau
> > Intel
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 1/2] net: Update bhash2 when socket's rcv saddr changes
  2022-06-02 16:51 ` [PATCH net-next v2 1/2] net: " Joanne Koong
@ 2022-06-07  8:33   ` Paolo Abeni
  2022-06-07 17:10     ` Mat Martineau
  2022-06-07 20:24     ` Joanne Koong
  0 siblings, 2 replies; 18+ messages in thread
From: Paolo Abeni @ 2022-06-07  8:33 UTC (permalink / raw)
  To: Joanne Koong, netdev
  Cc: edumazet, kafai, kuba, davem, syzbot+015d756bbd1f8b5c8f09, Mat Martineau

Hello,

On Thu, 2022-06-02 at 09:51 -0700, Joanne Koong wrote:
> Commit d5a42de8bdbe ("net: Add a second bind table hashed by port and
> address") added a second bind table, bhash2, that hashes by a socket's port
> and rcv address.
> 
> However, there are two cases where the socket's rcv saddr can change
> after it has been binded:
> 
> 1) The case where there is a bind() call on "::" (IPADDR_ANY) and then
> a connect() call. The kernel will assign the socket an address when it
> handles the connect()
> 
> 2) In inet_sk_reselect_saddr(), which is called when rerouting fails
> when rebuilding the sk header (invoked by inet_sk_rebuild_header)
> 
> In these two cases, we need to update the bhash2 table by removing the
> entry for the old address, and adding a new entry reflecting the updated
> address.
> 
> Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com
> Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/net/inet_hashtables.h |  6 ++-
>  include/net/ipv6.h            |  2 +-
>  net/dccp/ipv4.c               | 10 +++--
>  net/dccp/ipv6.c               |  4 +-
>  net/ipv4/af_inet.c            |  7 +++-
>  net/ipv4/inet_hashtables.c    | 70 ++++++++++++++++++++++++++++++++---
>  net/ipv4/tcp_ipv4.c           |  8 +++-
>  net/ipv6/inet6_hashtables.c   |  4 +-
>  net/ipv6/tcp_ipv6.c           |  4 +-
>  9 files changed, 97 insertions(+), 18 deletions(-)
> 
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index a0887b70967b..2c331ce6ca73 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -448,11 +448,13 @@ static inline void sk_rcv_saddr_set(struct sock *sk, __be32 addr)
>  }
>  
>  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> -			struct sock *sk, u64 port_offset,
> +			struct sock *sk, u64 port_offset, bool prev_inaddr_any,
>  			int (*check_established)(struct inet_timewait_death_row *,
>  						 struct sock *, __u16,
>  						 struct inet_timewait_sock **));
>  
>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> -		      struct sock *sk);
> +		      struct sock *sk, bool prev_inaddr_any);
> +
> +int inet_bhash2_update_saddr(struct sock *sk);
>  #endif /* _INET_HASHTABLES_H */
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index 5b38bf1a586b..6a50aca56d50 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -1187,7 +1187,7 @@ int inet6_compat_ioctl(struct socket *sock, unsigned int cmd,
>  		unsigned long arg);
>  
>  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
> -			      struct sock *sk);
> +		       struct sock *sk, bool prev_inaddr_any);
>  int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
>  int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>  		  int flags);
> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> index da6e3b20cd75..37a8bc3ee49e 100644
> --- a/net/dccp/ipv4.c
> +++ b/net/dccp/ipv4.c
> @@ -47,12 +47,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  	const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
>  	struct inet_sock *inet = inet_sk(sk);
>  	struct dccp_sock *dp = dccp_sk(sk);
> +	struct ip_options_rcu *inet_opt;
>  	__be16 orig_sport, orig_dport;
> +	bool prev_inaddr_any = false;
>  	__be32 daddr, nexthop;
>  	struct flowi4 *fl4;
>  	struct rtable *rt;
>  	int err;
> -	struct ip_options_rcu *inet_opt;
>  
>  	dp->dccps_role = DCCP_ROLE_CLIENT;
>  
> @@ -89,8 +90,11 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  	if (inet_opt == NULL || !inet_opt->opt.srr)
>  		daddr = fl4->daddr;
>  
> -	if (inet->inet_saddr == 0)
> +	if (inet->inet_saddr == 0) {
>  		inet->inet_saddr = fl4->saddr;
> +		prev_inaddr_any = true;
> +	}
> +
>  	sk_rcv_saddr_set(sk, inet->inet_saddr);
>  	inet->inet_dport = usin->sin_port;
>  	sk_daddr_set(sk, daddr);
> @@ -105,7 +109,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  	 * complete initialization after this.
>  	 */
>  	dccp_set_state(sk, DCCP_REQUESTING);
> -	err = inet_hash_connect(&dccp_death_row, sk);
> +	err = inet_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
>  	if (err != 0)
>  		goto failure;
>  
> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> index fd44638ec16b..03013522acab 100644
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -824,6 +824,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>  	struct ipv6_pinfo *np = inet6_sk(sk);
>  	struct dccp_sock *dp = dccp_sk(sk);
>  	struct in6_addr *saddr = NULL, *final_p, final;
> +	bool prev_inaddr_any = false;
>  	struct ipv6_txoptions *opt;
>  	struct flowi6 fl6;
>  	struct dst_entry *dst;
> @@ -936,6 +937,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>  	if (saddr == NULL) {
>  		saddr = &fl6.saddr;
>  		sk->sk_v6_rcv_saddr = *saddr;
> +		prev_inaddr_any = true;
>  	}
>  
>  	/* set the source address */
> @@ -951,7 +953,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>  	inet->inet_dport = usin->sin6_port;
>  
>  	dccp_set_state(sk, DCCP_REQUESTING);
> -	err = inet6_hash_connect(&dccp_death_row, sk);
> +	err = inet6_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
>  	if (err)
>  		goto late_failure;
>  
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 93da9f783bec..ad627a99ff9d 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1221,10 +1221,11 @@ static int inet_sk_reselect_saddr(struct sock *sk)
>  	struct inet_sock *inet = inet_sk(sk);
>  	__be32 old_saddr = inet->inet_saddr;
>  	__be32 daddr = inet->inet_daddr;
> +	struct ip_options_rcu *inet_opt;
>  	struct flowi4 *fl4;
>  	struct rtable *rt;
>  	__be32 new_saddr;
> -	struct ip_options_rcu *inet_opt;
> +	int err;
>  
>  	inet_opt = rcu_dereference_protected(inet->inet_opt,
>  					     lockdep_sock_is_held(sk));
> @@ -1253,6 +1254,10 @@ static int inet_sk_reselect_saddr(struct sock *sk)
>  
>  	inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
>  
> +	err = inet_bhash2_update_saddr(sk);
> +	if (err)
> +		return err;
> +
>  	/*
>  	 * XXX The only one ugly spot where we need to
>  	 * XXX really change the sockets identity after
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index e8de5e699b3f..592b70663a3b 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -826,6 +826,55 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
>  	return bhash2;
>  }
>  
> +/* the lock for the socket's corresponding bhash entry must be held */
> +static int __inet_bhash2_update_saddr(struct sock *sk,
> +				      struct inet_hashinfo *hinfo,
> +				      struct net *net, int port, int l3mdev)
> +{
> +	struct inet_bind2_hashbucket *head2;
> +	struct inet_bind2_bucket *tb2;
> +
> +	tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk,
> +				     &head2);
> +	if (!tb2) {
> +		tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep,
> +					       net, head2, port, l3mdev, sk);
> +		if (!tb2)
> +			return -ENOMEM;
> +	}
> +
> +	/* Remove the socket's old entry from bhash2 */
> +	__sk_del_bind2_node(sk);
> +
> +	sk_add_bind2_node(sk, &tb2->owners);
> +	inet_csk(sk)->icsk_bind2_hash = tb2;
> +
> +	return 0;
> +}
> +
> +/* This should be called if/when a socket's rcv saddr changes after it has
> + * been binded.
> + */
> +int inet_bhash2_update_saddr(struct sock *sk)
> +{
> +	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
> +	int l3mdev = inet_sk_bound_l3mdev(sk);
> +	struct inet_bind_hashbucket *head;
> +	int port = inet_sk(sk)->inet_num;
> +	struct net *net = sock_net(sk);
> +	int err;
> +
> +	head = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)];
> +
> +	spin_lock_bh(&head->lock);
> +
> +	err = __inet_bhash2_update_saddr(sk, hinfo, net, port, l3mdev);
> +
> +	spin_unlock_bh(&head->lock);
> +
> +	return err;
> +}
> +
>  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
>   * Note that we use 32bit integers (vs RFC 'short integers')
>   * because 2^16 is not a multiple of num_ephemeral and this
> @@ -840,7 +889,7 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
>  static u32 *table_perturb;
>  
>  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> -		struct sock *sk, u64 port_offset,
> +		struct sock *sk, u64 port_offset, bool prev_inaddr_any,
>  		int (*check_established)(struct inet_timewait_death_row *,
>  			struct sock *, __u16, struct inet_timewait_sock **))
>  {
> @@ -858,11 +907,24 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>  	int l3mdev;
>  	u32 index;
>  
> +	l3mdev = inet_sk_bound_l3mdev(sk);
> +
>  	if (port) {
>  		head = &hinfo->bhash[inet_bhashfn(net, port,
>  						  hinfo->bhash_size)];
>  		tb = inet_csk(sk)->icsk_bind_hash;
> +
>  		spin_lock_bh(&head->lock);
> +
> +		if (prev_inaddr_any) {
> +			ret = __inet_bhash2_update_saddr(sk, hinfo, net, port,
> +							 l3mdev);
> +			if (ret) {
> +				spin_unlock_bh(&head->lock);
> +				return ret;
> +			}
> +		}
> +
>  		if (sk_head(&tb->owners) == sk && !sk->sk_bind_node.next) {
>  			inet_ehash_nolisten(sk, NULL, NULL);
>  			spin_unlock_bh(&head->lock);
> @@ -875,8 +937,6 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>  		return ret;
>  	}
>  
> -	l3mdev = inet_sk_bound_l3mdev(sk);
> -
>  	inet_get_local_port_range(net, &low, &high);
>  	high++; /* [32768, 60999] -> [32768, 61000[ */
>  	remaining = high - low;
> @@ -987,13 +1047,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>   * Bind a port for a connect operation and hash it.
>   */
>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> -		      struct sock *sk)
> +		      struct sock *sk, bool prev_inaddr_any)
>  {
>  	u64 port_offset = 0;
>  
>  	if (!inet_sk(sk)->inet_num)
>  		port_offset = inet_sk_port_offset(sk);
> -	return __inet_hash_connect(death_row, sk, port_offset,
> +	return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
>  				   __inet_check_established);
>  }
>  EXPORT_SYMBOL_GPL(inet_hash_connect);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index dac2650f3863..adf8d750933d 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -203,6 +203,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  	struct inet_sock *inet = inet_sk(sk);
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	__be16 orig_sport, orig_dport;
> +	bool prev_inaddr_any = false;
>  	__be32 daddr, nexthop;
>  	struct flowi4 *fl4;
>  	struct rtable *rt;
> @@ -246,8 +247,11 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  	if (!inet_opt || !inet_opt->opt.srr)
>  		daddr = fl4->daddr;
>  
> -	if (!inet->inet_saddr)
> +	if (!inet->inet_saddr) {
>  		inet->inet_saddr = fl4->saddr;
> +		prev_inaddr_any = true;
> +	}
> +
>  	sk_rcv_saddr_set(sk, inet->inet_saddr);
>  
>  	if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> @@ -273,7 +277,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  	 * complete initialization after this.
>  	 */
>  	tcp_set_state(sk, TCP_SYN_SENT);
> -	err = inet_hash_connect(tcp_death_row, sk);
> +	err = inet_hash_connect(tcp_death_row, sk, prev_inaddr_any);
>  	if (err)
>  		goto failure;
>  
> diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> index 7d53d62783b1..c87c5933f3be 100644
> --- a/net/ipv6/inet6_hashtables.c
> +++ b/net/ipv6/inet6_hashtables.c
> @@ -317,13 +317,13 @@ static u64 inet6_sk_port_offset(const struct sock *sk)
>  }
>  
>  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
> -		       struct sock *sk)
> +		       struct sock *sk, bool prev_inaddr_any)
>  {
>  	u64 port_offset = 0;
>  
>  	if (!inet_sk(sk)->inet_num)
>  		port_offset = inet6_sk_port_offset(sk);
> -	return __inet_hash_connect(death_row, sk, port_offset,
> +	return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
>  				   __inet6_check_established);
>  }
>  EXPORT_SYMBOL_GPL(inet6_hash_connect);
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index f37dd4aa91c6..81e3312c2a97 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -152,6 +152,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>  	struct ipv6_pinfo *np = tcp_inet6_sk(sk);
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	struct in6_addr *saddr = NULL, *final_p, final;
> +	bool prev_inaddr_any = false;
>  	struct ipv6_txoptions *opt;
>  	struct flowi6 fl6;
>  	struct dst_entry *dst;
> @@ -289,6 +290,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>  	if (!saddr) {
>  		saddr = &fl6.saddr;
>  		sk->sk_v6_rcv_saddr = *saddr;
> +		prev_inaddr_any = true;
>  	}
>  
>  	/* set the source address */
> @@ -309,7 +311,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>  
>  	tcp_set_state(sk, TCP_SYN_SENT);
>  	tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
> -	err = inet6_hash_connect(tcp_death_row, sk);
> +	err = inet6_hash_connect(tcp_death_row, sk, prev_inaddr_any);
>  	if (err)
>  		goto late_failure;
>  

I'm sorry for the late notice, but it looks like that the mptcp
syzkaller instance is still hitting the Warning in icsk_get_port on top
of the v1 of this series:

https://github.com/multipath-tcp/mptcp_net-next/issues/279

and the change in v2 should not address that. @Mat could you please
confirm the above?

Dumb question: I don't understand how the locking in bhash2 works.
Could you explain that? 

What happens when 2 different processes bind different sockets on
different ports (with different bhash buckets) using different
addresses so that they hit the same bhash2 bucket? AFAICS each process
will use a different lock and access/modification to bhash2 could
happen simultaneusly?

Thanks!

Paolo



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 1/2] net: Update bhash2 when socket's rcv saddr changes
  2022-06-07  8:33   ` Paolo Abeni
@ 2022-06-07 17:10     ` Mat Martineau
  2022-06-07 20:24     ` Joanne Koong
  1 sibling, 0 replies; 18+ messages in thread
From: Mat Martineau @ 2022-06-07 17:10 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Joanne Koong, netdev, edumazet, kafai, kuba, davem,
	syzbot+015d756bbd1f8b5c8f09

On Tue, 7 Jun 2022, Paolo Abeni wrote:

> Hello,
>
> On Thu, 2022-06-02 at 09:51 -0700, Joanne Koong wrote:
>> Commit d5a42de8bdbe ("net: Add a second bind table hashed by port and
>> address") added a second bind table, bhash2, that hashes by a socket's port
>> and rcv address.
>>
>> However, there are two cases where the socket's rcv saddr can change
>> after it has been binded:
>>
>> 1) The case where there is a bind() call on "::" (IPADDR_ANY) and then
>> a connect() call. The kernel will assign the socket an address when it
>> handles the connect()
>>
>> 2) In inet_sk_reselect_saddr(), which is called when rerouting fails
>> when rebuilding the sk header (invoked by inet_sk_rebuild_header)
>>
>> In these two cases, we need to update the bhash2 table by removing the
>> entry for the old address, and adding a new entry reflecting the updated
>> address.
>>
>> Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com
>> Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>> Reviewed-by: Eric Dumazet <edumazet@google.com>
>> ---
>>  include/net/inet_hashtables.h |  6 ++-
>>  include/net/ipv6.h            |  2 +-
>>  net/dccp/ipv4.c               | 10 +++--
>>  net/dccp/ipv6.c               |  4 +-
>>  net/ipv4/af_inet.c            |  7 +++-
>>  net/ipv4/inet_hashtables.c    | 70 ++++++++++++++++++++++++++++++++---
>>  net/ipv4/tcp_ipv4.c           |  8 +++-
>>  net/ipv6/inet6_hashtables.c   |  4 +-
>>  net/ipv6/tcp_ipv6.c           |  4 +-
>>  9 files changed, 97 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
>> index a0887b70967b..2c331ce6ca73 100644
>> --- a/include/net/inet_hashtables.h
>> +++ b/include/net/inet_hashtables.h
>> @@ -448,11 +448,13 @@ static inline void sk_rcv_saddr_set(struct sock *sk, __be32 addr)
>>  }
>>
>>  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>> -			struct sock *sk, u64 port_offset,
>> +			struct sock *sk, u64 port_offset, bool prev_inaddr_any,
>>  			int (*check_established)(struct inet_timewait_death_row *,
>>  						 struct sock *, __u16,
>>  						 struct inet_timewait_sock **));
>>
>>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
>> -		      struct sock *sk);
>> +		      struct sock *sk, bool prev_inaddr_any);
>> +
>> +int inet_bhash2_update_saddr(struct sock *sk);
>>  #endif /* _INET_HASHTABLES_H */
>> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
>> index 5b38bf1a586b..6a50aca56d50 100644
>> --- a/include/net/ipv6.h
>> +++ b/include/net/ipv6.h
>> @@ -1187,7 +1187,7 @@ int inet6_compat_ioctl(struct socket *sock, unsigned int cmd,
>>  		unsigned long arg);
>>
>>  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
>> -			      struct sock *sk);
>> +		       struct sock *sk, bool prev_inaddr_any);
>>  int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
>>  int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>>  		  int flags);
>> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
>> index da6e3b20cd75..37a8bc3ee49e 100644
>> --- a/net/dccp/ipv4.c
>> +++ b/net/dccp/ipv4.c
>> @@ -47,12 +47,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>  	const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
>>  	struct inet_sock *inet = inet_sk(sk);
>>  	struct dccp_sock *dp = dccp_sk(sk);
>> +	struct ip_options_rcu *inet_opt;
>>  	__be16 orig_sport, orig_dport;
>> +	bool prev_inaddr_any = false;
>>  	__be32 daddr, nexthop;
>>  	struct flowi4 *fl4;
>>  	struct rtable *rt;
>>  	int err;
>> -	struct ip_options_rcu *inet_opt;
>>
>>  	dp->dccps_role = DCCP_ROLE_CLIENT;
>>
>> @@ -89,8 +90,11 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>  	if (inet_opt == NULL || !inet_opt->opt.srr)
>>  		daddr = fl4->daddr;
>>
>> -	if (inet->inet_saddr == 0)
>> +	if (inet->inet_saddr == 0) {
>>  		inet->inet_saddr = fl4->saddr;
>> +		prev_inaddr_any = true;
>> +	}
>> +
>>  	sk_rcv_saddr_set(sk, inet->inet_saddr);
>>  	inet->inet_dport = usin->sin_port;
>>  	sk_daddr_set(sk, daddr);
>> @@ -105,7 +109,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>  	 * complete initialization after this.
>>  	 */
>>  	dccp_set_state(sk, DCCP_REQUESTING);
>> -	err = inet_hash_connect(&dccp_death_row, sk);
>> +	err = inet_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
>>  	if (err != 0)
>>  		goto failure;
>>
>> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
>> index fd44638ec16b..03013522acab 100644
>> --- a/net/dccp/ipv6.c
>> +++ b/net/dccp/ipv6.c
>> @@ -824,6 +824,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>>  	struct ipv6_pinfo *np = inet6_sk(sk);
>>  	struct dccp_sock *dp = dccp_sk(sk);
>>  	struct in6_addr *saddr = NULL, *final_p, final;
>> +	bool prev_inaddr_any = false;
>>  	struct ipv6_txoptions *opt;
>>  	struct flowi6 fl6;
>>  	struct dst_entry *dst;
>> @@ -936,6 +937,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>>  	if (saddr == NULL) {
>>  		saddr = &fl6.saddr;
>>  		sk->sk_v6_rcv_saddr = *saddr;
>> +		prev_inaddr_any = true;
>>  	}
>>
>>  	/* set the source address */
>> @@ -951,7 +953,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>>  	inet->inet_dport = usin->sin6_port;
>>
>>  	dccp_set_state(sk, DCCP_REQUESTING);
>> -	err = inet6_hash_connect(&dccp_death_row, sk);
>> +	err = inet6_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
>>  	if (err)
>>  		goto late_failure;
>>
>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>> index 93da9f783bec..ad627a99ff9d 100644
>> --- a/net/ipv4/af_inet.c
>> +++ b/net/ipv4/af_inet.c
>> @@ -1221,10 +1221,11 @@ static int inet_sk_reselect_saddr(struct sock *sk)
>>  	struct inet_sock *inet = inet_sk(sk);
>>  	__be32 old_saddr = inet->inet_saddr;
>>  	__be32 daddr = inet->inet_daddr;
>> +	struct ip_options_rcu *inet_opt;
>>  	struct flowi4 *fl4;
>>  	struct rtable *rt;
>>  	__be32 new_saddr;
>> -	struct ip_options_rcu *inet_opt;
>> +	int err;
>>
>>  	inet_opt = rcu_dereference_protected(inet->inet_opt,
>>  					     lockdep_sock_is_held(sk));
>> @@ -1253,6 +1254,10 @@ static int inet_sk_reselect_saddr(struct sock *sk)
>>
>>  	inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
>>
>> +	err = inet_bhash2_update_saddr(sk);
>> +	if (err)
>> +		return err;
>> +
>>  	/*
>>  	 * XXX The only one ugly spot where we need to
>>  	 * XXX really change the sockets identity after
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index e8de5e699b3f..592b70663a3b 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -826,6 +826,55 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
>>  	return bhash2;
>>  }
>>
>> +/* the lock for the socket's corresponding bhash entry must be held */
>> +static int __inet_bhash2_update_saddr(struct sock *sk,
>> +				      struct inet_hashinfo *hinfo,
>> +				      struct net *net, int port, int l3mdev)
>> +{
>> +	struct inet_bind2_hashbucket *head2;
>> +	struct inet_bind2_bucket *tb2;
>> +
>> +	tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk,
>> +				     &head2);
>> +	if (!tb2) {
>> +		tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep,
>> +					       net, head2, port, l3mdev, sk);
>> +		if (!tb2)
>> +			return -ENOMEM;
>> +	}
>> +
>> +	/* Remove the socket's old entry from bhash2 */
>> +	__sk_del_bind2_node(sk);
>> +
>> +	sk_add_bind2_node(sk, &tb2->owners);
>> +	inet_csk(sk)->icsk_bind2_hash = tb2;
>> +
>> +	return 0;
>> +}
>> +
>> +/* This should be called if/when a socket's rcv saddr changes after it has
>> + * been binded.
>> + */
>> +int inet_bhash2_update_saddr(struct sock *sk)
>> +{
>> +	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
>> +	int l3mdev = inet_sk_bound_l3mdev(sk);
>> +	struct inet_bind_hashbucket *head;
>> +	int port = inet_sk(sk)->inet_num;
>> +	struct net *net = sock_net(sk);
>> +	int err;
>> +
>> +	head = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)];
>> +
>> +	spin_lock_bh(&head->lock);
>> +
>> +	err = __inet_bhash2_update_saddr(sk, hinfo, net, port, l3mdev);
>> +
>> +	spin_unlock_bh(&head->lock);
>> +
>> +	return err;
>> +}
>> +
>>  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
>>   * Note that we use 32bit integers (vs RFC 'short integers')
>>   * because 2^16 is not a multiple of num_ephemeral and this
>> @@ -840,7 +889,7 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
>>  static u32 *table_perturb;
>>
>>  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>> -		struct sock *sk, u64 port_offset,
>> +		struct sock *sk, u64 port_offset, bool prev_inaddr_any,
>>  		int (*check_established)(struct inet_timewait_death_row *,
>>  			struct sock *, __u16, struct inet_timewait_sock **))
>>  {
>> @@ -858,11 +907,24 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>>  	int l3mdev;
>>  	u32 index;
>>
>> +	l3mdev = inet_sk_bound_l3mdev(sk);
>> +
>>  	if (port) {
>>  		head = &hinfo->bhash[inet_bhashfn(net, port,
>>  						  hinfo->bhash_size)];
>>  		tb = inet_csk(sk)->icsk_bind_hash;
>> +
>>  		spin_lock_bh(&head->lock);
>> +
>> +		if (prev_inaddr_any) {
>> +			ret = __inet_bhash2_update_saddr(sk, hinfo, net, port,
>> +							 l3mdev);
>> +			if (ret) {
>> +				spin_unlock_bh(&head->lock);
>> +				return ret;
>> +			}
>> +		}
>> +
>>  		if (sk_head(&tb->owners) == sk && !sk->sk_bind_node.next) {
>>  			inet_ehash_nolisten(sk, NULL, NULL);
>>  			spin_unlock_bh(&head->lock);
>> @@ -875,8 +937,6 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>>  		return ret;
>>  	}
>>
>> -	l3mdev = inet_sk_bound_l3mdev(sk);
>> -
>>  	inet_get_local_port_range(net, &low, &high);
>>  	high++; /* [32768, 60999] -> [32768, 61000[ */
>>  	remaining = high - low;
>> @@ -987,13 +1047,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>>   * Bind a port for a connect operation and hash it.
>>   */
>>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
>> -		      struct sock *sk)
>> +		      struct sock *sk, bool prev_inaddr_any)
>>  {
>>  	u64 port_offset = 0;
>>
>>  	if (!inet_sk(sk)->inet_num)
>>  		port_offset = inet_sk_port_offset(sk);
>> -	return __inet_hash_connect(death_row, sk, port_offset,
>> +	return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
>>  				   __inet_check_established);
>>  }
>>  EXPORT_SYMBOL_GPL(inet_hash_connect);
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index dac2650f3863..adf8d750933d 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -203,6 +203,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>  	struct inet_sock *inet = inet_sk(sk);
>>  	struct tcp_sock *tp = tcp_sk(sk);
>>  	__be16 orig_sport, orig_dport;
>> +	bool prev_inaddr_any = false;
>>  	__be32 daddr, nexthop;
>>  	struct flowi4 *fl4;
>>  	struct rtable *rt;
>> @@ -246,8 +247,11 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>  	if (!inet_opt || !inet_opt->opt.srr)
>>  		daddr = fl4->daddr;
>>
>> -	if (!inet->inet_saddr)
>> +	if (!inet->inet_saddr) {
>>  		inet->inet_saddr = fl4->saddr;
>> +		prev_inaddr_any = true;
>> +	}
>> +
>>  	sk_rcv_saddr_set(sk, inet->inet_saddr);
>>
>>  	if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
>> @@ -273,7 +277,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>  	 * complete initialization after this.
>>  	 */
>>  	tcp_set_state(sk, TCP_SYN_SENT);
>> -	err = inet_hash_connect(tcp_death_row, sk);
>> +	err = inet_hash_connect(tcp_death_row, sk, prev_inaddr_any);
>>  	if (err)
>>  		goto failure;
>>
>> diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
>> index 7d53d62783b1..c87c5933f3be 100644
>> --- a/net/ipv6/inet6_hashtables.c
>> +++ b/net/ipv6/inet6_hashtables.c
>> @@ -317,13 +317,13 @@ static u64 inet6_sk_port_offset(const struct sock *sk)
>>  }
>>
>>  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
>> -		       struct sock *sk)
>> +		       struct sock *sk, bool prev_inaddr_any)
>>  {
>>  	u64 port_offset = 0;
>>
>>  	if (!inet_sk(sk)->inet_num)
>>  		port_offset = inet6_sk_port_offset(sk);
>> -	return __inet_hash_connect(death_row, sk, port_offset,
>> +	return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
>>  				   __inet6_check_established);
>>  }
>>  EXPORT_SYMBOL_GPL(inet6_hash_connect);
>> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
>> index f37dd4aa91c6..81e3312c2a97 100644
>> --- a/net/ipv6/tcp_ipv6.c
>> +++ b/net/ipv6/tcp_ipv6.c
>> @@ -152,6 +152,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>>  	struct ipv6_pinfo *np = tcp_inet6_sk(sk);
>>  	struct tcp_sock *tp = tcp_sk(sk);
>>  	struct in6_addr *saddr = NULL, *final_p, final;
>> +	bool prev_inaddr_any = false;
>>  	struct ipv6_txoptions *opt;
>>  	struct flowi6 fl6;
>>  	struct dst_entry *dst;
>> @@ -289,6 +290,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>>  	if (!saddr) {
>>  		saddr = &fl6.saddr;
>>  		sk->sk_v6_rcv_saddr = *saddr;
>> +		prev_inaddr_any = true;
>>  	}
>>
>>  	/* set the source address */
>> @@ -309,7 +311,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>>
>>  	tcp_set_state(sk, TCP_SYN_SENT);
>>  	tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
>> -	err = inet6_hash_connect(tcp_death_row, sk);
>> +	err = inet6_hash_connect(tcp_death_row, sk, prev_inaddr_any);
>>  	if (err)
>>  		goto late_failure;
>>
>
> I'm sorry for the late notice, but it looks like that the mptcp
> syzkaller instance is still hitting the Warning in icsk_get_port on top
> of the v1 of this series:
>
> https://github.com/multipath-tcp/mptcp_net-next/issues/279
>
> and the change in v2 should not address that. @Mat could you please
> confirm the above?

Yes, I did see the icsk_get_port warning one time between June 1 and today 
with the v1 patch applied. I'll restart syzkaller with latest net-next and 
v2 just to be sure.

>
> Dumb question: I don't understand how the locking in bhash2 works.
> Could you explain that?
>
> What happens when 2 different processes bind different sockets on
> different ports (with different bhash buckets) using different
> addresses so that they hit the same bhash2 bucket? AFAICS each process
> will use a different lock and access/modification to bhash2 could
> happen simultaneusly?
>
> Thanks!
>
> Paolo
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
  2022-06-07  0:02     ` Joanne Koong
@ 2022-06-07 17:31       ` Mat Martineau
  2022-06-08 20:27         ` Joanne Koong
  0 siblings, 1 reply; 18+ messages in thread
From: Mat Martineau @ 2022-06-07 17:31 UTC (permalink / raw)
  To: Joanne Koong
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, Paolo Abeni

On Mon, 6 Jun 2022, Joanne Koong wrote:

> On Fri, Jun 3, 2022 at 5:38 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Fri, Jun 3, 2022 at 11:55 AM Mat Martineau
>> <mathew.j.martineau@linux.intel.com> wrote:
>>>
>>> On Thu, 2 Jun 2022, Joanne Koong wrote:
>>>
>>>> As syzbot noted [1], there is an inconsistency in the bhash2 table in the
>>>> case where a socket's rcv saddr changes after it is binded. (For more
>>>> details, please see the commit message of the first patch)
>>>>
>>>> This patchset fixes that and adds a test that triggers the case where the
>>>> sk's rcv saddr changes. The subsequent listen() call should succeed.
>>>>
>>>> [1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
>>>>
>>>> --
>>>> v1 -> v2:
>>>> v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
>>>> * Mark __inet_bhash2_update_saddr as static
>>>>
>>>> Joanne Koong (2):
>>>>  net: Update bhash2 when socket's rcv saddr changes
>>>>  selftests/net: Add sk_bind_sendto_listen test
>>>>
>>>
>>> Hi Joanne -
>>>
>>> I've been running my own syzkaller instance with v1 of this fix for a
>>> couple of days. Before this patch, syzkaller would trigger the
>>> inet_csk_get_port warning a couple of times per hour. After this patch it
>>> took two days to show the warning:
>>>
>>> ------------[ cut here ]------------
>>>
>>> WARNING: CPU: 0 PID: 9430 at net/ipv4/inet_connection_sock.c:525
>>> inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
>>> Modules linked in:
>>> CPU: 0 PID: 9430 Comm: syz-executor.5 Not tainted 5.18.0-05016-g433fde5b4119 #3
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>> RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
>>> Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 2b 0f db fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 18 0f db fd <0f> 0b e9 70 f9 ff ff e8 0c 0f db fd 0f 0b e9 28 f9 ff ff e8 00 0f
>>> RSP: 0018:ffffc90000b5fbc0 EFLAGS: 00010212
>>> RAX: 00000000000000e7 RBX: ffff88803c410040 RCX: ffffc9000e419000
>>> RDX: 0000000000040000 RSI: ffffffff836f47e8 RDI: ffff88803c4106e0
>>> RBP: ffff88810b773840 R08: 0000000000000001 R09: 0000000000000001
>>> R10: fffffbfff0f64303 R11: 0000000000000001 R12: 0000000000000000
>>> R13: ffff88810605e2f0 R14: ffffffff88606040 R15: 000000000000c1ff
>>> FS:  00007fada4d03640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000001b32e24000 CR3: 00000001016de001 CR4: 0000000000770ef0
>>> PKRU: 55555554
>>> Call Trace:
>>>   <TASK>
>>>   inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
>>>   inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
>>>   mptcp_listen+0x205/0x440 net/mptcp/protocol.c:3564
>>>   __sys_listen+0x189/0x260 net/socket.c:1810
>>>   __do_sys_listen net/socket.c:1819 [inline]
>>>   __se_sys_listen net/socket.c:1817 [inline]
>>>   __x64_sys_listen+0x54/0x80 net/socket.c:1817
>>>   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>>   do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
>>>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>> RIP: 0033:0x7fada558f92d
>>> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
>>> RSP: 002b:00007fada4d03028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
>>> RAX: ffffffffffffffda RBX: 00007fada56aff60 RCX: 00007fada558f92d
>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
>>> RBP: 00007fada56000a0 R08: 0000000000000000 R09: 0000000000000000
>>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>>> R13: 000000000000000b R14: 00007fada56aff60 R15: 00007fada4ce3000
>>>   </TASK>
>>> irq event stamp: 2078
>>> hardirqs last  enabled at (2092): [<ffffffff812f1be3>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
>>> hardirqs last disabled at (2103): [<ffffffff812f1bc8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
>>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] lock_sock include/net/sock.h:1691 [inline]
>>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
>>> softirqs last disabled at (1500): [<ffffffff836f425c>] spin_lock_bh include/linux/spinlock.h:354 [inline]
>>> softirqs last disabled at (1500): [<ffffffff836f425c>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
>>> ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> In the full log file it does look like syskaller is doing something
>>> strange since it's calling bind, connect, and listen on the same socket:
>>>
>>> r4 = socket$inet_mptcp(0x2, 0x1, 0x106)
>>> bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e23, @empty}, 0x10)
>>> connect$inet(r4, &(0x7f0000000040)={0x2, 0x0, @local}, 0x10)
>>> listen(r4, 0x0)
>>>
>>> ...but it is a fuzz tester after all.
>>>
>>> I've uploaded the full syzkaller logs to this GitHub issue:
>>>
>>> https://github.com/multipath-tcp/mptcp_net-next/issues/279
>>>
>>>
>>> Not sure yet if this is MPTCP-related. I don't think MPTCP
>>> changes anything with the subflow TCP socket bind hashes.
>>>
>> Hi Mat,
>>
>> Thanks for bringing this up and for uploading the logs. I will look into this.
>>>
> Hi Mat,
>
> I am still trying to configure my local environment for mptcp to repro
> + test the fix to verify that it is correct. I think the fix is to add
> "inet_bhash2_update_saddr(msk);" to the end of the
> "mptcp_copy_inaddrs" function in net/mptcp/protocol.c.  Would you be
> able to run an instance on your local syzkaller environment with this
> line added to see if that fixes the warning?

Hi Joanne -

I also investigated that function when trying to figure out why this 
warning might be happening in MPTCP.

In MPTCP, the userspace-facing MPTCP socket (msk) doesn't directly bind or 
connect. The msk creates and manages TCP subflow sockets (ssk in 
mptcp_copy_inaddrs()), and passes through bind and connect calls to the 
subflows. The msk depends on the subflow to do the hash updates, since 
those subflow sockets are the ones interacting with the inet layer.

mptcp_copy_inaddrs() copies the already-hashed addresses and ports from 
the ssk to the msk, and we only want the ssk in the hash table.

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 1/2] net: Update bhash2 when socket's rcv saddr changes
  2022-06-07  8:33   ` Paolo Abeni
  2022-06-07 17:10     ` Mat Martineau
@ 2022-06-07 20:24     ` Joanne Koong
  2022-06-08  7:35       ` Paolo Abeni
  1 sibling, 1 reply; 18+ messages in thread
From: Joanne Koong @ 2022-06-07 20:24 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, syzbot, Mat Martineau

On Tue, Jun 7, 2022 at 1:33 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hello,
>
> On Thu, 2022-06-02 at 09:51 -0700, Joanne Koong wrote:
> > Commit d5a42de8bdbe ("net: Add a second bind table hashed by port and
> > address") added a second bind table, bhash2, that hashes by a socket's port
> > and rcv address.
> >
> > However, there are two cases where the socket's rcv saddr can change
> > after it has been binded:
> >
> > 1) The case where there is a bind() call on "::" (IPADDR_ANY) and then
> > a connect() call. The kernel will assign the socket an address when it
> > handles the connect()
> >
> > 2) In inet_sk_reselect_saddr(), which is called when rerouting fails
> > when rebuilding the sk header (invoked by inet_sk_rebuild_header)
> >
> > In these two cases, we need to update the bhash2 table by removing the
> > entry for the old address, and adding a new entry reflecting the updated
> > address.
> >
> > Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com
> > Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > Reviewed-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  include/net/inet_hashtables.h |  6 ++-
> >  include/net/ipv6.h            |  2 +-
> >  net/dccp/ipv4.c               | 10 +++--
> >  net/dccp/ipv6.c               |  4 +-
> >  net/ipv4/af_inet.c            |  7 +++-
> >  net/ipv4/inet_hashtables.c    | 70 ++++++++++++++++++++++++++++++++---
> >  net/ipv4/tcp_ipv4.c           |  8 +++-
> >  net/ipv6/inet6_hashtables.c   |  4 +-
> >  net/ipv6/tcp_ipv6.c           |  4 +-
> >  9 files changed, 97 insertions(+), 18 deletions(-)
> >
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index a0887b70967b..2c331ce6ca73 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -448,11 +448,13 @@ static inline void sk_rcv_saddr_set(struct sock *sk, __be32 addr)
> >  }
> >
> >  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > -                     struct sock *sk, u64 port_offset,
> > +                     struct sock *sk, u64 port_offset, bool prev_inaddr_any,
> >                       int (*check_established)(struct inet_timewait_death_row *,
> >                                                struct sock *, __u16,
> >                                                struct inet_timewait_sock **));
> >
> >  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> > -                   struct sock *sk);
> > +                   struct sock *sk, bool prev_inaddr_any);
> > +
> > +int inet_bhash2_update_saddr(struct sock *sk);
> >  #endif /* _INET_HASHTABLES_H */
> > diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> > index 5b38bf1a586b..6a50aca56d50 100644
> > --- a/include/net/ipv6.h
> > +++ b/include/net/ipv6.h
> > @@ -1187,7 +1187,7 @@ int inet6_compat_ioctl(struct socket *sock, unsigned int cmd,
> >               unsigned long arg);
> >
> >  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
> > -                           struct sock *sk);
> > +                    struct sock *sk, bool prev_inaddr_any);
> >  int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
> >  int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
> >                 int flags);
> > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > index da6e3b20cd75..37a8bc3ee49e 100644
> > --- a/net/dccp/ipv4.c
> > +++ b/net/dccp/ipv4.c
> > @@ -47,12 +47,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >       const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> >       struct inet_sock *inet = inet_sk(sk);
> >       struct dccp_sock *dp = dccp_sk(sk);
> > +     struct ip_options_rcu *inet_opt;
> >       __be16 orig_sport, orig_dport;
> > +     bool prev_inaddr_any = false;
> >       __be32 daddr, nexthop;
> >       struct flowi4 *fl4;
> >       struct rtable *rt;
> >       int err;
> > -     struct ip_options_rcu *inet_opt;
> >
> >       dp->dccps_role = DCCP_ROLE_CLIENT;
> >
> > @@ -89,8 +90,11 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >       if (inet_opt == NULL || !inet_opt->opt.srr)
> >               daddr = fl4->daddr;
> >
> > -     if (inet->inet_saddr == 0)
> > +     if (inet->inet_saddr == 0) {
> >               inet->inet_saddr = fl4->saddr;
> > +             prev_inaddr_any = true;
> > +     }
> > +
> >       sk_rcv_saddr_set(sk, inet->inet_saddr);
> >       inet->inet_dport = usin->sin_port;
> >       sk_daddr_set(sk, daddr);
> > @@ -105,7 +109,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >        * complete initialization after this.
> >        */
> >       dccp_set_state(sk, DCCP_REQUESTING);
> > -     err = inet_hash_connect(&dccp_death_row, sk);
> > +     err = inet_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
> >       if (err != 0)
> >               goto failure;
> >
> > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > index fd44638ec16b..03013522acab 100644
> > --- a/net/dccp/ipv6.c
> > +++ b/net/dccp/ipv6.c
> > @@ -824,6 +824,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >       struct ipv6_pinfo *np = inet6_sk(sk);
> >       struct dccp_sock *dp = dccp_sk(sk);
> >       struct in6_addr *saddr = NULL, *final_p, final;
> > +     bool prev_inaddr_any = false;
> >       struct ipv6_txoptions *opt;
> >       struct flowi6 fl6;
> >       struct dst_entry *dst;
> > @@ -936,6 +937,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >       if (saddr == NULL) {
> >               saddr = &fl6.saddr;
> >               sk->sk_v6_rcv_saddr = *saddr;
> > +             prev_inaddr_any = true;
> >       }
> >
> >       /* set the source address */
> > @@ -951,7 +953,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >       inet->inet_dport = usin->sin6_port;
> >
> >       dccp_set_state(sk, DCCP_REQUESTING);
> > -     err = inet6_hash_connect(&dccp_death_row, sk);
> > +     err = inet6_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
> >       if (err)
> >               goto late_failure;
> >
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index 93da9f783bec..ad627a99ff9d 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1221,10 +1221,11 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> >       struct inet_sock *inet = inet_sk(sk);
> >       __be32 old_saddr = inet->inet_saddr;
> >       __be32 daddr = inet->inet_daddr;
> > +     struct ip_options_rcu *inet_opt;
> >       struct flowi4 *fl4;
> >       struct rtable *rt;
> >       __be32 new_saddr;
> > -     struct ip_options_rcu *inet_opt;
> > +     int err;
> >
> >       inet_opt = rcu_dereference_protected(inet->inet_opt,
> >                                            lockdep_sock_is_held(sk));
> > @@ -1253,6 +1254,10 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> >
> >       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> >
> > +     err = inet_bhash2_update_saddr(sk);
> > +     if (err)
> > +             return err;
> > +
> >       /*
> >        * XXX The only one ugly spot where we need to
> >        * XXX really change the sockets identity after
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index e8de5e699b3f..592b70663a3b 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -826,6 +826,55 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
> >       return bhash2;
> >  }
> >
> > +/* the lock for the socket's corresponding bhash entry must be held */
> > +static int __inet_bhash2_update_saddr(struct sock *sk,
> > +                                   struct inet_hashinfo *hinfo,
> > +                                   struct net *net, int port, int l3mdev)
> > +{
> > +     struct inet_bind2_hashbucket *head2;
> > +     struct inet_bind2_bucket *tb2;
> > +
> > +     tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk,
> > +                                  &head2);
> > +     if (!tb2) {
> > +             tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep,
> > +                                            net, head2, port, l3mdev, sk);
> > +             if (!tb2)
> > +                     return -ENOMEM;
> > +     }
> > +
> > +     /* Remove the socket's old entry from bhash2 */
> > +     __sk_del_bind2_node(sk);
> > +
> > +     sk_add_bind2_node(sk, &tb2->owners);
> > +     inet_csk(sk)->icsk_bind2_hash = tb2;
> > +
> > +     return 0;
> > +}
> > +
> > +/* This should be called if/when a socket's rcv saddr changes after it has
> > + * been binded.
> > + */
> > +int inet_bhash2_update_saddr(struct sock *sk)
> > +{
> > +     struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
> > +     int l3mdev = inet_sk_bound_l3mdev(sk);
> > +     struct inet_bind_hashbucket *head;
> > +     int port = inet_sk(sk)->inet_num;
> > +     struct net *net = sock_net(sk);
> > +     int err;
> > +
> > +     head = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)];
> > +
> > +     spin_lock_bh(&head->lock);
> > +
> > +     err = __inet_bhash2_update_saddr(sk, hinfo, net, port, l3mdev);
> > +
> > +     spin_unlock_bh(&head->lock);
> > +
> > +     return err;
> > +}
> > +
> >  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
> >   * Note that we use 32bit integers (vs RFC 'short integers')
> >   * because 2^16 is not a multiple of num_ephemeral and this
> > @@ -840,7 +889,7 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
> >  static u32 *table_perturb;
> >
> >  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > -             struct sock *sk, u64 port_offset,
> > +             struct sock *sk, u64 port_offset, bool prev_inaddr_any,
> >               int (*check_established)(struct inet_timewait_death_row *,
> >                       struct sock *, __u16, struct inet_timewait_sock **))
> >  {
> > @@ -858,11 +907,24 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >       int l3mdev;
> >       u32 index;
> >
> > +     l3mdev = inet_sk_bound_l3mdev(sk);
> > +
> >       if (port) {
> >               head = &hinfo->bhash[inet_bhashfn(net, port,
> >                                                 hinfo->bhash_size)];
> >               tb = inet_csk(sk)->icsk_bind_hash;
> > +
> >               spin_lock_bh(&head->lock);
> > +
> > +             if (prev_inaddr_any) {
> > +                     ret = __inet_bhash2_update_saddr(sk, hinfo, net, port,
> > +                                                      l3mdev);
> > +                     if (ret) {
> > +                             spin_unlock_bh(&head->lock);
> > +                             return ret;
> > +                     }
> > +             }
> > +
> >               if (sk_head(&tb->owners) == sk && !sk->sk_bind_node.next) {
> >                       inet_ehash_nolisten(sk, NULL, NULL);
> >                       spin_unlock_bh(&head->lock);
> > @@ -875,8 +937,6 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >               return ret;
> >       }
> >
> > -     l3mdev = inet_sk_bound_l3mdev(sk);
> > -
> >       inet_get_local_port_range(net, &low, &high);
> >       high++; /* [32768, 60999] -> [32768, 61000[ */
> >       remaining = high - low;
> > @@ -987,13 +1047,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >   * Bind a port for a connect operation and hash it.
> >   */
> >  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> > -                   struct sock *sk)
> > +                   struct sock *sk, bool prev_inaddr_any)
> >  {
> >       u64 port_offset = 0;
> >
> >       if (!inet_sk(sk)->inet_num)
> >               port_offset = inet_sk_port_offset(sk);
> > -     return __inet_hash_connect(death_row, sk, port_offset,
> > +     return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
> >                                  __inet_check_established);
> >  }
> >  EXPORT_SYMBOL_GPL(inet_hash_connect);
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index dac2650f3863..adf8d750933d 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -203,6 +203,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >       struct inet_sock *inet = inet_sk(sk);
> >       struct tcp_sock *tp = tcp_sk(sk);
> >       __be16 orig_sport, orig_dport;
> > +     bool prev_inaddr_any = false;
> >       __be32 daddr, nexthop;
> >       struct flowi4 *fl4;
> >       struct rtable *rt;
> > @@ -246,8 +247,11 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >       if (!inet_opt || !inet_opt->opt.srr)
> >               daddr = fl4->daddr;
> >
> > -     if (!inet->inet_saddr)
> > +     if (!inet->inet_saddr) {
> >               inet->inet_saddr = fl4->saddr;
> > +             prev_inaddr_any = true;
> > +     }
> > +
> >       sk_rcv_saddr_set(sk, inet->inet_saddr);
> >
> >       if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > @@ -273,7 +277,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >        * complete initialization after this.
> >        */
> >       tcp_set_state(sk, TCP_SYN_SENT);
> > -     err = inet_hash_connect(tcp_death_row, sk);
> > +     err = inet_hash_connect(tcp_death_row, sk, prev_inaddr_any);
> >       if (err)
> >               goto failure;
> >
> > diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> > index 7d53d62783b1..c87c5933f3be 100644
> > --- a/net/ipv6/inet6_hashtables.c
> > +++ b/net/ipv6/inet6_hashtables.c
> > @@ -317,13 +317,13 @@ static u64 inet6_sk_port_offset(const struct sock *sk)
> >  }
> >
> >  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
> > -                    struct sock *sk)
> > +                    struct sock *sk, bool prev_inaddr_any)
> >  {
> >       u64 port_offset = 0;
> >
> >       if (!inet_sk(sk)->inet_num)
> >               port_offset = inet6_sk_port_offset(sk);
> > -     return __inet_hash_connect(death_row, sk, port_offset,
> > +     return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
> >                                  __inet6_check_established);
> >  }
> >  EXPORT_SYMBOL_GPL(inet6_hash_connect);
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index f37dd4aa91c6..81e3312c2a97 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -152,6 +152,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >       struct ipv6_pinfo *np = tcp_inet6_sk(sk);
> >       struct tcp_sock *tp = tcp_sk(sk);
> >       struct in6_addr *saddr = NULL, *final_p, final;
> > +     bool prev_inaddr_any = false;
> >       struct ipv6_txoptions *opt;
> >       struct flowi6 fl6;
> >       struct dst_entry *dst;
> > @@ -289,6 +290,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >       if (!saddr) {
> >               saddr = &fl6.saddr;
> >               sk->sk_v6_rcv_saddr = *saddr;
> > +             prev_inaddr_any = true;
> >       }
> >
> >       /* set the source address */
> > @@ -309,7 +311,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >
> >       tcp_set_state(sk, TCP_SYN_SENT);
> >       tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
> > -     err = inet6_hash_connect(tcp_death_row, sk);
> > +     err = inet6_hash_connect(tcp_death_row, sk, prev_inaddr_any);
> >       if (err)
> >               goto late_failure;
> >
>
> I'm sorry for the late notice, but it looks like that the mptcp
> syzkaller instance is still hitting the Warning in icsk_get_port on top
> of the v1 of this series:
>
> https://github.com/multipath-tcp/mptcp_net-next/issues/279
>
> and the change in v2 should not address that. @Mat could you please
> confirm the above?
>
> Dumb question: I don't understand how the locking in bhash2 works.
> Could you explain that?
>
> What happens when 2 different processes bind different sockets on
> different ports (with different bhash buckets) using different
> addresses so that they hit the same bhash2 bucket? AFAICS each process
> will use a different lock and access/modification to bhash2 could
> happen simultaneusly?
Hi Paolo. Yes, I think you are correct here that there could be a
scenario where this happens. Unfortunately, I think this means the
bhash2 table will need its own lock. I will submit a follow-up for
this.
>
> Thanks!
>
> Paolo
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 1/2] net: Update bhash2 when socket's rcv saddr changes
  2022-06-07 20:24     ` Joanne Koong
@ 2022-06-08  7:35       ` Paolo Abeni
  2022-06-08 17:47         ` Joanne Koong
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2022-06-08  7:35 UTC (permalink / raw)
  To: Joanne Koong, netdev, Eric Dumazet, Martin KaFai Lau,
	Jakub Kicinski, David Miller, syzbot, Mat Martineau

On Tue, 2022-06-07 at 13:24 -0700, Joanne Koong wrote:
> On Tue, Jun 7, 2022 at 1:33 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > Hello,
> > 
> > On Thu, 2022-06-02 at 09:51 -0700, Joanne Koong wrote:
> > > Commit d5a42de8bdbe ("net: Add a second bind table hashed by port and
> > > address") added a second bind table, bhash2, that hashes by a socket's port
> > > and rcv address.
> > > 
> > > However, there are two cases where the socket's rcv saddr can change
> > > after it has been binded:
> > > 
> > > 1) The case where there is a bind() call on "::" (IPADDR_ANY) and then
> > > a connect() call. The kernel will assign the socket an address when it
> > > handles the connect()
> > > 
> > > 2) In inet_sk_reselect_saddr(), which is called when rerouting fails
> > > when rebuilding the sk header (invoked by inet_sk_rebuild_header)
> > > 
> > > In these two cases, we need to update the bhash2 table by removing the
> > > entry for the old address, and adding a new entry reflecting the updated
> > > address.
> > > 
> > > Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com
> > > Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > Reviewed-by: Eric Dumazet <edumazet@google.com>
> > > ---
> > >  include/net/inet_hashtables.h |  6 ++-
> > >  include/net/ipv6.h            |  2 +-
> > >  net/dccp/ipv4.c               | 10 +++--
> > >  net/dccp/ipv6.c               |  4 +-
> > >  net/ipv4/af_inet.c            |  7 +++-
> > >  net/ipv4/inet_hashtables.c    | 70 ++++++++++++++++++++++++++++++++---
> > >  net/ipv4/tcp_ipv4.c           |  8 +++-
> > >  net/ipv6/inet6_hashtables.c   |  4 +-
> > >  net/ipv6/tcp_ipv6.c           |  4 +-
> > >  9 files changed, 97 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > index a0887b70967b..2c331ce6ca73 100644
> > > --- a/include/net/inet_hashtables.h
> > > +++ b/include/net/inet_hashtables.h
> > > @@ -448,11 +448,13 @@ static inline void sk_rcv_saddr_set(struct sock *sk, __be32 addr)
> > >  }
> > > 
> > >  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > -                     struct sock *sk, u64 port_offset,
> > > +                     struct sock *sk, u64 port_offset, bool prev_inaddr_any,
> > >                       int (*check_established)(struct inet_timewait_death_row *,
> > >                                                struct sock *, __u16,
> > >                                                struct inet_timewait_sock **));
> > > 
> > >  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > -                   struct sock *sk);
> > > +                   struct sock *sk, bool prev_inaddr_any);
> > > +
> > > +int inet_bhash2_update_saddr(struct sock *sk);
> > >  #endif /* _INET_HASHTABLES_H */
> > > diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> > > index 5b38bf1a586b..6a50aca56d50 100644
> > > --- a/include/net/ipv6.h
> > > +++ b/include/net/ipv6.h
> > > @@ -1187,7 +1187,7 @@ int inet6_compat_ioctl(struct socket *sock, unsigned int cmd,
> > >               unsigned long arg);
> > > 
> > >  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
> > > -                           struct sock *sk);
> > > +                    struct sock *sk, bool prev_inaddr_any);
> > >  int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
> > >  int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
> > >                 int flags);
> > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > index da6e3b20cd75..37a8bc3ee49e 100644
> > > --- a/net/dccp/ipv4.c
> > > +++ b/net/dccp/ipv4.c
> > > @@ -47,12 +47,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >       const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > >       struct inet_sock *inet = inet_sk(sk);
> > >       struct dccp_sock *dp = dccp_sk(sk);
> > > +     struct ip_options_rcu *inet_opt;
> > >       __be16 orig_sport, orig_dport;
> > > +     bool prev_inaddr_any = false;
> > >       __be32 daddr, nexthop;
> > >       struct flowi4 *fl4;
> > >       struct rtable *rt;
> > >       int err;
> > > -     struct ip_options_rcu *inet_opt;
> > > 
> > >       dp->dccps_role = DCCP_ROLE_CLIENT;
> > > 
> > > @@ -89,8 +90,11 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >       if (inet_opt == NULL || !inet_opt->opt.srr)
> > >               daddr = fl4->daddr;
> > > 
> > > -     if (inet->inet_saddr == 0)
> > > +     if (inet->inet_saddr == 0) {
> > >               inet->inet_saddr = fl4->saddr;
> > > +             prev_inaddr_any = true;
> > > +     }
> > > +
> > >       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > >       inet->inet_dport = usin->sin_port;
> > >       sk_daddr_set(sk, daddr);
> > > @@ -105,7 +109,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >        * complete initialization after this.
> > >        */
> > >       dccp_set_state(sk, DCCP_REQUESTING);
> > > -     err = inet_hash_connect(&dccp_death_row, sk);
> > > +     err = inet_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
> > >       if (err != 0)
> > >               goto failure;
> > > 
> > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > index fd44638ec16b..03013522acab 100644
> > > --- a/net/dccp/ipv6.c
> > > +++ b/net/dccp/ipv6.c
> > > @@ -824,6 +824,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >       struct ipv6_pinfo *np = inet6_sk(sk);
> > >       struct dccp_sock *dp = dccp_sk(sk);
> > >       struct in6_addr *saddr = NULL, *final_p, final;
> > > +     bool prev_inaddr_any = false;
> > >       struct ipv6_txoptions *opt;
> > >       struct flowi6 fl6;
> > >       struct dst_entry *dst;
> > > @@ -936,6 +937,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >       if (saddr == NULL) {
> > >               saddr = &fl6.saddr;
> > >               sk->sk_v6_rcv_saddr = *saddr;
> > > +             prev_inaddr_any = true;
> > >       }
> > > 
> > >       /* set the source address */
> > > @@ -951,7 +953,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >       inet->inet_dport = usin->sin6_port;
> > > 
> > >       dccp_set_state(sk, DCCP_REQUESTING);
> > > -     err = inet6_hash_connect(&dccp_death_row, sk);
> > > +     err = inet6_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
> > >       if (err)
> > >               goto late_failure;
> > > 
> > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > index 93da9f783bec..ad627a99ff9d 100644
> > > --- a/net/ipv4/af_inet.c
> > > +++ b/net/ipv4/af_inet.c
> > > @@ -1221,10 +1221,11 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > >       struct inet_sock *inet = inet_sk(sk);
> > >       __be32 old_saddr = inet->inet_saddr;
> > >       __be32 daddr = inet->inet_daddr;
> > > +     struct ip_options_rcu *inet_opt;
> > >       struct flowi4 *fl4;
> > >       struct rtable *rt;
> > >       __be32 new_saddr;
> > > -     struct ip_options_rcu *inet_opt;
> > > +     int err;
> > > 
> > >       inet_opt = rcu_dereference_protected(inet->inet_opt,
> > >                                            lockdep_sock_is_held(sk));
> > > @@ -1253,6 +1254,10 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > > 
> > >       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > 
> > > +     err = inet_bhash2_update_saddr(sk);
> > > +     if (err)
> > > +             return err;
> > > +
> > >       /*
> > >        * XXX The only one ugly spot where we need to
> > >        * XXX really change the sockets identity after
> > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > index e8de5e699b3f..592b70663a3b 100644
> > > --- a/net/ipv4/inet_hashtables.c
> > > +++ b/net/ipv4/inet_hashtables.c
> > > @@ -826,6 +826,55 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
> > >       return bhash2;
> > >  }
> > > 
> > > +/* the lock for the socket's corresponding bhash entry must be held */
> > > +static int __inet_bhash2_update_saddr(struct sock *sk,
> > > +                                   struct inet_hashinfo *hinfo,
> > > +                                   struct net *net, int port, int l3mdev)
> > > +{
> > > +     struct inet_bind2_hashbucket *head2;
> > > +     struct inet_bind2_bucket *tb2;
> > > +
> > > +     tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk,
> > > +                                  &head2);
> > > +     if (!tb2) {
> > > +             tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep,
> > > +                                            net, head2, port, l3mdev, sk);
> > > +             if (!tb2)
> > > +                     return -ENOMEM;
> > > +     }
> > > +
> > > +     /* Remove the socket's old entry from bhash2 */
> > > +     __sk_del_bind2_node(sk);
> > > +
> > > +     sk_add_bind2_node(sk, &tb2->owners);
> > > +     inet_csk(sk)->icsk_bind2_hash = tb2;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +/* This should be called if/when a socket's rcv saddr changes after it has
> > > + * been binded.
> > > + */
> > > +int inet_bhash2_update_saddr(struct sock *sk)
> > > +{
> > > +     struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
> > > +     int l3mdev = inet_sk_bound_l3mdev(sk);
> > > +     struct inet_bind_hashbucket *head;
> > > +     int port = inet_sk(sk)->inet_num;
> > > +     struct net *net = sock_net(sk);
> > > +     int err;
> > > +
> > > +     head = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)];
> > > +
> > > +     spin_lock_bh(&head->lock);
> > > +
> > > +     err = __inet_bhash2_update_saddr(sk, hinfo, net, port, l3mdev);
> > > +
> > > +     spin_unlock_bh(&head->lock);
> > > +
> > > +     return err;
> > > +}
> > > +
> > >  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
> > >   * Note that we use 32bit integers (vs RFC 'short integers')
> > >   * because 2^16 is not a multiple of num_ephemeral and this
> > > @@ -840,7 +889,7 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
> > >  static u32 *table_perturb;
> > > 
> > >  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > -             struct sock *sk, u64 port_offset,
> > > +             struct sock *sk, u64 port_offset, bool prev_inaddr_any,
> > >               int (*check_established)(struct inet_timewait_death_row *,
> > >                       struct sock *, __u16, struct inet_timewait_sock **))
> > >  {
> > > @@ -858,11 +907,24 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > >       int l3mdev;
> > >       u32 index;
> > > 
> > > +     l3mdev = inet_sk_bound_l3mdev(sk);
> > > +
> > >       if (port) {
> > >               head = &hinfo->bhash[inet_bhashfn(net, port,
> > >                                                 hinfo->bhash_size)];
> > >               tb = inet_csk(sk)->icsk_bind_hash;
> > > +
> > >               spin_lock_bh(&head->lock);
> > > +
> > > +             if (prev_inaddr_any) {
> > > +                     ret = __inet_bhash2_update_saddr(sk, hinfo, net, port,
> > > +                                                      l3mdev);
> > > +                     if (ret) {
> > > +                             spin_unlock_bh(&head->lock);
> > > +                             return ret;
> > > +                     }
> > > +             }
> > > +
> > >               if (sk_head(&tb->owners) == sk && !sk->sk_bind_node.next) {
> > >                       inet_ehash_nolisten(sk, NULL, NULL);
> > >                       spin_unlock_bh(&head->lock);
> > > @@ -875,8 +937,6 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > >               return ret;
> > >       }
> > > 
> > > -     l3mdev = inet_sk_bound_l3mdev(sk);
> > > -
> > >       inet_get_local_port_range(net, &low, &high);
> > >       high++; /* [32768, 60999] -> [32768, 61000[ */
> > >       remaining = high - low;
> > > @@ -987,13 +1047,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > >   * Bind a port for a connect operation and hash it.
> > >   */
> > >  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > -                   struct sock *sk)
> > > +                   struct sock *sk, bool prev_inaddr_any)
> > >  {
> > >       u64 port_offset = 0;
> > > 
> > >       if (!inet_sk(sk)->inet_num)
> > >               port_offset = inet_sk_port_offset(sk);
> > > -     return __inet_hash_connect(death_row, sk, port_offset,
> > > +     return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
> > >                                  __inet_check_established);
> > >  }
> > >  EXPORT_SYMBOL_GPL(inet_hash_connect);
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index dac2650f3863..adf8d750933d 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -203,6 +203,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >       struct inet_sock *inet = inet_sk(sk);
> > >       struct tcp_sock *tp = tcp_sk(sk);
> > >       __be16 orig_sport, orig_dport;
> > > +     bool prev_inaddr_any = false;
> > >       __be32 daddr, nexthop;
> > >       struct flowi4 *fl4;
> > >       struct rtable *rt;
> > > @@ -246,8 +247,11 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >       if (!inet_opt || !inet_opt->opt.srr)
> > >               daddr = fl4->daddr;
> > > 
> > > -     if (!inet->inet_saddr)
> > > +     if (!inet->inet_saddr) {
> > >               inet->inet_saddr = fl4->saddr;
> > > +             prev_inaddr_any = true;
> > > +     }
> > > +
> > >       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > 
> > >       if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > @@ -273,7 +277,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >        * complete initialization after this.
> > >        */
> > >       tcp_set_state(sk, TCP_SYN_SENT);
> > > -     err = inet_hash_connect(tcp_death_row, sk);
> > > +     err = inet_hash_connect(tcp_death_row, sk, prev_inaddr_any);
> > >       if (err)
> > >               goto failure;
> > > 
> > > diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> > > index 7d53d62783b1..c87c5933f3be 100644
> > > --- a/net/ipv6/inet6_hashtables.c
> > > +++ b/net/ipv6/inet6_hashtables.c
> > > @@ -317,13 +317,13 @@ static u64 inet6_sk_port_offset(const struct sock *sk)
> > >  }
> > > 
> > >  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
> > > -                    struct sock *sk)
> > > +                    struct sock *sk, bool prev_inaddr_any)
> > >  {
> > >       u64 port_offset = 0;
> > > 
> > >       if (!inet_sk(sk)->inet_num)
> > >               port_offset = inet6_sk_port_offset(sk);
> > > -     return __inet_hash_connect(death_row, sk, port_offset,
> > > +     return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
> > >                                  __inet6_check_established);
> > >  }
> > >  EXPORT_SYMBOL_GPL(inet6_hash_connect);
> > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > index f37dd4aa91c6..81e3312c2a97 100644
> > > --- a/net/ipv6/tcp_ipv6.c
> > > +++ b/net/ipv6/tcp_ipv6.c
> > > @@ -152,6 +152,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >       struct ipv6_pinfo *np = tcp_inet6_sk(sk);
> > >       struct tcp_sock *tp = tcp_sk(sk);
> > >       struct in6_addr *saddr = NULL, *final_p, final;
> > > +     bool prev_inaddr_any = false;
> > >       struct ipv6_txoptions *opt;
> > >       struct flowi6 fl6;
> > >       struct dst_entry *dst;
> > > @@ -289,6 +290,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >       if (!saddr) {
> > >               saddr = &fl6.saddr;
> > >               sk->sk_v6_rcv_saddr = *saddr;
> > > +             prev_inaddr_any = true;
> > >       }
> > > 
> > >       /* set the source address */
> > > @@ -309,7 +311,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > 
> > >       tcp_set_state(sk, TCP_SYN_SENT);
> > >       tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
> > > -     err = inet6_hash_connect(tcp_death_row, sk);
> > > +     err = inet6_hash_connect(tcp_death_row, sk, prev_inaddr_any);
> > >       if (err)
> > >               goto late_failure;
> > > 
> > 
> > I'm sorry for the late notice, but it looks like that the mptcp
> > syzkaller instance is still hitting the Warning in icsk_get_port on top
> > of the v1 of this series:
> > 
> > https://github.com/multipath-tcp/mptcp_net-next/issues/279
> > 
> > and the change in v2 should not address that. @Mat could you please
> > confirm the above?
> > 
> > Dumb question: I don't understand how the locking in bhash2 works.
> > Could you explain that?
> > 
> > What happens when 2 different processes bind different sockets on
> > different ports (with different bhash buckets) using different
> > addresses so that they hit the same bhash2 bucket? AFAICS each process
> > will use a different lock and access/modification to bhash2 could
> > happen simultaneusly?
> Hi Paolo. Yes, I think you are correct here that there could be a
> scenario where this happens. Unfortunately, I think this means the
> bhash2 table will need its own lock. I will submit a follow-up for
> this.
> 

I'm wondering if we could (and more importantly, if it would make any
sense) resort to use bhash2 usage only?

e.g. add a spinlock per row to bhash2, use it to protect row
manipulation, always hash into bhash2 and then drop bhash, similar to
commit cae3873c5b3a4fcd9706fb461ff4e91bdf1f0120?

Cheers,

Paolo





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 1/2] net: Update bhash2 when socket's rcv saddr changes
  2022-06-08  7:35       ` Paolo Abeni
@ 2022-06-08 17:47         ` Joanne Koong
  0 siblings, 0 replies; 18+ messages in thread
From: Joanne Koong @ 2022-06-08 17:47 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, syzbot, Mat Martineau

On Wed, Jun 8, 2022 at 12:35 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Tue, 2022-06-07 at 13:24 -0700, Joanne Koong wrote:
> > On Tue, Jun 7, 2022 at 1:33 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > Hello,
> > >
> > > On Thu, 2022-06-02 at 09:51 -0700, Joanne Koong wrote:
> > > > Commit d5a42de8bdbe ("net: Add a second bind table hashed by port and
> > > > address") added a second bind table, bhash2, that hashes by a socket's port
> > > > and rcv address.
> > > >
> > > > However, there are two cases where the socket's rcv saddr can change
> > > > after it has been binded:
> > > >
> > > > 1) The case where there is a bind() call on "::" (IPADDR_ANY) and then
> > > > a connect() call. The kernel will assign the socket an address when it
> > > > handles the connect()
> > > >
> > > > 2) In inet_sk_reselect_saddr(), which is called when rerouting fails
> > > > when rebuilding the sk header (invoked by inet_sk_rebuild_header)
> > > >
> > > > In these two cases, we need to update the bhash2 table by removing the
> > > > entry for the old address, and adding a new entry reflecting the updated
> > > > address.
> > > >
> > > > Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com
> > > > Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > Reviewed-by: Eric Dumazet <edumazet@google.com>
> > > > ---
> > > >  include/net/inet_hashtables.h |  6 ++-
> > > >  include/net/ipv6.h            |  2 +-
> > > >  net/dccp/ipv4.c               | 10 +++--
> > > >  net/dccp/ipv6.c               |  4 +-
> > > >  net/ipv4/af_inet.c            |  7 +++-
> > > >  net/ipv4/inet_hashtables.c    | 70 ++++++++++++++++++++++++++++++++---
> > > >  net/ipv4/tcp_ipv4.c           |  8 +++-
> > > >  net/ipv6/inet6_hashtables.c   |  4 +-
> > > >  net/ipv6/tcp_ipv6.c           |  4 +-
> > > >  9 files changed, 97 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > > index a0887b70967b..2c331ce6ca73 100644
> > > > --- a/include/net/inet_hashtables.h
> > > > +++ b/include/net/inet_hashtables.h
> > > > @@ -448,11 +448,13 @@ static inline void sk_rcv_saddr_set(struct sock *sk, __be32 addr)
> > > >  }
> > > >
> > > >  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > > -                     struct sock *sk, u64 port_offset,
> > > > +                     struct sock *sk, u64 port_offset, bool prev_inaddr_any,
> > > >                       int (*check_established)(struct inet_timewait_death_row *,
> > > >                                                struct sock *, __u16,
> > > >                                                struct inet_timewait_sock **));
> > > >
> > > >  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > > -                   struct sock *sk);
> > > > +                   struct sock *sk, bool prev_inaddr_any);
> > > > +
> > > > +int inet_bhash2_update_saddr(struct sock *sk);
> > > >  #endif /* _INET_HASHTABLES_H */
> > > > diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> > > > index 5b38bf1a586b..6a50aca56d50 100644
> > > > --- a/include/net/ipv6.h
> > > > +++ b/include/net/ipv6.h
> > > > @@ -1187,7 +1187,7 @@ int inet6_compat_ioctl(struct socket *sock, unsigned int cmd,
> > > >               unsigned long arg);
> > > >
> > > >  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
> > > > -                           struct sock *sk);
> > > > +                    struct sock *sk, bool prev_inaddr_any);
> > > >  int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
> > > >  int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
> > > >                 int flags);
> > > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > > index da6e3b20cd75..37a8bc3ee49e 100644
> > > > --- a/net/dccp/ipv4.c
> > > > +++ b/net/dccp/ipv4.c
> > > > @@ -47,12 +47,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >       const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > >       struct inet_sock *inet = inet_sk(sk);
> > > >       struct dccp_sock *dp = dccp_sk(sk);
> > > > +     struct ip_options_rcu *inet_opt;
> > > >       __be16 orig_sport, orig_dport;
> > > > +     bool prev_inaddr_any = false;
> > > >       __be32 daddr, nexthop;
> > > >       struct flowi4 *fl4;
> > > >       struct rtable *rt;
> > > >       int err;
> > > > -     struct ip_options_rcu *inet_opt;
> > > >
> > > >       dp->dccps_role = DCCP_ROLE_CLIENT;
> > > >
> > > > @@ -89,8 +90,11 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >       if (inet_opt == NULL || !inet_opt->opt.srr)
> > > >               daddr = fl4->daddr;
> > > >
> > > > -     if (inet->inet_saddr == 0)
> > > > +     if (inet->inet_saddr == 0) {
> > > >               inet->inet_saddr = fl4->saddr;
> > > > +             prev_inaddr_any = true;
> > > > +     }
> > > > +
> > > >       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > >       inet->inet_dport = usin->sin_port;
> > > >       sk_daddr_set(sk, daddr);
> > > > @@ -105,7 +109,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >        * complete initialization after this.
> > > >        */
> > > >       dccp_set_state(sk, DCCP_REQUESTING);
> > > > -     err = inet_hash_connect(&dccp_death_row, sk);
> > > > +     err = inet_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
> > > >       if (err != 0)
> > > >               goto failure;
> > > >
> > > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > > index fd44638ec16b..03013522acab 100644
> > > > --- a/net/dccp/ipv6.c
> > > > +++ b/net/dccp/ipv6.c
> > > > @@ -824,6 +824,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >       struct ipv6_pinfo *np = inet6_sk(sk);
> > > >       struct dccp_sock *dp = dccp_sk(sk);
> > > >       struct in6_addr *saddr = NULL, *final_p, final;
> > > > +     bool prev_inaddr_any = false;
> > > >       struct ipv6_txoptions *opt;
> > > >       struct flowi6 fl6;
> > > >       struct dst_entry *dst;
> > > > @@ -936,6 +937,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >       if (saddr == NULL) {
> > > >               saddr = &fl6.saddr;
> > > >               sk->sk_v6_rcv_saddr = *saddr;
> > > > +             prev_inaddr_any = true;
> > > >       }
> > > >
> > > >       /* set the source address */
> > > > @@ -951,7 +953,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >       inet->inet_dport = usin->sin6_port;
> > > >
> > > >       dccp_set_state(sk, DCCP_REQUESTING);
> > > > -     err = inet6_hash_connect(&dccp_death_row, sk);
> > > > +     err = inet6_hash_connect(&dccp_death_row, sk, prev_inaddr_any);
> > > >       if (err)
> > > >               goto late_failure;
> > > >
> > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > > index 93da9f783bec..ad627a99ff9d 100644
> > > > --- a/net/ipv4/af_inet.c
> > > > +++ b/net/ipv4/af_inet.c
> > > > @@ -1221,10 +1221,11 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > > >       struct inet_sock *inet = inet_sk(sk);
> > > >       __be32 old_saddr = inet->inet_saddr;
> > > >       __be32 daddr = inet->inet_daddr;
> > > > +     struct ip_options_rcu *inet_opt;
> > > >       struct flowi4 *fl4;
> > > >       struct rtable *rt;
> > > >       __be32 new_saddr;
> > > > -     struct ip_options_rcu *inet_opt;
> > > > +     int err;
> > > >
> > > >       inet_opt = rcu_dereference_protected(inet->inet_opt,
> > > >                                            lockdep_sock_is_held(sk));
> > > > @@ -1253,6 +1254,10 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > > >
> > > >       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > >
> > > > +     err = inet_bhash2_update_saddr(sk);
> > > > +     if (err)
> > > > +             return err;
> > > > +
> > > >       /*
> > > >        * XXX The only one ugly spot where we need to
> > > >        * XXX really change the sockets identity after
> > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > index e8de5e699b3f..592b70663a3b 100644
> > > > --- a/net/ipv4/inet_hashtables.c
> > > > +++ b/net/ipv4/inet_hashtables.c
> > > > @@ -826,6 +826,55 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
> > > >       return bhash2;
> > > >  }
> > > >
> > > > +/* the lock for the socket's corresponding bhash entry must be held */
> > > > +static int __inet_bhash2_update_saddr(struct sock *sk,
> > > > +                                   struct inet_hashinfo *hinfo,
> > > > +                                   struct net *net, int port, int l3mdev)
> > > > +{
> > > > +     struct inet_bind2_hashbucket *head2;
> > > > +     struct inet_bind2_bucket *tb2;
> > > > +
> > > > +     tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk,
> > > > +                                  &head2);
> > > > +     if (!tb2) {
> > > > +             tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep,
> > > > +                                            net, head2, port, l3mdev, sk);
> > > > +             if (!tb2)
> > > > +                     return -ENOMEM;
> > > > +     }
> > > > +
> > > > +     /* Remove the socket's old entry from bhash2 */
> > > > +     __sk_del_bind2_node(sk);
> > > > +
> > > > +     sk_add_bind2_node(sk, &tb2->owners);
> > > > +     inet_csk(sk)->icsk_bind2_hash = tb2;
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +/* This should be called if/when a socket's rcv saddr changes after it has
> > > > + * been binded.
> > > > + */
> > > > +int inet_bhash2_update_saddr(struct sock *sk)
> > > > +{
> > > > +     struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
> > > > +     int l3mdev = inet_sk_bound_l3mdev(sk);
> > > > +     struct inet_bind_hashbucket *head;
> > > > +     int port = inet_sk(sk)->inet_num;
> > > > +     struct net *net = sock_net(sk);
> > > > +     int err;
> > > > +
> > > > +     head = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)];
> > > > +
> > > > +     spin_lock_bh(&head->lock);
> > > > +
> > > > +     err = __inet_bhash2_update_saddr(sk, hinfo, net, port, l3mdev);
> > > > +
> > > > +     spin_unlock_bh(&head->lock);
> > > > +
> > > > +     return err;
> > > > +}
> > > > +
> > > >  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
> > > >   * Note that we use 32bit integers (vs RFC 'short integers')
> > > >   * because 2^16 is not a multiple of num_ephemeral and this
> > > > @@ -840,7 +889,7 @@ inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net,
> > > >  static u32 *table_perturb;
> > > >
> > > >  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > > -             struct sock *sk, u64 port_offset,
> > > > +             struct sock *sk, u64 port_offset, bool prev_inaddr_any,
> > > >               int (*check_established)(struct inet_timewait_death_row *,
> > > >                       struct sock *, __u16, struct inet_timewait_sock **))
> > > >  {
> > > > @@ -858,11 +907,24 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > >       int l3mdev;
> > > >       u32 index;
> > > >
> > > > +     l3mdev = inet_sk_bound_l3mdev(sk);
> > > > +
> > > >       if (port) {
> > > >               head = &hinfo->bhash[inet_bhashfn(net, port,
> > > >                                                 hinfo->bhash_size)];
> > > >               tb = inet_csk(sk)->icsk_bind_hash;
> > > > +
> > > >               spin_lock_bh(&head->lock);
> > > > +
> > > > +             if (prev_inaddr_any) {
> > > > +                     ret = __inet_bhash2_update_saddr(sk, hinfo, net, port,
> > > > +                                                      l3mdev);
> > > > +                     if (ret) {
> > > > +                             spin_unlock_bh(&head->lock);
> > > > +                             return ret;
> > > > +                     }
> > > > +             }
> > > > +
> > > >               if (sk_head(&tb->owners) == sk && !sk->sk_bind_node.next) {
> > > >                       inet_ehash_nolisten(sk, NULL, NULL);
> > > >                       spin_unlock_bh(&head->lock);
> > > > @@ -875,8 +937,6 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > >               return ret;
> > > >       }
> > > >
> > > > -     l3mdev = inet_sk_bound_l3mdev(sk);
> > > > -
> > > >       inet_get_local_port_range(net, &low, &high);
> > > >       high++; /* [32768, 60999] -> [32768, 61000[ */
> > > >       remaining = high - low;
> > > > @@ -987,13 +1047,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > >   * Bind a port for a connect operation and hash it.
> > > >   */
> > > >  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> > > > -                   struct sock *sk)
> > > > +                   struct sock *sk, bool prev_inaddr_any)
> > > >  {
> > > >       u64 port_offset = 0;
> > > >
> > > >       if (!inet_sk(sk)->inet_num)
> > > >               port_offset = inet_sk_port_offset(sk);
> > > > -     return __inet_hash_connect(death_row, sk, port_offset,
> > > > +     return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
> > > >                                  __inet_check_established);
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(inet_hash_connect);
> > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > index dac2650f3863..adf8d750933d 100644
> > > > --- a/net/ipv4/tcp_ipv4.c
> > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > @@ -203,6 +203,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >       struct inet_sock *inet = inet_sk(sk);
> > > >       struct tcp_sock *tp = tcp_sk(sk);
> > > >       __be16 orig_sport, orig_dport;
> > > > +     bool prev_inaddr_any = false;
> > > >       __be32 daddr, nexthop;
> > > >       struct flowi4 *fl4;
> > > >       struct rtable *rt;
> > > > @@ -246,8 +247,11 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >       if (!inet_opt || !inet_opt->opt.srr)
> > > >               daddr = fl4->daddr;
> > > >
> > > > -     if (!inet->inet_saddr)
> > > > +     if (!inet->inet_saddr) {
> > > >               inet->inet_saddr = fl4->saddr;
> > > > +             prev_inaddr_any = true;
> > > > +     }
> > > > +
> > > >       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > >
> > > >       if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > > @@ -273,7 +277,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >        * complete initialization after this.
> > > >        */
> > > >       tcp_set_state(sk, TCP_SYN_SENT);
> > > > -     err = inet_hash_connect(tcp_death_row, sk);
> > > > +     err = inet_hash_connect(tcp_death_row, sk, prev_inaddr_any);
> > > >       if (err)
> > > >               goto failure;
> > > >
> > > > diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> > > > index 7d53d62783b1..c87c5933f3be 100644
> > > > --- a/net/ipv6/inet6_hashtables.c
> > > > +++ b/net/ipv6/inet6_hashtables.c
> > > > @@ -317,13 +317,13 @@ static u64 inet6_sk_port_offset(const struct sock *sk)
> > > >  }
> > > >
> > > >  int inet6_hash_connect(struct inet_timewait_death_row *death_row,
> > > > -                    struct sock *sk)
> > > > +                    struct sock *sk, bool prev_inaddr_any)
> > > >  {
> > > >       u64 port_offset = 0;
> > > >
> > > >       if (!inet_sk(sk)->inet_num)
> > > >               port_offset = inet6_sk_port_offset(sk);
> > > > -     return __inet_hash_connect(death_row, sk, port_offset,
> > > > +     return __inet_hash_connect(death_row, sk, port_offset, prev_inaddr_any,
> > > >                                  __inet6_check_established);
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(inet6_hash_connect);
> > > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > > index f37dd4aa91c6..81e3312c2a97 100644
> > > > --- a/net/ipv6/tcp_ipv6.c
> > > > +++ b/net/ipv6/tcp_ipv6.c
> > > > @@ -152,6 +152,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >       struct ipv6_pinfo *np = tcp_inet6_sk(sk);
> > > >       struct tcp_sock *tp = tcp_sk(sk);
> > > >       struct in6_addr *saddr = NULL, *final_p, final;
> > > > +     bool prev_inaddr_any = false;
> > > >       struct ipv6_txoptions *opt;
> > > >       struct flowi6 fl6;
> > > >       struct dst_entry *dst;
> > > > @@ -289,6 +290,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >       if (!saddr) {
> > > >               saddr = &fl6.saddr;
> > > >               sk->sk_v6_rcv_saddr = *saddr;
> > > > +             prev_inaddr_any = true;
> > > >       }
> > > >
> > > >       /* set the source address */
> > > > @@ -309,7 +311,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >
> > > >       tcp_set_state(sk, TCP_SYN_SENT);
> > > >       tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
> > > > -     err = inet6_hash_connect(tcp_death_row, sk);
> > > > +     err = inet6_hash_connect(tcp_death_row, sk, prev_inaddr_any);
> > > >       if (err)
> > > >               goto late_failure;
> > > >
> > >
> > > I'm sorry for the late notice, but it looks like that the mptcp
> > > syzkaller instance is still hitting the Warning in icsk_get_port on top
> > > of the v1 of this series:
> > >
> > > https://github.com/multipath-tcp/mptcp_net-next/issues/279
> > >
> > > and the change in v2 should not address that. @Mat could you please
> > > confirm the above?
> > >
> > > Dumb question: I don't understand how the locking in bhash2 works.
> > > Could you explain that?
> > >
> > > What happens when 2 different processes bind different sockets on
> > > different ports (with different bhash buckets) using different
> > > addresses so that they hit the same bhash2 bucket? AFAICS each process
> > > will use a different lock and access/modification to bhash2 could
> > > happen simultaneusly?
> > Hi Paolo. Yes, I think you are correct here that there could be a
> > scenario where this happens. Unfortunately, I think this means the
> > bhash2 table will need its own lock. I will submit a follow-up for
> > this.
> >
>
> I'm wondering if we could (and more importantly, if it would make any
> sense) resort to use bhash2 usage only?
>
> e.g. add a spinlock per row to bhash2, use it to protect row
> manipulation, always hash into bhash2 and then drop bhash, similar to
> commit cae3873c5b3a4fcd9706fb461ff4e91bdf1f0120?

This would be the ideal solution but I think we need the bhash table
to handle the case where the bind request is for address INADDR_ANY.
If the bind request is for INADDR_ANY on a port, then we need to check
every socket already bound to that port to determine if there is a
conflict. Without the bhash table (which hashes only by port number),
we don't have a way of getting all the sockets that are bound to a
specified port.

>
> Cheers,
>
> Paolo
>
>
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
  2022-06-07 17:31       ` Mat Martineau
@ 2022-06-08 20:27         ` Joanne Koong
  2022-06-09 23:52           ` Joanne Koong
  2022-06-09 23:59           ` Mat Martineau
  0 siblings, 2 replies; 18+ messages in thread
From: Joanne Koong @ 2022-06-08 20:27 UTC (permalink / raw)
  To: Mat Martineau
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, Paolo Abeni

On Tue, Jun 7, 2022 at 10:33 AM Mat Martineau
<mathew.j.martineau@linux.intel.com> wrote:
>
> On Mon, 6 Jun 2022, Joanne Koong wrote:
>
> > On Fri, Jun 3, 2022 at 5:38 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>
> >> On Fri, Jun 3, 2022 at 11:55 AM Mat Martineau
> >> <mathew.j.martineau@linux.intel.com> wrote:
> >>>
> >>> On Thu, 2 Jun 2022, Joanne Koong wrote:
> >>>
> >>>> As syzbot noted [1], there is an inconsistency in the bhash2 table in the
> >>>> case where a socket's rcv saddr changes after it is binded. (For more
> >>>> details, please see the commit message of the first patch)
> >>>>
> >>>> This patchset fixes that and adds a test that triggers the case where the
> >>>> sk's rcv saddr changes. The subsequent listen() call should succeed.
> >>>>
> >>>> [1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
> >>>>
> >>>> --
> >>>> v1 -> v2:
> >>>> v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
> >>>> * Mark __inet_bhash2_update_saddr as static
> >>>>
> >>>> Joanne Koong (2):
> >>>>  net: Update bhash2 when socket's rcv saddr changes
> >>>>  selftests/net: Add sk_bind_sendto_listen test
> >>>>
> >>>
> >>> Hi Joanne -
> >>>
> >>> I've been running my own syzkaller instance with v1 of this fix for a
> >>> couple of days. Before this patch, syzkaller would trigger the
> >>> inet_csk_get_port warning a couple of times per hour. After this patch it
> >>> took two days to show the warning:
> >>>
> >>> ------------[ cut here ]------------
> >>>
> >>> WARNING: CPU: 0 PID: 9430 at net/ipv4/inet_connection_sock.c:525
> >>> inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
> >>> Modules linked in:
> >>> CPU: 0 PID: 9430 Comm: syz-executor.5 Not tainted 5.18.0-05016-g433fde5b4119 #3
> >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> >>> RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
> >>> Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 2b 0f db fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 18 0f db fd <0f> 0b e9 70 f9 ff ff e8 0c 0f db fd 0f 0b e9 28 f9 ff ff e8 00 0f
> >>> RSP: 0018:ffffc90000b5fbc0 EFLAGS: 00010212
> >>> RAX: 00000000000000e7 RBX: ffff88803c410040 RCX: ffffc9000e419000
> >>> RDX: 0000000000040000 RSI: ffffffff836f47e8 RDI: ffff88803c4106e0
> >>> RBP: ffff88810b773840 R08: 0000000000000001 R09: 0000000000000001
> >>> R10: fffffbfff0f64303 R11: 0000000000000001 R12: 0000000000000000
> >>> R13: ffff88810605e2f0 R14: ffffffff88606040 R15: 000000000000c1ff
> >>> FS:  00007fada4d03640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
> >>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> CR2: 0000001b32e24000 CR3: 00000001016de001 CR4: 0000000000770ef0
> >>> PKRU: 55555554
> >>> Call Trace:
> >>>   <TASK>
> >>>   inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
> >>>   inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
> >>>   mptcp_listen+0x205/0x440 net/mptcp/protocol.c:3564
> >>>   __sys_listen+0x189/0x260 net/socket.c:1810
> >>>   __do_sys_listen net/socket.c:1819 [inline]
> >>>   __se_sys_listen net/socket.c:1817 [inline]
> >>>   __x64_sys_listen+0x54/0x80 net/socket.c:1817
> >>>   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >>>   do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
> >>>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >>> RIP: 0033:0x7fada558f92d
> >>> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> >>> RSP: 002b:00007fada4d03028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
> >>> RAX: ffffffffffffffda RBX: 00007fada56aff60 RCX: 00007fada558f92d
> >>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
> >>> RBP: 00007fada56000a0 R08: 0000000000000000 R09: 0000000000000000
> >>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> >>> R13: 000000000000000b R14: 00007fada56aff60 R15: 00007fada4ce3000
> >>>   </TASK>
> >>> irq event stamp: 2078
> >>> hardirqs last  enabled at (2092): [<ffffffff812f1be3>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
> >>> hardirqs last disabled at (2103): [<ffffffff812f1bc8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
> >>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] lock_sock include/net/sock.h:1691 [inline]
> >>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
> >>> softirqs last disabled at (1500): [<ffffffff836f425c>] spin_lock_bh include/linux/spinlock.h:354 [inline]
> >>> softirqs last disabled at (1500): [<ffffffff836f425c>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
> >>> ---[ end trace 0000000000000000 ]---
> >>>
> >>>
> >>> In the full log file it does look like syskaller is doing something
> >>> strange since it's calling bind, connect, and listen on the same socket:
> >>>
> >>> r4 = socket$inet_mptcp(0x2, 0x1, 0x106)
> >>> bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e23, @empty}, 0x10)
> >>> connect$inet(r4, &(0x7f0000000040)={0x2, 0x0, @local}, 0x10)
> >>> listen(r4, 0x0)
> >>>
> >>> ...but it is a fuzz tester after all.
> >>>
> >>> I've uploaded the full syzkaller logs to this GitHub issue:
> >>>
> >>> https://github.com/multipath-tcp/mptcp_net-next/issues/279
> >>>
> >>>
> >>> Not sure yet if this is MPTCP-related. I don't think MPTCP
> >>> changes anything with the subflow TCP socket bind hashes.
> >>>
> >> Hi Mat,
> >>
> >> Thanks for bringing this up and for uploading the logs. I will look into this.
> >>>
> > Hi Mat,
> >
> > I am still trying to configure my local environment for mptcp to repro
> > + test the fix to verify that it is correct. I think the fix is to add
> > "inet_bhash2_update_saddr(msk);" to the end of the
> > "mptcp_copy_inaddrs" function in net/mptcp/protocol.c.  Would you be
> > able to run an instance on your local syzkaller environment with this
> > line added to see if that fixes the warning?
>
> Hi Joanne -
>
> I also investigated that function when trying to figure out why this
> warning might be happening in MPTCP.
>
> In MPTCP, the userspace-facing MPTCP socket (msk) doesn't directly bind or
> connect. The msk creates and manages TCP subflow sockets (ssk in
> mptcp_copy_inaddrs()), and passes through bind and connect calls to the
> subflows. The msk depends on the subflow to do the hash updates, since
> those subflow sockets are the ones interacting with the inet layer.
>
> mptcp_copy_inaddrs() copies the already-hashed addresses and ports from
> the ssk to the msk, and we only want the ssk in the hash table.
>
I see, thanks for the explanation, Mat! I will keep investigating.
> --
> Mat Martineau
> Intel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
  2022-06-08 20:27         ` Joanne Koong
@ 2022-06-09 23:52           ` Joanne Koong
  2022-06-10  0:37             ` Mat Martineau
  2022-06-09 23:59           ` Mat Martineau
  1 sibling, 1 reply; 18+ messages in thread
From: Joanne Koong @ 2022-06-09 23:52 UTC (permalink / raw)
  To: Mat Martineau
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, Paolo Abeni

On Wed, Jun 8, 2022 at 1:27 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Jun 7, 2022 at 10:33 AM Mat Martineau
> <mathew.j.martineau@linux.intel.com> wrote:
> >
> > On Mon, 6 Jun 2022, Joanne Koong wrote:
> >
> > > On Fri, Jun 3, 2022 at 5:38 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >>
> > >> On Fri, Jun 3, 2022 at 11:55 AM Mat Martineau
> > >> <mathew.j.martineau@linux.intel.com> wrote:
> > >>>
> > >>> On Thu, 2 Jun 2022, Joanne Koong wrote:
> > >>>
> > >>>> As syzbot noted [1], there is an inconsistency in the bhash2 table in the
> > >>>> case where a socket's rcv saddr changes after it is binded. (For more
> > >>>> details, please see the commit message of the first patch)
> > >>>>
> > >>>> This patchset fixes that and adds a test that triggers the case where the
> > >>>> sk's rcv saddr changes. The subsequent listen() call should succeed.
> > >>>>
> > >>>> [1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
> > >>>>
> > >>>> --
> > >>>> v1 -> v2:
> > >>>> v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
> > >>>> * Mark __inet_bhash2_update_saddr as static
> > >>>>
> > >>>> Joanne Koong (2):
> > >>>>  net: Update bhash2 when socket's rcv saddr changes
> > >>>>  selftests/net: Add sk_bind_sendto_listen test
> > >>>>
> > >>>
> > >>> Hi Joanne -
> > >>>
> > >>> I've been running my own syzkaller instance with v1 of this fix for a
> > >>> couple of days. Before this patch, syzkaller would trigger the
> > >>> inet_csk_get_port warning a couple of times per hour. After this patch it
> > >>> took two days to show the warning:
> > >>>
> > >>> ------------[ cut here ]------------
> > >>>
> > >>> WARNING: CPU: 0 PID: 9430 at net/ipv4/inet_connection_sock.c:525
> > >>> inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
> > >>> Modules linked in:
> > >>> CPU: 0 PID: 9430 Comm: syz-executor.5 Not tainted 5.18.0-05016-g433fde5b4119 #3
> > >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > >>> RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
> > >>> Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 2b 0f db fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 18 0f db fd <0f> 0b e9 70 f9 ff ff e8 0c 0f db fd 0f 0b e9 28 f9 ff ff e8 00 0f
> > >>> RSP: 0018:ffffc90000b5fbc0 EFLAGS: 00010212
> > >>> RAX: 00000000000000e7 RBX: ffff88803c410040 RCX: ffffc9000e419000
> > >>> RDX: 0000000000040000 RSI: ffffffff836f47e8 RDI: ffff88803c4106e0
> > >>> RBP: ffff88810b773840 R08: 0000000000000001 R09: 0000000000000001
> > >>> R10: fffffbfff0f64303 R11: 0000000000000001 R12: 0000000000000000
> > >>> R13: ffff88810605e2f0 R14: ffffffff88606040 R15: 000000000000c1ff
> > >>> FS:  00007fada4d03640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
> > >>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >>> CR2: 0000001b32e24000 CR3: 00000001016de001 CR4: 0000000000770ef0
> > >>> PKRU: 55555554
> > >>> Call Trace:
> > >>>   <TASK>
> > >>>   inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
> > >>>   inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
> > >>>   mptcp_listen+0x205/0x440 net/mptcp/protocol.c:3564
> > >>>   __sys_listen+0x189/0x260 net/socket.c:1810
> > >>>   __do_sys_listen net/socket.c:1819 [inline]
> > >>>   __se_sys_listen net/socket.c:1817 [inline]
> > >>>   __x64_sys_listen+0x54/0x80 net/socket.c:1817
> > >>>   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > >>>   do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
> > >>>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > >>> RIP: 0033:0x7fada558f92d
> > >>> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> > >>> RSP: 002b:00007fada4d03028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
> > >>> RAX: ffffffffffffffda RBX: 00007fada56aff60 RCX: 00007fada558f92d
> > >>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
> > >>> RBP: 00007fada56000a0 R08: 0000000000000000 R09: 0000000000000000
> > >>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > >>> R13: 000000000000000b R14: 00007fada56aff60 R15: 00007fada4ce3000
> > >>>   </TASK>
> > >>> irq event stamp: 2078
> > >>> hardirqs last  enabled at (2092): [<ffffffff812f1be3>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
> > >>> hardirqs last disabled at (2103): [<ffffffff812f1bc8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
> > >>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] lock_sock include/net/sock.h:1691 [inline]
> > >>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
> > >>> softirqs last disabled at (1500): [<ffffffff836f425c>] spin_lock_bh include/linux/spinlock.h:354 [inline]
> > >>> softirqs last disabled at (1500): [<ffffffff836f425c>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
> > >>> ---[ end trace 0000000000000000 ]---
> > >>>
> > >>>
> > >>> In the full log file it does look like syskaller is doing something
> > >>> strange since it's calling bind, connect, and listen on the same socket:
> > >>>
> > >>> r4 = socket$inet_mptcp(0x2, 0x1, 0x106)
> > >>> bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e23, @empty}, 0x10)
> > >>> connect$inet(r4, &(0x7f0000000040)={0x2, 0x0, @local}, 0x10)
> > >>> listen(r4, 0x0)
> > >>>
> > >>> ...but it is a fuzz tester after all.
> > >>>
> > >>> I've uploaded the full syzkaller logs to this GitHub issue:
> > >>>
> > >>> https://github.com/multipath-tcp/mptcp_net-next/issues/279
> > >>>
> > >>>
> > >>> Not sure yet if this is MPTCP-related. I don't think MPTCP
> > >>> changes anything with the subflow TCP socket bind hashes.
> > >>>
> > >> Hi Mat,
> > >>
> > >> Thanks for bringing this up and for uploading the logs. I will look into this.
> > >>>
> > > Hi Mat,
> > >
> > > I am still trying to configure my local environment for mptcp to repro
> > > + test the fix to verify that it is correct. I think the fix is to add
> > > "inet_bhash2_update_saddr(msk);" to the end of the
> > > "mptcp_copy_inaddrs" function in net/mptcp/protocol.c.  Would you be
> > > able to run an instance on your local syzkaller environment with this
> > > line added to see if that fixes the warning?
> >
> > Hi Joanne -
> >
> > I also investigated that function when trying to figure out why this
> > warning might be happening in MPTCP.
> >
> > In MPTCP, the userspace-facing MPTCP socket (msk) doesn't directly bind or
> > connect. The msk creates and manages TCP subflow sockets (ssk in
> > mptcp_copy_inaddrs()), and passes through bind and connect calls to the
> > subflows. The msk depends on the subflow to do the hash updates, since
> > those subflow sockets are the ones interacting with the inet layer.
> >
> > mptcp_copy_inaddrs() copies the already-hashed addresses and ports from
> > the ssk to the msk, and we only want the ssk in the hash table.
> >
> I see, thanks for the explanation, Mat! I will keep investigating.\

Are you able to repro this warning locally, Mat?

I have a test program that calls:
struct addrinfo hints = {
          .ai_protocol = IPPROTO_TCP,
          .ai_socktype = SOCK_STREAM,
         .ai_family = AF_INET,
};
getaddrinfo("127.0.0.1", "15432", &hints, &addr);
socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
bind(sock_fd, addr->ai_addr, addr->ai_addrlen)
connect(sock_fd, &zeroed_sockaddr, sizeof(zeroed_sockaddr));
listen(sock_fd, 0);

but I'm unable to trigger this warning in my local environment.

I'm not understanding how this warning can get triggered
non-deterministically when the socket resides only within the program.
The only theory that makes sense to me is that the subflow sockets'
saddr changes somewhere after it has been binded, but then wouldn't
that trigger this warning deterministically?

How often are you seeing this warning show up in the mptcp syzkaller?
Is there a way to run a local patch on the mptcp syzkaller (eg like a
patch that prints out extraneous debugging information about
icsk_bind2_hash address, tb2 address, and the socket's saddr changes)
- if so, how can I do this?

Thanks.
> > --
> > Mat Martineau
> > Intel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
  2022-06-08 20:27         ` Joanne Koong
  2022-06-09 23:52           ` Joanne Koong
@ 2022-06-09 23:59           ` Mat Martineau
  1 sibling, 0 replies; 18+ messages in thread
From: Mat Martineau @ 2022-06-09 23:59 UTC (permalink / raw)
  To: Joanne Koong
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, Paolo Abeni

On Wed, 8 Jun 2022, Joanne Koong wrote:

> On Tue, Jun 7, 2022 at 10:33 AM Mat Martineau
> <mathew.j.martineau@linux.intel.com> wrote:
>>
>> On Mon, 6 Jun 2022, Joanne Koong wrote:
>>
>>> On Fri, Jun 3, 2022 at 5:38 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>
>>>> On Fri, Jun 3, 2022 at 11:55 AM Mat Martineau
>>>> <mathew.j.martineau@linux.intel.com> wrote:
>>>>>
>>>>> On Thu, 2 Jun 2022, Joanne Koong wrote:
>>>>>
>>>>>> As syzbot noted [1], there is an inconsistency in the bhash2 table in the
>>>>>> case where a socket's rcv saddr changes after it is binded. (For more
>>>>>> details, please see the commit message of the first patch)
>>>>>>
>>>>>> This patchset fixes that and adds a test that triggers the case where the
>>>>>> sk's rcv saddr changes. The subsequent listen() call should succeed.
>>>>>>
>>>>>> [1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
>>>>>>
>>>>>> --
>>>>>> v1 -> v2:
>>>>>> v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
>>>>>> * Mark __inet_bhash2_update_saddr as static
>>>>>>
>>>>>> Joanne Koong (2):
>>>>>>  net: Update bhash2 when socket's rcv saddr changes
>>>>>>  selftests/net: Add sk_bind_sendto_listen test
>>>>>>
>>>>>
>>>>> Hi Joanne -
>>>>>
>>>>> I've been running my own syzkaller instance with v1 of this fix for a
>>>>> couple of days. Before this patch, syzkaller would trigger the
>>>>> inet_csk_get_port warning a couple of times per hour. After this patch it
>>>>> took two days to show the warning:
>>>>>
>>>>> ------------[ cut here ]------------
>>>>>
>>>>> WARNING: CPU: 0 PID: 9430 at net/ipv4/inet_connection_sock.c:525
>>>>> inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
>>>>> Modules linked in:
>>>>> CPU: 0 PID: 9430 Comm: syz-executor.5 Not tainted 5.18.0-05016-g433fde5b4119 #3
>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>>>> RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
>>>>> Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 2b 0f db fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 18 0f db fd <0f> 0b e9 70 f9 ff ff e8 0c 0f db fd 0f 0b e9 28 f9 ff ff e8 00 0f
>>>>> RSP: 0018:ffffc90000b5fbc0 EFLAGS: 00010212
>>>>> RAX: 00000000000000e7 RBX: ffff88803c410040 RCX: ffffc9000e419000
>>>>> RDX: 0000000000040000 RSI: ffffffff836f47e8 RDI: ffff88803c4106e0
>>>>> RBP: ffff88810b773840 R08: 0000000000000001 R09: 0000000000000001
>>>>> R10: fffffbfff0f64303 R11: 0000000000000001 R12: 0000000000000000
>>>>> R13: ffff88810605e2f0 R14: ffffffff88606040 R15: 000000000000c1ff
>>>>> FS:  00007fada4d03640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> CR2: 0000001b32e24000 CR3: 00000001016de001 CR4: 0000000000770ef0
>>>>> PKRU: 55555554
>>>>> Call Trace:
>>>>>   <TASK>
>>>>>   inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
>>>>>   inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
>>>>>   mptcp_listen+0x205/0x440 net/mptcp/protocol.c:3564
>>>>>   __sys_listen+0x189/0x260 net/socket.c:1810
>>>>>   __do_sys_listen net/socket.c:1819 [inline]
>>>>>   __se_sys_listen net/socket.c:1817 [inline]
>>>>>   __x64_sys_listen+0x54/0x80 net/socket.c:1817
>>>>>   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>>>>   do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
>>>>>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>>> RIP: 0033:0x7fada558f92d
>>>>> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
>>>>> RSP: 002b:00007fada4d03028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
>>>>> RAX: ffffffffffffffda RBX: 00007fada56aff60 RCX: 00007fada558f92d
>>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
>>>>> RBP: 00007fada56000a0 R08: 0000000000000000 R09: 0000000000000000
>>>>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>>>>> R13: 000000000000000b R14: 00007fada56aff60 R15: 00007fada4ce3000
>>>>>   </TASK>
>>>>> irq event stamp: 2078
>>>>> hardirqs last  enabled at (2092): [<ffffffff812f1be3>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
>>>>> hardirqs last disabled at (2103): [<ffffffff812f1bc8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
>>>>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] lock_sock include/net/sock.h:1691 [inline]
>>>>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
>>>>> softirqs last disabled at (1500): [<ffffffff836f425c>] spin_lock_bh include/linux/spinlock.h:354 [inline]
>>>>> softirqs last disabled at (1500): [<ffffffff836f425c>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
>>>>> ---[ end trace 0000000000000000 ]---
>>>>>
>>>>>
>>>>> In the full log file it does look like syskaller is doing something
>>>>> strange since it's calling bind, connect, and listen on the same socket:
>>>>>
>>>>> r4 = socket$inet_mptcp(0x2, 0x1, 0x106)
>>>>> bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e23, @empty}, 0x10)
>>>>> connect$inet(r4, &(0x7f0000000040)={0x2, 0x0, @local}, 0x10)
>>>>> listen(r4, 0x0)
>>>>>
>>>>> ...but it is a fuzz tester after all.
>>>>>
>>>>> I've uploaded the full syzkaller logs to this GitHub issue:
>>>>>
>>>>> https://github.com/multipath-tcp/mptcp_net-next/issues/279
>>>>>
>>>>>
>>>>> Not sure yet if this is MPTCP-related. I don't think MPTCP
>>>>> changes anything with the subflow TCP socket bind hashes.
>>>>>
>>>> Hi Mat,
>>>>
>>>> Thanks for bringing this up and for uploading the logs. I will look into this.
>>>>>
>>> Hi Mat,
>>>
>>> I am still trying to configure my local environment for mptcp to repro
>>> + test the fix to verify that it is correct. I think the fix is to add
>>> "inet_bhash2_update_saddr(msk);" to the end of the
>>> "mptcp_copy_inaddrs" function in net/mptcp/protocol.c.  Would you be
>>> able to run an instance on your local syzkaller environment with this
>>> line added to see if that fixes the warning?
>>
>> Hi Joanne -
>>
>> I also investigated that function when trying to figure out why this
>> warning might be happening in MPTCP.
>>
>> In MPTCP, the userspace-facing MPTCP socket (msk) doesn't directly bind or
>> connect. The msk creates and manages TCP subflow sockets (ssk in
>> mptcp_copy_inaddrs()), and passes through bind and connect calls to the
>> subflows. The msk depends on the subflow to do the hash updates, since
>> those subflow sockets are the ones interacting with the inet layer.
>>
>> mptcp_copy_inaddrs() copies the already-hashed addresses and ports from
>> the ssk to the msk, and we only want the ssk in the hash table.
>>
> I see, thanks for the explanation, Mat! I will keep investigating.

No problem!

To give you an update, I have not seen the warning again in nearly a week 
of running syzkaller. My theory is that the one I saw last Friday was not 
MPTCP-specific, but may have been related to the locking concern Paolo 
noted.

If I see the warning again I will update you.

Thanks,

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
  2022-06-09 23:52           ` Joanne Koong
@ 2022-06-10  0:37             ` Mat Martineau
  2022-06-10 18:09               ` Joanne Koong
  0 siblings, 1 reply; 18+ messages in thread
From: Mat Martineau @ 2022-06-10  0:37 UTC (permalink / raw)
  To: Joanne Koong
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, Paolo Abeni

On Thu, 9 Jun 2022, Joanne Koong wrote:

> On Wed, Jun 8, 2022 at 1:27 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Tue, Jun 7, 2022 at 10:33 AM Mat Martineau
>> <mathew.j.martineau@linux.intel.com> wrote:
>>>
>>> On Mon, 6 Jun 2022, Joanne Koong wrote:
>>>
>>>> On Fri, Jun 3, 2022 at 5:38 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>>
>>>>> On Fri, Jun 3, 2022 at 11:55 AM Mat Martineau
>>>>> <mathew.j.martineau@linux.intel.com> wrote:
>>>>>>
>>>>>> On Thu, 2 Jun 2022, Joanne Koong wrote:
>>>>>>
>>>>>>> As syzbot noted [1], there is an inconsistency in the bhash2 table in the
>>>>>>> case where a socket's rcv saddr changes after it is binded. (For more
>>>>>>> details, please see the commit message of the first patch)
>>>>>>>
>>>>>>> This patchset fixes that and adds a test that triggers the case where the
>>>>>>> sk's rcv saddr changes. The subsequent listen() call should succeed.
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
>>>>>>>
>>>>>>> --
>>>>>>> v1 -> v2:
>>>>>>> v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
>>>>>>> * Mark __inet_bhash2_update_saddr as static
>>>>>>>
>>>>>>> Joanne Koong (2):
>>>>>>>  net: Update bhash2 when socket's rcv saddr changes
>>>>>>>  selftests/net: Add sk_bind_sendto_listen test
>>>>>>>
>>>>>>
>>>>>> Hi Joanne -
>>>>>>
>>>>>> I've been running my own syzkaller instance with v1 of this fix for a
>>>>>> couple of days. Before this patch, syzkaller would trigger the
>>>>>> inet_csk_get_port warning a couple of times per hour. After this patch it
>>>>>> took two days to show the warning:
>>>>>>
>>>>>> ------------[ cut here ]------------
>>>>>>
>>>>>> WARNING: CPU: 0 PID: 9430 at net/ipv4/inet_connection_sock.c:525
>>>>>> inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
>>>>>> Modules linked in:
>>>>>> CPU: 0 PID: 9430 Comm: syz-executor.5 Not tainted 5.18.0-05016-g433fde5b4119 #3
>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>>>>> RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
>>>>>> Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 2b 0f db fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 18 0f db fd <0f> 0b e9 70 f9 ff ff e8 0c 0f db fd 0f 0b e9 28 f9 ff ff e8 00 0f
>>>>>> RSP: 0018:ffffc90000b5fbc0 EFLAGS: 00010212
>>>>>> RAX: 00000000000000e7 RBX: ffff88803c410040 RCX: ffffc9000e419000
>>>>>> RDX: 0000000000040000 RSI: ffffffff836f47e8 RDI: ffff88803c4106e0
>>>>>> RBP: ffff88810b773840 R08: 0000000000000001 R09: 0000000000000001
>>>>>> R10: fffffbfff0f64303 R11: 0000000000000001 R12: 0000000000000000
>>>>>> R13: ffff88810605e2f0 R14: ffffffff88606040 R15: 000000000000c1ff
>>>>>> FS:  00007fada4d03640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> CR2: 0000001b32e24000 CR3: 00000001016de001 CR4: 0000000000770ef0
>>>>>> PKRU: 55555554
>>>>>> Call Trace:
>>>>>>   <TASK>
>>>>>>   inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
>>>>>>   inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
>>>>>>   mptcp_listen+0x205/0x440 net/mptcp/protocol.c:3564
>>>>>>   __sys_listen+0x189/0x260 net/socket.c:1810
>>>>>>   __do_sys_listen net/socket.c:1819 [inline]
>>>>>>   __se_sys_listen net/socket.c:1817 [inline]
>>>>>>   __x64_sys_listen+0x54/0x80 net/socket.c:1817
>>>>>>   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>>>>>   do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
>>>>>>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>>>> RIP: 0033:0x7fada558f92d
>>>>>> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
>>>>>> RSP: 002b:00007fada4d03028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
>>>>>> RAX: ffffffffffffffda RBX: 00007fada56aff60 RCX: 00007fada558f92d
>>>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
>>>>>> RBP: 00007fada56000a0 R08: 0000000000000000 R09: 0000000000000000
>>>>>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>>>>>> R13: 000000000000000b R14: 00007fada56aff60 R15: 00007fada4ce3000
>>>>>>   </TASK>
>>>>>> irq event stamp: 2078
>>>>>> hardirqs last  enabled at (2092): [<ffffffff812f1be3>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
>>>>>> hardirqs last disabled at (2103): [<ffffffff812f1bc8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
>>>>>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] lock_sock include/net/sock.h:1691 [inline]
>>>>>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
>>>>>> softirqs last disabled at (1500): [<ffffffff836f425c>] spin_lock_bh include/linux/spinlock.h:354 [inline]
>>>>>> softirqs last disabled at (1500): [<ffffffff836f425c>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
>>>>>> ---[ end trace 0000000000000000 ]---
>>>>>>
>>>>>>
>>>>>> In the full log file it does look like syskaller is doing something
>>>>>> strange since it's calling bind, connect, and listen on the same socket:
>>>>>>
>>>>>> r4 = socket$inet_mptcp(0x2, 0x1, 0x106)
>>>>>> bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e23, @empty}, 0x10)
>>>>>> connect$inet(r4, &(0x7f0000000040)={0x2, 0x0, @local}, 0x10)
>>>>>> listen(r4, 0x0)
>>>>>>
>>>>>> ...but it is a fuzz tester after all.
>>>>>>
>>>>>> I've uploaded the full syzkaller logs to this GitHub issue:
>>>>>>
>>>>>> https://github.com/multipath-tcp/mptcp_net-next/issues/279
>>>>>>
>>>>>>
>>>>>> Not sure yet if this is MPTCP-related. I don't think MPTCP
>>>>>> changes anything with the subflow TCP socket bind hashes.
>>>>>>
>>>>> Hi Mat,
>>>>>
>>>>> Thanks for bringing this up and for uploading the logs. I will look into this.
>>>>>>
>>>> Hi Mat,
>>>>
>>>> I am still trying to configure my local environment for mptcp to repro
>>>> + test the fix to verify that it is correct. I think the fix is to add
>>>> "inet_bhash2_update_saddr(msk);" to the end of the
>>>> "mptcp_copy_inaddrs" function in net/mptcp/protocol.c.  Would you be
>>>> able to run an instance on your local syzkaller environment with this
>>>> line added to see if that fixes the warning?
>>>
>>> Hi Joanne -
>>>
>>> I also investigated that function when trying to figure out why this
>>> warning might be happening in MPTCP.
>>>
>>> In MPTCP, the userspace-facing MPTCP socket (msk) doesn't directly bind or
>>> connect. The msk creates and manages TCP subflow sockets (ssk in
>>> mptcp_copy_inaddrs()), and passes through bind and connect calls to the
>>> subflows. The msk depends on the subflow to do the hash updates, since
>>> those subflow sockets are the ones interacting with the inet layer.
>>>
>>> mptcp_copy_inaddrs() copies the already-hashed addresses and ports from
>>> the ssk to the msk, and we only want the ssk in the hash table.
>>>
>> I see, thanks for the explanation, Mat! I will keep investigating.\
>
> Are you able to repro this warning locally, Mat?
>

This message made it to my inbox after hitting 'send' on my last reply to 
this thread!

As I said there, I haven't seen syzkaller trigger this again. It only 
happened one time after applying the fix, and not at all in the last week 
running syzkaller (nearly) continuously.

> I have a test program that calls:
> struct addrinfo hints = {
>          .ai_protocol = IPPROTO_TCP,
>          .ai_socktype = SOCK_STREAM,
>         .ai_family = AF_INET,
> };
> getaddrinfo("127.0.0.1", "15432", &hints, &addr);
> socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
> bind(sock_fd, addr->ai_addr, addr->ai_addrlen)
> connect(sock_fd, &zeroed_sockaddr, sizeof(zeroed_sockaddr));
> listen(sock_fd, 0);
>
> but I'm unable to trigger this warning in my local environment.
>
> I'm not understanding how this warning can get triggered
> non-deterministically when the socket resides only within the program.
> The only theory that makes sense to me is that the subflow sockets'
> saddr changes somewhere after it has been binded, but then wouldn't
> that trigger this warning deterministically?

I agree, if the MPTCP code was modifying the subflow saddr this would 
trigger more consistently. But as far as I can tell, MPTCP isn't 
changing those values.

>
> How often are you seeing this warning show up in the mptcp syzkaller?
> Is there a way to run a local patch on the mptcp syzkaller (eg like a
> patch that prints out extraneous debugging information about
> icsk_bind2_hash address, tb2 address, and the socket's saddr changes)
> - if so, how can I do this?
>

Since the warning has not repeated in the last week, I'm wondering if 
maybe the fuzz tester happened across the locking issue that Paolo noted? 
The lack of determinism makes me suspect some kind of concurrency issue.

I'm running syzkaller on a large-ish datacenter VM with a kernel I 
compiled, so I can add any patches myself and run with those. Setup for 
syzkaller as I've been using it with MPTCP is documented here: 
https://github.com/multipath-tcp/mptcp_net-next/wiki/Testing#syzkaller

If you have a patch with some extra debug output that you can point me to, 
I can apply that to the kernel I'm running.

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
  2022-06-10  0:37             ` Mat Martineau
@ 2022-06-10 18:09               ` Joanne Koong
  2022-06-10 20:15                 ` Mat Martineau
  0 siblings, 1 reply; 18+ messages in thread
From: Joanne Koong @ 2022-06-10 18:09 UTC (permalink / raw)
  To: Mat Martineau
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, Paolo Abeni

On Thu, Jun 9, 2022 at 5:37 PM Mat Martineau
<mathew.j.martineau@linux.intel.com> wrote:
>
> On Thu, 9 Jun 2022, Joanne Koong wrote:
>
> > On Wed, Jun 8, 2022 at 1:27 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>
> >> On Tue, Jun 7, 2022 at 10:33 AM Mat Martineau
> >> <mathew.j.martineau@linux.intel.com> wrote:
> >>>
> >>> On Mon, 6 Jun 2022, Joanne Koong wrote:
> >>>
> >>>> On Fri, Jun 3, 2022 at 5:38 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>>>>
> >>>>> On Fri, Jun 3, 2022 at 11:55 AM Mat Martineau
> >>>>> <mathew.j.martineau@linux.intel.com> wrote:
> >>>>>>
> >>>>>> On Thu, 2 Jun 2022, Joanne Koong wrote:
> >>>>>>
> >>>>>>> As syzbot noted [1], there is an inconsistency in the bhash2 table in the
> >>>>>>> case where a socket's rcv saddr changes after it is binded. (For more
> >>>>>>> details, please see the commit message of the first patch)
> >>>>>>>
> >>>>>>> This patchset fixes that and adds a test that triggers the case where the
> >>>>>>> sk's rcv saddr changes. The subsequent listen() call should succeed.
> >>>>>>>
> >>>>>>> [1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
> >>>>>>>
> >>>>>>> --
> >>>>>>> v1 -> v2:
> >>>>>>> v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
> >>>>>>> * Mark __inet_bhash2_update_saddr as static
> >>>>>>>
> >>>>>>> Joanne Koong (2):
> >>>>>>>  net: Update bhash2 when socket's rcv saddr changes
> >>>>>>>  selftests/net: Add sk_bind_sendto_listen test
> >>>>>>>
> >>>>>>
> >>>>>> Hi Joanne -
> >>>>>>
> >>>>>> I've been running my own syzkaller instance with v1 of this fix for a
> >>>>>> couple of days. Before this patch, syzkaller would trigger the
> >>>>>> inet_csk_get_port warning a couple of times per hour. After this patch it
> >>>>>> took two days to show the warning:
> >>>>>>
> >>>>>> ------------[ cut here ]------------
> >>>>>>
> >>>>>> WARNING: CPU: 0 PID: 9430 at net/ipv4/inet_connection_sock.c:525
> >>>>>> inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
> >>>>>> Modules linked in:
> >>>>>> CPU: 0 PID: 9430 Comm: syz-executor.5 Not tainted 5.18.0-05016-g433fde5b4119 #3
> >>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> >>>>>> RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
> >>>>>> Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 2b 0f db fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 18 0f db fd <0f> 0b e9 70 f9 ff ff e8 0c 0f db fd 0f 0b e9 28 f9 ff ff e8 00 0f
> >>>>>> RSP: 0018:ffffc90000b5fbc0 EFLAGS: 00010212
> >>>>>> RAX: 00000000000000e7 RBX: ffff88803c410040 RCX: ffffc9000e419000
> >>>>>> RDX: 0000000000040000 RSI: ffffffff836f47e8 RDI: ffff88803c4106e0
> >>>>>> RBP: ffff88810b773840 R08: 0000000000000001 R09: 0000000000000001
> >>>>>> R10: fffffbfff0f64303 R11: 0000000000000001 R12: 0000000000000000
> >>>>>> R13: ffff88810605e2f0 R14: ffffffff88606040 R15: 000000000000c1ff
> >>>>>> FS:  00007fada4d03640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
> >>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>> CR2: 0000001b32e24000 CR3: 00000001016de001 CR4: 0000000000770ef0
> >>>>>> PKRU: 55555554
> >>>>>> Call Trace:
> >>>>>>   <TASK>
> >>>>>>   inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
> >>>>>>   inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
> >>>>>>   mptcp_listen+0x205/0x440 net/mptcp/protocol.c:3564
> >>>>>>   __sys_listen+0x189/0x260 net/socket.c:1810
> >>>>>>   __do_sys_listen net/socket.c:1819 [inline]
> >>>>>>   __se_sys_listen net/socket.c:1817 [inline]
> >>>>>>   __x64_sys_listen+0x54/0x80 net/socket.c:1817
> >>>>>>   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >>>>>>   do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
> >>>>>>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >>>>>> RIP: 0033:0x7fada558f92d
> >>>>>> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> >>>>>> RSP: 002b:00007fada4d03028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
> >>>>>> RAX: ffffffffffffffda RBX: 00007fada56aff60 RCX: 00007fada558f92d
> >>>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
> >>>>>> RBP: 00007fada56000a0 R08: 0000000000000000 R09: 0000000000000000
> >>>>>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> >>>>>> R13: 000000000000000b R14: 00007fada56aff60 R15: 00007fada4ce3000
> >>>>>>   </TASK>
> >>>>>> irq event stamp: 2078
> >>>>>> hardirqs last  enabled at (2092): [<ffffffff812f1be3>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
> >>>>>> hardirqs last disabled at (2103): [<ffffffff812f1bc8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
> >>>>>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] lock_sock include/net/sock.h:1691 [inline]
> >>>>>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
> >>>>>> softirqs last disabled at (1500): [<ffffffff836f425c>] spin_lock_bh include/linux/spinlock.h:354 [inline]
> >>>>>> softirqs last disabled at (1500): [<ffffffff836f425c>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
> >>>>>> ---[ end trace 0000000000000000 ]---
> >>>>>>
> >>>>>>
> >>>>>> In the full log file it does look like syskaller is doing something
> >>>>>> strange since it's calling bind, connect, and listen on the same socket:
> >>>>>>
> >>>>>> r4 = socket$inet_mptcp(0x2, 0x1, 0x106)
> >>>>>> bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e23, @empty}, 0x10)
> >>>>>> connect$inet(r4, &(0x7f0000000040)={0x2, 0x0, @local}, 0x10)
> >>>>>> listen(r4, 0x0)
> >>>>>>
> >>>>>> ...but it is a fuzz tester after all.
> >>>>>>
> >>>>>> I've uploaded the full syzkaller logs to this GitHub issue:
> >>>>>>
> >>>>>> https://github.com/multipath-tcp/mptcp_net-next/issues/279
> >>>>>>
> >>>>>>
> >>>>>> Not sure yet if this is MPTCP-related. I don't think MPTCP
> >>>>>> changes anything with the subflow TCP socket bind hashes.
> >>>>>>
> >>>>> Hi Mat,
> >>>>>
> >>>>> Thanks for bringing this up and for uploading the logs. I will look into this.
> >>>>>>
> >>>> Hi Mat,
> >>>>
> >>>> I am still trying to configure my local environment for mptcp to repro
> >>>> + test the fix to verify that it is correct. I think the fix is to add
> >>>> "inet_bhash2_update_saddr(msk);" to the end of the
> >>>> "mptcp_copy_inaddrs" function in net/mptcp/protocol.c.  Would you be
> >>>> able to run an instance on your local syzkaller environment with this
> >>>> line added to see if that fixes the warning?
> >>>
> >>> Hi Joanne -
> >>>
> >>> I also investigated that function when trying to figure out why this
> >>> warning might be happening in MPTCP.
> >>>
> >>> In MPTCP, the userspace-facing MPTCP socket (msk) doesn't directly bind or
> >>> connect. The msk creates and manages TCP subflow sockets (ssk in
> >>> mptcp_copy_inaddrs()), and passes through bind and connect calls to the
> >>> subflows. The msk depends on the subflow to do the hash updates, since
> >>> those subflow sockets are the ones interacting with the inet layer.
> >>>
> >>> mptcp_copy_inaddrs() copies the already-hashed addresses and ports from
> >>> the ssk to the msk, and we only want the ssk in the hash table.
> >>>
> >> I see, thanks for the explanation, Mat! I will keep investigating.\
> >
> > Are you able to repro this warning locally, Mat?
> >
>
> This message made it to my inbox after hitting 'send' on my last reply to
> this thread!
>
> As I said there, I haven't seen syzkaller trigger this again. It only
> happened one time after applying the fix, and not at all in the last week
> running syzkaller (nearly) continuously.
>
> > I have a test program that calls:
> > struct addrinfo hints = {
> >          .ai_protocol = IPPROTO_TCP,
> >          .ai_socktype = SOCK_STREAM,
> >         .ai_family = AF_INET,
> > };
> > getaddrinfo("127.0.0.1", "15432", &hints, &addr);
> > socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
> > bind(sock_fd, addr->ai_addr, addr->ai_addrlen)
> > connect(sock_fd, &zeroed_sockaddr, sizeof(zeroed_sockaddr));
> > listen(sock_fd, 0);
> >
> > but I'm unable to trigger this warning in my local environment.
> >
> > I'm not understanding how this warning can get triggered
> > non-deterministically when the socket resides only within the program.
> > The only theory that makes sense to me is that the subflow sockets'
> > saddr changes somewhere after it has been binded, but then wouldn't
> > that trigger this warning deterministically?
>
> I agree, if the MPTCP code was modifying the subflow saddr this would
> trigger more consistently. But as far as I can tell, MPTCP isn't
> changing those values.
>
> >
> > How often are you seeing this warning show up in the mptcp syzkaller?
> > Is there a way to run a local patch on the mptcp syzkaller (eg like a
> > patch that prints out extraneous debugging information about
> > icsk_bind2_hash address, tb2 address, and the socket's saddr changes)
> > - if so, how can I do this?
> >
>
> Since the warning has not repeated in the last week, I'm wondering if
> maybe the fuzz tester happened across the locking issue that Paolo noted?
> The lack of determinism makes me suspect some kind of concurrency issue.
>
> I'm running syzkaller on a large-ish datacenter VM with a kernel I
> compiled, so I can add any patches myself and run with those. Setup for
> syzkaller as I've been using it with MPTCP is documented here:
> https://github.com/multipath-tcp/mptcp_net-next/wiki/Testing#syzkaller
>
> If you have a patch with some extra debug output that you can point me to,
> I can apply that to the kernel I'm running.
>

Awesome, I'm glad to hear that the warning is not constantly
happening. I am tidying up the bhash2 locks patch and will submit that
soon. If this warning still shows up after that patch, then I will
send you a local patch with extra debug output to run on top of the
mptcp syzkaller.

Thanks!

> --
> Mat Martineau
> Intel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes
  2022-06-10 18:09               ` Joanne Koong
@ 2022-06-10 20:15                 ` Mat Martineau
  0 siblings, 0 replies; 18+ messages in thread
From: Mat Martineau @ 2022-06-10 20:15 UTC (permalink / raw)
  To: Joanne Koong
  Cc: netdev, Eric Dumazet, Martin KaFai Lau, Jakub Kicinski,
	David Miller, Paolo Abeni

On Fri, 10 Jun 2022, Joanne Koong wrote:

> On Thu, Jun 9, 2022 at 5:37 PM Mat Martineau
> <mathew.j.martineau@linux.intel.com> wrote:
>>
>> On Thu, 9 Jun 2022, Joanne Koong wrote:
>>
>>> On Wed, Jun 8, 2022 at 1:27 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>
>>>> On Tue, Jun 7, 2022 at 10:33 AM Mat Martineau
>>>> <mathew.j.martineau@linux.intel.com> wrote:
>>>>>
>>>>> On Mon, 6 Jun 2022, Joanne Koong wrote:
>>>>>
>>>>>> On Fri, Jun 3, 2022 at 5:38 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>>>>
>>>>>>> On Fri, Jun 3, 2022 at 11:55 AM Mat Martineau
>>>>>>> <mathew.j.martineau@linux.intel.com> wrote:
>>>>>>>>
>>>>>>>> On Thu, 2 Jun 2022, Joanne Koong wrote:
>>>>>>>>
>>>>>>>>> As syzbot noted [1], there is an inconsistency in the bhash2 table in the
>>>>>>>>> case where a socket's rcv saddr changes after it is binded. (For more
>>>>>>>>> details, please see the commit message of the first patch)
>>>>>>>>>
>>>>>>>>> This patchset fixes that and adds a test that triggers the case where the
>>>>>>>>> sk's rcv saddr changes. The subsequent listen() call should succeed.
>>>>>>>>>
>>>>>>>>> [1] https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> v1 -> v2:
>>>>>>>>> v1: https://lore.kernel.org/netdev/20220601201434.1710931-1-joannekoong@fb.com/
>>>>>>>>> * Mark __inet_bhash2_update_saddr as static
>>>>>>>>>
>>>>>>>>> Joanne Koong (2):
>>>>>>>>>  net: Update bhash2 when socket's rcv saddr changes
>>>>>>>>>  selftests/net: Add sk_bind_sendto_listen test
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Joanne -
>>>>>>>>
>>>>>>>> I've been running my own syzkaller instance with v1 of this fix for a
>>>>>>>> couple of days. Before this patch, syzkaller would trigger the
>>>>>>>> inet_csk_get_port warning a couple of times per hour. After this patch it
>>>>>>>> took two days to show the warning:
>>>>>>>>
>>>>>>>> ------------[ cut here ]------------
>>>>>>>>
>>>>>>>> WARNING: CPU: 0 PID: 9430 at net/ipv4/inet_connection_sock.c:525
>>>>>>>> inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
>>>>>>>> Modules linked in:
>>>>>>>> CPU: 0 PID: 9430 Comm: syz-executor.5 Not tainted 5.18.0-05016-g433fde5b4119 #3
>>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>>>>>>> RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
>>>>>>>> Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 2b 0f db fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 18 0f db fd <0f> 0b e9 70 f9 ff ff e8 0c 0f db fd 0f 0b e9 28 f9 ff ff e8 00 0f
>>>>>>>> RSP: 0018:ffffc90000b5fbc0 EFLAGS: 00010212
>>>>>>>> RAX: 00000000000000e7 RBX: ffff88803c410040 RCX: ffffc9000e419000
>>>>>>>> RDX: 0000000000040000 RSI: ffffffff836f47e8 RDI: ffff88803c4106e0
>>>>>>>> RBP: ffff88810b773840 R08: 0000000000000001 R09: 0000000000000001
>>>>>>>> R10: fffffbfff0f64303 R11: 0000000000000001 R12: 0000000000000000
>>>>>>>> R13: ffff88810605e2f0 R14: ffffffff88606040 R15: 000000000000c1ff
>>>>>>>> FS:  00007fada4d03640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
>>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>> CR2: 0000001b32e24000 CR3: 00000001016de001 CR4: 0000000000770ef0
>>>>>>>> PKRU: 55555554
>>>>>>>> Call Trace:
>>>>>>>>   <TASK>
>>>>>>>>   inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
>>>>>>>>   inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
>>>>>>>>   mptcp_listen+0x205/0x440 net/mptcp/protocol.c:3564
>>>>>>>>   __sys_listen+0x189/0x260 net/socket.c:1810
>>>>>>>>   __do_sys_listen net/socket.c:1819 [inline]
>>>>>>>>   __se_sys_listen net/socket.c:1817 [inline]
>>>>>>>>   __x64_sys_listen+0x54/0x80 net/socket.c:1817
>>>>>>>>   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>>>>>>>   do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
>>>>>>>>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>>>>>> RIP: 0033:0x7fada558f92d
>>>>>>>> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
>>>>>>>> RSP: 002b:00007fada4d03028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
>>>>>>>> RAX: ffffffffffffffda RBX: 00007fada56aff60 RCX: 00007fada558f92d
>>>>>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
>>>>>>>> RBP: 00007fada56000a0 R08: 0000000000000000 R09: 0000000000000000
>>>>>>>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>>>>>>>> R13: 000000000000000b R14: 00007fada56aff60 R15: 00007fada4ce3000
>>>>>>>>   </TASK>
>>>>>>>> irq event stamp: 2078
>>>>>>>> hardirqs last  enabled at (2092): [<ffffffff812f1be3>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
>>>>>>>> hardirqs last disabled at (2103): [<ffffffff812f1bc8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
>>>>>>>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] lock_sock include/net/sock.h:1691 [inline]
>>>>>>>> softirqs last  enabled at (1498): [<ffffffff83807dd4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
>>>>>>>> softirqs last disabled at (1500): [<ffffffff836f425c>] spin_lock_bh include/linux/spinlock.h:354 [inline]
>>>>>>>> softirqs last disabled at (1500): [<ffffffff836f425c>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
>>>>>>>> ---[ end trace 0000000000000000 ]---
>>>>>>>>
>>>>>>>>
>>>>>>>> In the full log file it does look like syskaller is doing something
>>>>>>>> strange since it's calling bind, connect, and listen on the same socket:
>>>>>>>>
>>>>>>>> r4 = socket$inet_mptcp(0x2, 0x1, 0x106)
>>>>>>>> bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e23, @empty}, 0x10)
>>>>>>>> connect$inet(r4, &(0x7f0000000040)={0x2, 0x0, @local}, 0x10)
>>>>>>>> listen(r4, 0x0)
>>>>>>>>
>>>>>>>> ...but it is a fuzz tester after all.
>>>>>>>>
>>>>>>>> I've uploaded the full syzkaller logs to this GitHub issue:
>>>>>>>>
>>>>>>>> https://github.com/multipath-tcp/mptcp_net-next/issues/279
>>>>>>>>
>>>>>>>>
>>>>>>>> Not sure yet if this is MPTCP-related. I don't think MPTCP
>>>>>>>> changes anything with the subflow TCP socket bind hashes.
>>>>>>>>
>>>>>>> Hi Mat,
>>>>>>>
>>>>>>> Thanks for bringing this up and for uploading the logs. I will look into this.
>>>>>>>>
>>>>>> Hi Mat,
>>>>>>
>>>>>> I am still trying to configure my local environment for mptcp to repro
>>>>>> + test the fix to verify that it is correct. I think the fix is to add
>>>>>> "inet_bhash2_update_saddr(msk);" to the end of the
>>>>>> "mptcp_copy_inaddrs" function in net/mptcp/protocol.c.  Would you be
>>>>>> able to run an instance on your local syzkaller environment with this
>>>>>> line added to see if that fixes the warning?
>>>>>
>>>>> Hi Joanne -
>>>>>
>>>>> I also investigated that function when trying to figure out why this
>>>>> warning might be happening in MPTCP.
>>>>>
>>>>> In MPTCP, the userspace-facing MPTCP socket (msk) doesn't directly bind or
>>>>> connect. The msk creates and manages TCP subflow sockets (ssk in
>>>>> mptcp_copy_inaddrs()), and passes through bind and connect calls to the
>>>>> subflows. The msk depends on the subflow to do the hash updates, since
>>>>> those subflow sockets are the ones interacting with the inet layer.
>>>>>
>>>>> mptcp_copy_inaddrs() copies the already-hashed addresses and ports from
>>>>> the ssk to the msk, and we only want the ssk in the hash table.
>>>>>
>>>> I see, thanks for the explanation, Mat! I will keep investigating.\
>>>
>>> Are you able to repro this warning locally, Mat?
>>>
>>
>> This message made it to my inbox after hitting 'send' on my last reply to
>> this thread!
>>
>> As I said there, I haven't seen syzkaller trigger this again. It only
>> happened one time after applying the fix, and not at all in the last week
>> running syzkaller (nearly) continuously.
>>
>>> I have a test program that calls:
>>> struct addrinfo hints = {
>>>          .ai_protocol = IPPROTO_TCP,
>>>          .ai_socktype = SOCK_STREAM,
>>>         .ai_family = AF_INET,
>>> };
>>> getaddrinfo("127.0.0.1", "15432", &hints, &addr);
>>> socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
>>> bind(sock_fd, addr->ai_addr, addr->ai_addrlen)
>>> connect(sock_fd, &zeroed_sockaddr, sizeof(zeroed_sockaddr));
>>> listen(sock_fd, 0);
>>>
>>> but I'm unable to trigger this warning in my local environment.
>>>
>>> I'm not understanding how this warning can get triggered
>>> non-deterministically when the socket resides only within the program.
>>> The only theory that makes sense to me is that the subflow sockets'
>>> saddr changes somewhere after it has been binded, but then wouldn't
>>> that trigger this warning deterministically?
>>
>> I agree, if the MPTCP code was modifying the subflow saddr this would
>> trigger more consistently. But as far as I can tell, MPTCP isn't
>> changing those values.
>>
>>>
>>> How often are you seeing this warning show up in the mptcp syzkaller?
>>> Is there a way to run a local patch on the mptcp syzkaller (eg like a
>>> patch that prints out extraneous debugging information about
>>> icsk_bind2_hash address, tb2 address, and the socket's saddr changes)
>>> - if so, how can I do this?
>>>
>>
>> Since the warning has not repeated in the last week, I'm wondering if
>> maybe the fuzz tester happened across the locking issue that Paolo noted?
>> The lack of determinism makes me suspect some kind of concurrency issue.
>>
>> I'm running syzkaller on a large-ish datacenter VM with a kernel I
>> compiled, so I can add any patches myself and run with those. Setup for
>> syzkaller as I've been using it with MPTCP is documented here:
>> https://github.com/multipath-tcp/mptcp_net-next/wiki/Testing#syzkaller
>>
>> If you have a patch with some extra debug output that you can point me to,
>> I can apply that to the kernel I'm running.
>>
>
> Awesome, I'm glad to hear that the warning is not constantly
> happening. I am tidying up the bhash2 locks patch and will submit that
> soon. If this warning still shows up after that patch, then I will
> send you a local patch with extra debug output to run on top of the
> mptcp syzkaller.
>

As luck would have it, the warning occurred again today (but not on a 
MPTCP socket):

------------[ cut here ]------------
WARNING: CPU: 0 PID: 14437 at net/ipv4/inet_connection_sock.c:525 inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
Modules linked in:
CPU: 0 PID: 14437 Comm: syz-executor.5 Not tainted 5.18.0-12154-gb63ea817dc3d #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:inet_csk_get_port+0x938/0xe80 net/ipv4/inet_connection_sock.c:525
Code: ff 48 89 84 24 b0 00 00 00 48 85 c0 0f 84 6a 01 00 00 e8 cb bd da fd 48 8b 6c 24 70 c6 04 24 01 e9 fb fb ff ff e8 b8 bd da fd <0f> 0b e9 70 f9 ff ff e8 ac bd da fd 0f 0b e9 28 f9 ff ff e8 a0 bd
RSP: 0018:ffffc90010d47c50 EFLAGS: 00010212
RAX: 000000000000007b RBX: ffff888028c25600 RCX: ffffc90003ad9000
RDX: 0000000000040000 RSI: ffffffff836ffa48 RDI: ffff888028c25ca0
RBP: ffff888038d6a040 R08: 0000000000000001 R09: 0000000000000001
R10: fffffbfff0f6490b R11: 0000000000000001 R12: 0000000000000000
R13: ffff888106074670 R14: ffffffff88608f40 R15: 0000000000000000
FS:  00007f6200b61640(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020003700 CR3: 0000000038d50004 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
  <TASK>
  inet_csk_listen_start+0x143/0x3d0 net/ipv4/inet_connection_sock.c:1178
  inet_listen+0x22f/0x650 net/ipv4/af_inet.c:228
  __sys_listen+0x189/0x260 net/socket.c:1810
  __do_sys_listen net/socket.c:1819 [inline]
  __se_sys_listen net/socket.c:1817 [inline]
  __x64_sys_listen+0x54/0x80 net/socket.c:1817
  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
  do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f62013ed92d
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6200b61028 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
RAX: ffffffffffffffda RBX: 00007f620150df60 RCX: 00007f62013ed92d
RDX: 0000000000000000 RSI: 00000000fffffffd RDI: 0000000000000004
RBP: 00007f620145e0a0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007f620150df60 R15: 00007f6200b41000
  </TASK>
irq event stamp: 2750
hardirqs last  enabled at (2762): [<ffffffff812f3813>] __up_console_sem+0xb3/0xd0 kernel/printk/printk.c:291
hardirqs last disabled at (2773): [<ffffffff812f37f8>] __up_console_sem+0x98/0xd0 kernel/printk/printk.c:289
softirqs last  enabled at (2198): [<ffffffff83812ff4>] lock_sock include/net/sock.h:1691 [inline]
softirqs last  enabled at (2198): [<ffffffff83812ff4>] inet_listen+0x94/0x650 net/ipv4/af_inet.c:199
softirqs last disabled at (2200): [<ffffffff836ff4bc>] spin_lock_bh include/linux/spinlock.h:354 [inline]
softirqs last disabled at (2200): [<ffffffff836ff4bc>] inet_csk_get_port+0x3ac/0xe80 net/ipv4/inet_connection_sock.c:483
---[ end trace 0000000000000000 ]---


It looks like it triggered with this from the full syzkaller log:

"""
16:42:38 executing program 5:
r0 = socket$inet_mptcp(0x2, 0x1, 0x106)
r1 = dup3(r0, 0xffffffffffffffff, 0x80000)
bind$inet(r1, &(0x7f00000001c0)={0x2, 0x4e23, @empty}, 0x10)
bind$inet(r0, &(0x7f0000000000)={0x2, 0x4e21, @broadcast}, 0x10)
connect$inet(r0, &(0x7f0000000800)={0x2, 0x4e21, @local}, 0x10)
r2 = socket$inet_tcp(0x2, 0x1, 0x0)
ioctl$F2FS_IOC_MOVE_RANGE(r2, 0xc020f509, &(0x7f0000000200)={<r3=>r2, 0x4, 0xeea5, 0x9bf})
ioctl$sock_SIOCSIFVLAN_GET_VLAN_VID_CMD(r3, 0x8983, &(0x7f0000000140))
ioctl$sock_SIOCGIFVLAN_GET_VLAN_REALDEV_NAME_CMD(r0, 0x8982, &(0x7f0000000180)={0x8, 'veth1_to_bridge\x00', {'syzkaller0\x00'}})
bind$inet(r2, &(0x7f0000000000)={0x2, 0x4e22, @empty}, 0x10)
connect$inet(r2, &(0x7f0000000040)={0x2, 0x4e23, @local}, 0x10)
setsockopt$SO_TIMESTAMPING(r1, 0x1, 0x41, &(0x7f0000000240)=0x10, 0x4)
sendmmsg$inet(r2, &(0x7f00000000c0)=[{{&(0x7f0000000080)={0x2, 0x0, @empty}, 0x10, &(0x7f00000006c0)=[{&(0x7f0000000380)="01", 0x1}, {0x0}, {&(0x7f0000000600)}], 0x3}}, {{0x0, 0x0, 0x0}}], 0x2, 0x4080)
bind$inet(0xffffffffffffffff, &(0x7f0000000000)={0x2, 0x4e22, @empty}, 0x10)
connect$inet(0xffffffffffffffff, &(0x7f0000000040)={0x2, 0x4e22, @empty}, 0x10)
sendmmsg$inet(0xffffffffffffffff, &(0x7f0000003700)=[{{&(0x7f0000000340)={0x2, 0x0, @initdev={0xac, 0x1e, 0x0, 0x0}}, 0x10, &(0x7f00000006c0)=[{&(0x7f0000000380)="01", 0x7ffff000}, {0x0}, {&(0x7f0000000600)}], 0x3}}, {{0x0, 0x0, 0x0}}], 0x2, 0x4080)
recvfrom(0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0)
setsockopt$inet_tcp_TCP_MD5SIG(0xffffffffffffffff, 0x6, 0xe, &(0x7f0000000280)={@in={{0x2, 0x4e20, @multicast2}}, 0x0, 0x0, 0x18, 0x0, "ead8cdf27d67e3a674fe37c3ad427a4abf5a4925fe88866d975cdf4c5630aacf9cb63d3a96005215ef90dada225dc104e2842170476791560ebc77065d4fc5b3f42dbb9490dfef773dde3e5867e96fdf"}, 0xd8)
recvfrom(r2, 0x0, 0x0, 0x0, 0x0, 0x0)
listen(r2, 0xfffffffd)
"""

So there was some concurrent MPTCP activity but the stack trace is for the 
'r2' TCP socket.

I'll watch for the locking patch. Thanks!

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-06-10 20:15 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-02 16:50 [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes Joanne Koong
2022-06-02 16:51 ` [PATCH net-next v2 1/2] net: " Joanne Koong
2022-06-07  8:33   ` Paolo Abeni
2022-06-07 17:10     ` Mat Martineau
2022-06-07 20:24     ` Joanne Koong
2022-06-08  7:35       ` Paolo Abeni
2022-06-08 17:47         ` Joanne Koong
2022-06-02 16:51 ` [PATCH net-next v2 2/2] selftests/net: Add sk_bind_sendto_listen test Joanne Koong
2022-06-03 18:54 ` [PATCH net-next v2 0/2] Update bhash2 when socket's rcv saddr changes Mat Martineau
2022-06-04  0:38   ` Joanne Koong
2022-06-07  0:02     ` Joanne Koong
2022-06-07 17:31       ` Mat Martineau
2022-06-08 20:27         ` Joanne Koong
2022-06-09 23:52           ` Joanne Koong
2022-06-10  0:37             ` Mat Martineau
2022-06-10 18:09               ` Joanne Koong
2022-06-10 20:15                 ` Mat Martineau
2022-06-09 23:59           ` Mat Martineau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.