All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 net 0/4] dccp/tcp: Fix bhash2 issues related to WARN_ON() in inet_csk_get_port().
@ 2022-11-16 22:28 ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-16 22:28 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern
  Cc: Arnaldo Carvalho de Melo, Joanne Koong, Martin KaFai Lau,
	Mat Martineau, Ziyang Xuan (William),
	Stephen Hemminger, Pengfei Xu, Kuniyuki Iwashima,
	Kuniyuki Iwashima, netdev, dccp

syzkaller was hitting a WARN_ON() in inet_csk_get_port() in the 4th patch,
which was because we forgot to fix up bhash2 bucket when connect() for a
socket bound to a wildcard address fails in __inet_stream_connect().

There was a similar report [0], but its repro does not fire the WARN_ON() due
to inconsistent error handling.

When connect() for a socket bound to a wildcard address fails, saddr may or
may not be reset depending on where the failure happens.  When we fail in
__inet_stream_connect(), sk->sk_prot->disconnect() resets saddr.  OTOH, in
(dccp|tcp)_v[46]_connect(), if we fail after inet_hash6?_connect(), we
forget to reset saddr.

We fix this inconsistent error handling in the 1st patch, and then we'll
fix the bhash2 WARN_ON() issue.

Note that there is still an issue in that we reset saddr without checking
if there are conflicting sockets in bhash and bhash2, but this should be
another series.

See [1][2] for the previous discussion.

[0]: https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
[1]: https://lore.kernel.org/netdev/20221029001249.86337-1-kuniyu@amazon.com/
[2]: https://lore.kernel.org/netdev/20221103172419.20977-1-kuniyu@amazon.com/


Changes:
  v2:
    * Add patch 2-4

  v1: [2]


Kuniyuki Iwashima (4):
  dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
  dccp/tcp: Remove NULL check for prev_saddr in
    inet_bhash2_update_saddr().
  dccp/tcp: Don't update saddr before unlinking sk from the old bucket
  dccp/tcp: Fixup bhash2 bucket when connect() fails.

 include/net/inet_hashtables.h |  3 +-
 net/dccp/ipv4.c               | 23 +++---------
 net/dccp/ipv6.c               | 24 +++---------
 net/dccp/proto.c              |  3 +-
 net/ipv4/af_inet.c            | 11 +-----
 net/ipv4/inet_hashtables.c    | 70 ++++++++++++++++++++++++++++++-----
 net/ipv4/tcp.c                |  3 +-
 net/ipv4/tcp_ipv4.c           | 21 +++--------
 net/ipv6/tcp_ipv6.c           | 20 ++--------
 9 files changed, 85 insertions(+), 93 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 net 0/4] dccp/tcp: Fix bhash2 issues related to WARN_ON() in inet_csk_get_port().
@ 2022-11-16 22:28 ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-16 22:28 UTC (permalink / raw)
  To: dccp

syzkaller was hitting a WARN_ON() in inet_csk_get_port() in the 4th patch,
which was because we forgot to fix up bhash2 bucket when connect() for a
socket bound to a wildcard address fails in __inet_stream_connect().

There was a similar report [0], but its repro does not fire the WARN_ON() due
to inconsistent error handling.

When connect() for a socket bound to a wildcard address fails, saddr may or
may not be reset depending on where the failure happens.  When we fail in
__inet_stream_connect(), sk->sk_prot->disconnect() resets saddr.  OTOH, in
(dccp|tcp)_v[46]_connect(), if we fail after inet_hash6?_connect(), we
forget to reset saddr.

We fix this inconsistent error handling in the 1st patch, and then we'll
fix the bhash2 WARN_ON() issue.

Note that there is still an issue in that we reset saddr without checking
if there are conflicting sockets in bhash and bhash2, but this should be
another series.

See [1][2] for the previous discussion.

[0]: https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
[1]: https://lore.kernel.org/netdev/20221029001249.86337-1-kuniyu@amazon.com/
[2]: https://lore.kernel.org/netdev/20221103172419.20977-1-kuniyu@amazon.com/


Changes:
  v2:
    * Add patch 2-4

  v1: [2]


Kuniyuki Iwashima (4):
  dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
  dccp/tcp: Remove NULL check for prev_saddr in
    inet_bhash2_update_saddr().
  dccp/tcp: Don't update saddr before unlinking sk from the old bucket
  dccp/tcp: Fixup bhash2 bucket when connect() fails.

 include/net/inet_hashtables.h |  3 +-
 net/dccp/ipv4.c               | 23 +++---------
 net/dccp/ipv6.c               | 24 +++---------
 net/dccp/proto.c              |  3 +-
 net/ipv4/af_inet.c            | 11 +-----
 net/ipv4/inet_hashtables.c    | 70 ++++++++++++++++++++++++++++++-----
 net/ipv4/tcp.c                |  3 +-
 net/ipv4/tcp_ipv4.c           | 21 +++--------
 net/ipv6/tcp_ipv6.c           | 20 ++--------
 9 files changed, 85 insertions(+), 93 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 net 1/4] dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
@ 2022-11-16 22:28   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-16 22:28 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern
  Cc: Arnaldo Carvalho de Melo, Joanne Koong, Martin KaFai Lau,
	Mat Martineau, Ziyang Xuan (William),
	Stephen Hemminger, Pengfei Xu, Kuniyuki Iwashima,
	Kuniyuki Iwashima, netdev, dccp

When connect() is called on a socket bound to the wildcard address,
we change the socket's saddr to a local address.  If the socket
fails to connect() to the destination, we have to reset the saddr.

However, when an error occurs after inet_hash6?_connect() in
(dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
the socket bound to the address.

From the user's point of view, whether saddr is reset or not varies
with errno.  Let's fix this inconsistent behaviour.

Note that after this patch, the repro [0] will trigger the WARN_ON()
in inet_csk_get_port() again, but this patch is not buggy and rather
fixes a bug papering over the bhash2's bug for which we need another
fix.

For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
by this sequence:

  s1 = socket()
  s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
  s1.bind(('127.0.0.1', 10000))
  s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
  # or s1.connect(('127.0.0.1', 10000))

  s2 = socket()
  s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
  s2.bind(('0.0.0.0', 10000))
  s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL

  s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);

[0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09

Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/dccp/ipv4.c     | 2 ++
 net/dccp/ipv6.c     | 2 ++
 net/ipv4/tcp_ipv4.c | 2 ++
 net/ipv6/tcp_ipv6.c | 2 ++
 4 files changed, 8 insertions(+)

diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 713b7b8dad7e..40640c26680e 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -157,6 +157,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	 * This unhashes the socket and releases the local port, if necessary.
 	 */
 	dccp_set_state(sk, DCCP_CLOSED);
+	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+		inet_reset_saddr(sk);
 	ip_rt_put(rt);
 	sk->sk_route_caps = 0;
 	inet->inet_dport = 0;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index e57b43006074..626166cb6d7e 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -985,6 +985,8 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 
 late_failure:
 	dccp_set_state(sk, DCCP_CLOSED);
+	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+		inet_reset_saddr(sk);
 	__sk_dst_reset(sk);
 failure:
 	inet->inet_dport = 0;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 87d440f47a70..6a3a732b584d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -343,6 +343,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	 * if necessary.
 	 */
 	tcp_set_state(sk, TCP_CLOSE);
+	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+		inet_reset_saddr(sk);
 	ip_rt_put(rt);
 	sk->sk_route_caps = 0;
 	inet->inet_dport = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2a3f9296df1e..81b396e5cf79 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -359,6 +359,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 
 late_failure:
 	tcp_set_state(sk, TCP_CLOSE);
+	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+		inet_reset_saddr(sk);
 failure:
 	inet->inet_dport = 0;
 	sk->sk_route_caps = 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 net 1/4] dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
@ 2022-11-16 22:28   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-16 22:28 UTC (permalink / raw)
  To: dccp

When connect() is called on a socket bound to the wildcard address,
we change the socket's saddr to a local address.  If the socket
fails to connect() to the destination, we have to reset the saddr.

However, when an error occurs after inet_hash6?_connect() in
(dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
the socket bound to the address.

From the user's point of view, whether saddr is reset or not varies
with errno.  Let's fix this inconsistent behaviour.

Note that after this patch, the repro [0] will trigger the WARN_ON()
in inet_csk_get_port() again, but this patch is not buggy and rather
fixes a bug papering over the bhash2's bug for which we need another
fix.

For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
by this sequence:

  s1 = socket()
  s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
  s1.bind(('127.0.0.1', 10000))
  s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
  # or s1.connect(('127.0.0.1', 10000))

  s2 = socket()
  s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
  s2.bind(('0.0.0.0', 10000))
  s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL

  s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);

[0]: https://syzkaller.appspot.com/bug?extid\x015d756bbd1f8b5c8f09

Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/dccp/ipv4.c     | 2 ++
 net/dccp/ipv6.c     | 2 ++
 net/ipv4/tcp_ipv4.c | 2 ++
 net/ipv6/tcp_ipv6.c | 2 ++
 4 files changed, 8 insertions(+)

diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 713b7b8dad7e..40640c26680e 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -157,6 +157,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	 * This unhashes the socket and releases the local port, if necessary.
 	 */
 	dccp_set_state(sk, DCCP_CLOSED);
+	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+		inet_reset_saddr(sk);
 	ip_rt_put(rt);
 	sk->sk_route_caps = 0;
 	inet->inet_dport = 0;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index e57b43006074..626166cb6d7e 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -985,6 +985,8 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 
 late_failure:
 	dccp_set_state(sk, DCCP_CLOSED);
+	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+		inet_reset_saddr(sk);
 	__sk_dst_reset(sk);
 failure:
 	inet->inet_dport = 0;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 87d440f47a70..6a3a732b584d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -343,6 +343,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	 * if necessary.
 	 */
 	tcp_set_state(sk, TCP_CLOSE);
+	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+		inet_reset_saddr(sk);
 	ip_rt_put(rt);
 	sk->sk_route_caps = 0;
 	inet->inet_dport = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2a3f9296df1e..81b396e5cf79 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -359,6 +359,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 
 late_failure:
 	tcp_set_state(sk, TCP_CLOSE);
+	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+		inet_reset_saddr(sk);
 failure:
 	inet->inet_dport = 0;
 	sk->sk_route_caps = 0;
-- 
2.30.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 net 2/4] dccp/tcp: Remove NULL check for prev_saddr in inet_bhash2_update_saddr().
@ 2022-11-16 22:28   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-16 22:28 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern
  Cc: Arnaldo Carvalho de Melo, Joanne Koong, Martin KaFai Lau,
	Mat Martineau, Ziyang Xuan (William),
	Stephen Hemminger, Pengfei Xu, Kuniyuki Iwashima,
	Kuniyuki Iwashima, netdev, dccp

When we call inet_bhash2_update_saddr(), prev_saddr is always non-NULL.
Let's remove the unnecessary test.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/inet_hashtables.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 033bf3c2538f..d745f962745e 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -877,13 +877,10 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
 
 	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
 
-	if (prev_saddr) {
-		spin_lock_bh(&prev_saddr->lock);
-		__sk_del_bind2_node(sk);
-		inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep,
-					  inet_csk(sk)->icsk_bind2_hash);
-		spin_unlock_bh(&prev_saddr->lock);
-	}
+	spin_lock_bh(&prev_saddr->lock);
+	__sk_del_bind2_node(sk);
+	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
+	spin_unlock_bh(&prev_saddr->lock);
 
 	spin_lock_bh(&head2->lock);
 	tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 net 2/4] dccp/tcp: Remove NULL check for prev_saddr in inet_bhash2_update_saddr().
@ 2022-11-16 22:28   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-16 22:28 UTC (permalink / raw)
  To: dccp

When we call inet_bhash2_update_saddr(), prev_saddr is always non-NULL.
Let's remove the unnecessary test.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/inet_hashtables.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 033bf3c2538f..d745f962745e 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -877,13 +877,10 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
 
 	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
 
-	if (prev_saddr) {
-		spin_lock_bh(&prev_saddr->lock);
-		__sk_del_bind2_node(sk);
-		inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep,
-					  inet_csk(sk)->icsk_bind2_hash);
-		spin_unlock_bh(&prev_saddr->lock);
-	}
+	spin_lock_bh(&prev_saddr->lock);
+	__sk_del_bind2_node(sk);
+	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
+	spin_unlock_bh(&prev_saddr->lock);
 
 	spin_lock_bh(&head2->lock);
 	tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
-- 
2.30.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
@ 2022-11-16 22:28   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-16 22:28 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern
  Cc: Arnaldo Carvalho de Melo, Joanne Koong, Martin KaFai Lau,
	Mat Martineau, Ziyang Xuan (William),
	Stephen Hemminger, Pengfei Xu, Kuniyuki Iwashima,
	Kuniyuki Iwashima, netdev, dccp

Currently, we update saddr before calling inet_bhash2_update_saddr(), so
another thread iterating over the bhash2 bucket might see an inconsistent
address.

Let's update saddr after unlinking sk from the old bhash2 bucket.

Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/inet_hashtables.h |  2 +-
 net/dccp/ipv4.c               | 22 ++++------------------
 net/dccp/ipv6.c               | 23 ++++-------------------
 net/ipv4/af_inet.c            | 11 +----------
 net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
 net/ipv4/tcp_ipv4.c           | 20 ++++----------------
 net/ipv6/tcp_ipv6.c           | 19 +++----------------
 7 files changed, 45 insertions(+), 83 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 3af1e927247d..ba06e8b52264 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
  * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
  * rcv_saddr field should already have been updated when this is called.
  */
-int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
+int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
 
 void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
 		    struct inet_bind2_bucket *tb2, unsigned short port);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 40640c26680e..95e376e3b911 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
 int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
 	const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
-	struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
-	__be32 daddr, nexthop, prev_sk_rcv_saddr;
 	struct inet_sock *inet = inet_sk(sk);
 	struct dccp_sock *dp = dccp_sk(sk);
 	__be16 orig_sport, orig_dport;
+	__be32 daddr, nexthop;
 	struct flowi4 *fl4;
 	struct rtable *rt;
 	int err;
@@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		daddr = fl4->daddr;
 
 	if (inet->inet_saddr == 0) {
-		if (inet_csk(sk)->icsk_bind2_hash) {
-			prev_addr_hashbucket =
-				inet_bhashfn_portaddr(&dccp_hashinfo, sk,
-						      sock_net(sk),
-						      inet->inet_num);
-			prev_sk_rcv_saddr = sk->sk_rcv_saddr;
-		}
-		inet->inet_saddr = fl4->saddr;
-	}
-
-	sk_rcv_saddr_set(sk, inet->inet_saddr);
-
-	if (prev_addr_hashbucket) {
-		err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+		err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
 		if (err) {
-			inet->inet_saddr = 0;
-			sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
 			ip_rt_put(rt);
 			return err;
 		}
+	} else {
+		sk_rcv_saddr_set(sk, inet->inet_saddr);
 	}
 
 	inet->inet_dport = usin->sin_port;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 626166cb6d7e..94c101ed57a9 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	}
 
 	if (saddr == NULL) {
-		struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
-		struct in6_addr prev_v6_rcv_saddr;
-
-		if (icsk->icsk_bind2_hash) {
-			prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
-								     sk, sock_net(sk),
-								     inet->inet_num);
-			prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
-		}
-
 		saddr = &fl6.saddr;
-		sk->sk_v6_rcv_saddr = *saddr;
-
-		if (prev_addr_hashbucket) {
-			err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
-			if (err) {
-				sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
-				goto failure;
-			}
-		}
+
+		err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
+		if (err)
+			goto failure;
 	}
 
 	/* set the source address */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 4728087c42a5..0da679411330 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
 
 static int inet_sk_reselect_saddr(struct sock *sk)
 {
-	struct inet_bind_hashbucket *prev_addr_hashbucket;
 	struct inet_sock *inet = inet_sk(sk);
 	__be32 old_saddr = inet->inet_saddr;
 	__be32 daddr = inet->inet_daddr;
@@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 		return 0;
 	}
 
-	prev_addr_hashbucket =
-		inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
-				      sock_net(sk), inet->inet_num);
-
-	inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
-
-	err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+	err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
 	if (err) {
-		inet->inet_saddr = old_saddr;
-		inet->inet_rcv_saddr = old_saddr;
 		ip_rt_put(rt);
 		return err;
 	}
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index d745f962745e..dcb6bc918966 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
 	return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
 }
 
-int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
+static void inet_update_saddr(struct sock *sk, void *saddr, int family)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	if (family == AF_INET6) {
+		sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
+	} else
+#endif
+	{
+		inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
+		sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
+	}
+}
+
+int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
 {
 	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
 	struct inet_bind2_bucket *tb2, *new_tb2;
@@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
 	int port = inet_sk(sk)->inet_num;
 	struct net *net = sock_net(sk);
 
+	if (!inet_csk(sk)->icsk_bind2_hash) {
+		/* Not bind()ed before. */
+		inet_update_saddr(sk, saddr, family);
+		return 0;
+	}
+
 	/* Allocate a bind2 bucket ahead of time to avoid permanently putting
 	 * the bhash2 table in an inconsistent state if a new tb2 bucket
 	 * allocation fails.
@@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
 	if (!new_tb2)
 		return -ENOMEM;
 
+	/* Unlink first not to show the wrong address for other threads. */
 	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
 
-	spin_lock_bh(&prev_saddr->lock);
+	spin_lock_bh(&head2->lock);
 	__sk_del_bind2_node(sk);
 	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
-	spin_unlock_bh(&prev_saddr->lock);
+	spin_unlock_bh(&head2->lock);
+
+	inet_update_saddr(sk, saddr, family);
+
+	/* Update bhash2 bucket. */
+	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
 
 	spin_lock_bh(&head2->lock);
 	tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6a3a732b584d..23dd7e9df2d5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
 /* This will initiate an outgoing connection. */
 int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
-	struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
 	struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
 	struct inet_timewait_death_row *tcp_death_row;
-	__be32 daddr, nexthop, prev_sk_rcv_saddr;
 	struct inet_sock *inet = inet_sk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct ip_options_rcu *inet_opt;
 	struct net *net = sock_net(sk);
 	__be16 orig_sport, orig_dport;
+	__be32 daddr, nexthop;
 	struct flowi4 *fl4;
 	struct rtable *rt;
 	int err;
@@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
 
 	if (!inet->inet_saddr) {
-		if (inet_csk(sk)->icsk_bind2_hash) {
-			prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
-								     sk, net, inet->inet_num);
-			prev_sk_rcv_saddr = sk->sk_rcv_saddr;
-		}
-		inet->inet_saddr = fl4->saddr;
-	}
-
-	sk_rcv_saddr_set(sk, inet->inet_saddr);
-
-	if (prev_addr_hashbucket) {
-		err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+		err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
 		if (err) {
-			inet->inet_saddr = 0;
-			sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
 			ip_rt_put(rt);
 			return err;
 		}
+	} else {
+		sk_rcv_saddr_set(sk, inet->inet_saddr);
 	}
 
 	if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 81b396e5cf79..2f3ca3190d26 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
 
 	if (!saddr) {
-		struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
-		struct in6_addr prev_v6_rcv_saddr;
-
-		if (icsk->icsk_bind2_hash) {
-			prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
-								     sk, net, inet->inet_num);
-			prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
-		}
 		saddr = &fl6.saddr;
-		sk->sk_v6_rcv_saddr = *saddr;
 
-		if (prev_addr_hashbucket) {
-			err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
-			if (err) {
-				sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
-				goto failure;
-			}
-		}
+		err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
+		if (err)
+			goto failure;
 	}
 
 	/* set the source address */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
@ 2022-11-16 22:28   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-16 22:28 UTC (permalink / raw)
  To: dccp

Currently, we update saddr before calling inet_bhash2_update_saddr(), so
another thread iterating over the bhash2 bucket might see an inconsistent
address.

Let's update saddr after unlinking sk from the old bhash2 bucket.

Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/inet_hashtables.h |  2 +-
 net/dccp/ipv4.c               | 22 ++++------------------
 net/dccp/ipv6.c               | 23 ++++-------------------
 net/ipv4/af_inet.c            | 11 +----------
 net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
 net/ipv4/tcp_ipv4.c           | 20 ++++----------------
 net/ipv6/tcp_ipv6.c           | 19 +++----------------
 7 files changed, 45 insertions(+), 83 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 3af1e927247d..ba06e8b52264 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
  * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
  * rcv_saddr field should already have been updated when this is called.
  */
-int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
+int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
 
 void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
 		    struct inet_bind2_bucket *tb2, unsigned short port);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 40640c26680e..95e376e3b911 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
 int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
 	const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
-	struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
-	__be32 daddr, nexthop, prev_sk_rcv_saddr;
 	struct inet_sock *inet = inet_sk(sk);
 	struct dccp_sock *dp = dccp_sk(sk);
 	__be16 orig_sport, orig_dport;
+	__be32 daddr, nexthop;
 	struct flowi4 *fl4;
 	struct rtable *rt;
 	int err;
@@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		daddr = fl4->daddr;
 
 	if (inet->inet_saddr = 0) {
-		if (inet_csk(sk)->icsk_bind2_hash) {
-			prev_addr_hashbucket -				inet_bhashfn_portaddr(&dccp_hashinfo, sk,
-						      sock_net(sk),
-						      inet->inet_num);
-			prev_sk_rcv_saddr = sk->sk_rcv_saddr;
-		}
-		inet->inet_saddr = fl4->saddr;
-	}
-
-	sk_rcv_saddr_set(sk, inet->inet_saddr);
-
-	if (prev_addr_hashbucket) {
-		err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+		err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
 		if (err) {
-			inet->inet_saddr = 0;
-			sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
 			ip_rt_put(rt);
 			return err;
 		}
+	} else {
+		sk_rcv_saddr_set(sk, inet->inet_saddr);
 	}
 
 	inet->inet_dport = usin->sin_port;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 626166cb6d7e..94c101ed57a9 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	}
 
 	if (saddr = NULL) {
-		struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
-		struct in6_addr prev_v6_rcv_saddr;
-
-		if (icsk->icsk_bind2_hash) {
-			prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
-								     sk, sock_net(sk),
-								     inet->inet_num);
-			prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
-		}
-
 		saddr = &fl6.saddr;
-		sk->sk_v6_rcv_saddr = *saddr;
-
-		if (prev_addr_hashbucket) {
-			err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
-			if (err) {
-				sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
-				goto failure;
-			}
-		}
+
+		err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
+		if (err)
+			goto failure;
 	}
 
 	/* set the source address */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 4728087c42a5..0da679411330 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
 
 static int inet_sk_reselect_saddr(struct sock *sk)
 {
-	struct inet_bind_hashbucket *prev_addr_hashbucket;
 	struct inet_sock *inet = inet_sk(sk);
 	__be32 old_saddr = inet->inet_saddr;
 	__be32 daddr = inet->inet_daddr;
@@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 		return 0;
 	}
 
-	prev_addr_hashbucket -		inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
-				      sock_net(sk), inet->inet_num);
-
-	inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
-
-	err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+	err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
 	if (err) {
-		inet->inet_saddr = old_saddr;
-		inet->inet_rcv_saddr = old_saddr;
 		ip_rt_put(rt);
 		return err;
 	}
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index d745f962745e..dcb6bc918966 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
 	return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
 }
 
-int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
+static void inet_update_saddr(struct sock *sk, void *saddr, int family)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	if (family = AF_INET6) {
+		sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
+	} else
+#endif
+	{
+		inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
+		sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
+	}
+}
+
+int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
 {
 	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
 	struct inet_bind2_bucket *tb2, *new_tb2;
@@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
 	int port = inet_sk(sk)->inet_num;
 	struct net *net = sock_net(sk);
 
+	if (!inet_csk(sk)->icsk_bind2_hash) {
+		/* Not bind()ed before. */
+		inet_update_saddr(sk, saddr, family);
+		return 0;
+	}
+
 	/* Allocate a bind2 bucket ahead of time to avoid permanently putting
 	 * the bhash2 table in an inconsistent state if a new tb2 bucket
 	 * allocation fails.
@@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
 	if (!new_tb2)
 		return -ENOMEM;
 
+	/* Unlink first not to show the wrong address for other threads. */
 	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
 
-	spin_lock_bh(&prev_saddr->lock);
+	spin_lock_bh(&head2->lock);
 	__sk_del_bind2_node(sk);
 	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
-	spin_unlock_bh(&prev_saddr->lock);
+	spin_unlock_bh(&head2->lock);
+
+	inet_update_saddr(sk, saddr, family);
+
+	/* Update bhash2 bucket. */
+	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
 
 	spin_lock_bh(&head2->lock);
 	tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6a3a732b584d..23dd7e9df2d5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
 /* This will initiate an outgoing connection. */
 int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
-	struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
 	struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
 	struct inet_timewait_death_row *tcp_death_row;
-	__be32 daddr, nexthop, prev_sk_rcv_saddr;
 	struct inet_sock *inet = inet_sk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct ip_options_rcu *inet_opt;
 	struct net *net = sock_net(sk);
 	__be16 orig_sport, orig_dport;
+	__be32 daddr, nexthop;
 	struct flowi4 *fl4;
 	struct rtable *rt;
 	int err;
@@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
 
 	if (!inet->inet_saddr) {
-		if (inet_csk(sk)->icsk_bind2_hash) {
-			prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
-								     sk, net, inet->inet_num);
-			prev_sk_rcv_saddr = sk->sk_rcv_saddr;
-		}
-		inet->inet_saddr = fl4->saddr;
-	}
-
-	sk_rcv_saddr_set(sk, inet->inet_saddr);
-
-	if (prev_addr_hashbucket) {
-		err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
+		err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
 		if (err) {
-			inet->inet_saddr = 0;
-			sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
 			ip_rt_put(rt);
 			return err;
 		}
+	} else {
+		sk_rcv_saddr_set(sk, inet->inet_saddr);
 	}
 
 	if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 81b396e5cf79..2f3ca3190d26 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
 
 	if (!saddr) {
-		struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
-		struct in6_addr prev_v6_rcv_saddr;
-
-		if (icsk->icsk_bind2_hash) {
-			prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
-								     sk, net, inet->inet_num);
-			prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
-		}
 		saddr = &fl6.saddr;
-		sk->sk_v6_rcv_saddr = *saddr;
 
-		if (prev_addr_hashbucket) {
-			err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
-			if (err) {
-				sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
-				goto failure;
-			}
-		}
+		err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
+		if (err)
+			goto failure;
 	}
 
 	/* set the source address */
-- 
2.30.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 net 4/4] dccp/tcp: Fixup bhash2 bucket when connect() fails.
@ 2022-11-16 22:28   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-16 22:28 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern
  Cc: Arnaldo Carvalho de Melo, Joanne Koong, Martin KaFai Lau,
	Mat Martineau, Ziyang Xuan (William),
	Stephen Hemminger, Pengfei Xu, Kuniyuki Iwashima,
	Kuniyuki Iwashima, netdev, dccp, syzbot

If a socket bound to a wildcard address fails to connect(), we
only reset saddr and keep the port.  Then, we have to fix up the
bhash2 bucket; otherwise, the bucket has an inconsistent address
in the list.

Also, listen() for such a socket will fire the WARN_ON() in
inet_csk_get_port(). [0]

Note that when a system runs out of memory, we give up fixing the
bucket and unlink sk from bhash and bhash2 by inet_put_port().

[0]:
WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
Modules linked in:
CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
 inet_listen (net/ipv4/af_inet.c:228)
 __sys_listen (net/socket.c:1810)
 __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
 do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
RIP: 0033:0x7f8ac051de5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
 </TASK>

Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/inet_hashtables.h |  1 +
 net/dccp/ipv4.c               |  3 +--
 net/dccp/ipv6.c               |  3 +--
 net/dccp/proto.c              |  3 +--
 net/ipv4/inet_hashtables.c    | 38 +++++++++++++++++++++++++++++++----
 net/ipv4/tcp.c                |  3 +--
 net/ipv4/tcp_ipv4.c           |  3 +--
 net/ipv6/tcp_ipv6.c           |  3 +--
 8 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index ba06e8b52264..69174093078f 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -282,6 +282,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
  * rcv_saddr field should already have been updated when this is called.
  */
 int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
+void inet_bhash2_reset_saddr(struct sock *sk);
 
 void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
 		    struct inet_bind2_bucket *tb2, unsigned short port);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 95e376e3b911..b780827f5e0a 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -143,8 +143,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	 * This unhashes the socket and releases the local port, if necessary.
 	 */
 	dccp_set_state(sk, DCCP_CLOSED);
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 	ip_rt_put(rt);
 	sk->sk_route_caps = 0;
 	inet->inet_dport = 0;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 94c101ed57a9..602f3432d80b 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -970,8 +970,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 
 late_failure:
 	dccp_set_state(sk, DCCP_CLOSED);
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 	__sk_dst_reset(sk);
 failure:
 	inet->inet_dport = 0;
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index c548ca3e9b0e..85e35c5e8890 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -279,8 +279,7 @@ int dccp_disconnect(struct sock *sk, int flags)
 
 	inet->inet_dport = 0;
 
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 
 	sk->sk_shutdown = 0;
 	sock_reset_flag(sk, SOCK_DONE);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index dcb6bc918966..d24a04815f20 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -871,7 +871,7 @@ static void inet_update_saddr(struct sock *sk, void *saddr, int family)
 	}
 }
 
-int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
+static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family, bool reset)
 {
 	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
 	struct inet_bind2_bucket *tb2, *new_tb2;
@@ -882,7 +882,11 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
 
 	if (!inet_csk(sk)->icsk_bind2_hash) {
 		/* Not bind()ed before. */
-		inet_update_saddr(sk, saddr, family);
+		if (reset)
+			inet_reset_saddr(sk);
+		else
+			inet_update_saddr(sk, saddr, family);
+
 		return 0;
 	}
 
@@ -891,8 +895,19 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
 	 * allocation fails.
 	 */
 	new_tb2 = kmem_cache_alloc(hinfo->bind2_bucket_cachep, GFP_ATOMIC);
-	if (!new_tb2)
+	if (!new_tb2) {
+		if (reset) {
+			/* The (INADDR_ANY, port) bucket might have already been
+			 * freed, then we cannot fixup icsk_bind2_hash, so we give
+			 * up and unlink sk from bhash/bhash2 not to fire WARN_ON()
+			 * in inet_csk_get_port().
+			 */
+			inet_put_port(sk);
+			inet_reset_saddr(sk);
+		}
+
 		return -ENOMEM;
+	}
 
 	/* Unlink first not to show the wrong address for other threads. */
 	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
@@ -902,7 +917,10 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
 	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
 	spin_unlock_bh(&head2->lock);
 
-	inet_update_saddr(sk, saddr, family);
+	if (reset)
+		inet_reset_saddr(sk);
+	else
+		inet_update_saddr(sk, saddr, family);
 
 	/* Update bhash2 bucket. */
 	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
@@ -922,8 +940,20 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
 
 	return 0;
 }
+
+int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
+{
+	return __inet_bhash2_update_saddr(sk, saddr, family, false);
+}
 EXPORT_SYMBOL_GPL(inet_bhash2_update_saddr);
 
+void inet_bhash2_reset_saddr(struct sock *sk)
+{
+	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+		__inet_bhash2_update_saddr(sk, NULL, 0, true);
+}
+EXPORT_SYMBOL_GPL(inet_bhash2_reset_saddr);
+
 /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
  * Note that we use 32bit integers (vs RFC 'short integers')
  * because 2^16 is not a multiple of num_ephemeral and this
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 54836a6b81d6..4f2205756cfe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3114,8 +3114,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 
 	inet->inet_dport = 0;
 
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 
 	sk->sk_shutdown = 0;
 	sock_reset_flag(sk, SOCK_DONE);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 23dd7e9df2d5..da46357f501b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -331,8 +331,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	 * if necessary.
 	 */
 	tcp_set_state(sk, TCP_CLOSE);
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 	ip_rt_put(rt);
 	sk->sk_route_caps = 0;
 	inet->inet_dport = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2f3ca3190d26..f0548dbcabd2 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -346,8 +346,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 
 late_failure:
 	tcp_set_state(sk, TCP_CLOSE);
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 failure:
 	inet->inet_dport = 0;
 	sk->sk_route_caps = 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 net 4/4] dccp/tcp: Fixup bhash2 bucket when connect() fails.
@ 2022-11-16 22:28   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-16 22:28 UTC (permalink / raw)
  To: dccp

If a socket bound to a wildcard address fails to connect(), we
only reset saddr and keep the port.  Then, we have to fix up the
bhash2 bucket; otherwise, the bucket has an inconsistent address
in the list.

Also, listen() for such a socket will fire the WARN_ON() in
inet_csk_get_port(). [0]

Note that when a system runs out of memory, we give up fixing the
bucket and unlink sk from bhash and bhash2 by inet_put_port().

[0]:
WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
Modules linked in:
CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
 inet_listen (net/ipv4/af_inet.c:228)
 __sys_listen (net/socket.c:1810)
 __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
 do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
RIP: 0033:0x7f8ac051de5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
 </TASK>

Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/inet_hashtables.h |  1 +
 net/dccp/ipv4.c               |  3 +--
 net/dccp/ipv6.c               |  3 +--
 net/dccp/proto.c              |  3 +--
 net/ipv4/inet_hashtables.c    | 38 +++++++++++++++++++++++++++++++----
 net/ipv4/tcp.c                |  3 +--
 net/ipv4/tcp_ipv4.c           |  3 +--
 net/ipv6/tcp_ipv6.c           |  3 +--
 8 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index ba06e8b52264..69174093078f 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -282,6 +282,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
  * rcv_saddr field should already have been updated when this is called.
  */
 int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
+void inet_bhash2_reset_saddr(struct sock *sk);
 
 void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
 		    struct inet_bind2_bucket *tb2, unsigned short port);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 95e376e3b911..b780827f5e0a 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -143,8 +143,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	 * This unhashes the socket and releases the local port, if necessary.
 	 */
 	dccp_set_state(sk, DCCP_CLOSED);
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 	ip_rt_put(rt);
 	sk->sk_route_caps = 0;
 	inet->inet_dport = 0;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 94c101ed57a9..602f3432d80b 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -970,8 +970,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 
 late_failure:
 	dccp_set_state(sk, DCCP_CLOSED);
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 	__sk_dst_reset(sk);
 failure:
 	inet->inet_dport = 0;
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index c548ca3e9b0e..85e35c5e8890 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -279,8 +279,7 @@ int dccp_disconnect(struct sock *sk, int flags)
 
 	inet->inet_dport = 0;
 
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 
 	sk->sk_shutdown = 0;
 	sock_reset_flag(sk, SOCK_DONE);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index dcb6bc918966..d24a04815f20 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -871,7 +871,7 @@ static void inet_update_saddr(struct sock *sk, void *saddr, int family)
 	}
 }
 
-int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
+static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family, bool reset)
 {
 	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
 	struct inet_bind2_bucket *tb2, *new_tb2;
@@ -882,7 +882,11 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
 
 	if (!inet_csk(sk)->icsk_bind2_hash) {
 		/* Not bind()ed before. */
-		inet_update_saddr(sk, saddr, family);
+		if (reset)
+			inet_reset_saddr(sk);
+		else
+			inet_update_saddr(sk, saddr, family);
+
 		return 0;
 	}
 
@@ -891,8 +895,19 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
 	 * allocation fails.
 	 */
 	new_tb2 = kmem_cache_alloc(hinfo->bind2_bucket_cachep, GFP_ATOMIC);
-	if (!new_tb2)
+	if (!new_tb2) {
+		if (reset) {
+			/* The (INADDR_ANY, port) bucket might have already been
+			 * freed, then we cannot fixup icsk_bind2_hash, so we give
+			 * up and unlink sk from bhash/bhash2 not to fire WARN_ON()
+			 * in inet_csk_get_port().
+			 */
+			inet_put_port(sk);
+			inet_reset_saddr(sk);
+		}
+
 		return -ENOMEM;
+	}
 
 	/* Unlink first not to show the wrong address for other threads. */
 	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
@@ -902,7 +917,10 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
 	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
 	spin_unlock_bh(&head2->lock);
 
-	inet_update_saddr(sk, saddr, family);
+	if (reset)
+		inet_reset_saddr(sk);
+	else
+		inet_update_saddr(sk, saddr, family);
 
 	/* Update bhash2 bucket. */
 	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
@@ -922,8 +940,20 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
 
 	return 0;
 }
+
+int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
+{
+	return __inet_bhash2_update_saddr(sk, saddr, family, false);
+}
 EXPORT_SYMBOL_GPL(inet_bhash2_update_saddr);
 
+void inet_bhash2_reset_saddr(struct sock *sk)
+{
+	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+		__inet_bhash2_update_saddr(sk, NULL, 0, true);
+}
+EXPORT_SYMBOL_GPL(inet_bhash2_reset_saddr);
+
 /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
  * Note that we use 32bit integers (vs RFC 'short integers')
  * because 2^16 is not a multiple of num_ephemeral and this
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 54836a6b81d6..4f2205756cfe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3114,8 +3114,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 
 	inet->inet_dport = 0;
 
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 
 	sk->sk_shutdown = 0;
 	sock_reset_flag(sk, SOCK_DONE);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 23dd7e9df2d5..da46357f501b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -331,8 +331,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	 * if necessary.
 	 */
 	tcp_set_state(sk, TCP_CLOSE);
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 	ip_rt_put(rt);
 	sk->sk_route_caps = 0;
 	inet->inet_dport = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2f3ca3190d26..f0548dbcabd2 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -346,8 +346,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 
 late_failure:
 	tcp_set_state(sk, TCP_CLOSE);
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
-		inet_reset_saddr(sk);
+	inet_bhash2_reset_saddr(sk);
 failure:
 	inet->inet_dport = 0;
 	sk->sk_route_caps = 0;
-- 
2.30.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 2/4] dccp/tcp: Remove NULL check for prev_saddr in inet_bhash2_update_saddr().
  2022-11-16 22:28   ` Kuniyuki Iwashima
@ 2022-11-17  0:07     ` Joanne Koong
  -1 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-17  0:07 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Arnaldo Carvalho de Melo,
	Martin KaFai Lau, Mat Martineau, Ziyang Xuan (William),
	Stephen Hemminger, Pengfei Xu, Kuniyuki Iwashima, netdev, dccp

On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> When we call inet_bhash2_update_saddr(), prev_saddr is always non-NULL.
> Let's remove the unnecessary test.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Acked-by: Joanne Koong <joannelkoong@gmail.com>

> ---
>  net/ipv4/inet_hashtables.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 033bf3c2538f..d745f962745e 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -877,13 +877,10 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
>
>         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
>
> -       if (prev_saddr) {
> -               spin_lock_bh(&prev_saddr->lock);
> -               __sk_del_bind2_node(sk);
> -               inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep,
> -                                         inet_csk(sk)->icsk_bind2_hash);
> -               spin_unlock_bh(&prev_saddr->lock);
> -       }
> +       spin_lock_bh(&prev_saddr->lock);
> +       __sk_del_bind2_node(sk);
> +       inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> +       spin_unlock_bh(&prev_saddr->lock);
>
>         spin_lock_bh(&head2->lock);
>         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 2/4] dccp/tcp: Remove NULL check for prev_saddr in inet_bhash2_update_saddr().
@ 2022-11-17  0:07     ` Joanne Koong
  0 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-17  0:07 UTC (permalink / raw)
  To: dccp

On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> When we call inet_bhash2_update_saddr(), prev_saddr is always non-NULL.
> Let's remove the unnecessary test.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Acked-by: Joanne Koong <joannelkoong@gmail.com>

> ---
>  net/ipv4/inet_hashtables.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 033bf3c2538f..d745f962745e 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -877,13 +877,10 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
>
>         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
>
> -       if (prev_saddr) {
> -               spin_lock_bh(&prev_saddr->lock);
> -               __sk_del_bind2_node(sk);
> -               inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep,
> -                                         inet_csk(sk)->icsk_bind2_hash);
> -               spin_unlock_bh(&prev_saddr->lock);
> -       }
> +       spin_lock_bh(&prev_saddr->lock);
> +       __sk_del_bind2_node(sk);
> +       inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> +       spin_unlock_bh(&prev_saddr->lock);
>
>         spin_lock_bh(&head2->lock);
>         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 1/4] dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
  2022-11-16 22:28   ` Kuniyuki Iwashima
@ 2022-11-17  0:11     ` Joanne Koong
  -1 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-17  0:11 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Arnaldo Carvalho de Melo,
	Martin KaFai Lau, Mat Martineau, Ziyang Xuan (William),
	Stephen Hemminger, Pengfei Xu, Kuniyuki Iwashima, netdev, dccp

On Wed, Nov 16, 2022 at 2:28 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> When connect() is called on a socket bound to the wildcard address,
> we change the socket's saddr to a local address.  If the socket
> fails to connect() to the destination, we have to reset the saddr.
>
> However, when an error occurs after inet_hash6?_connect() in
> (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
> the socket bound to the address.
>
> From the user's point of view, whether saddr is reset or not varies
> with errno.  Let's fix this inconsistent behaviour.
>
> Note that after this patch, the repro [0] will trigger the WARN_ON()
> in inet_csk_get_port() again, but this patch is not buggy and rather
> fixes a bug papering over the bhash2's bug for which we need another
> fix.
>
> For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
> by this sequence:
>
>   s1 = socket()
>   s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>   s1.bind(('127.0.0.1', 10000))
>   s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
>   # or s1.connect(('127.0.0.1', 10000))
>
>   s2 = socket()
>   s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>   s2.bind(('0.0.0.0', 10000))
>   s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL
>
>   s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
>
> [0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09
>
> Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
> Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

LGTM. Btw, the 4th patch in this series overwrites these changes by
moving this logic into the new inet_bhash2_reset_saddr() function you
added, so we could also drop this patch from the series. OTOH, this
commit message in this patch has some good background context. So I
don't have a preference either way :)

Acked-by: Joanne Koong <joannelkoong@gmail.com>

> ---
>  net/dccp/ipv4.c     | 2 ++
>  net/dccp/ipv6.c     | 2 ++
>  net/ipv4/tcp_ipv4.c | 2 ++
>  net/ipv6/tcp_ipv6.c | 2 ++
>  4 files changed, 8 insertions(+)
>
> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> index 713b7b8dad7e..40640c26680e 100644
> --- a/net/dccp/ipv4.c
> +++ b/net/dccp/ipv4.c
> @@ -157,6 +157,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>          * This unhashes the socket and releases the local port, if necessary.
>          */
>         dccp_set_state(sk, DCCP_CLOSED);
> +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> +               inet_reset_saddr(sk);
>         ip_rt_put(rt);
>         sk->sk_route_caps = 0;
>         inet->inet_dport = 0;
> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> index e57b43006074..626166cb6d7e 100644
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -985,6 +985,8 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>
>  late_failure:
>         dccp_set_state(sk, DCCP_CLOSED);
> +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> +               inet_reset_saddr(sk);
>         __sk_dst_reset(sk);
>  failure:
>         inet->inet_dport = 0;
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 87d440f47a70..6a3a732b584d 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -343,6 +343,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>          * if necessary.
>          */
>         tcp_set_state(sk, TCP_CLOSE);
> +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> +               inet_reset_saddr(sk);
>         ip_rt_put(rt);
>         sk->sk_route_caps = 0;
>         inet->inet_dport = 0;
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 2a3f9296df1e..81b396e5cf79 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -359,6 +359,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>
>  late_failure:
>         tcp_set_state(sk, TCP_CLOSE);
> +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> +               inet_reset_saddr(sk);
>  failure:
>         inet->inet_dport = 0;
>         sk->sk_route_caps = 0;
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 1/4] dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
@ 2022-11-17  0:11     ` Joanne Koong
  0 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-17  0:11 UTC (permalink / raw)
  To: dccp

On Wed, Nov 16, 2022 at 2:28 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> When connect() is called on a socket bound to the wildcard address,
> we change the socket's saddr to a local address.  If the socket
> fails to connect() to the destination, we have to reset the saddr.
>
> However, when an error occurs after inet_hash6?_connect() in
> (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
> the socket bound to the address.
>
> From the user's point of view, whether saddr is reset or not varies
> with errno.  Let's fix this inconsistent behaviour.
>
> Note that after this patch, the repro [0] will trigger the WARN_ON()
> in inet_csk_get_port() again, but this patch is not buggy and rather
> fixes a bug papering over the bhash2's bug for which we need another
> fix.
>
> For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
> by this sequence:
>
>   s1 = socket()
>   s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>   s1.bind(('127.0.0.1', 10000))
>   s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
>   # or s1.connect(('127.0.0.1', 10000))
>
>   s2 = socket()
>   s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>   s2.bind(('0.0.0.0', 10000))
>   s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL
>
>   s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
>
> [0]: https://syzkaller.appspot.com/bug?extid\x015d756bbd1f8b5c8f09
>
> Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
> Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

LGTM. Btw, the 4th patch in this series overwrites these changes by
moving this logic into the new inet_bhash2_reset_saddr() function you
added, so we could also drop this patch from the series. OTOH, this
commit message in this patch has some good background context. So I
don't have a preference either way :)

Acked-by: Joanne Koong <joannelkoong@gmail.com>

> ---
>  net/dccp/ipv4.c     | 2 ++
>  net/dccp/ipv6.c     | 2 ++
>  net/ipv4/tcp_ipv4.c | 2 ++
>  net/ipv6/tcp_ipv6.c | 2 ++
>  4 files changed, 8 insertions(+)
>
> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> index 713b7b8dad7e..40640c26680e 100644
> --- a/net/dccp/ipv4.c
> +++ b/net/dccp/ipv4.c
> @@ -157,6 +157,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>          * This unhashes the socket and releases the local port, if necessary.
>          */
>         dccp_set_state(sk, DCCP_CLOSED);
> +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> +               inet_reset_saddr(sk);
>         ip_rt_put(rt);
>         sk->sk_route_caps = 0;
>         inet->inet_dport = 0;
> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> index e57b43006074..626166cb6d7e 100644
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -985,6 +985,8 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>
>  late_failure:
>         dccp_set_state(sk, DCCP_CLOSED);
> +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> +               inet_reset_saddr(sk);
>         __sk_dst_reset(sk);
>  failure:
>         inet->inet_dport = 0;
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 87d440f47a70..6a3a732b584d 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -343,6 +343,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>          * if necessary.
>          */
>         tcp_set_state(sk, TCP_CLOSE);
> +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> +               inet_reset_saddr(sk);
>         ip_rt_put(rt);
>         sk->sk_route_caps = 0;
>         inet->inet_dport = 0;
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 2a3f9296df1e..81b396e5cf79 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -359,6 +359,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>
>  late_failure:
>         tcp_set_state(sk, TCP_CLOSE);
> +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> +               inet_reset_saddr(sk);
>  failure:
>         inet->inet_dport = 0;
>         sk->sk_route_caps = 0;
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 1/4] dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
  2022-11-16 22:28   ` Kuniyuki Iwashima
@ 2022-11-17  0:20       ` Kuniyuki Iwashima
  -1 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-17  0:20 UTC (permalink / raw)
  To: joannelkoong
  Cc: acme, davem, dccp, dsahern, edumazet, kuba, kuni1840, kuniyu,
	martin.lau, mathew.j.martineau, netdev, pabeni, pengfei.xu,
	stephen, william.xuanziyang, yoshfuji

From:   Joanne Koong <joannelkoong@gmail.com>
Date:   Wed, 16 Nov 2022 16:11:21 -0800
> On Wed, Nov 16, 2022 at 2:28 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > When connect() is called on a socket bound to the wildcard address,
> > we change the socket's saddr to a local address.  If the socket
> > fails to connect() to the destination, we have to reset the saddr.
> >
> > However, when an error occurs after inet_hash6?_connect() in
> > (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
> > the socket bound to the address.
> >
> > From the user's point of view, whether saddr is reset or not varies
> > with errno.  Let's fix this inconsistent behaviour.
> >
> > Note that after this patch, the repro [0] will trigger the WARN_ON()
> > in inet_csk_get_port() again, but this patch is not buggy and rather
> > fixes a bug papering over the bhash2's bug for which we need another
> > fix.
> >
> > For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
> > by this sequence:
> >
> >   s1 = socket()
> >   s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
> >   s1.bind(('127.0.0.1', 10000))
> >   s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
> >   # or s1.connect(('127.0.0.1', 10000))
> >
> >   s2 = socket()
> >   s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
> >   s2.bind(('0.0.0.0', 10000))
> >   s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL
> >
> >   s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
> >
> > [0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09
> >
> > Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
> > Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
> > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> 
> LGTM. Btw, the 4th patch in this series overwrites these changes by
> moving this logic into the new inet_bhash2_reset_saddr() function you
> added, so we could also drop this patch from the series. OTOH, this
> commit message in this patch has some good background context. So I
> don't have a preference either way :)
> 
> Acked-by: Joanne Koong <joannelkoong@gmail.com>

Thanks for reviewing!

Yes, these changes are overwritten later, but only this patch can be
backported to other stable versions, so I kept this separated.


> > ---
> >  net/dccp/ipv4.c     | 2 ++
> >  net/dccp/ipv6.c     | 2 ++
> >  net/ipv4/tcp_ipv4.c | 2 ++
> >  net/ipv6/tcp_ipv6.c | 2 ++
> >  4 files changed, 8 insertions(+)
> >
> > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > index 713b7b8dad7e..40640c26680e 100644
> > --- a/net/dccp/ipv4.c
> > +++ b/net/dccp/ipv4.c
> > @@ -157,6 +157,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >          * This unhashes the socket and releases the local port, if necessary.
> >          */
> >         dccp_set_state(sk, DCCP_CLOSED);
> > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > +               inet_reset_saddr(sk);
> >         ip_rt_put(rt);
> >         sk->sk_route_caps = 0;
> >         inet->inet_dport = 0;
> > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > index e57b43006074..626166cb6d7e 100644
> > --- a/net/dccp/ipv6.c
> > +++ b/net/dccp/ipv6.c
> > @@ -985,6 +985,8 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >
> >  late_failure:
> >         dccp_set_state(sk, DCCP_CLOSED);
> > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > +               inet_reset_saddr(sk);
> >         __sk_dst_reset(sk);
> >  failure:
> >         inet->inet_dport = 0;
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 87d440f47a70..6a3a732b584d 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -343,6 +343,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >          * if necessary.
> >          */
> >         tcp_set_state(sk, TCP_CLOSE);
> > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > +               inet_reset_saddr(sk);
> >         ip_rt_put(rt);
> >         sk->sk_route_caps = 0;
> >         inet->inet_dport = 0;
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 2a3f9296df1e..81b396e5cf79 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -359,6 +359,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >
> >  late_failure:
> >         tcp_set_state(sk, TCP_CLOSE);
> > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > +               inet_reset_saddr(sk);
> >  failure:
> >         inet->inet_dport = 0;
> >         sk->sk_route_caps = 0;
> > --
> > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 1/4] dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
@ 2022-11-17  0:20       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-17  0:20 UTC (permalink / raw)
  To: dccp

From:   Joanne Koong <joannelkoong@gmail.com>
Date:   Wed, 16 Nov 2022 16:11:21 -0800
> On Wed, Nov 16, 2022 at 2:28 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > When connect() is called on a socket bound to the wildcard address,
> > we change the socket's saddr to a local address.  If the socket
> > fails to connect() to the destination, we have to reset the saddr.
> >
> > However, when an error occurs after inet_hash6?_connect() in
> > (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
> > the socket bound to the address.
> >
> > From the user's point of view, whether saddr is reset or not varies
> > with errno.  Let's fix this inconsistent behaviour.
> >
> > Note that after this patch, the repro [0] will trigger the WARN_ON()
> > in inet_csk_get_port() again, but this patch is not buggy and rather
> > fixes a bug papering over the bhash2's bug for which we need another
> > fix.
> >
> > For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
> > by this sequence:
> >
> >   s1 = socket()
> >   s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
> >   s1.bind(('127.0.0.1', 10000))
> >   s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
> >   # or s1.connect(('127.0.0.1', 10000))
> >
> >   s2 = socket()
> >   s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
> >   s2.bind(('0.0.0.0', 10000))
> >   s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL
> >
> >   s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
> >
> > [0]: https://syzkaller.appspot.com/bug?extid\x015d756bbd1f8b5c8f09
> >
> > Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
> > Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
> > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> 
> LGTM. Btw, the 4th patch in this series overwrites these changes by
> moving this logic into the new inet_bhash2_reset_saddr() function you
> added, so we could also drop this patch from the series. OTOH, this
> commit message in this patch has some good background context. So I
> don't have a preference either way :)
> 
> Acked-by: Joanne Koong <joannelkoong@gmail.com>

Thanks for reviewing!

Yes, these changes are overwritten later, but only this patch can be
backported to other stable versions, so I kept this separated.


> > ---
> >  net/dccp/ipv4.c     | 2 ++
> >  net/dccp/ipv6.c     | 2 ++
> >  net/ipv4/tcp_ipv4.c | 2 ++
> >  net/ipv6/tcp_ipv6.c | 2 ++
> >  4 files changed, 8 insertions(+)
> >
> > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > index 713b7b8dad7e..40640c26680e 100644
> > --- a/net/dccp/ipv4.c
> > +++ b/net/dccp/ipv4.c
> > @@ -157,6 +157,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >          * This unhashes the socket and releases the local port, if necessary.
> >          */
> >         dccp_set_state(sk, DCCP_CLOSED);
> > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > +               inet_reset_saddr(sk);
> >         ip_rt_put(rt);
> >         sk->sk_route_caps = 0;
> >         inet->inet_dport = 0;
> > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > index e57b43006074..626166cb6d7e 100644
> > --- a/net/dccp/ipv6.c
> > +++ b/net/dccp/ipv6.c
> > @@ -985,6 +985,8 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >
> >  late_failure:
> >         dccp_set_state(sk, DCCP_CLOSED);
> > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > +               inet_reset_saddr(sk);
> >         __sk_dst_reset(sk);
> >  failure:
> >         inet->inet_dport = 0;
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 87d440f47a70..6a3a732b584d 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -343,6 +343,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >          * if necessary.
> >          */
> >         tcp_set_state(sk, TCP_CLOSE);
> > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > +               inet_reset_saddr(sk);
> >         ip_rt_put(rt);
> >         sk->sk_route_caps = 0;
> >         inet->inet_dport = 0;
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 2a3f9296df1e..81b396e5cf79 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -359,6 +359,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >
> >  late_failure:
> >         tcp_set_state(sk, TCP_CLOSE);
> > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > +               inet_reset_saddr(sk);
> >  failure:
> >         inet->inet_dport = 0;
> >         sk->sk_route_caps = 0;
> > --
> > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 1/4] dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
  2022-11-16 22:28   ` Kuniyuki Iwashima
@ 2022-11-17  0:43         ` Joanne Koong
  -1 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-17  0:43 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: acme, davem, dccp, dsahern, edumazet, kuba, kuni1840, martin.lau,
	mathew.j.martineau, netdev, pabeni, pengfei.xu, stephen,
	william.xuanziyang, yoshfuji

On Wed, Nov 16, 2022 at 4:20 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Joanne Koong <joannelkoong@gmail.com>
> Date:   Wed, 16 Nov 2022 16:11:21 -0800
> > On Wed, Nov 16, 2022 at 2:28 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > When connect() is called on a socket bound to the wildcard address,
> > > we change the socket's saddr to a local address.  If the socket
> > > fails to connect() to the destination, we have to reset the saddr.
> > >
> > > However, when an error occurs after inet_hash6?_connect() in
> > > (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
> > > the socket bound to the address.
> > >
> > > From the user's point of view, whether saddr is reset or not varies
> > > with errno.  Let's fix this inconsistent behaviour.
> > >
> > > Note that after this patch, the repro [0] will trigger the WARN_ON()
> > > in inet_csk_get_port() again, but this patch is not buggy and rather
> > > fixes a bug papering over the bhash2's bug for which we need another
> > > fix.
> > >
> > > For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
> > > by this sequence:
> > >
> > >   s1 = socket()
> > >   s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
> > >   s1.bind(('127.0.0.1', 10000))
> > >   s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
> > >   # or s1.connect(('127.0.0.1', 10000))
> > >
> > >   s2 = socket()
> > >   s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
> > >   s2.bind(('0.0.0.0', 10000))
> > >   s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL
> > >
> > >   s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
> > >
> > > [0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09
> > >
> > > Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
> > > Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
> > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> >
> > LGTM. Btw, the 4th patch in this series overwrites these changes by
> > moving this logic into the new inet_bhash2_reset_saddr() function you
> > added, so we could also drop this patch from the series. OTOH, this
> > commit message in this patch has some good background context. So I
> > don't have a preference either way :)
> >
> > Acked-by: Joanne Koong <joannelkoong@gmail.com>
>
> Thanks for reviewing!
>
> Yes, these changes are overwritten later, but only this patch can be
> backported to other stable versions, so I kept this separated.
>

Gotcha, that makes sense! I will try to get my review of the other 2
patches by tomorrow or Friday.

Thanks for your work on this!

>
> > > ---
> > >  net/dccp/ipv4.c     | 2 ++
> > >  net/dccp/ipv6.c     | 2 ++
> > >  net/ipv4/tcp_ipv4.c | 2 ++
> > >  net/ipv6/tcp_ipv6.c | 2 ++
> > >  4 files changed, 8 insertions(+)
> > >
> > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > index 713b7b8dad7e..40640c26680e 100644
> > > --- a/net/dccp/ipv4.c
> > > +++ b/net/dccp/ipv4.c
> > > @@ -157,6 +157,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >          * This unhashes the socket and releases the local port, if necessary.
> > >          */
> > >         dccp_set_state(sk, DCCP_CLOSED);
> > > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > +               inet_reset_saddr(sk);
> > >         ip_rt_put(rt);
> > >         sk->sk_route_caps = 0;
> > >         inet->inet_dport = 0;
> > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > index e57b43006074..626166cb6d7e 100644
> > > --- a/net/dccp/ipv6.c
> > > +++ b/net/dccp/ipv6.c
> > > @@ -985,6 +985,8 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >
> > >  late_failure:
> > >         dccp_set_state(sk, DCCP_CLOSED);
> > > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > +               inet_reset_saddr(sk);
> > >         __sk_dst_reset(sk);
> > >  failure:
> > >         inet->inet_dport = 0;
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 87d440f47a70..6a3a732b584d 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -343,6 +343,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >          * if necessary.
> > >          */
> > >         tcp_set_state(sk, TCP_CLOSE);
> > > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > +               inet_reset_saddr(sk);
> > >         ip_rt_put(rt);
> > >         sk->sk_route_caps = 0;
> > >         inet->inet_dport = 0;
> > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > index 2a3f9296df1e..81b396e5cf79 100644
> > > --- a/net/ipv6/tcp_ipv6.c
> > > +++ b/net/ipv6/tcp_ipv6.c
> > > @@ -359,6 +359,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >
> > >  late_failure:
> > >         tcp_set_state(sk, TCP_CLOSE);
> > > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > +               inet_reset_saddr(sk);
> > >  failure:
> > >         inet->inet_dport = 0;
> > >         sk->sk_route_caps = 0;
> > > --
> > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 1/4] dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
@ 2022-11-17  0:43         ` Joanne Koong
  0 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-17  0:43 UTC (permalink / raw)
  To: dccp

On Wed, Nov 16, 2022 at 4:20 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Joanne Koong <joannelkoong@gmail.com>
> Date:   Wed, 16 Nov 2022 16:11:21 -0800
> > On Wed, Nov 16, 2022 at 2:28 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > When connect() is called on a socket bound to the wildcard address,
> > > we change the socket's saddr to a local address.  If the socket
> > > fails to connect() to the destination, we have to reset the saddr.
> > >
> > > However, when an error occurs after inet_hash6?_connect() in
> > > (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
> > > the socket bound to the address.
> > >
> > > From the user's point of view, whether saddr is reset or not varies
> > > with errno.  Let's fix this inconsistent behaviour.
> > >
> > > Note that after this patch, the repro [0] will trigger the WARN_ON()
> > > in inet_csk_get_port() again, but this patch is not buggy and rather
> > > fixes a bug papering over the bhash2's bug for which we need another
> > > fix.
> > >
> > > For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
> > > by this sequence:
> > >
> > >   s1 = socket()
> > >   s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
> > >   s1.bind(('127.0.0.1', 10000))
> > >   s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
> > >   # or s1.connect(('127.0.0.1', 10000))
> > >
> > >   s2 = socket()
> > >   s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
> > >   s2.bind(('0.0.0.0', 10000))
> > >   s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL
> > >
> > >   s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
> > >
> > > [0]: https://syzkaller.appspot.com/bug?extid\x015d756bbd1f8b5c8f09
> > >
> > > Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6")
> > > Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
> > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> >
> > LGTM. Btw, the 4th patch in this series overwrites these changes by
> > moving this logic into the new inet_bhash2_reset_saddr() function you
> > added, so we could also drop this patch from the series. OTOH, this
> > commit message in this patch has some good background context. So I
> > don't have a preference either way :)
> >
> > Acked-by: Joanne Koong <joannelkoong@gmail.com>
>
> Thanks for reviewing!
>
> Yes, these changes are overwritten later, but only this patch can be
> backported to other stable versions, so I kept this separated.
>

Gotcha, that makes sense! I will try to get my review of the other 2
patches by tomorrow or Friday.

Thanks for your work on this!

>
> > > ---
> > >  net/dccp/ipv4.c     | 2 ++
> > >  net/dccp/ipv6.c     | 2 ++
> > >  net/ipv4/tcp_ipv4.c | 2 ++
> > >  net/ipv6/tcp_ipv6.c | 2 ++
> > >  4 files changed, 8 insertions(+)
> > >
> > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > index 713b7b8dad7e..40640c26680e 100644
> > > --- a/net/dccp/ipv4.c
> > > +++ b/net/dccp/ipv4.c
> > > @@ -157,6 +157,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >          * This unhashes the socket and releases the local port, if necessary.
> > >          */
> > >         dccp_set_state(sk, DCCP_CLOSED);
> > > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > +               inet_reset_saddr(sk);
> > >         ip_rt_put(rt);
> > >         sk->sk_route_caps = 0;
> > >         inet->inet_dport = 0;
> > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > index e57b43006074..626166cb6d7e 100644
> > > --- a/net/dccp/ipv6.c
> > > +++ b/net/dccp/ipv6.c
> > > @@ -985,6 +985,8 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >
> > >  late_failure:
> > >         dccp_set_state(sk, DCCP_CLOSED);
> > > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > +               inet_reset_saddr(sk);
> > >         __sk_dst_reset(sk);
> > >  failure:
> > >         inet->inet_dport = 0;
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 87d440f47a70..6a3a732b584d 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -343,6 +343,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >          * if necessary.
> > >          */
> > >         tcp_set_state(sk, TCP_CLOSE);
> > > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > +               inet_reset_saddr(sk);
> > >         ip_rt_put(rt);
> > >         sk->sk_route_caps = 0;
> > >         inet->inet_dport = 0;
> > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > index 2a3f9296df1e..81b396e5cf79 100644
> > > --- a/net/ipv6/tcp_ipv6.c
> > > +++ b/net/ipv6/tcp_ipv6.c
> > > @@ -359,6 +359,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >
> > >  late_failure:
> > >         tcp_set_state(sk, TCP_CLOSE);
> > > +       if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > +               inet_reset_saddr(sk);
> > >  failure:
> > >         inet->inet_dport = 0;
> > >         sk->sk_route_caps = 0;
> > > --
> > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 4/4] dccp/tcp: Fixup bhash2 bucket when connect() fails.
  2022-11-16 22:28   ` Kuniyuki Iwashima
@ 2022-11-17  2:23     ` Pengfei Xu
  -1 siblings, 0 replies; 36+ messages in thread
From: Pengfei Xu @ 2022-11-17  2:23 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Arnaldo Carvalho de Melo,
	Joanne Koong, Martin KaFai Lau, Mat Martineau,
	Ziyang Xuan (William),
	Stephen Hemminger, Kuniyuki Iwashima, netdev, dccp, syzbot

Hi Kuniyuki Iwashima,

If you consider bisect commit or some other info from below link is useful:
"https://lore.kernel.org/lkml/Y2xyHM1fcCkh9AKU@xpf.sh.intel.com/"
could you add one more Reported-by tag from me, if no, please ignore the
email.

Thanks!
BR.

On 2022-11-16 at 14:28:05 -0800, Kuniyuki Iwashima wrote:
> If a socket bound to a wildcard address fails to connect(), we
> only reset saddr and keep the port.  Then, we have to fix up the
> bhash2 bucket; otherwise, the bucket has an inconsistent address
> in the list.
> 
> Also, listen() for such a socket will fire the WARN_ON() in
> inet_csk_get_port(). [0]
> 
> Note that when a system runs out of memory, we give up fixing the
> bucket and unlink sk from bhash and bhash2 by inet_put_port().
> 
> [0]:
> WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> Modules linked in:
> CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
> RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
> RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
> RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
> RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
> RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
> R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
> FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
>  inet_listen (net/ipv4/af_inet.c:228)
>  __sys_listen (net/socket.c:1810)
>  __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
>  do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
>  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> RIP: 0033:0x7f8ac051de5d
> Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
> RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
> RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
> RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
> R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
> R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
>  </TASK>
> 
> Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/inet_hashtables.h |  1 +
>  net/dccp/ipv4.c               |  3 +--
>  net/dccp/ipv6.c               |  3 +--
>  net/dccp/proto.c              |  3 +--
>  net/ipv4/inet_hashtables.c    | 38 +++++++++++++++++++++++++++++++----
>  net/ipv4/tcp.c                |  3 +--
>  net/ipv4/tcp_ipv4.c           |  3 +--
>  net/ipv6/tcp_ipv6.c           |  3 +--
>  8 files changed, 41 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index ba06e8b52264..69174093078f 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -282,6 +282,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
>   * rcv_saddr field should already have been updated when this is called.
>   */
>  int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> +void inet_bhash2_reset_saddr(struct sock *sk);
>  
>  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
>  		    struct inet_bind2_bucket *tb2, unsigned short port);
> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> index 95e376e3b911..b780827f5e0a 100644
> --- a/net/dccp/ipv4.c
> +++ b/net/dccp/ipv4.c
> @@ -143,8 +143,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  	 * This unhashes the socket and releases the local port, if necessary.
>  	 */
>  	dccp_set_state(sk, DCCP_CLOSED);
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  	ip_rt_put(rt);
>  	sk->sk_route_caps = 0;
>  	inet->inet_dport = 0;
> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> index 94c101ed57a9..602f3432d80b 100644
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -970,8 +970,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>  
>  late_failure:
>  	dccp_set_state(sk, DCCP_CLOSED);
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  	__sk_dst_reset(sk);
>  failure:
>  	inet->inet_dport = 0;
> diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> index c548ca3e9b0e..85e35c5e8890 100644
> --- a/net/dccp/proto.c
> +++ b/net/dccp/proto.c
> @@ -279,8 +279,7 @@ int dccp_disconnect(struct sock *sk, int flags)
>  
>  	inet->inet_dport = 0;
>  
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  
>  	sk->sk_shutdown = 0;
>  	sock_reset_flag(sk, SOCK_DONE);
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index dcb6bc918966..d24a04815f20 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -871,7 +871,7 @@ static void inet_update_saddr(struct sock *sk, void *saddr, int family)
>  	}
>  }
>  
> -int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> +static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family, bool reset)
>  {
>  	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
>  	struct inet_bind2_bucket *tb2, *new_tb2;
> @@ -882,7 +882,11 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
>  
>  	if (!inet_csk(sk)->icsk_bind2_hash) {
>  		/* Not bind()ed before. */
> -		inet_update_saddr(sk, saddr, family);
> +		if (reset)
> +			inet_reset_saddr(sk);
> +		else
> +			inet_update_saddr(sk, saddr, family);
> +
>  		return 0;
>  	}
>  
> @@ -891,8 +895,19 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
>  	 * allocation fails.
>  	 */
>  	new_tb2 = kmem_cache_alloc(hinfo->bind2_bucket_cachep, GFP_ATOMIC);
> -	if (!new_tb2)
> +	if (!new_tb2) {
> +		if (reset) {
> +			/* The (INADDR_ANY, port) bucket might have already been
> +			 * freed, then we cannot fixup icsk_bind2_hash, so we give
> +			 * up and unlink sk from bhash/bhash2 not to fire WARN_ON()
> +			 * in inet_csk_get_port().
> +			 */
> +			inet_put_port(sk);
> +			inet_reset_saddr(sk);
> +		}
> +
>  		return -ENOMEM;
> +	}
>  
>  	/* Unlink first not to show the wrong address for other threads. */
>  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> @@ -902,7 +917,10 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
>  	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
>  	spin_unlock_bh(&head2->lock);
>  
> -	inet_update_saddr(sk, saddr, family);
> +	if (reset)
> +		inet_reset_saddr(sk);
> +	else
> +		inet_update_saddr(sk, saddr, family);
>  
>  	/* Update bhash2 bucket. */
>  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> @@ -922,8 +940,20 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
>  
>  	return 0;
>  }
> +
> +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> +{
> +	return __inet_bhash2_update_saddr(sk, saddr, family, false);
> +}
>  EXPORT_SYMBOL_GPL(inet_bhash2_update_saddr);
>  
> +void inet_bhash2_reset_saddr(struct sock *sk)
> +{
> +	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> +		__inet_bhash2_update_saddr(sk, NULL, 0, true);
> +}
> +EXPORT_SYMBOL_GPL(inet_bhash2_reset_saddr);
> +
>  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
>   * Note that we use 32bit integers (vs RFC 'short integers')
>   * because 2^16 is not a multiple of num_ephemeral and this
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 54836a6b81d6..4f2205756cfe 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -3114,8 +3114,7 @@ int tcp_disconnect(struct sock *sk, int flags)
>  
>  	inet->inet_dport = 0;
>  
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  
>  	sk->sk_shutdown = 0;
>  	sock_reset_flag(sk, SOCK_DONE);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 23dd7e9df2d5..da46357f501b 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -331,8 +331,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  	 * if necessary.
>  	 */
>  	tcp_set_state(sk, TCP_CLOSE);
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  	ip_rt_put(rt);
>  	sk->sk_route_caps = 0;
>  	inet->inet_dport = 0;
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 2f3ca3190d26..f0548dbcabd2 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -346,8 +346,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>  
>  late_failure:
>  	tcp_set_state(sk, TCP_CLOSE);
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  failure:
>  	inet->inet_dport = 0;
>  	sk->sk_route_caps = 0;
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 4/4] dccp/tcp: Fixup bhash2 bucket when connect() fails.
@ 2022-11-17  2:23     ` Pengfei Xu
  0 siblings, 0 replies; 36+ messages in thread
From: Pengfei Xu @ 2022-11-17  2:23 UTC (permalink / raw)
  To: dccp

Hi Kuniyuki Iwashima,

If you consider bisect commit or some other info from below link is useful:
"https://lore.kernel.org/lkml/Y2xyHM1fcCkh9AKU@xpf.sh.intel.com/"
could you add one more Reported-by tag from me, if no, please ignore the
email.

Thanks!
BR.

On 2022-11-16 at 14:28:05 -0800, Kuniyuki Iwashima wrote:
> If a socket bound to a wildcard address fails to connect(), we
> only reset saddr and keep the port.  Then, we have to fix up the
> bhash2 bucket; otherwise, the bucket has an inconsistent address
> in the list.
> 
> Also, listen() for such a socket will fire the WARN_ON() in
> inet_csk_get_port(). [0]
> 
> Note that when a system runs out of memory, we give up fixing the
> bucket and unlink sk from bhash and bhash2 by inet_put_port().
> 
> [0]:
> WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> Modules linked in:
> CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
> RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
> RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
> RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
> RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
> RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
> R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
> FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
>  inet_listen (net/ipv4/af_inet.c:228)
>  __sys_listen (net/socket.c:1810)
>  __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
>  do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
>  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> RIP: 0033:0x7f8ac051de5d
> Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
> RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
> RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
> RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
> R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
> R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
>  </TASK>
> 
> Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/inet_hashtables.h |  1 +
>  net/dccp/ipv4.c               |  3 +--
>  net/dccp/ipv6.c               |  3 +--
>  net/dccp/proto.c              |  3 +--
>  net/ipv4/inet_hashtables.c    | 38 +++++++++++++++++++++++++++++++----
>  net/ipv4/tcp.c                |  3 +--
>  net/ipv4/tcp_ipv4.c           |  3 +--
>  net/ipv6/tcp_ipv6.c           |  3 +--
>  8 files changed, 41 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index ba06e8b52264..69174093078f 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -282,6 +282,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
>   * rcv_saddr field should already have been updated when this is called.
>   */
>  int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> +void inet_bhash2_reset_saddr(struct sock *sk);
>  
>  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
>  		    struct inet_bind2_bucket *tb2, unsigned short port);
> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> index 95e376e3b911..b780827f5e0a 100644
> --- a/net/dccp/ipv4.c
> +++ b/net/dccp/ipv4.c
> @@ -143,8 +143,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  	 * This unhashes the socket and releases the local port, if necessary.
>  	 */
>  	dccp_set_state(sk, DCCP_CLOSED);
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  	ip_rt_put(rt);
>  	sk->sk_route_caps = 0;
>  	inet->inet_dport = 0;
> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> index 94c101ed57a9..602f3432d80b 100644
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -970,8 +970,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>  
>  late_failure:
>  	dccp_set_state(sk, DCCP_CLOSED);
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  	__sk_dst_reset(sk);
>  failure:
>  	inet->inet_dport = 0;
> diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> index c548ca3e9b0e..85e35c5e8890 100644
> --- a/net/dccp/proto.c
> +++ b/net/dccp/proto.c
> @@ -279,8 +279,7 @@ int dccp_disconnect(struct sock *sk, int flags)
>  
>  	inet->inet_dport = 0;
>  
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  
>  	sk->sk_shutdown = 0;
>  	sock_reset_flag(sk, SOCK_DONE);
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index dcb6bc918966..d24a04815f20 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -871,7 +871,7 @@ static void inet_update_saddr(struct sock *sk, void *saddr, int family)
>  	}
>  }
>  
> -int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> +static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family, bool reset)
>  {
>  	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
>  	struct inet_bind2_bucket *tb2, *new_tb2;
> @@ -882,7 +882,11 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
>  
>  	if (!inet_csk(sk)->icsk_bind2_hash) {
>  		/* Not bind()ed before. */
> -		inet_update_saddr(sk, saddr, family);
> +		if (reset)
> +			inet_reset_saddr(sk);
> +		else
> +			inet_update_saddr(sk, saddr, family);
> +
>  		return 0;
>  	}
>  
> @@ -891,8 +895,19 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
>  	 * allocation fails.
>  	 */
>  	new_tb2 = kmem_cache_alloc(hinfo->bind2_bucket_cachep, GFP_ATOMIC);
> -	if (!new_tb2)
> +	if (!new_tb2) {
> +		if (reset) {
> +			/* The (INADDR_ANY, port) bucket might have already been
> +			 * freed, then we cannot fixup icsk_bind2_hash, so we give
> +			 * up and unlink sk from bhash/bhash2 not to fire WARN_ON()
> +			 * in inet_csk_get_port().
> +			 */
> +			inet_put_port(sk);
> +			inet_reset_saddr(sk);
> +		}
> +
>  		return -ENOMEM;
> +	}
>  
>  	/* Unlink first not to show the wrong address for other threads. */
>  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> @@ -902,7 +917,10 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
>  	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
>  	spin_unlock_bh(&head2->lock);
>  
> -	inet_update_saddr(sk, saddr, family);
> +	if (reset)
> +		inet_reset_saddr(sk);
> +	else
> +		inet_update_saddr(sk, saddr, family);
>  
>  	/* Update bhash2 bucket. */
>  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> @@ -922,8 +940,20 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
>  
>  	return 0;
>  }
> +
> +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> +{
> +	return __inet_bhash2_update_saddr(sk, saddr, family, false);
> +}
>  EXPORT_SYMBOL_GPL(inet_bhash2_update_saddr);
>  
> +void inet_bhash2_reset_saddr(struct sock *sk)
> +{
> +	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> +		__inet_bhash2_update_saddr(sk, NULL, 0, true);
> +}
> +EXPORT_SYMBOL_GPL(inet_bhash2_reset_saddr);
> +
>  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
>   * Note that we use 32bit integers (vs RFC 'short integers')
>   * because 2^16 is not a multiple of num_ephemeral and this
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 54836a6b81d6..4f2205756cfe 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -3114,8 +3114,7 @@ int tcp_disconnect(struct sock *sk, int flags)
>  
>  	inet->inet_dport = 0;
>  
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  
>  	sk->sk_shutdown = 0;
>  	sock_reset_flag(sk, SOCK_DONE);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 23dd7e9df2d5..da46357f501b 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -331,8 +331,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  	 * if necessary.
>  	 */
>  	tcp_set_state(sk, TCP_CLOSE);
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  	ip_rt_put(rt);
>  	sk->sk_route_caps = 0;
>  	inet->inet_dport = 0;
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 2f3ca3190d26..f0548dbcabd2 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -346,8 +346,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>  
>  late_failure:
>  	tcp_set_state(sk, TCP_CLOSE);
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> -		inet_reset_saddr(sk);
> +	inet_bhash2_reset_saddr(sk);
>  failure:
>  	inet->inet_dport = 0;
>  	sk->sk_route_caps = 0;
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 4/4] dccp/tcp: Fixup bhash2 bucket when connect() fails.
  2022-11-16 22:28   ` Kuniyuki Iwashima
@ 2022-11-17  3:20       ` Kuniyuki Iwashima
  -1 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-17  3:20 UTC (permalink / raw)
  To: pengfei.xu
  Cc: acme, davem, dccp, dsahern, edumazet, joannelkoong, kuba,
	kuni1840, kuniyu, martin.lau, mathew.j.martineau, netdev, pabeni,
	stephen, syzkaller, william.xuanziyang, yoshfuji

From:   Pengfei Xu <pengfei.xu@intel.com>
Date:   Thu, 17 Nov 2022 10:23:01 +0800
> Hi Kuniyuki Iwashima,
> 
> If you consider bisect commit or some other info from below link is useful:
> "https://lore.kernel.org/lkml/Y2xyHM1fcCkh9AKU@xpf.sh.intel.com/"
> could you add one more Reported-by tag from me, if no, please ignore the
> email.

Hi,

Thanks for providing a repro and bisecting, and sorry, I didn't subscribe
LKML and haven't noticed the thread until Stephen forwarded it to the
netndev mailing list today. [0]

The issue was brought up for discussion [1] about two weeks before the
thread.  So, I would recommend that you check netdev first and send a
report CCing netdev if it is a netwokring stuff.

The issue is reported by Mat[2], me[1], Ziyang[3], and you, and all of
them were originally generated by syzkaller.

If we added all Reported-by tags, they would be:

Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reported-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reported-by: Ziyang Xuan (William) <william.xuanziyang@huawei.com>
Reported-by: Pengfei Xu <pengfei.xu@intel.com>

But adding my Reported-by sounds odd, so considering the order, only
syzbot and Mat ones make sense ?

Anyway, I'll leave the decision to maintainers.

[0]: https://lore.kernel.org/netdev/20221116085854.0dcfa44d@hermes.local/
[1]: https://lore.kernel.org/netdev/20221029001249.86337-1-kuniyu@amazon.com/
[2]: https://lore.kernel.org/netdev/4bae9df4-42c1-85c3-d350-119a151d29@linux.intel.com/
[3]: https://lore.kernel.org/netdev/4bd122d2-d606-b71e-dbe7-63fa293f0a73@huawei.com/

Thank you.

P.S please don't top post at Linux mailing lists :)


> 
> Thanks!
> BR.
> 
> On 2022-11-16 at 14:28:05 -0800, Kuniyuki Iwashima wrote:
> > If a socket bound to a wildcard address fails to connect(), we
> > only reset saddr and keep the port.  Then, we have to fix up the
> > bhash2 bucket; otherwise, the bucket has an inconsistent address
> > in the list.
> > 
> > Also, listen() for such a socket will fire the WARN_ON() in
> > inet_csk_get_port(). [0]
> > 
> > Note that when a system runs out of memory, we give up fixing the
> > bucket and unlink sk from bhash and bhash2 by inet_put_port().
> > 
> > [0]:
> > WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> > Modules linked in:
> > CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
> > RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> > Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
> > RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
> > RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
> > RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
> > RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
> > R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> > R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
> > FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> >  <TASK>
> >  inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
> >  inet_listen (net/ipv4/af_inet.c:228)
> >  __sys_listen (net/socket.c:1810)
> >  __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
> >  do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> >  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> > RIP: 0033:0x7f8ac051de5d
> > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> > RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
> > RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
> > RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
> > RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
> > R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
> > R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
> >  </TASK>
> > 
> > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > Reported-by: syzbot <syzkaller@googlegroups.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/inet_hashtables.h |  1 +
> >  net/dccp/ipv4.c               |  3 +--
> >  net/dccp/ipv6.c               |  3 +--
> >  net/dccp/proto.c              |  3 +--
> >  net/ipv4/inet_hashtables.c    | 38 +++++++++++++++++++++++++++++++----
> >  net/ipv4/tcp.c                |  3 +--
> >  net/ipv4/tcp_ipv4.c           |  3 +--
> >  net/ipv6/tcp_ipv6.c           |  3 +--
> >  8 files changed, 41 insertions(+), 16 deletions(-)
> > 
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index ba06e8b52264..69174093078f 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -282,6 +282,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> >   * rcv_saddr field should already have been updated when this is called.
> >   */
> >  int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > +void inet_bhash2_reset_saddr(struct sock *sk);
> >  
> >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> >  		    struct inet_bind2_bucket *tb2, unsigned short port);
> > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > index 95e376e3b911..b780827f5e0a 100644
> > --- a/net/dccp/ipv4.c
> > +++ b/net/dccp/ipv4.c
> > @@ -143,8 +143,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >  	 * This unhashes the socket and releases the local port, if necessary.
> >  	 */
> >  	dccp_set_state(sk, DCCP_CLOSED);
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  	ip_rt_put(rt);
> >  	sk->sk_route_caps = 0;
> >  	inet->inet_dport = 0;
> > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > index 94c101ed57a9..602f3432d80b 100644
> > --- a/net/dccp/ipv6.c
> > +++ b/net/dccp/ipv6.c
> > @@ -970,8 +970,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >  
> >  late_failure:
> >  	dccp_set_state(sk, DCCP_CLOSED);
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  	__sk_dst_reset(sk);
> >  failure:
> >  	inet->inet_dport = 0;
> > diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> > index c548ca3e9b0e..85e35c5e8890 100644
> > --- a/net/dccp/proto.c
> > +++ b/net/dccp/proto.c
> > @@ -279,8 +279,7 @@ int dccp_disconnect(struct sock *sk, int flags)
> >  
> >  	inet->inet_dport = 0;
> >  
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  
> >  	sk->sk_shutdown = 0;
> >  	sock_reset_flag(sk, SOCK_DONE);
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index dcb6bc918966..d24a04815f20 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -871,7 +871,7 @@ static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> >  	}
> >  }
> >  
> > -int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > +static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family, bool reset)
> >  {
> >  	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> >  	struct inet_bind2_bucket *tb2, *new_tb2;
> > @@ -882,7 +882,11 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> >  
> >  	if (!inet_csk(sk)->icsk_bind2_hash) {
> >  		/* Not bind()ed before. */
> > -		inet_update_saddr(sk, saddr, family);
> > +		if (reset)
> > +			inet_reset_saddr(sk);
> > +		else
> > +			inet_update_saddr(sk, saddr, family);
> > +
> >  		return 0;
> >  	}
> >  
> > @@ -891,8 +895,19 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> >  	 * allocation fails.
> >  	 */
> >  	new_tb2 = kmem_cache_alloc(hinfo->bind2_bucket_cachep, GFP_ATOMIC);
> > -	if (!new_tb2)
> > +	if (!new_tb2) {
> > +		if (reset) {
> > +			/* The (INADDR_ANY, port) bucket might have already been
> > +			 * freed, then we cannot fixup icsk_bind2_hash, so we give
> > +			 * up and unlink sk from bhash/bhash2 not to fire WARN_ON()
> > +			 * in inet_csk_get_port().
> > +			 */
> > +			inet_put_port(sk);
> > +			inet_reset_saddr(sk);
> > +		}
> > +
> >  		return -ENOMEM;
> > +	}
> >  
> >  	/* Unlink first not to show the wrong address for other threads. */
> >  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > @@ -902,7 +917,10 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> >  	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> >  	spin_unlock_bh(&head2->lock);
> >  
> > -	inet_update_saddr(sk, saddr, family);
> > +	if (reset)
> > +		inet_reset_saddr(sk);
> > +	else
> > +		inet_update_saddr(sk, saddr, family);
> >  
> >  	/* Update bhash2 bucket. */
> >  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > @@ -922,8 +940,20 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> >  
> >  	return 0;
> >  }
> > +
> > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > +{
> > +	return __inet_bhash2_update_saddr(sk, saddr, family, false);
> > +}
> >  EXPORT_SYMBOL_GPL(inet_bhash2_update_saddr);
> >  
> > +void inet_bhash2_reset_saddr(struct sock *sk)
> > +{
> > +	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > +		__inet_bhash2_update_saddr(sk, NULL, 0, true);
> > +}
> > +EXPORT_SYMBOL_GPL(inet_bhash2_reset_saddr);
> > +
> >  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
> >   * Note that we use 32bit integers (vs RFC 'short integers')
> >   * because 2^16 is not a multiple of num_ephemeral and this
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 54836a6b81d6..4f2205756cfe 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -3114,8 +3114,7 @@ int tcp_disconnect(struct sock *sk, int flags)
> >  
> >  	inet->inet_dport = 0;
> >  
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  
> >  	sk->sk_shutdown = 0;
> >  	sock_reset_flag(sk, SOCK_DONE);
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 23dd7e9df2d5..da46357f501b 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -331,8 +331,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >  	 * if necessary.
> >  	 */
> >  	tcp_set_state(sk, TCP_CLOSE);
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  	ip_rt_put(rt);
> >  	sk->sk_route_caps = 0;
> >  	inet->inet_dport = 0;
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 2f3ca3190d26..f0548dbcabd2 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -346,8 +346,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >  
> >  late_failure:
> >  	tcp_set_state(sk, TCP_CLOSE);
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  failure:
> >  	inet->inet_dport = 0;
> >  	sk->sk_route_caps = 0;
> > -- 
> > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 4/4] dccp/tcp: Fixup bhash2 bucket when connect() fails.
@ 2022-11-17  3:20       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-17  3:20 UTC (permalink / raw)
  To: dccp

From:   Pengfei Xu <pengfei.xu@intel.com>
Date:   Thu, 17 Nov 2022 10:23:01 +0800
> Hi Kuniyuki Iwashima,
> 
> If you consider bisect commit or some other info from below link is useful:
> "https://lore.kernel.org/lkml/Y2xyHM1fcCkh9AKU@xpf.sh.intel.com/"
> could you add one more Reported-by tag from me, if no, please ignore the
> email.

Hi,

Thanks for providing a repro and bisecting, and sorry, I didn't subscribe
LKML and haven't noticed the thread until Stephen forwarded it to the
netndev mailing list today. [0]

The issue was brought up for discussion [1] about two weeks before the
thread.  So, I would recommend that you check netdev first and send a
report CCing netdev if it is a netwokring stuff.

The issue is reported by Mat[2], me[1], Ziyang[3], and you, and all of
them were originally generated by syzkaller.

If we added all Reported-by tags, they would be:

Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reported-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reported-by: Ziyang Xuan (William) <william.xuanziyang@huawei.com>
Reported-by: Pengfei Xu <pengfei.xu@intel.com>

But adding my Reported-by sounds odd, so considering the order, only
syzbot and Mat ones make sense ?

Anyway, I'll leave the decision to maintainers.

[0]: https://lore.kernel.org/netdev/20221116085854.0dcfa44d@hermes.local/
[1]: https://lore.kernel.org/netdev/20221029001249.86337-1-kuniyu@amazon.com/
[2]: https://lore.kernel.org/netdev/4bae9df4-42c1-85c3-d350-119a151d29@linux.intel.com/
[3]: https://lore.kernel.org/netdev/4bd122d2-d606-b71e-dbe7-63fa293f0a73@huawei.com/

Thank you.

P.S please don't top post at Linux mailing lists :)


> 
> Thanks!
> BR.
> 
> On 2022-11-16 at 14:28:05 -0800, Kuniyuki Iwashima wrote:
> > If a socket bound to a wildcard address fails to connect(), we
> > only reset saddr and keep the port.  Then, we have to fix up the
> > bhash2 bucket; otherwise, the bucket has an inconsistent address
> > in the list.
> > 
> > Also, listen() for such a socket will fire the WARN_ON() in
> > inet_csk_get_port(). [0]
> > 
> > Note that when a system runs out of memory, we give up fixing the
> > bucket and unlink sk from bhash and bhash2 by inet_put_port().
> > 
> > [0]:
> > WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> > Modules linked in:
> > CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
> > RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> > Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
> > RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
> > RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
> > RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
> > RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
> > R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> > R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
> > FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> >  <TASK>
> >  inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
> >  inet_listen (net/ipv4/af_inet.c:228)
> >  __sys_listen (net/socket.c:1810)
> >  __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
> >  do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> >  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> > RIP: 0033:0x7f8ac051de5d
> > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> > RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
> > RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
> > RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
> > RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
> > R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
> > R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
> >  </TASK>
> > 
> > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > Reported-by: syzbot <syzkaller@googlegroups.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/inet_hashtables.h |  1 +
> >  net/dccp/ipv4.c               |  3 +--
> >  net/dccp/ipv6.c               |  3 +--
> >  net/dccp/proto.c              |  3 +--
> >  net/ipv4/inet_hashtables.c    | 38 +++++++++++++++++++++++++++++++----
> >  net/ipv4/tcp.c                |  3 +--
> >  net/ipv4/tcp_ipv4.c           |  3 +--
> >  net/ipv6/tcp_ipv6.c           |  3 +--
> >  8 files changed, 41 insertions(+), 16 deletions(-)
> > 
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index ba06e8b52264..69174093078f 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -282,6 +282,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> >   * rcv_saddr field should already have been updated when this is called.
> >   */
> >  int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > +void inet_bhash2_reset_saddr(struct sock *sk);
> >  
> >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> >  		    struct inet_bind2_bucket *tb2, unsigned short port);
> > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > index 95e376e3b911..b780827f5e0a 100644
> > --- a/net/dccp/ipv4.c
> > +++ b/net/dccp/ipv4.c
> > @@ -143,8 +143,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >  	 * This unhashes the socket and releases the local port, if necessary.
> >  	 */
> >  	dccp_set_state(sk, DCCP_CLOSED);
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  	ip_rt_put(rt);
> >  	sk->sk_route_caps = 0;
> >  	inet->inet_dport = 0;
> > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > index 94c101ed57a9..602f3432d80b 100644
> > --- a/net/dccp/ipv6.c
> > +++ b/net/dccp/ipv6.c
> > @@ -970,8 +970,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >  
> >  late_failure:
> >  	dccp_set_state(sk, DCCP_CLOSED);
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  	__sk_dst_reset(sk);
> >  failure:
> >  	inet->inet_dport = 0;
> > diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> > index c548ca3e9b0e..85e35c5e8890 100644
> > --- a/net/dccp/proto.c
> > +++ b/net/dccp/proto.c
> > @@ -279,8 +279,7 @@ int dccp_disconnect(struct sock *sk, int flags)
> >  
> >  	inet->inet_dport = 0;
> >  
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  
> >  	sk->sk_shutdown = 0;
> >  	sock_reset_flag(sk, SOCK_DONE);
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index dcb6bc918966..d24a04815f20 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -871,7 +871,7 @@ static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> >  	}
> >  }
> >  
> > -int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > +static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family, bool reset)
> >  {
> >  	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> >  	struct inet_bind2_bucket *tb2, *new_tb2;
> > @@ -882,7 +882,11 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> >  
> >  	if (!inet_csk(sk)->icsk_bind2_hash) {
> >  		/* Not bind()ed before. */
> > -		inet_update_saddr(sk, saddr, family);
> > +		if (reset)
> > +			inet_reset_saddr(sk);
> > +		else
> > +			inet_update_saddr(sk, saddr, family);
> > +
> >  		return 0;
> >  	}
> >  
> > @@ -891,8 +895,19 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> >  	 * allocation fails.
> >  	 */
> >  	new_tb2 = kmem_cache_alloc(hinfo->bind2_bucket_cachep, GFP_ATOMIC);
> > -	if (!new_tb2)
> > +	if (!new_tb2) {
> > +		if (reset) {
> > +			/* The (INADDR_ANY, port) bucket might have already been
> > +			 * freed, then we cannot fixup icsk_bind2_hash, so we give
> > +			 * up and unlink sk from bhash/bhash2 not to fire WARN_ON()
> > +			 * in inet_csk_get_port().
> > +			 */
> > +			inet_put_port(sk);
> > +			inet_reset_saddr(sk);
> > +		}
> > +
> >  		return -ENOMEM;
> > +	}
> >  
> >  	/* Unlink first not to show the wrong address for other threads. */
> >  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > @@ -902,7 +917,10 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> >  	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> >  	spin_unlock_bh(&head2->lock);
> >  
> > -	inet_update_saddr(sk, saddr, family);
> > +	if (reset)
> > +		inet_reset_saddr(sk);
> > +	else
> > +		inet_update_saddr(sk, saddr, family);
> >  
> >  	/* Update bhash2 bucket. */
> >  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > @@ -922,8 +940,20 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> >  
> >  	return 0;
> >  }
> > +
> > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > +{
> > +	return __inet_bhash2_update_saddr(sk, saddr, family, false);
> > +}
> >  EXPORT_SYMBOL_GPL(inet_bhash2_update_saddr);
> >  
> > +void inet_bhash2_reset_saddr(struct sock *sk)
> > +{
> > +	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > +		__inet_bhash2_update_saddr(sk, NULL, 0, true);
> > +}
> > +EXPORT_SYMBOL_GPL(inet_bhash2_reset_saddr);
> > +
> >  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
> >   * Note that we use 32bit integers (vs RFC 'short integers')
> >   * because 2^16 is not a multiple of num_ephemeral and this
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 54836a6b81d6..4f2205756cfe 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -3114,8 +3114,7 @@ int tcp_disconnect(struct sock *sk, int flags)
> >  
> >  	inet->inet_dport = 0;
> >  
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  
> >  	sk->sk_shutdown = 0;
> >  	sock_reset_flag(sk, SOCK_DONE);
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 23dd7e9df2d5..da46357f501b 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -331,8 +331,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >  	 * if necessary.
> >  	 */
> >  	tcp_set_state(sk, TCP_CLOSE);
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  	ip_rt_put(rt);
> >  	sk->sk_route_caps = 0;
> >  	inet->inet_dport = 0;
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 2f3ca3190d26..f0548dbcabd2 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -346,8 +346,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >  
> >  late_failure:
> >  	tcp_set_state(sk, TCP_CLOSE);
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > -		inet_reset_saddr(sk);
> > +	inet_bhash2_reset_saddr(sk);
> >  failure:
> >  	inet->inet_dport = 0;
> >  	sk->sk_route_caps = 0;
> > -- 
> > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 4/4] dccp/tcp: Fixup bhash2 bucket when connect() fails.
  2022-11-17  3:20       ` Kuniyuki Iwashima
@ 2022-11-17  5:02     ` Pengfei Xu
  -1 siblings, 0 replies; 36+ messages in thread
From: Pengfei Xu @ 2022-11-17  4:56 UTC (permalink / raw)
  To: dccp

Hi Kuniyuki Iwashima,

On 2022-11-16 at 19:20:14 -0800, Kuniyuki Iwashima wrote:
> From:   Pengfei Xu <pengfei.xu@intel.com>
> Date:   Thu, 17 Nov 2022 10:23:01 +0800
> > Hi Kuniyuki Iwashima,
> > 
> > If you consider bisect commit or some other info from below link is useful:
> > "https://lore.kernel.org/lkml/Y2xyHM1fcCkh9AKU@xpf.sh.intel.com/"
> > could you add one more Reported-by tag from me, if no, please ignore the
> > email.
> 
> Hi,
> 
> Thanks for providing a repro and bisecting, and sorry, I didn't subscribe
> LKML and haven't noticed the thread until Stephen forwarded it to the
> netndev mailing list today. [0]
> 
> The issue was brought up for discussion [1] about two weeks before the
> thread.  So, I would recommend that you check netdev first and send a
> report CCing netdev if it is a netwokring stuff.
> 
> The issue is reported by Mat[2], me[1], Ziyang[3], and you, and all of
> them were originally generated by syzkaller.
> 
> If we added all Reported-by tags, they would be:
> 
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
> Reported-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> Reported-by: Ziyang Xuan (William) <william.xuanziyang@huawei.com>
> Reported-by: Pengfei Xu <pengfei.xu@intel.com>
> 
> But adding my Reported-by sounds odd, so considering the order, only
> syzbot and Mat ones make sense ?
  Ah, that make sense.

> 
> Anyway, I'll leave the decision to maintainers.
> 
> [0]: https://lore.kernel.org/netdev/20221116085854.0dcfa44d@hermes.local/
> [1]: https://lore.kernel.org/netdev/20221029001249.86337-1-kuniyu@amazon.com/
> [2]: https://lore.kernel.org/netdev/4bae9df4-42c1-85c3-d350-119a151d29@linux.intel.com/
> [3]: https://lore.kernel.org/netdev/4bd122d2-d606-b71e-dbe7-63fa293f0a73@huawei.com/
> 
> Thank you.
> 
> P.S please don't top post at Linux mailing lists :)
Is this suggestion for me, I'm not sure which action I should do if there
is.
If no, please ignore.

Thanks!
BR.

> 
> 
> > 
> > Thanks!
> > BR.
> > 
> > On 2022-11-16 at 14:28:05 -0800, Kuniyuki Iwashima wrote:
> > > If a socket bound to a wildcard address fails to connect(), we
> > > only reset saddr and keep the port.  Then, we have to fix up the
> > > bhash2 bucket; otherwise, the bucket has an inconsistent address
> > > in the list.
> > > 
> > > Also, listen() for such a socket will fire the WARN_ON() in
> > > inet_csk_get_port(). [0]
> > > 
> > > Note that when a system runs out of memory, we give up fixing the
> > > bucket and unlink sk from bhash and bhash2 by inet_put_port().
> > > 
> > > [0]:
> > > WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> > > Modules linked in:
> > > CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
> > > RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> > > Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
> > > RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
> > > RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
> > > RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
> > > RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
> > > R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> > > R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
> > > FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > PKRU: 55555554
> > > Call Trace:
> > >  <TASK>
> > >  inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
> > >  inet_listen (net/ipv4/af_inet.c:228)
> > >  __sys_listen (net/socket.c:1810)
> > >  __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
> > >  do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> > >  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> > > RIP: 0033:0x7f8ac051de5d
> > > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> > > RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
> > > RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
> > > RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
> > > RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
> > > R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
> > > R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
> > >  </TASK>
> > > 
> > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > ---
> > >  include/net/inet_hashtables.h |  1 +
> > >  net/dccp/ipv4.c               |  3 +--
> > >  net/dccp/ipv6.c               |  3 +--
> > >  net/dccp/proto.c              |  3 +--
> > >  net/ipv4/inet_hashtables.c    | 38 +++++++++++++++++++++++++++++++----
> > >  net/ipv4/tcp.c                |  3 +--
> > >  net/ipv4/tcp_ipv4.c           |  3 +--
> > >  net/ipv6/tcp_ipv6.c           |  3 +--
> > >  8 files changed, 41 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > index ba06e8b52264..69174093078f 100644
> > > --- a/include/net/inet_hashtables.h
> > > +++ b/include/net/inet_hashtables.h
> > > @@ -282,6 +282,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > >   * rcv_saddr field should already have been updated when this is called.
> > >   */
> > >  int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > > +void inet_bhash2_reset_saddr(struct sock *sk);
> > >  
> > >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > >  		    struct inet_bind2_bucket *tb2, unsigned short port);
> > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > index 95e376e3b911..b780827f5e0a 100644
> > > --- a/net/dccp/ipv4.c
> > > +++ b/net/dccp/ipv4.c
> > > @@ -143,8 +143,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >  	 * This unhashes the socket and releases the local port, if necessary.
> > >  	 */
> > >  	dccp_set_state(sk, DCCP_CLOSED);
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  	ip_rt_put(rt);
> > >  	sk->sk_route_caps = 0;
> > >  	inet->inet_dport = 0;
> > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > index 94c101ed57a9..602f3432d80b 100644
> > > --- a/net/dccp/ipv6.c
> > > +++ b/net/dccp/ipv6.c
> > > @@ -970,8 +970,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >  
> > >  late_failure:
> > >  	dccp_set_state(sk, DCCP_CLOSED);
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  	__sk_dst_reset(sk);
> > >  failure:
> > >  	inet->inet_dport = 0;
> > > diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> > > index c548ca3e9b0e..85e35c5e8890 100644
> > > --- a/net/dccp/proto.c
> > > +++ b/net/dccp/proto.c
> > > @@ -279,8 +279,7 @@ int dccp_disconnect(struct sock *sk, int flags)
> > >  
> > >  	inet->inet_dport = 0;
> > >  
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  
> > >  	sk->sk_shutdown = 0;
> > >  	sock_reset_flag(sk, SOCK_DONE);
> > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > index dcb6bc918966..d24a04815f20 100644
> > > --- a/net/ipv4/inet_hashtables.c
> > > +++ b/net/ipv4/inet_hashtables.c
> > > @@ -871,7 +871,7 @@ static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > >  	}
> > >  }
> > >  
> > > -int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > > +static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family, bool reset)
> > >  {
> > >  	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> > >  	struct inet_bind2_bucket *tb2, *new_tb2;
> > > @@ -882,7 +882,11 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > >  
> > >  	if (!inet_csk(sk)->icsk_bind2_hash) {
> > >  		/* Not bind()ed before. */
> > > -		inet_update_saddr(sk, saddr, family);
> > > +		if (reset)
> > > +			inet_reset_saddr(sk);
> > > +		else
> > > +			inet_update_saddr(sk, saddr, family);
> > > +
> > >  		return 0;
> > >  	}
> > >  
> > > @@ -891,8 +895,19 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > >  	 * allocation fails.
> > >  	 */
> > >  	new_tb2 = kmem_cache_alloc(hinfo->bind2_bucket_cachep, GFP_ATOMIC);
> > > -	if (!new_tb2)
> > > +	if (!new_tb2) {
> > > +		if (reset) {
> > > +			/* The (INADDR_ANY, port) bucket might have already been
> > > +			 * freed, then we cannot fixup icsk_bind2_hash, so we give
> > > +			 * up and unlink sk from bhash/bhash2 not to fire WARN_ON()
> > > +			 * in inet_csk_get_port().
> > > +			 */
> > > +			inet_put_port(sk);
> > > +			inet_reset_saddr(sk);
> > > +		}
> > > +
> > >  		return -ENOMEM;
> > > +	}
> > >  
> > >  	/* Unlink first not to show the wrong address for other threads. */
> > >  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > @@ -902,7 +917,10 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > >  	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > >  	spin_unlock_bh(&head2->lock);
> > >  
> > > -	inet_update_saddr(sk, saddr, family);
> > > +	if (reset)
> > > +		inet_reset_saddr(sk);
> > > +	else
> > > +		inet_update_saddr(sk, saddr, family);
> > >  
> > >  	/* Update bhash2 bucket. */
> > >  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > @@ -922,8 +940,20 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > >  
> > >  	return 0;
> > >  }
> > > +
> > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > > +{
> > > +	return __inet_bhash2_update_saddr(sk, saddr, family, false);
> > > +}
> > >  EXPORT_SYMBOL_GPL(inet_bhash2_update_saddr);
> > >  
> > > +void inet_bhash2_reset_saddr(struct sock *sk)
> > > +{
> > > +	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > +		__inet_bhash2_update_saddr(sk, NULL, 0, true);
> > > +}
> > > +EXPORT_SYMBOL_GPL(inet_bhash2_reset_saddr);
> > > +
> > >  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
> > >   * Note that we use 32bit integers (vs RFC 'short integers')
> > >   * because 2^16 is not a multiple of num_ephemeral and this
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 54836a6b81d6..4f2205756cfe 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -3114,8 +3114,7 @@ int tcp_disconnect(struct sock *sk, int flags)
> > >  
> > >  	inet->inet_dport = 0;
> > >  
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  
> > >  	sk->sk_shutdown = 0;
> > >  	sock_reset_flag(sk, SOCK_DONE);
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 23dd7e9df2d5..da46357f501b 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -331,8 +331,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >  	 * if necessary.
> > >  	 */
> > >  	tcp_set_state(sk, TCP_CLOSE);
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  	ip_rt_put(rt);
> > >  	sk->sk_route_caps = 0;
> > >  	inet->inet_dport = 0;
> > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > index 2f3ca3190d26..f0548dbcabd2 100644
> > > --- a/net/ipv6/tcp_ipv6.c
> > > +++ b/net/ipv6/tcp_ipv6.c
> > > @@ -346,8 +346,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >  
> > >  late_failure:
> > >  	tcp_set_state(sk, TCP_CLOSE);
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  failure:
> > >  	inet->inet_dport = 0;
> > >  	sk->sk_route_caps = 0;
> > > -- 
> > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 4/4] dccp/tcp: Fixup bhash2 bucket when connect() fails.
@ 2022-11-17  5:02     ` Pengfei Xu
  0 siblings, 0 replies; 36+ messages in thread
From: Pengfei Xu @ 2022-11-17  5:02 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: acme, davem, dccp, dsahern, edumazet, joannelkoong, kuba,
	kuni1840, martin.lau, mathew.j.martineau, netdev, pabeni,
	stephen, syzkaller, william.xuanziyang, yoshfuji

Hi Kuniyuki Iwashima,

On 2022-11-16 at 19:20:14 -0800, Kuniyuki Iwashima wrote:
> From:   Pengfei Xu <pengfei.xu@intel.com>
> Date:   Thu, 17 Nov 2022 10:23:01 +0800
> > Hi Kuniyuki Iwashima,
> > 
> > If you consider bisect commit or some other info from below link is useful:
> > "https://lore.kernel.org/lkml/Y2xyHM1fcCkh9AKU@xpf.sh.intel.com/"
> > could you add one more Reported-by tag from me, if no, please ignore the
> > email.
> 
> Hi,
> 
> Thanks for providing a repro and bisecting, and sorry, I didn't subscribe
> LKML and haven't noticed the thread until Stephen forwarded it to the
> netndev mailing list today. [0]
> 
> The issue was brought up for discussion [1] about two weeks before the
> thread.  So, I would recommend that you check netdev first and send a
> report CCing netdev if it is a netwokring stuff.
> 
> The issue is reported by Mat[2], me[1], Ziyang[3], and you, and all of
> them were originally generated by syzkaller.
> 
> If we added all Reported-by tags, they would be:
> 
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
> Reported-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> Reported-by: Ziyang Xuan (William) <william.xuanziyang@huawei.com>
> Reported-by: Pengfei Xu <pengfei.xu@intel.com>
> 
> But adding my Reported-by sounds odd, so considering the order, only
> syzbot and Mat ones make sense ?
  Ah, that make sense.

> 
> Anyway, I'll leave the decision to maintainers.
> 
> [0]: https://lore.kernel.org/netdev/20221116085854.0dcfa44d@hermes.local/
> [1]: https://lore.kernel.org/netdev/20221029001249.86337-1-kuniyu@amazon.com/
> [2]: https://lore.kernel.org/netdev/4bae9df4-42c1-85c3-d350-119a151d29@linux.intel.com/
> [3]: https://lore.kernel.org/netdev/4bd122d2-d606-b71e-dbe7-63fa293f0a73@huawei.com/
> 
> Thank you.
> 
> P.S please don't top post at Linux mailing lists :)
Is this suggestion for me, I'm not sure which action I should do if there
is.
If no, please ignore.

Thanks!
BR.

> 
> 
> > 
> > Thanks!
> > BR.
> > 
> > On 2022-11-16 at 14:28:05 -0800, Kuniyuki Iwashima wrote:
> > > If a socket bound to a wildcard address fails to connect(), we
> > > only reset saddr and keep the port.  Then, we have to fix up the
> > > bhash2 bucket; otherwise, the bucket has an inconsistent address
> > > in the list.
> > > 
> > > Also, listen() for such a socket will fire the WARN_ON() in
> > > inet_csk_get_port(). [0]
> > > 
> > > Note that when a system runs out of memory, we give up fixing the
> > > bucket and unlink sk from bhash and bhash2 by inet_put_port().
> > > 
> > > [0]:
> > > WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> > > Modules linked in:
> > > CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
> > > RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
> > > Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
> > > RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
> > > RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
> > > RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
> > > RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
> > > R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> > > R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
> > > FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > PKRU: 55555554
> > > Call Trace:
> > >  <TASK>
> > >  inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
> > >  inet_listen (net/ipv4/af_inet.c:228)
> > >  __sys_listen (net/socket.c:1810)
> > >  __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
> > >  do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> > >  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> > > RIP: 0033:0x7f8ac051de5d
> > > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> > > RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
> > > RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
> > > RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
> > > RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
> > > R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
> > > R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
> > >  </TASK>
> > > 
> > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > ---
> > >  include/net/inet_hashtables.h |  1 +
> > >  net/dccp/ipv4.c               |  3 +--
> > >  net/dccp/ipv6.c               |  3 +--
> > >  net/dccp/proto.c              |  3 +--
> > >  net/ipv4/inet_hashtables.c    | 38 +++++++++++++++++++++++++++++++----
> > >  net/ipv4/tcp.c                |  3 +--
> > >  net/ipv4/tcp_ipv4.c           |  3 +--
> > >  net/ipv6/tcp_ipv6.c           |  3 +--
> > >  8 files changed, 41 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > index ba06e8b52264..69174093078f 100644
> > > --- a/include/net/inet_hashtables.h
> > > +++ b/include/net/inet_hashtables.h
> > > @@ -282,6 +282,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > >   * rcv_saddr field should already have been updated when this is called.
> > >   */
> > >  int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > > +void inet_bhash2_reset_saddr(struct sock *sk);
> > >  
> > >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > >  		    struct inet_bind2_bucket *tb2, unsigned short port);
> > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > index 95e376e3b911..b780827f5e0a 100644
> > > --- a/net/dccp/ipv4.c
> > > +++ b/net/dccp/ipv4.c
> > > @@ -143,8 +143,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >  	 * This unhashes the socket and releases the local port, if necessary.
> > >  	 */
> > >  	dccp_set_state(sk, DCCP_CLOSED);
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  	ip_rt_put(rt);
> > >  	sk->sk_route_caps = 0;
> > >  	inet->inet_dport = 0;
> > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > index 94c101ed57a9..602f3432d80b 100644
> > > --- a/net/dccp/ipv6.c
> > > +++ b/net/dccp/ipv6.c
> > > @@ -970,8 +970,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >  
> > >  late_failure:
> > >  	dccp_set_state(sk, DCCP_CLOSED);
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  	__sk_dst_reset(sk);
> > >  failure:
> > >  	inet->inet_dport = 0;
> > > diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> > > index c548ca3e9b0e..85e35c5e8890 100644
> > > --- a/net/dccp/proto.c
> > > +++ b/net/dccp/proto.c
> > > @@ -279,8 +279,7 @@ int dccp_disconnect(struct sock *sk, int flags)
> > >  
> > >  	inet->inet_dport = 0;
> > >  
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  
> > >  	sk->sk_shutdown = 0;
> > >  	sock_reset_flag(sk, SOCK_DONE);
> > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > index dcb6bc918966..d24a04815f20 100644
> > > --- a/net/ipv4/inet_hashtables.c
> > > +++ b/net/ipv4/inet_hashtables.c
> > > @@ -871,7 +871,7 @@ static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > >  	}
> > >  }
> > >  
> > > -int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > > +static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family, bool reset)
> > >  {
> > >  	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> > >  	struct inet_bind2_bucket *tb2, *new_tb2;
> > > @@ -882,7 +882,11 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > >  
> > >  	if (!inet_csk(sk)->icsk_bind2_hash) {
> > >  		/* Not bind()ed before. */
> > > -		inet_update_saddr(sk, saddr, family);
> > > +		if (reset)
> > > +			inet_reset_saddr(sk);
> > > +		else
> > > +			inet_update_saddr(sk, saddr, family);
> > > +
> > >  		return 0;
> > >  	}
> > >  
> > > @@ -891,8 +895,19 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > >  	 * allocation fails.
> > >  	 */
> > >  	new_tb2 = kmem_cache_alloc(hinfo->bind2_bucket_cachep, GFP_ATOMIC);
> > > -	if (!new_tb2)
> > > +	if (!new_tb2) {
> > > +		if (reset) {
> > > +			/* The (INADDR_ANY, port) bucket might have already been
> > > +			 * freed, then we cannot fixup icsk_bind2_hash, so we give
> > > +			 * up and unlink sk from bhash/bhash2 not to fire WARN_ON()
> > > +			 * in inet_csk_get_port().
> > > +			 */
> > > +			inet_put_port(sk);
> > > +			inet_reset_saddr(sk);
> > > +		}
> > > +
> > >  		return -ENOMEM;
> > > +	}
> > >  
> > >  	/* Unlink first not to show the wrong address for other threads. */
> > >  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > @@ -902,7 +917,10 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > >  	inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > >  	spin_unlock_bh(&head2->lock);
> > >  
> > > -	inet_update_saddr(sk, saddr, family);
> > > +	if (reset)
> > > +		inet_reset_saddr(sk);
> > > +	else
> > > +		inet_update_saddr(sk, saddr, family);
> > >  
> > >  	/* Update bhash2 bucket. */
> > >  	head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > @@ -922,8 +940,20 @@ int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > >  
> > >  	return 0;
> > >  }
> > > +
> > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > > +{
> > > +	return __inet_bhash2_update_saddr(sk, saddr, family, false);
> > > +}
> > >  EXPORT_SYMBOL_GPL(inet_bhash2_update_saddr);
> > >  
> > > +void inet_bhash2_reset_saddr(struct sock *sk)
> > > +{
> > > +	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > +		__inet_bhash2_update_saddr(sk, NULL, 0, true);
> > > +}
> > > +EXPORT_SYMBOL_GPL(inet_bhash2_reset_saddr);
> > > +
> > >  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
> > >   * Note that we use 32bit integers (vs RFC 'short integers')
> > >   * because 2^16 is not a multiple of num_ephemeral and this
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 54836a6b81d6..4f2205756cfe 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -3114,8 +3114,7 @@ int tcp_disconnect(struct sock *sk, int flags)
> > >  
> > >  	inet->inet_dport = 0;
> > >  
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  
> > >  	sk->sk_shutdown = 0;
> > >  	sock_reset_flag(sk, SOCK_DONE);
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 23dd7e9df2d5..da46357f501b 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -331,8 +331,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >  	 * if necessary.
> > >  	 */
> > >  	tcp_set_state(sk, TCP_CLOSE);
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  	ip_rt_put(rt);
> > >  	sk->sk_route_caps = 0;
> > >  	inet->inet_dport = 0;
> > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > index 2f3ca3190d26..f0548dbcabd2 100644
> > > --- a/net/ipv6/tcp_ipv6.c
> > > +++ b/net/ipv6/tcp_ipv6.c
> > > @@ -346,8 +346,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >  
> > >  late_failure:
> > >  	tcp_set_state(sk, TCP_CLOSE);
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> > > -		inet_reset_saddr(sk);
> > > +	inet_bhash2_reset_saddr(sk);
> > >  failure:
> > >  	inet->inet_dport = 0;
> > >  	sk->sk_route_caps = 0;
> > > -- 
> > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
  2022-11-16 22:28   ` Kuniyuki Iwashima
@ 2022-11-17 21:32     ` Joanne Koong
  -1 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-17 21:32 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Arnaldo Carvalho de Melo,
	Martin KaFai Lau, Mat Martineau, Ziyang Xuan (William),
	Stephen Hemminger, Pengfei Xu, Kuniyuki Iwashima, netdev, dccp

On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> another thread iterating over the bhash2 bucket might see an inconsistent
> address.
>
> Let's update saddr after unlinking sk from the old bhash2 bucket.

I'm not sure whether this patch is necessary and I'm curious to hear
your thoughts. There's no adverse effect that comes from updating the
sk's saddr before calling inet_bhash2_update_saddr() in the current
code. Another thread can be iterating over the bhash2 bucket, but it
has no effect whether they see this new address or not (eg when they
are iterating through the bucket they are trying to check for bind
conflicts on another socket, and the sk having the new address doesn't
affect this). What are your thoughts?

>
> Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/inet_hashtables.h |  2 +-
>  net/dccp/ipv4.c               | 22 ++++------------------
>  net/dccp/ipv6.c               | 23 ++++-------------------
>  net/ipv4/af_inet.c            | 11 +----------
>  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
>  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
>  net/ipv6/tcp_ipv6.c           | 19 +++----------------
>  7 files changed, 45 insertions(+), 83 deletions(-)
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 3af1e927247d..ba06e8b52264 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
>   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
>   * rcv_saddr field should already have been updated when this is called.
>   */
> -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
>
>  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
>                     struct inet_bind2_bucket *tb2, unsigned short port);
> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> index 40640c26680e..95e376e3b911 100644
> --- a/net/dccp/ipv4.c
> +++ b/net/dccp/ipv4.c
> @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
>  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  {
>         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
>         struct inet_sock *inet = inet_sk(sk);
>         struct dccp_sock *dp = dccp_sk(sk);
>         __be16 orig_sport, orig_dport;
> +       __be32 daddr, nexthop;
>         struct flowi4 *fl4;
>         struct rtable *rt;
>         int err;
> @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>                 daddr = fl4->daddr;
>
>         if (inet->inet_saddr == 0) {
> -               if (inet_csk(sk)->icsk_bind2_hash) {
> -                       prev_addr_hashbucket =
> -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> -                                                     sock_net(sk),
> -                                                     inet->inet_num);
> -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> -               }
> -               inet->inet_saddr = fl4->saddr;
> -       }
> -
> -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> -
> -       if (prev_addr_hashbucket) {
> -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
>                 if (err) {
> -                       inet->inet_saddr = 0;
> -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
>                         ip_rt_put(rt);
>                         return err;
>                 }
> +       } else {
> +               sk_rcv_saddr_set(sk, inet->inet_saddr);
>         }
>
>         inet->inet_dport = usin->sin_port;
> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> index 626166cb6d7e..94c101ed57a9 100644
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>         }
>
>         if (saddr == NULL) {
> -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> -               struct in6_addr prev_v6_rcv_saddr;
> -
> -               if (icsk->icsk_bind2_hash) {
> -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> -                                                                    sk, sock_net(sk),
> -                                                                    inet->inet_num);
> -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> -               }
> -
>                 saddr = &fl6.saddr;
> -               sk->sk_v6_rcv_saddr = *saddr;
> -
> -               if (prev_addr_hashbucket) {
> -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> -                       if (err) {
> -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> -                               goto failure;
> -                       }
> -               }
> +
> +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> +               if (err)
> +                       goto failure;
>         }
>
>         /* set the source address */
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 4728087c42a5..0da679411330 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
>
>  static int inet_sk_reselect_saddr(struct sock *sk)
>  {
> -       struct inet_bind_hashbucket *prev_addr_hashbucket;
>         struct inet_sock *inet = inet_sk(sk);
>         __be32 old_saddr = inet->inet_saddr;
>         __be32 daddr = inet->inet_daddr;
> @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
>                 return 0;
>         }
>
> -       prev_addr_hashbucket =
> -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> -                                     sock_net(sk), inet->inet_num);
> -
> -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> -
> -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
>         if (err) {
> -               inet->inet_saddr = old_saddr;
> -               inet->inet_rcv_saddr = old_saddr;
>                 ip_rt_put(rt);
>                 return err;
>         }
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index d745f962745e..dcb6bc918966 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
>         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
>  }
>
> -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> +{
> +#if IS_ENABLED(CONFIG_IPV6)
> +       if (family == AF_INET6) {
> +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> +       } else
> +#endif
> +       {
> +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> +       }
> +}
> +
> +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
>  {
>         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
>         struct inet_bind2_bucket *tb2, *new_tb2;
> @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
>         int port = inet_sk(sk)->inet_num;
>         struct net *net = sock_net(sk);
>
> +       if (!inet_csk(sk)->icsk_bind2_hash) {
> +               /* Not bind()ed before. */
> +               inet_update_saddr(sk, saddr, family);
> +               return 0;
> +       }
> +
>         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
>          * the bhash2 table in an inconsistent state if a new tb2 bucket
>          * allocation fails.
> @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
>         if (!new_tb2)
>                 return -ENOMEM;
>
> +       /* Unlink first not to show the wrong address for other threads. */
>         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
>
> -       spin_lock_bh(&prev_saddr->lock);
> +       spin_lock_bh(&head2->lock);
>         __sk_del_bind2_node(sk);
>         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> -       spin_unlock_bh(&prev_saddr->lock);
> +       spin_unlock_bh(&head2->lock);
> +
> +       inet_update_saddr(sk, saddr, family);
> +
> +       /* Update bhash2 bucket. */
> +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
>
>         spin_lock_bh(&head2->lock);
>         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 6a3a732b584d..23dd7e9df2d5 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
>  /* This will initiate an outgoing connection. */
>  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  {
> -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
>         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
>         struct inet_timewait_death_row *tcp_death_row;
> -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
>         struct inet_sock *inet = inet_sk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct ip_options_rcu *inet_opt;
>         struct net *net = sock_net(sk);
>         __be16 orig_sport, orig_dport;
> +       __be32 daddr, nexthop;
>         struct flowi4 *fl4;
>         struct rtable *rt;
>         int err;
> @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
>
>         if (!inet->inet_saddr) {
> -               if (inet_csk(sk)->icsk_bind2_hash) {
> -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> -                                                                    sk, net, inet->inet_num);
> -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> -               }
> -               inet->inet_saddr = fl4->saddr;
> -       }
> -
> -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> -
> -       if (prev_addr_hashbucket) {
> -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
>                 if (err) {
> -                       inet->inet_saddr = 0;
> -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
>                         ip_rt_put(rt);
>                         return err;
>                 }
> +       } else {
> +               sk_rcv_saddr_set(sk, inet->inet_saddr);
>         }
>
>         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 81b396e5cf79..2f3ca3190d26 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
>
>         if (!saddr) {
> -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> -               struct in6_addr prev_v6_rcv_saddr;
> -
> -               if (icsk->icsk_bind2_hash) {
> -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> -                                                                    sk, net, inet->inet_num);
> -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> -               }
>                 saddr = &fl6.saddr;
> -               sk->sk_v6_rcv_saddr = *saddr;
>
> -               if (prev_addr_hashbucket) {
> -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> -                       if (err) {
> -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> -                               goto failure;
> -                       }
> -               }
> +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> +               if (err)
> +                       goto failure;
>         }
>
>         /* set the source address */
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
@ 2022-11-17 21:32     ` Joanne Koong
  0 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-17 21:32 UTC (permalink / raw)
  To: dccp

On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> another thread iterating over the bhash2 bucket might see an inconsistent
> address.
>
> Let's update saddr after unlinking sk from the old bhash2 bucket.

I'm not sure whether this patch is necessary and I'm curious to hear
your thoughts. There's no adverse effect that comes from updating the
sk's saddr before calling inet_bhash2_update_saddr() in the current
code. Another thread can be iterating over the bhash2 bucket, but it
has no effect whether they see this new address or not (eg when they
are iterating through the bucket they are trying to check for bind
conflicts on another socket, and the sk having the new address doesn't
affect this). What are your thoughts?

>
> Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/inet_hashtables.h |  2 +-
>  net/dccp/ipv4.c               | 22 ++++------------------
>  net/dccp/ipv6.c               | 23 ++++-------------------
>  net/ipv4/af_inet.c            | 11 +----------
>  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
>  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
>  net/ipv6/tcp_ipv6.c           | 19 +++----------------
>  7 files changed, 45 insertions(+), 83 deletions(-)
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 3af1e927247d..ba06e8b52264 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
>   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
>   * rcv_saddr field should already have been updated when this is called.
>   */
> -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
>
>  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
>                     struct inet_bind2_bucket *tb2, unsigned short port);
> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> index 40640c26680e..95e376e3b911 100644
> --- a/net/dccp/ipv4.c
> +++ b/net/dccp/ipv4.c
> @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
>  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  {
>         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
>         struct inet_sock *inet = inet_sk(sk);
>         struct dccp_sock *dp = dccp_sk(sk);
>         __be16 orig_sport, orig_dport;
> +       __be32 daddr, nexthop;
>         struct flowi4 *fl4;
>         struct rtable *rt;
>         int err;
> @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>                 daddr = fl4->daddr;
>
>         if (inet->inet_saddr = 0) {
> -               if (inet_csk(sk)->icsk_bind2_hash) {
> -                       prev_addr_hashbucket > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> -                                                     sock_net(sk),
> -                                                     inet->inet_num);
> -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> -               }
> -               inet->inet_saddr = fl4->saddr;
> -       }
> -
> -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> -
> -       if (prev_addr_hashbucket) {
> -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
>                 if (err) {
> -                       inet->inet_saddr = 0;
> -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
>                         ip_rt_put(rt);
>                         return err;
>                 }
> +       } else {
> +               sk_rcv_saddr_set(sk, inet->inet_saddr);
>         }
>
>         inet->inet_dport = usin->sin_port;
> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> index 626166cb6d7e..94c101ed57a9 100644
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>         }
>
>         if (saddr = NULL) {
> -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> -               struct in6_addr prev_v6_rcv_saddr;
> -
> -               if (icsk->icsk_bind2_hash) {
> -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> -                                                                    sk, sock_net(sk),
> -                                                                    inet->inet_num);
> -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> -               }
> -
>                 saddr = &fl6.saddr;
> -               sk->sk_v6_rcv_saddr = *saddr;
> -
> -               if (prev_addr_hashbucket) {
> -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> -                       if (err) {
> -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> -                               goto failure;
> -                       }
> -               }
> +
> +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> +               if (err)
> +                       goto failure;
>         }
>
>         /* set the source address */
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 4728087c42a5..0da679411330 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
>
>  static int inet_sk_reselect_saddr(struct sock *sk)
>  {
> -       struct inet_bind_hashbucket *prev_addr_hashbucket;
>         struct inet_sock *inet = inet_sk(sk);
>         __be32 old_saddr = inet->inet_saddr;
>         __be32 daddr = inet->inet_daddr;
> @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
>                 return 0;
>         }
>
> -       prev_addr_hashbucket > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> -                                     sock_net(sk), inet->inet_num);
> -
> -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> -
> -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
>         if (err) {
> -               inet->inet_saddr = old_saddr;
> -               inet->inet_rcv_saddr = old_saddr;
>                 ip_rt_put(rt);
>                 return err;
>         }
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index d745f962745e..dcb6bc918966 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
>         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
>  }
>
> -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> +{
> +#if IS_ENABLED(CONFIG_IPV6)
> +       if (family = AF_INET6) {
> +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> +       } else
> +#endif
> +       {
> +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> +       }
> +}
> +
> +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
>  {
>         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
>         struct inet_bind2_bucket *tb2, *new_tb2;
> @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
>         int port = inet_sk(sk)->inet_num;
>         struct net *net = sock_net(sk);
>
> +       if (!inet_csk(sk)->icsk_bind2_hash) {
> +               /* Not bind()ed before. */
> +               inet_update_saddr(sk, saddr, family);
> +               return 0;
> +       }
> +
>         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
>          * the bhash2 table in an inconsistent state if a new tb2 bucket
>          * allocation fails.
> @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
>         if (!new_tb2)
>                 return -ENOMEM;
>
> +       /* Unlink first not to show the wrong address for other threads. */
>         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
>
> -       spin_lock_bh(&prev_saddr->lock);
> +       spin_lock_bh(&head2->lock);
>         __sk_del_bind2_node(sk);
>         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> -       spin_unlock_bh(&prev_saddr->lock);
> +       spin_unlock_bh(&head2->lock);
> +
> +       inet_update_saddr(sk, saddr, family);
> +
> +       /* Update bhash2 bucket. */
> +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
>
>         spin_lock_bh(&head2->lock);
>         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 6a3a732b584d..23dd7e9df2d5 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
>  /* This will initiate an outgoing connection. */
>  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>  {
> -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
>         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
>         struct inet_timewait_death_row *tcp_death_row;
> -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
>         struct inet_sock *inet = inet_sk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct ip_options_rcu *inet_opt;
>         struct net *net = sock_net(sk);
>         __be16 orig_sport, orig_dport;
> +       __be32 daddr, nexthop;
>         struct flowi4 *fl4;
>         struct rtable *rt;
>         int err;
> @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
>
>         if (!inet->inet_saddr) {
> -               if (inet_csk(sk)->icsk_bind2_hash) {
> -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> -                                                                    sk, net, inet->inet_num);
> -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> -               }
> -               inet->inet_saddr = fl4->saddr;
> -       }
> -
> -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> -
> -       if (prev_addr_hashbucket) {
> -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
>                 if (err) {
> -                       inet->inet_saddr = 0;
> -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
>                         ip_rt_put(rt);
>                         return err;
>                 }
> +       } else {
> +               sk_rcv_saddr_set(sk, inet->inet_saddr);
>         }
>
>         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 81b396e5cf79..2f3ca3190d26 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
>
>         if (!saddr) {
> -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> -               struct in6_addr prev_v6_rcv_saddr;
> -
> -               if (icsk->icsk_bind2_hash) {
> -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> -                                                                    sk, net, inet->inet_num);
> -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> -               }
>                 saddr = &fl6.saddr;
> -               sk->sk_v6_rcv_saddr = *saddr;
>
> -               if (prev_addr_hashbucket) {
> -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> -                       if (err) {
> -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> -                               goto failure;
> -                       }
> -               }
> +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> +               if (err)
> +                       goto failure;
>         }
>
>         /* set the source address */
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
  2022-11-17 21:32     ` Joanne Koong
@ 2022-11-18  0:06     ` Kuniyuki Iwashima
  -1 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-17 23:59 UTC (permalink / raw)
  To: dccp

From:   Joanne Koong <joannelkoong@gmail.com>
Date:   Thu, 17 Nov 2022 13:32:18 -0800
> On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> > another thread iterating over the bhash2 bucket might see an inconsistent

Sorry this should be just bhash       ^^^ here.

> > address.
> >
> > Let's update saddr after unlinking sk from the old bhash2 bucket.
> 
> I'm not sure whether this patch is necessary and I'm curious to hear
> your thoughts. There's no adverse effect that comes from updating the
> sk's saddr before calling inet_bhash2_update_saddr() in the current
> code. Another thread can be iterating over the bhash2 bucket, but it
> has no effect whether they see this new address or not (eg when they
> are iterating through the bucket they are trying to check for bind
> conflicts on another socket, and the sk having the new address doesn't
> affect this). What are your thoughts?

You are right, it seems I was confused.

I was thinking that lockless change of saddr could result in data race;
another process iterating over bhash might see a corrupted address.

So, we need to acquire the bhash lock before updating saddr, and then
related code should be in inet_bhash2_update_saddr().

But I seem to have forgot to add the lock part... :p


> > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/inet_hashtables.h |  2 +-
> >  net/dccp/ipv4.c               | 22 ++++------------------
> >  net/dccp/ipv6.c               | 23 ++++-------------------
> >  net/ipv4/af_inet.c            | 11 +----------
> >  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
> >  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
> >  net/ipv6/tcp_ipv6.c           | 19 +++----------------
> >  7 files changed, 45 insertions(+), 83 deletions(-)
> >
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index 3af1e927247d..ba06e8b52264 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> >   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
> >   * rcv_saddr field should already have been updated when this is called.
> >   */
> > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> >
> >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> >                     struct inet_bind2_bucket *tb2, unsigned short port);
> > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > index 40640c26680e..95e376e3b911 100644
> > --- a/net/dccp/ipv4.c
> > +++ b/net/dccp/ipv4.c
> > @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
> >  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >  {
> >         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> >         struct inet_sock *inet = inet_sk(sk);
> >         struct dccp_sock *dp = dccp_sk(sk);
> >         __be16 orig_sport, orig_dport;
> > +       __be32 daddr, nexthop;
> >         struct flowi4 *fl4;
> >         struct rtable *rt;
> >         int err;
> > @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >                 daddr = fl4->daddr;
> >
> >         if (inet->inet_saddr = 0) {
> > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > -                       prev_addr_hashbucket > > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> > -                                                     sock_net(sk),
> > -                                                     inet->inet_num);
> > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > -               }
> > -               inet->inet_saddr = fl4->saddr;
> > -       }
> > -
> > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > -
> > -       if (prev_addr_hashbucket) {
> > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> >                 if (err) {
> > -                       inet->inet_saddr = 0;
> > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> >                         ip_rt_put(rt);
> >                         return err;
> >                 }
> > +       } else {
> > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> >         }
> >
> >         inet->inet_dport = usin->sin_port;
> > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > index 626166cb6d7e..94c101ed57a9 100644
> > --- a/net/dccp/ipv6.c
> > +++ b/net/dccp/ipv6.c
> > @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >         }
> >
> >         if (saddr = NULL) {
> > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > -               struct in6_addr prev_v6_rcv_saddr;
> > -
> > -               if (icsk->icsk_bind2_hash) {
> > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> > -                                                                    sk, sock_net(sk),
> > -                                                                    inet->inet_num);
> > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > -               }
> > -
> >                 saddr = &fl6.saddr;
> > -               sk->sk_v6_rcv_saddr = *saddr;
> > -
> > -               if (prev_addr_hashbucket) {
> > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > -                       if (err) {
> > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > -                               goto failure;
> > -                       }
> > -               }
> > +
> > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > +               if (err)
> > +                       goto failure;
> >         }
> >
> >         /* set the source address */
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index 4728087c42a5..0da679411330 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
> >
> >  static int inet_sk_reselect_saddr(struct sock *sk)
> >  {
> > -       struct inet_bind_hashbucket *prev_addr_hashbucket;
> >         struct inet_sock *inet = inet_sk(sk);
> >         __be32 old_saddr = inet->inet_saddr;
> >         __be32 daddr = inet->inet_daddr;
> > @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> >                 return 0;
> >         }
> >
> > -       prev_addr_hashbucket > > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> > -                                     sock_net(sk), inet->inet_num);
> > -
> > -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > -
> > -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
> >         if (err) {
> > -               inet->inet_saddr = old_saddr;
> > -               inet->inet_rcv_saddr = old_saddr;
> >                 ip_rt_put(rt);
> >                 return err;
> >         }
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index d745f962745e..dcb6bc918966 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> >         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> >  }
> >
> > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> > +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > +{
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +       if (family = AF_INET6) {
> > +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> > +       } else
> > +#endif
> > +       {
> > +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> > +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> > +       }
> > +}
> > +
> > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> >  {
> >         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> >         struct inet_bind2_bucket *tb2, *new_tb2;
> > @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> >         int port = inet_sk(sk)->inet_num;
> >         struct net *net = sock_net(sk);
> >
> > +       if (!inet_csk(sk)->icsk_bind2_hash) {
> > +               /* Not bind()ed before. */
> > +               inet_update_saddr(sk, saddr, family);
> > +               return 0;
> > +       }
> > +
> >         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
> >          * the bhash2 table in an inconsistent state if a new tb2 bucket
> >          * allocation fails.
> > @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> >         if (!new_tb2)
> >                 return -ENOMEM;
> >
> > +       /* Unlink first not to show the wrong address for other threads. */
> >         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> >
> > -       spin_lock_bh(&prev_saddr->lock);
> > +       spin_lock_bh(&head2->lock);
> >         __sk_del_bind2_node(sk);
> >         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > -       spin_unlock_bh(&prev_saddr->lock);
> > +       spin_unlock_bh(&head2->lock);
> > +
> > +       inet_update_saddr(sk, saddr, family);
> > +
> > +       /* Update bhash2 bucket. */
> > +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> >
> >         spin_lock_bh(&head2->lock);
> >         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 6a3a732b584d..23dd7e9df2d5 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
> >  /* This will initiate an outgoing connection. */
> >  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >  {
> > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> >         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> >         struct inet_timewait_death_row *tcp_death_row;
> > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> >         struct inet_sock *inet = inet_sk(sk);
> >         struct tcp_sock *tp = tcp_sk(sk);
> >         struct ip_options_rcu *inet_opt;
> >         struct net *net = sock_net(sk);
> >         __be16 orig_sport, orig_dport;
> > +       __be32 daddr, nexthop;
> >         struct flowi4 *fl4;
> >         struct rtable *rt;
> >         int err;
> > @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> >
> >         if (!inet->inet_saddr) {
> > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > -                                                                    sk, net, inet->inet_num);
> > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > -               }
> > -               inet->inet_saddr = fl4->saddr;
> > -       }
> > -
> > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > -
> > -       if (prev_addr_hashbucket) {
> > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> >                 if (err) {
> > -                       inet->inet_saddr = 0;
> > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> >                         ip_rt_put(rt);
> >                         return err;
> >                 }
> > +       } else {
> > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> >         }
> >
> >         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 81b396e5cf79..2f3ca3190d26 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> >
> >         if (!saddr) {
> > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > -               struct in6_addr prev_v6_rcv_saddr;
> > -
> > -               if (icsk->icsk_bind2_hash) {
> > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > -                                                                    sk, net, inet->inet_num);
> > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > -               }
> >                 saddr = &fl6.saddr;
> > -               sk->sk_v6_rcv_saddr = *saddr;
> >
> > -               if (prev_addr_hashbucket) {
> > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > -                       if (err) {
> > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > -                               goto failure;
> > -                       }
> > -               }
> > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > +               if (err)
> > +                       goto failure;
> >         }
> >
> >         /* set the source address */
> > --
> > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
@ 2022-11-18  0:06     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-18  0:06 UTC (permalink / raw)
  To: joannelkoong
  Cc: acme, davem, dccp, dsahern, edumazet, kuba, kuni1840, kuniyu,
	martin.lau, mathew.j.martineau, netdev, pabeni, pengfei.xu,
	stephen, william.xuanziyang, yoshfuji

From:   Joanne Koong <joannelkoong@gmail.com>
Date:   Thu, 17 Nov 2022 13:32:18 -0800
> On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> > another thread iterating over the bhash2 bucket might see an inconsistent

Sorry this should be just bhash       ^^^ here.

> > address.
> >
> > Let's update saddr after unlinking sk from the old bhash2 bucket.
> 
> I'm not sure whether this patch is necessary and I'm curious to hear
> your thoughts. There's no adverse effect that comes from updating the
> sk's saddr before calling inet_bhash2_update_saddr() in the current
> code. Another thread can be iterating over the bhash2 bucket, but it
> has no effect whether they see this new address or not (eg when they
> are iterating through the bucket they are trying to check for bind
> conflicts on another socket, and the sk having the new address doesn't
> affect this). What are your thoughts?

You are right, it seems I was confused.

I was thinking that lockless change of saddr could result in data race;
another process iterating over bhash might see a corrupted address.

So, we need to acquire the bhash lock before updating saddr, and then
related code should be in inet_bhash2_update_saddr().

But I seem to have forgot to add the lock part... :p


> > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/inet_hashtables.h |  2 +-
> >  net/dccp/ipv4.c               | 22 ++++------------------
> >  net/dccp/ipv6.c               | 23 ++++-------------------
> >  net/ipv4/af_inet.c            | 11 +----------
> >  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
> >  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
> >  net/ipv6/tcp_ipv6.c           | 19 +++----------------
> >  7 files changed, 45 insertions(+), 83 deletions(-)
> >
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index 3af1e927247d..ba06e8b52264 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> >   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
> >   * rcv_saddr field should already have been updated when this is called.
> >   */
> > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> >
> >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> >                     struct inet_bind2_bucket *tb2, unsigned short port);
> > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > index 40640c26680e..95e376e3b911 100644
> > --- a/net/dccp/ipv4.c
> > +++ b/net/dccp/ipv4.c
> > @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
> >  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >  {
> >         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> >         struct inet_sock *inet = inet_sk(sk);
> >         struct dccp_sock *dp = dccp_sk(sk);
> >         __be16 orig_sport, orig_dport;
> > +       __be32 daddr, nexthop;
> >         struct flowi4 *fl4;
> >         struct rtable *rt;
> >         int err;
> > @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >                 daddr = fl4->daddr;
> >
> >         if (inet->inet_saddr == 0) {
> > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > -                       prev_addr_hashbucket =
> > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> > -                                                     sock_net(sk),
> > -                                                     inet->inet_num);
> > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > -               }
> > -               inet->inet_saddr = fl4->saddr;
> > -       }
> > -
> > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > -
> > -       if (prev_addr_hashbucket) {
> > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> >                 if (err) {
> > -                       inet->inet_saddr = 0;
> > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> >                         ip_rt_put(rt);
> >                         return err;
> >                 }
> > +       } else {
> > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> >         }
> >
> >         inet->inet_dport = usin->sin_port;
> > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > index 626166cb6d7e..94c101ed57a9 100644
> > --- a/net/dccp/ipv6.c
> > +++ b/net/dccp/ipv6.c
> > @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >         }
> >
> >         if (saddr == NULL) {
> > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > -               struct in6_addr prev_v6_rcv_saddr;
> > -
> > -               if (icsk->icsk_bind2_hash) {
> > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> > -                                                                    sk, sock_net(sk),
> > -                                                                    inet->inet_num);
> > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > -               }
> > -
> >                 saddr = &fl6.saddr;
> > -               sk->sk_v6_rcv_saddr = *saddr;
> > -
> > -               if (prev_addr_hashbucket) {
> > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > -                       if (err) {
> > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > -                               goto failure;
> > -                       }
> > -               }
> > +
> > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > +               if (err)
> > +                       goto failure;
> >         }
> >
> >         /* set the source address */
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index 4728087c42a5..0da679411330 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
> >
> >  static int inet_sk_reselect_saddr(struct sock *sk)
> >  {
> > -       struct inet_bind_hashbucket *prev_addr_hashbucket;
> >         struct inet_sock *inet = inet_sk(sk);
> >         __be32 old_saddr = inet->inet_saddr;
> >         __be32 daddr = inet->inet_daddr;
> > @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> >                 return 0;
> >         }
> >
> > -       prev_addr_hashbucket =
> > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> > -                                     sock_net(sk), inet->inet_num);
> > -
> > -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > -
> > -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
> >         if (err) {
> > -               inet->inet_saddr = old_saddr;
> > -               inet->inet_rcv_saddr = old_saddr;
> >                 ip_rt_put(rt);
> >                 return err;
> >         }
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index d745f962745e..dcb6bc918966 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> >         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> >  }
> >
> > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> > +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > +{
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +       if (family == AF_INET6) {
> > +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> > +       } else
> > +#endif
> > +       {
> > +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> > +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> > +       }
> > +}
> > +
> > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> >  {
> >         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> >         struct inet_bind2_bucket *tb2, *new_tb2;
> > @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> >         int port = inet_sk(sk)->inet_num;
> >         struct net *net = sock_net(sk);
> >
> > +       if (!inet_csk(sk)->icsk_bind2_hash) {
> > +               /* Not bind()ed before. */
> > +               inet_update_saddr(sk, saddr, family);
> > +               return 0;
> > +       }
> > +
> >         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
> >          * the bhash2 table in an inconsistent state if a new tb2 bucket
> >          * allocation fails.
> > @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> >         if (!new_tb2)
> >                 return -ENOMEM;
> >
> > +       /* Unlink first not to show the wrong address for other threads. */
> >         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> >
> > -       spin_lock_bh(&prev_saddr->lock);
> > +       spin_lock_bh(&head2->lock);
> >         __sk_del_bind2_node(sk);
> >         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > -       spin_unlock_bh(&prev_saddr->lock);
> > +       spin_unlock_bh(&head2->lock);
> > +
> > +       inet_update_saddr(sk, saddr, family);
> > +
> > +       /* Update bhash2 bucket. */
> > +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> >
> >         spin_lock_bh(&head2->lock);
> >         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 6a3a732b584d..23dd7e9df2d5 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
> >  /* This will initiate an outgoing connection. */
> >  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >  {
> > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> >         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> >         struct inet_timewait_death_row *tcp_death_row;
> > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> >         struct inet_sock *inet = inet_sk(sk);
> >         struct tcp_sock *tp = tcp_sk(sk);
> >         struct ip_options_rcu *inet_opt;
> >         struct net *net = sock_net(sk);
> >         __be16 orig_sport, orig_dport;
> > +       __be32 daddr, nexthop;
> >         struct flowi4 *fl4;
> >         struct rtable *rt;
> >         int err;
> > @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> >
> >         if (!inet->inet_saddr) {
> > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > -                                                                    sk, net, inet->inet_num);
> > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > -               }
> > -               inet->inet_saddr = fl4->saddr;
> > -       }
> > -
> > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > -
> > -       if (prev_addr_hashbucket) {
> > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> >                 if (err) {
> > -                       inet->inet_saddr = 0;
> > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> >                         ip_rt_put(rt);
> >                         return err;
> >                 }
> > +       } else {
> > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> >         }
> >
> >         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 81b396e5cf79..2f3ca3190d26 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> >
> >         if (!saddr) {
> > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > -               struct in6_addr prev_v6_rcv_saddr;
> > -
> > -               if (icsk->icsk_bind2_hash) {
> > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > -                                                                    sk, net, inet->inet_num);
> > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > -               }
> >                 saddr = &fl6.saddr;
> > -               sk->sk_v6_rcv_saddr = *saddr;
> >
> > -               if (prev_addr_hashbucket) {
> > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > -                       if (err) {
> > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > -                               goto failure;
> > -                       }
> > -               }
> > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > +               if (err)
> > +                       goto failure;
> >         }
> >
> >         /* set the source address */
> > --
> > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
  2022-11-16 22:28   ` Kuniyuki Iwashima
@ 2022-11-18  0:55       ` Joanne Koong
  -1 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-18  0:55 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: acme, davem, dccp, dsahern, edumazet, kuba, kuni1840, martin.lau,
	mathew.j.martineau, netdev, pabeni, pengfei.xu, stephen,
	william.xuanziyang, yoshfuji

On Thu, Nov 17, 2022 at 4:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Joanne Koong <joannelkoong@gmail.com>
> Date:   Thu, 17 Nov 2022 13:32:18 -0800
> > On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> > > another thread iterating over the bhash2 bucket might see an inconsistent
>
> Sorry this should be just bhash       ^^^ here.
>
> > > address.
> > >
> > > Let's update saddr after unlinking sk from the old bhash2 bucket.
> >
> > I'm not sure whether this patch is necessary and I'm curious to hear
> > your thoughts. There's no adverse effect that comes from updating the
> > sk's saddr before calling inet_bhash2_update_saddr() in the current
> > code. Another thread can be iterating over the bhash2 bucket, but it
> > has no effect whether they see this new address or not (eg when they
> > are iterating through the bucket they are trying to check for bind
> > conflicts on another socket, and the sk having the new address doesn't
> > affect this). What are your thoughts?
>
> You are right, it seems I was confused.
>
> I was thinking that lockless change of saddr could result in data race;
> another process iterating over bhash might see a corrupted address.
>
> So, we need to acquire the bhash lock before updating saddr, and then
> related code should be in inet_bhash2_update_saddr().
>
> But I seem to have forgot to add the lock part... :p

No worries! :) Is acquiring the bhash lock necessary before updating
saddr? I think the worst case scenario (which would only happen very
rarely) is that there is another process iterating over bhash, that
process tries to access the address the exact time the address is
being updated in this function, causing the other process to see the
corrupted address, that corrupted address matches that other process's
socket address, thus causing that other process to reject the bind
request.

It doesn't seem like that is a big deal, in the rare event where that
would happen. In my opinion, it's not worth solving for by making the
common case slower by grabbing the bhash lock.

What are your thoughts?

>
>
> > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > ---
> > >  include/net/inet_hashtables.h |  2 +-
> > >  net/dccp/ipv4.c               | 22 ++++------------------
> > >  net/dccp/ipv6.c               | 23 ++++-------------------
> > >  net/ipv4/af_inet.c            | 11 +----------
> > >  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
> > >  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
> > >  net/ipv6/tcp_ipv6.c           | 19 +++----------------
> > >  7 files changed, 45 insertions(+), 83 deletions(-)
> > >
> > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > index 3af1e927247d..ba06e8b52264 100644
> > > --- a/include/net/inet_hashtables.h
> > > +++ b/include/net/inet_hashtables.h
> > > @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > >   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
> > >   * rcv_saddr field should already have been updated when this is called.
> > >   */
> > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > >
> > >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > >                     struct inet_bind2_bucket *tb2, unsigned short port);
> > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > index 40640c26680e..95e376e3b911 100644
> > > --- a/net/dccp/ipv4.c
> > > +++ b/net/dccp/ipv4.c
> > > @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
> > >  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >  {
> > >         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > >         struct inet_sock *inet = inet_sk(sk);
> > >         struct dccp_sock *dp = dccp_sk(sk);
> > >         __be16 orig_sport, orig_dport;
> > > +       __be32 daddr, nexthop;
> > >         struct flowi4 *fl4;
> > >         struct rtable *rt;
> > >         int err;
> > > @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >                 daddr = fl4->daddr;
> > >
> > >         if (inet->inet_saddr == 0) {
> > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > -                       prev_addr_hashbucket =
> > > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> > > -                                                     sock_net(sk),
> > > -                                                     inet->inet_num);
> > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > -               }
> > > -               inet->inet_saddr = fl4->saddr;
> > > -       }
> > > -
> > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > -
> > > -       if (prev_addr_hashbucket) {
> > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > >                 if (err) {
> > > -                       inet->inet_saddr = 0;
> > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > >                         ip_rt_put(rt);
> > >                         return err;
> > >                 }
> > > +       } else {
> > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > >         }
> > >
> > >         inet->inet_dport = usin->sin_port;
> > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > index 626166cb6d7e..94c101ed57a9 100644
> > > --- a/net/dccp/ipv6.c
> > > +++ b/net/dccp/ipv6.c
> > > @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >         }
> > >
> > >         if (saddr == NULL) {
> > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > -               struct in6_addr prev_v6_rcv_saddr;
> > > -
> > > -               if (icsk->icsk_bind2_hash) {
> > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> > > -                                                                    sk, sock_net(sk),
> > > -                                                                    inet->inet_num);
> > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > -               }
> > > -
> > >                 saddr = &fl6.saddr;
> > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > -
> > > -               if (prev_addr_hashbucket) {
> > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > -                       if (err) {
> > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > -                               goto failure;
> > > -                       }
> > > -               }
> > > +
> > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > +               if (err)
> > > +                       goto failure;
> > >         }
> > >
> > >         /* set the source address */
> > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > index 4728087c42a5..0da679411330 100644
> > > --- a/net/ipv4/af_inet.c
> > > +++ b/net/ipv4/af_inet.c
> > > @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
> > >
> > >  static int inet_sk_reselect_saddr(struct sock *sk)
> > >  {
> > > -       struct inet_bind_hashbucket *prev_addr_hashbucket;
> > >         struct inet_sock *inet = inet_sk(sk);
> > >         __be32 old_saddr = inet->inet_saddr;
> > >         __be32 daddr = inet->inet_daddr;
> > > @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > >                 return 0;
> > >         }
> > >
> > > -       prev_addr_hashbucket =
> > > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> > > -                                     sock_net(sk), inet->inet_num);
> > > -
> > > -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > -
> > > -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
> > >         if (err) {
> > > -               inet->inet_saddr = old_saddr;
> > > -               inet->inet_rcv_saddr = old_saddr;
> > >                 ip_rt_put(rt);
> > >                 return err;
> > >         }
> > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > index d745f962745e..dcb6bc918966 100644
> > > --- a/net/ipv4/inet_hashtables.c
> > > +++ b/net/ipv4/inet_hashtables.c
> > > @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > >         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> > >  }
> > >
> > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> > > +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > > +{
> > > +#if IS_ENABLED(CONFIG_IPV6)
> > > +       if (family == AF_INET6) {
> > > +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> > > +       } else
> > > +#endif
> > > +       {
> > > +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> > > +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> > > +       }
> > > +}
> > > +
> > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > >  {
> > >         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> > >         struct inet_bind2_bucket *tb2, *new_tb2;
> > > @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > >         int port = inet_sk(sk)->inet_num;
> > >         struct net *net = sock_net(sk);
> > >
> > > +       if (!inet_csk(sk)->icsk_bind2_hash) {
> > > +               /* Not bind()ed before. */
> > > +               inet_update_saddr(sk, saddr, family);
> > > +               return 0;
> > > +       }
> > > +
> > >         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
> > >          * the bhash2 table in an inconsistent state if a new tb2 bucket
> > >          * allocation fails.
> > > @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > >         if (!new_tb2)
> > >                 return -ENOMEM;
> > >
> > > +       /* Unlink first not to show the wrong address for other threads. */
> > >         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > >
> > > -       spin_lock_bh(&prev_saddr->lock);
> > > +       spin_lock_bh(&head2->lock);
> > >         __sk_del_bind2_node(sk);
> > >         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > > -       spin_unlock_bh(&prev_saddr->lock);
> > > +       spin_unlock_bh(&head2->lock);
> > > +
> > > +       inet_update_saddr(sk, saddr, family);
> > > +
> > > +       /* Update bhash2 bucket. */
> > > +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > >
> > >         spin_lock_bh(&head2->lock);
> > >         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 6a3a732b584d..23dd7e9df2d5 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
> > >  /* This will initiate an outgoing connection. */
> > >  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >  {
> > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > >         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > >         struct inet_timewait_death_row *tcp_death_row;
> > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > >         struct inet_sock *inet = inet_sk(sk);
> > >         struct tcp_sock *tp = tcp_sk(sk);
> > >         struct ip_options_rcu *inet_opt;
> > >         struct net *net = sock_net(sk);
> > >         __be16 orig_sport, orig_dport;
> > > +       __be32 daddr, nexthop;
> > >         struct flowi4 *fl4;
> > >         struct rtable *rt;
> > >         int err;
> > > @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > >
> > >         if (!inet->inet_saddr) {
> > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > -                                                                    sk, net, inet->inet_num);
> > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > -               }
> > > -               inet->inet_saddr = fl4->saddr;
> > > -       }
> > > -
> > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > -
> > > -       if (prev_addr_hashbucket) {
> > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > >                 if (err) {
> > > -                       inet->inet_saddr = 0;
> > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > >                         ip_rt_put(rt);
> > >                         return err;
> > >                 }
> > > +       } else {
> > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > >         }
> > >
> > >         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > index 81b396e5cf79..2f3ca3190d26 100644
> > > --- a/net/ipv6/tcp_ipv6.c
> > > +++ b/net/ipv6/tcp_ipv6.c
> > > @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > >
> > >         if (!saddr) {
> > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > -               struct in6_addr prev_v6_rcv_saddr;
> > > -
> > > -               if (icsk->icsk_bind2_hash) {
> > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > -                                                                    sk, net, inet->inet_num);
> > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > -               }
> > >                 saddr = &fl6.saddr;
> > > -               sk->sk_v6_rcv_saddr = *saddr;
> > >
> > > -               if (prev_addr_hashbucket) {
> > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > -                       if (err) {
> > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > -                               goto failure;
> > > -                       }
> > > -               }
> > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > +               if (err)
> > > +                       goto failure;
> > >         }
> > >
> > >         /* set the source address */
> > > --
> > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
@ 2022-11-18  0:55       ` Joanne Koong
  0 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-18  0:55 UTC (permalink / raw)
  To: dccp

On Thu, Nov 17, 2022 at 4:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Joanne Koong <joannelkoong@gmail.com>
> Date:   Thu, 17 Nov 2022 13:32:18 -0800
> > On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> > > another thread iterating over the bhash2 bucket might see an inconsistent
>
> Sorry this should be just bhash       ^^^ here.
>
> > > address.
> > >
> > > Let's update saddr after unlinking sk from the old bhash2 bucket.
> >
> > I'm not sure whether this patch is necessary and I'm curious to hear
> > your thoughts. There's no adverse effect that comes from updating the
> > sk's saddr before calling inet_bhash2_update_saddr() in the current
> > code. Another thread can be iterating over the bhash2 bucket, but it
> > has no effect whether they see this new address or not (eg when they
> > are iterating through the bucket they are trying to check for bind
> > conflicts on another socket, and the sk having the new address doesn't
> > affect this). What are your thoughts?
>
> You are right, it seems I was confused.
>
> I was thinking that lockless change of saddr could result in data race;
> another process iterating over bhash might see a corrupted address.
>
> So, we need to acquire the bhash lock before updating saddr, and then
> related code should be in inet_bhash2_update_saddr().
>
> But I seem to have forgot to add the lock part... :p

No worries! :) Is acquiring the bhash lock necessary before updating
saddr? I think the worst case scenario (which would only happen very
rarely) is that there is another process iterating over bhash, that
process tries to access the address the exact time the address is
being updated in this function, causing the other process to see the
corrupted address, that corrupted address matches that other process's
socket address, thus causing that other process to reject the bind
request.

It doesn't seem like that is a big deal, in the rare event where that
would happen. In my opinion, it's not worth solving for by making the
common case slower by grabbing the bhash lock.

What are your thoughts?

>
>
> > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > ---
> > >  include/net/inet_hashtables.h |  2 +-
> > >  net/dccp/ipv4.c               | 22 ++++------------------
> > >  net/dccp/ipv6.c               | 23 ++++-------------------
> > >  net/ipv4/af_inet.c            | 11 +----------
> > >  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
> > >  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
> > >  net/ipv6/tcp_ipv6.c           | 19 +++----------------
> > >  7 files changed, 45 insertions(+), 83 deletions(-)
> > >
> > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > index 3af1e927247d..ba06e8b52264 100644
> > > --- a/include/net/inet_hashtables.h
> > > +++ b/include/net/inet_hashtables.h
> > > @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > >   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
> > >   * rcv_saddr field should already have been updated when this is called.
> > >   */
> > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > >
> > >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > >                     struct inet_bind2_bucket *tb2, unsigned short port);
> > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > index 40640c26680e..95e376e3b911 100644
> > > --- a/net/dccp/ipv4.c
> > > +++ b/net/dccp/ipv4.c
> > > @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
> > >  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >  {
> > >         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > >         struct inet_sock *inet = inet_sk(sk);
> > >         struct dccp_sock *dp = dccp_sk(sk);
> > >         __be16 orig_sport, orig_dport;
> > > +       __be32 daddr, nexthop;
> > >         struct flowi4 *fl4;
> > >         struct rtable *rt;
> > >         int err;
> > > @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >                 daddr = fl4->daddr;
> > >
> > >         if (inet->inet_saddr = 0) {
> > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > -                       prev_addr_hashbucket > > > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> > > -                                                     sock_net(sk),
> > > -                                                     inet->inet_num);
> > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > -               }
> > > -               inet->inet_saddr = fl4->saddr;
> > > -       }
> > > -
> > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > -
> > > -       if (prev_addr_hashbucket) {
> > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > >                 if (err) {
> > > -                       inet->inet_saddr = 0;
> > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > >                         ip_rt_put(rt);
> > >                         return err;
> > >                 }
> > > +       } else {
> > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > >         }
> > >
> > >         inet->inet_dport = usin->sin_port;
> > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > index 626166cb6d7e..94c101ed57a9 100644
> > > --- a/net/dccp/ipv6.c
> > > +++ b/net/dccp/ipv6.c
> > > @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >         }
> > >
> > >         if (saddr = NULL) {
> > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > -               struct in6_addr prev_v6_rcv_saddr;
> > > -
> > > -               if (icsk->icsk_bind2_hash) {
> > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> > > -                                                                    sk, sock_net(sk),
> > > -                                                                    inet->inet_num);
> > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > -               }
> > > -
> > >                 saddr = &fl6.saddr;
> > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > -
> > > -               if (prev_addr_hashbucket) {
> > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > -                       if (err) {
> > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > -                               goto failure;
> > > -                       }
> > > -               }
> > > +
> > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > +               if (err)
> > > +                       goto failure;
> > >         }
> > >
> > >         /* set the source address */
> > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > index 4728087c42a5..0da679411330 100644
> > > --- a/net/ipv4/af_inet.c
> > > +++ b/net/ipv4/af_inet.c
> > > @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
> > >
> > >  static int inet_sk_reselect_saddr(struct sock *sk)
> > >  {
> > > -       struct inet_bind_hashbucket *prev_addr_hashbucket;
> > >         struct inet_sock *inet = inet_sk(sk);
> > >         __be32 old_saddr = inet->inet_saddr;
> > >         __be32 daddr = inet->inet_daddr;
> > > @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > >                 return 0;
> > >         }
> > >
> > > -       prev_addr_hashbucket > > > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> > > -                                     sock_net(sk), inet->inet_num);
> > > -
> > > -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > -
> > > -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
> > >         if (err) {
> > > -               inet->inet_saddr = old_saddr;
> > > -               inet->inet_rcv_saddr = old_saddr;
> > >                 ip_rt_put(rt);
> > >                 return err;
> > >         }
> > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > index d745f962745e..dcb6bc918966 100644
> > > --- a/net/ipv4/inet_hashtables.c
> > > +++ b/net/ipv4/inet_hashtables.c
> > > @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > >         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> > >  }
> > >
> > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> > > +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > > +{
> > > +#if IS_ENABLED(CONFIG_IPV6)
> > > +       if (family = AF_INET6) {
> > > +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> > > +       } else
> > > +#endif
> > > +       {
> > > +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> > > +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> > > +       }
> > > +}
> > > +
> > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > >  {
> > >         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> > >         struct inet_bind2_bucket *tb2, *new_tb2;
> > > @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > >         int port = inet_sk(sk)->inet_num;
> > >         struct net *net = sock_net(sk);
> > >
> > > +       if (!inet_csk(sk)->icsk_bind2_hash) {
> > > +               /* Not bind()ed before. */
> > > +               inet_update_saddr(sk, saddr, family);
> > > +               return 0;
> > > +       }
> > > +
> > >         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
> > >          * the bhash2 table in an inconsistent state if a new tb2 bucket
> > >          * allocation fails.
> > > @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > >         if (!new_tb2)
> > >                 return -ENOMEM;
> > >
> > > +       /* Unlink first not to show the wrong address for other threads. */
> > >         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > >
> > > -       spin_lock_bh(&prev_saddr->lock);
> > > +       spin_lock_bh(&head2->lock);
> > >         __sk_del_bind2_node(sk);
> > >         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > > -       spin_unlock_bh(&prev_saddr->lock);
> > > +       spin_unlock_bh(&head2->lock);
> > > +
> > > +       inet_update_saddr(sk, saddr, family);
> > > +
> > > +       /* Update bhash2 bucket. */
> > > +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > >
> > >         spin_lock_bh(&head2->lock);
> > >         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 6a3a732b584d..23dd7e9df2d5 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
> > >  /* This will initiate an outgoing connection. */
> > >  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >  {
> > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > >         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > >         struct inet_timewait_death_row *tcp_death_row;
> > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > >         struct inet_sock *inet = inet_sk(sk);
> > >         struct tcp_sock *tp = tcp_sk(sk);
> > >         struct ip_options_rcu *inet_opt;
> > >         struct net *net = sock_net(sk);
> > >         __be16 orig_sport, orig_dport;
> > > +       __be32 daddr, nexthop;
> > >         struct flowi4 *fl4;
> > >         struct rtable *rt;
> > >         int err;
> > > @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > >
> > >         if (!inet->inet_saddr) {
> > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > -                                                                    sk, net, inet->inet_num);
> > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > -               }
> > > -               inet->inet_saddr = fl4->saddr;
> > > -       }
> > > -
> > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > -
> > > -       if (prev_addr_hashbucket) {
> > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > >                 if (err) {
> > > -                       inet->inet_saddr = 0;
> > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > >                         ip_rt_put(rt);
> > >                         return err;
> > >                 }
> > > +       } else {
> > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > >         }
> > >
> > >         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > index 81b396e5cf79..2f3ca3190d26 100644
> > > --- a/net/ipv6/tcp_ipv6.c
> > > +++ b/net/ipv6/tcp_ipv6.c
> > > @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > >
> > >         if (!saddr) {
> > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > -               struct in6_addr prev_v6_rcv_saddr;
> > > -
> > > -               if (icsk->icsk_bind2_hash) {
> > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > -                                                                    sk, net, inet->inet_num);
> > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > -               }
> > >                 saddr = &fl6.saddr;
> > > -               sk->sk_v6_rcv_saddr = *saddr;
> > >
> > > -               if (prev_addr_hashbucket) {
> > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > -                       if (err) {
> > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > -                               goto failure;
> > > -                       }
> > > -               }
> > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > +               if (err)
> > > +                       goto failure;
> > >         }
> > >
> > >         /* set the source address */
> > > --
> > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
  2022-11-16 22:28   ` Kuniyuki Iwashima
@ 2022-11-18  1:08         ` Kuniyuki Iwashima
  -1 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-18  1:08 UTC (permalink / raw)
  To: joannelkoong
  Cc: acme, davem, dccp, dsahern, edumazet, kuba, kuni1840, kuniyu,
	martin.lau, mathew.j.martineau, netdev, pabeni, pengfei.xu,
	stephen, william.xuanziyang, yoshfuji

From:   Joanne Koong <joannelkoong@gmail.com>
Date:   Thu, 17 Nov 2022 16:55:59 -0800
> On Thu, Nov 17, 2022 at 4:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From:   Joanne Koong <joannelkoong@gmail.com>
> > Date:   Thu, 17 Nov 2022 13:32:18 -0800
> > > On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> > > > another thread iterating over the bhash2 bucket might see an inconsistent
> >
> > Sorry this should be just bhash       ^^^ here.
> >
> > > > address.
> > > >
> > > > Let's update saddr after unlinking sk from the old bhash2 bucket.
> > >
> > > I'm not sure whether this patch is necessary and I'm curious to hear
> > > your thoughts. There's no adverse effect that comes from updating the
> > > sk's saddr before calling inet_bhash2_update_saddr() in the current
> > > code. Another thread can be iterating over the bhash2 bucket, but it
> > > has no effect whether they see this new address or not (eg when they
> > > are iterating through the bucket they are trying to check for bind
> > > conflicts on another socket, and the sk having the new address doesn't
> > > affect this). What are your thoughts?
> >
> > You are right, it seems I was confused.
> >
> > I was thinking that lockless change of saddr could result in data race;
> > another process iterating over bhash might see a corrupted address.
> >
> > So, we need to acquire the bhash lock before updating saddr, and then
> > related code should be in inet_bhash2_update_saddr().
> >
> > But I seem to have forgot to add the lock part... :p
> 
> No worries! :) Is acquiring the bhash lock necessary before updating
> saddr? I think the worst case scenario (which would only happen very
> rarely) is that there is another process iterating over bhash, that
> process tries to access the address the exact time the address is
> being updated in this function, causing the other process to see the
> corrupted address, that corrupted address matches that other process's
> socket address, thus causing that other process to reject the bind
> request.
> 
> It doesn't seem like that is a big deal, in the rare event where that
> would happen. In my opinion, it's not worth solving for by making the
> common case slower by grabbing the bhash lock.
> 
> What are your thoughts?

In that sense, inet_bhash2_update_saddr() is not the common case, I think.

For the IPv4 case, we need not acquire the lock.  Adding READ_ONCE()
and WRITE_ONCE() would be enough, but we cannot do so for IPv6 addr.

Also, I think netdev code often fixes such data races reported by
KCSAN.


> > > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > ---
> > > >  include/net/inet_hashtables.h |  2 +-
> > > >  net/dccp/ipv4.c               | 22 ++++------------------
> > > >  net/dccp/ipv6.c               | 23 ++++-------------------
> > > >  net/ipv4/af_inet.c            | 11 +----------
> > > >  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
> > > >  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
> > > >  net/ipv6/tcp_ipv6.c           | 19 +++----------------
> > > >  7 files changed, 45 insertions(+), 83 deletions(-)
> > > >
> > > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > > index 3af1e927247d..ba06e8b52264 100644
> > > > --- a/include/net/inet_hashtables.h
> > > > +++ b/include/net/inet_hashtables.h
> > > > @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > >   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
> > > >   * rcv_saddr field should already have been updated when this is called.
> > > >   */
> > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > > >
> > > >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > > >                     struct inet_bind2_bucket *tb2, unsigned short port);
> > > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > > index 40640c26680e..95e376e3b911 100644
> > > > --- a/net/dccp/ipv4.c
> > > > +++ b/net/dccp/ipv4.c
> > > > @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
> > > >  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >  {
> > > >         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > >         struct inet_sock *inet = inet_sk(sk);
> > > >         struct dccp_sock *dp = dccp_sk(sk);
> > > >         __be16 orig_sport, orig_dport;
> > > > +       __be32 daddr, nexthop;
> > > >         struct flowi4 *fl4;
> > > >         struct rtable *rt;
> > > >         int err;
> > > > @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >                 daddr = fl4->daddr;
> > > >
> > > >         if (inet->inet_saddr == 0) {
> > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > -                       prev_addr_hashbucket =
> > > > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> > > > -                                                     sock_net(sk),
> > > > -                                                     inet->inet_num);
> > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > -               }
> > > > -               inet->inet_saddr = fl4->saddr;
> > > > -       }
> > > > -
> > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > -
> > > > -       if (prev_addr_hashbucket) {
> > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > >                 if (err) {
> > > > -                       inet->inet_saddr = 0;
> > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > >                         ip_rt_put(rt);
> > > >                         return err;
> > > >                 }
> > > > +       } else {
> > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > >         }
> > > >
> > > >         inet->inet_dport = usin->sin_port;
> > > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > > index 626166cb6d7e..94c101ed57a9 100644
> > > > --- a/net/dccp/ipv6.c
> > > > +++ b/net/dccp/ipv6.c
> > > > @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >         }
> > > >
> > > >         if (saddr == NULL) {
> > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > -
> > > > -               if (icsk->icsk_bind2_hash) {
> > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> > > > -                                                                    sk, sock_net(sk),
> > > > -                                                                    inet->inet_num);
> > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > -               }
> > > > -
> > > >                 saddr = &fl6.saddr;
> > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > > -
> > > > -               if (prev_addr_hashbucket) {
> > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > -                       if (err) {
> > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > -                               goto failure;
> > > > -                       }
> > > > -               }
> > > > +
> > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > +               if (err)
> > > > +                       goto failure;
> > > >         }
> > > >
> > > >         /* set the source address */
> > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > > index 4728087c42a5..0da679411330 100644
> > > > --- a/net/ipv4/af_inet.c
> > > > +++ b/net/ipv4/af_inet.c
> > > > @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
> > > >
> > > >  static int inet_sk_reselect_saddr(struct sock *sk)
> > > >  {
> > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket;
> > > >         struct inet_sock *inet = inet_sk(sk);
> > > >         __be32 old_saddr = inet->inet_saddr;
> > > >         __be32 daddr = inet->inet_daddr;
> > > > @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > > >                 return 0;
> > > >         }
> > > >
> > > > -       prev_addr_hashbucket =
> > > > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> > > > -                                     sock_net(sk), inet->inet_num);
> > > > -
> > > > -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > > -
> > > > -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
> > > >         if (err) {
> > > > -               inet->inet_saddr = old_saddr;
> > > > -               inet->inet_rcv_saddr = old_saddr;
> > > >                 ip_rt_put(rt);
> > > >                 return err;
> > > >         }
> > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > index d745f962745e..dcb6bc918966 100644
> > > > --- a/net/ipv4/inet_hashtables.c
> > > > +++ b/net/ipv4/inet_hashtables.c
> > > > @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > >         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> > > >  }
> > > >
> > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> > > > +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > > > +{
> > > > +#if IS_ENABLED(CONFIG_IPV6)
> > > > +       if (family == AF_INET6) {
> > > > +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> > > > +       } else
> > > > +#endif
> > > > +       {
> > > > +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> > > > +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> > > > +       }
> > > > +}
> > > > +
> > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > > >  {
> > > >         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> > > >         struct inet_bind2_bucket *tb2, *new_tb2;
> > > > @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > >         int port = inet_sk(sk)->inet_num;
> > > >         struct net *net = sock_net(sk);
> > > >
> > > > +       if (!inet_csk(sk)->icsk_bind2_hash) {
> > > > +               /* Not bind()ed before. */
> > > > +               inet_update_saddr(sk, saddr, family);
> > > > +               return 0;
> > > > +       }
> > > > +
> > > >         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
> > > >          * the bhash2 table in an inconsistent state if a new tb2 bucket
> > > >          * allocation fails.
> > > > @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > >         if (!new_tb2)
> > > >                 return -ENOMEM;
> > > >
> > > > +       /* Unlink first not to show the wrong address for other threads. */
> > > >         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > >
> > > > -       spin_lock_bh(&prev_saddr->lock);
> > > > +       spin_lock_bh(&head2->lock);
> > > >         __sk_del_bind2_node(sk);
> > > >         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > > > -       spin_unlock_bh(&prev_saddr->lock);
> > > > +       spin_unlock_bh(&head2->lock);
> > > > +
> > > > +       inet_update_saddr(sk, saddr, family);
> > > > +
> > > > +       /* Update bhash2 bucket. */
> > > > +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > >
> > > >         spin_lock_bh(&head2->lock);
> > > >         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > index 6a3a732b584d..23dd7e9df2d5 100644
> > > > --- a/net/ipv4/tcp_ipv4.c
> > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >  /* This will initiate an outgoing connection. */
> > > >  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >  {
> > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > >         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > >         struct inet_timewait_death_row *tcp_death_row;
> > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > >         struct inet_sock *inet = inet_sk(sk);
> > > >         struct tcp_sock *tp = tcp_sk(sk);
> > > >         struct ip_options_rcu *inet_opt;
> > > >         struct net *net = sock_net(sk);
> > > >         __be16 orig_sport, orig_dport;
> > > > +       __be32 daddr, nexthop;
> > > >         struct flowi4 *fl4;
> > > >         struct rtable *rt;
> > > >         int err;
> > > > @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > >
> > > >         if (!inet->inet_saddr) {
> > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > -                                                                    sk, net, inet->inet_num);
> > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > -               }
> > > > -               inet->inet_saddr = fl4->saddr;
> > > > -       }
> > > > -
> > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > -
> > > > -       if (prev_addr_hashbucket) {
> > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > >                 if (err) {
> > > > -                       inet->inet_saddr = 0;
> > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > >                         ip_rt_put(rt);
> > > >                         return err;
> > > >                 }
> > > > +       } else {
> > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > >         }
> > > >
> > > >         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > > index 81b396e5cf79..2f3ca3190d26 100644
> > > > --- a/net/ipv6/tcp_ipv6.c
> > > > +++ b/net/ipv6/tcp_ipv6.c
> > > > @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > >
> > > >         if (!saddr) {
> > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > -
> > > > -               if (icsk->icsk_bind2_hash) {
> > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > -                                                                    sk, net, inet->inet_num);
> > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > -               }
> > > >                 saddr = &fl6.saddr;
> > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > >
> > > > -               if (prev_addr_hashbucket) {
> > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > -                       if (err) {
> > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > -                               goto failure;
> > > > -                       }
> > > > -               }
> > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > +               if (err)
> > > > +                       goto failure;
> > > >         }
> > > >
> > > >         /* set the source address */
> > > > --
> > > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
@ 2022-11-18  1:08         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-18  1:08 UTC (permalink / raw)
  To: dccp

From:   Joanne Koong <joannelkoong@gmail.com>
Date:   Thu, 17 Nov 2022 16:55:59 -0800
> On Thu, Nov 17, 2022 at 4:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From:   Joanne Koong <joannelkoong@gmail.com>
> > Date:   Thu, 17 Nov 2022 13:32:18 -0800
> > > On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> > > > another thread iterating over the bhash2 bucket might see an inconsistent
> >
> > Sorry this should be just bhash       ^^^ here.
> >
> > > > address.
> > > >
> > > > Let's update saddr after unlinking sk from the old bhash2 bucket.
> > >
> > > I'm not sure whether this patch is necessary and I'm curious to hear
> > > your thoughts. There's no adverse effect that comes from updating the
> > > sk's saddr before calling inet_bhash2_update_saddr() in the current
> > > code. Another thread can be iterating over the bhash2 bucket, but it
> > > has no effect whether they see this new address or not (eg when they
> > > are iterating through the bucket they are trying to check for bind
> > > conflicts on another socket, and the sk having the new address doesn't
> > > affect this). What are your thoughts?
> >
> > You are right, it seems I was confused.
> >
> > I was thinking that lockless change of saddr could result in data race;
> > another process iterating over bhash might see a corrupted address.
> >
> > So, we need to acquire the bhash lock before updating saddr, and then
> > related code should be in inet_bhash2_update_saddr().
> >
> > But I seem to have forgot to add the lock part... :p
> 
> No worries! :) Is acquiring the bhash lock necessary before updating
> saddr? I think the worst case scenario (which would only happen very
> rarely) is that there is another process iterating over bhash, that
> process tries to access the address the exact time the address is
> being updated in this function, causing the other process to see the
> corrupted address, that corrupted address matches that other process's
> socket address, thus causing that other process to reject the bind
> request.
> 
> It doesn't seem like that is a big deal, in the rare event where that
> would happen. In my opinion, it's not worth solving for by making the
> common case slower by grabbing the bhash lock.
> 
> What are your thoughts?

In that sense, inet_bhash2_update_saddr() is not the common case, I think.

For the IPv4 case, we need not acquire the lock.  Adding READ_ONCE()
and WRITE_ONCE() would be enough, but we cannot do so for IPv6 addr.

Also, I think netdev code often fixes such data races reported by
KCSAN.


> > > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > ---
> > > >  include/net/inet_hashtables.h |  2 +-
> > > >  net/dccp/ipv4.c               | 22 ++++------------------
> > > >  net/dccp/ipv6.c               | 23 ++++-------------------
> > > >  net/ipv4/af_inet.c            | 11 +----------
> > > >  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
> > > >  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
> > > >  net/ipv6/tcp_ipv6.c           | 19 +++----------------
> > > >  7 files changed, 45 insertions(+), 83 deletions(-)
> > > >
> > > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > > index 3af1e927247d..ba06e8b52264 100644
> > > > --- a/include/net/inet_hashtables.h
> > > > +++ b/include/net/inet_hashtables.h
> > > > @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > >   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
> > > >   * rcv_saddr field should already have been updated when this is called.
> > > >   */
> > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > > >
> > > >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > > >                     struct inet_bind2_bucket *tb2, unsigned short port);
> > > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > > index 40640c26680e..95e376e3b911 100644
> > > > --- a/net/dccp/ipv4.c
> > > > +++ b/net/dccp/ipv4.c
> > > > @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
> > > >  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >  {
> > > >         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > >         struct inet_sock *inet = inet_sk(sk);
> > > >         struct dccp_sock *dp = dccp_sk(sk);
> > > >         __be16 orig_sport, orig_dport;
> > > > +       __be32 daddr, nexthop;
> > > >         struct flowi4 *fl4;
> > > >         struct rtable *rt;
> > > >         int err;
> > > > @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >                 daddr = fl4->daddr;
> > > >
> > > >         if (inet->inet_saddr = 0) {
> > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > -                       prev_addr_hashbucket > > > > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> > > > -                                                     sock_net(sk),
> > > > -                                                     inet->inet_num);
> > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > -               }
> > > > -               inet->inet_saddr = fl4->saddr;
> > > > -       }
> > > > -
> > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > -
> > > > -       if (prev_addr_hashbucket) {
> > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > >                 if (err) {
> > > > -                       inet->inet_saddr = 0;
> > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > >                         ip_rt_put(rt);
> > > >                         return err;
> > > >                 }
> > > > +       } else {
> > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > >         }
> > > >
> > > >         inet->inet_dport = usin->sin_port;
> > > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > > index 626166cb6d7e..94c101ed57a9 100644
> > > > --- a/net/dccp/ipv6.c
> > > > +++ b/net/dccp/ipv6.c
> > > > @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >         }
> > > >
> > > >         if (saddr = NULL) {
> > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > -
> > > > -               if (icsk->icsk_bind2_hash) {
> > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> > > > -                                                                    sk, sock_net(sk),
> > > > -                                                                    inet->inet_num);
> > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > -               }
> > > > -
> > > >                 saddr = &fl6.saddr;
> > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > > -
> > > > -               if (prev_addr_hashbucket) {
> > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > -                       if (err) {
> > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > -                               goto failure;
> > > > -                       }
> > > > -               }
> > > > +
> > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > +               if (err)
> > > > +                       goto failure;
> > > >         }
> > > >
> > > >         /* set the source address */
> > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > > index 4728087c42a5..0da679411330 100644
> > > > --- a/net/ipv4/af_inet.c
> > > > +++ b/net/ipv4/af_inet.c
> > > > @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
> > > >
> > > >  static int inet_sk_reselect_saddr(struct sock *sk)
> > > >  {
> > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket;
> > > >         struct inet_sock *inet = inet_sk(sk);
> > > >         __be32 old_saddr = inet->inet_saddr;
> > > >         __be32 daddr = inet->inet_daddr;
> > > > @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > > >                 return 0;
> > > >         }
> > > >
> > > > -       prev_addr_hashbucket > > > > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> > > > -                                     sock_net(sk), inet->inet_num);
> > > > -
> > > > -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > > -
> > > > -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
> > > >         if (err) {
> > > > -               inet->inet_saddr = old_saddr;
> > > > -               inet->inet_rcv_saddr = old_saddr;
> > > >                 ip_rt_put(rt);
> > > >                 return err;
> > > >         }
> > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > index d745f962745e..dcb6bc918966 100644
> > > > --- a/net/ipv4/inet_hashtables.c
> > > > +++ b/net/ipv4/inet_hashtables.c
> > > > @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > >         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> > > >  }
> > > >
> > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> > > > +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > > > +{
> > > > +#if IS_ENABLED(CONFIG_IPV6)
> > > > +       if (family = AF_INET6) {
> > > > +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> > > > +       } else
> > > > +#endif
> > > > +       {
> > > > +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> > > > +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> > > > +       }
> > > > +}
> > > > +
> > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > > >  {
> > > >         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> > > >         struct inet_bind2_bucket *tb2, *new_tb2;
> > > > @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > >         int port = inet_sk(sk)->inet_num;
> > > >         struct net *net = sock_net(sk);
> > > >
> > > > +       if (!inet_csk(sk)->icsk_bind2_hash) {
> > > > +               /* Not bind()ed before. */
> > > > +               inet_update_saddr(sk, saddr, family);
> > > > +               return 0;
> > > > +       }
> > > > +
> > > >         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
> > > >          * the bhash2 table in an inconsistent state if a new tb2 bucket
> > > >          * allocation fails.
> > > > @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > >         if (!new_tb2)
> > > >                 return -ENOMEM;
> > > >
> > > > +       /* Unlink first not to show the wrong address for other threads. */
> > > >         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > >
> > > > -       spin_lock_bh(&prev_saddr->lock);
> > > > +       spin_lock_bh(&head2->lock);
> > > >         __sk_del_bind2_node(sk);
> > > >         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > > > -       spin_unlock_bh(&prev_saddr->lock);
> > > > +       spin_unlock_bh(&head2->lock);
> > > > +
> > > > +       inet_update_saddr(sk, saddr, family);
> > > > +
> > > > +       /* Update bhash2 bucket. */
> > > > +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > >
> > > >         spin_lock_bh(&head2->lock);
> > > >         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > index 6a3a732b584d..23dd7e9df2d5 100644
> > > > --- a/net/ipv4/tcp_ipv4.c
> > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >  /* This will initiate an outgoing connection. */
> > > >  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >  {
> > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > >         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > >         struct inet_timewait_death_row *tcp_death_row;
> > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > >         struct inet_sock *inet = inet_sk(sk);
> > > >         struct tcp_sock *tp = tcp_sk(sk);
> > > >         struct ip_options_rcu *inet_opt;
> > > >         struct net *net = sock_net(sk);
> > > >         __be16 orig_sport, orig_dport;
> > > > +       __be32 daddr, nexthop;
> > > >         struct flowi4 *fl4;
> > > >         struct rtable *rt;
> > > >         int err;
> > > > @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > >
> > > >         if (!inet->inet_saddr) {
> > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > -                                                                    sk, net, inet->inet_num);
> > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > -               }
> > > > -               inet->inet_saddr = fl4->saddr;
> > > > -       }
> > > > -
> > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > -
> > > > -       if (prev_addr_hashbucket) {
> > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > >                 if (err) {
> > > > -                       inet->inet_saddr = 0;
> > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > >                         ip_rt_put(rt);
> > > >                         return err;
> > > >                 }
> > > > +       } else {
> > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > >         }
> > > >
> > > >         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > > index 81b396e5cf79..2f3ca3190d26 100644
> > > > --- a/net/ipv6/tcp_ipv6.c
> > > > +++ b/net/ipv6/tcp_ipv6.c
> > > > @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > >
> > > >         if (!saddr) {
> > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > -
> > > > -               if (icsk->icsk_bind2_hash) {
> > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > -                                                                    sk, net, inet->inet_num);
> > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > -               }
> > > >                 saddr = &fl6.saddr;
> > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > >
> > > > -               if (prev_addr_hashbucket) {
> > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > -                       if (err) {
> > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > -                               goto failure;
> > > > -                       }
> > > > -               }
> > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > +               if (err)
> > > > +                       goto failure;
> > > >         }
> > > >
> > > >         /* set the source address */
> > > > --
> > > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
  2022-11-18  1:08         ` Kuniyuki Iwashima
@ 2022-11-18 19:02     ` Joanne Koong
  -1 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-18 18:56 UTC (permalink / raw)
  To: dccp

On Thu, Nov 17, 2022 at 5:08 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Joanne Koong <joannelkoong@gmail.com>
> Date:   Thu, 17 Nov 2022 16:55:59 -0800
> > On Thu, Nov 17, 2022 at 4:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From:   Joanne Koong <joannelkoong@gmail.com>
> > > Date:   Thu, 17 Nov 2022 13:32:18 -0800
> > > > On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> > > > > another thread iterating over the bhash2 bucket might see an inconsistent
> > >
> > > Sorry this should be just bhash       ^^^ here.
> > >
> > > > > address.
> > > > >
> > > > > Let's update saddr after unlinking sk from the old bhash2 bucket.
> > > >
> > > > I'm not sure whether this patch is necessary and I'm curious to hear
> > > > your thoughts. There's no adverse effect that comes from updating the
> > > > sk's saddr before calling inet_bhash2_update_saddr() in the current
> > > > code. Another thread can be iterating over the bhash2 bucket, but it
> > > > has no effect whether they see this new address or not (eg when they
> > > > are iterating through the bucket they are trying to check for bind
> > > > conflicts on another socket, and the sk having the new address doesn't
> > > > affect this). What are your thoughts?
> > >
> > > You are right, it seems I was confused.
> > >
> > > I was thinking that lockless change of saddr could result in data race;
> > > another process iterating over bhash might see a corrupted address.
> > >
> > > So, we need to acquire the bhash lock before updating saddr, and then
> > > related code should be in inet_bhash2_update_saddr().
> > >
> > > But I seem to have forgot to add the lock part... :p
> >
> > No worries! :) Is acquiring the bhash lock necessary before updating
> > saddr? I think the worst case scenario (which would only happen very
> > rarely) is that there is another process iterating over bhash, that
> > process tries to access the address the exact time the address is
> > being updated in this function, causing the other process to see the
> > corrupted address, that corrupted address matches that other process's
> > socket address, thus causing that other process to reject the bind
> > request.
> >
> > It doesn't seem like that is a big deal, in the rare event where that
> > would happen. In my opinion, it's not worth solving for by making the
> > common case slower by grabbing the bhash lock.
> >
> > What are your thoughts?
>
> In that sense, inet_bhash2_update_saddr() is not the common case, I think.
>
> For the IPv4 case, we need not acquire the lock.  Adding READ_ONCE()
> and WRITE_ONCE() would be enough, but we cannot do so for IPv6 addr.
>
> Also, I think netdev code often fixes such data races reported by
> KCSAN.

I'll leave the final decision on this up to you :)

My line of reasoning is that:

1) This case will be run into *very* rarely - a lot of things would
need to align, not only that the read and write occur at the same
time, but that the address gets corrupted to the exact address of the
other socket

2) There's no pernicious effect from this scenario; the worst thing
that happens is that the other socket's bind request fails and it'll
need to retry

3) Grabbing the bhash lock opens the door to unpleasant cases that
would happen a lot more commonly than this one. In particular, the
case I'm thinking of is where another socket is binding to the same
port and can't use bhash2 (eg they're binding on INADDR_ANY or
IPV6_ADDR_MAPPED); this socket will grab the bhash lock, go through
every socket binded to this port to check for a bind conflict (can
take a very long time if there are many sockets), while that is
happening this connect call will be blocked waiting for the bhash lock
to be released.

>
>
> > > > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > ---
> > > > >  include/net/inet_hashtables.h |  2 +-
> > > > >  net/dccp/ipv4.c               | 22 ++++------------------
> > > > >  net/dccp/ipv6.c               | 23 ++++-------------------
> > > > >  net/ipv4/af_inet.c            | 11 +----------
> > > > >  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
> > > > >  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
> > > > >  net/ipv6/tcp_ipv6.c           | 19 +++----------------
> > > > >  7 files changed, 45 insertions(+), 83 deletions(-)
> > > > >
> > > > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > > > index 3af1e927247d..ba06e8b52264 100644
> > > > > --- a/include/net/inet_hashtables.h
> > > > > +++ b/include/net/inet_hashtables.h
> > > > > @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > > >   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
> > > > >   * rcv_saddr field should already have been updated when this is called.
> > > > >   */
> > > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> > > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > > > >
> > > > >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > > > >                     struct inet_bind2_bucket *tb2, unsigned short port);
> > > > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > > > index 40640c26680e..95e376e3b911 100644
> > > > > --- a/net/dccp/ipv4.c
> > > > > +++ b/net/dccp/ipv4.c
> > > > > @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
> > > > >  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > >  {
> > > > >         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > >         struct dccp_sock *dp = dccp_sk(sk);
> > > > >         __be16 orig_sport, orig_dport;
> > > > > +       __be32 daddr, nexthop;
> > > > >         struct flowi4 *fl4;
> > > > >         struct rtable *rt;
> > > > >         int err;
> > > > > @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > >                 daddr = fl4->daddr;
> > > > >
> > > > >         if (inet->inet_saddr = 0) {
> > > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > > -                       prev_addr_hashbucket > > > > > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> > > > > -                                                     sock_net(sk),
> > > > > -                                                     inet->inet_num);
> > > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > > -               }
> > > > > -               inet->inet_saddr = fl4->saddr;
> > > > > -       }
> > > > > -
> > > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > -
> > > > > -       if (prev_addr_hashbucket) {
> > > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > > >                 if (err) {
> > > > > -                       inet->inet_saddr = 0;
> > > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > > >                         ip_rt_put(rt);
> > > > >                         return err;
> > > > >                 }
> > > > > +       } else {
> > > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > >         }
> > > > >
> > > > >         inet->inet_dport = usin->sin_port;
> > > > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > > > index 626166cb6d7e..94c101ed57a9 100644
> > > > > --- a/net/dccp/ipv6.c
> > > > > +++ b/net/dccp/ipv6.c
> > > > > @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > >         }
> > > > >
> > > > >         if (saddr = NULL) {
> > > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > > -
> > > > > -               if (icsk->icsk_bind2_hash) {
> > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> > > > > -                                                                    sk, sock_net(sk),
> > > > > -                                                                    inet->inet_num);
> > > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > > -               }
> > > > > -
> > > > >                 saddr = &fl6.saddr;
> > > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > > > -
> > > > > -               if (prev_addr_hashbucket) {
> > > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > -                       if (err) {
> > > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > > -                               goto failure;
> > > > > -                       }
> > > > > -               }
> > > > > +
> > > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > > +               if (err)
> > > > > +                       goto failure;
> > > > >         }
> > > > >
> > > > >         /* set the source address */
> > > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > > > index 4728087c42a5..0da679411330 100644
> > > > > --- a/net/ipv4/af_inet.c
> > > > > +++ b/net/ipv4/af_inet.c
> > > > > @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
> > > > >
> > > > >  static int inet_sk_reselect_saddr(struct sock *sk)
> > > > >  {
> > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket;
> > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > >         __be32 old_saddr = inet->inet_saddr;
> > > > >         __be32 daddr = inet->inet_daddr;
> > > > > @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > > > >                 return 0;
> > > > >         }
> > > > >
> > > > > -       prev_addr_hashbucket > > > > > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> > > > > -                                     sock_net(sk), inet->inet_num);
> > > > > -
> > > > > -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > > > -
> > > > > -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
> > > > >         if (err) {
> > > > > -               inet->inet_saddr = old_saddr;
> > > > > -               inet->inet_rcv_saddr = old_saddr;
> > > > >                 ip_rt_put(rt);
> > > > >                 return err;
> > > > >         }
> > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > index d745f962745e..dcb6bc918966 100644
> > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > > >         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> > > > >  }
> > > > >
> > > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> > > > > +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > > > > +{
> > > > > +#if IS_ENABLED(CONFIG_IPV6)
> > > > > +       if (family = AF_INET6) {
> > > > > +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> > > > > +       } else
> > > > > +#endif
> > > > > +       {
> > > > > +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> > > > > +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> > > > > +       }
> > > > > +}
> > > > > +
> > > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > > > >  {
> > > > >         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> > > > >         struct inet_bind2_bucket *tb2, *new_tb2;
> > > > > @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > > >         int port = inet_sk(sk)->inet_num;
> > > > >         struct net *net = sock_net(sk);
> > > > >
> > > > > +       if (!inet_csk(sk)->icsk_bind2_hash) {
> > > > > +               /* Not bind()ed before. */
> > > > > +               inet_update_saddr(sk, saddr, family);
> > > > > +               return 0;
> > > > > +       }
> > > > > +
> > > > >         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
> > > > >          * the bhash2 table in an inconsistent state if a new tb2 bucket
> > > > >          * allocation fails.
> > > > > @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > > >         if (!new_tb2)
> > > > >                 return -ENOMEM;
> > > > >
> > > > > +       /* Unlink first not to show the wrong address for other threads. */
> > > > >         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > > >
> > > > > -       spin_lock_bh(&prev_saddr->lock);
> > > > > +       spin_lock_bh(&head2->lock);
> > > > >         __sk_del_bind2_node(sk);
> > > > >         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > > > > -       spin_unlock_bh(&prev_saddr->lock);
> > > > > +       spin_unlock_bh(&head2->lock);
> > > > > +
> > > > > +       inet_update_saddr(sk, saddr, family);
> > > > > +
> > > > > +       /* Update bhash2 bucket. */
> > > > > +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > > >
> > > > >         spin_lock_bh(&head2->lock);
> > > > >         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > > index 6a3a732b584d..23dd7e9df2d5 100644
> > > > > --- a/net/ipv4/tcp_ipv4.c
> > > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > > @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > >  /* This will initiate an outgoing connection. */
> > > > >  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > >  {
> > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > >         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > > >         struct inet_timewait_death_row *tcp_death_row;
> > > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > >         struct tcp_sock *tp = tcp_sk(sk);
> > > > >         struct ip_options_rcu *inet_opt;
> > > > >         struct net *net = sock_net(sk);
> > > > >         __be16 orig_sport, orig_dport;
> > > > > +       __be32 daddr, nexthop;
> > > > >         struct flowi4 *fl4;
> > > > >         struct rtable *rt;
> > > > >         int err;
> > > > > @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > > >
> > > > >         if (!inet->inet_saddr) {
> > > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > > -                                                                    sk, net, inet->inet_num);
> > > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > > -               }
> > > > > -               inet->inet_saddr = fl4->saddr;
> > > > > -       }
> > > > > -
> > > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > -
> > > > > -       if (prev_addr_hashbucket) {
> > > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > > >                 if (err) {
> > > > > -                       inet->inet_saddr = 0;
> > > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > > >                         ip_rt_put(rt);
> > > > >                         return err;
> > > > >                 }
> > > > > +       } else {
> > > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > >         }
> > > > >
> > > > >         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > > > index 81b396e5cf79..2f3ca3190d26 100644
> > > > > --- a/net/ipv6/tcp_ipv6.c
> > > > > +++ b/net/ipv6/tcp_ipv6.c
> > > > > @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > > >
> > > > >         if (!saddr) {
> > > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > > -
> > > > > -               if (icsk->icsk_bind2_hash) {
> > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > > -                                                                    sk, net, inet->inet_num);
> > > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > > -               }
> > > > >                 saddr = &fl6.saddr;
> > > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > > >
> > > > > -               if (prev_addr_hashbucket) {
> > > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > -                       if (err) {
> > > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > > -                               goto failure;
> > > > > -                       }
> > > > > -               }
> > > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > > +               if (err)
> > > > > +                       goto failure;
> > > > >         }
> > > > >
> > > > >         /* set the source address */
> > > > > --
> > > > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
@ 2022-11-18 19:02     ` Joanne Koong
  0 siblings, 0 replies; 36+ messages in thread
From: Joanne Koong @ 2022-11-18 19:02 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: acme, davem, dccp, dsahern, edumazet, kuba, kuni1840, martin.lau,
	mathew.j.martineau, netdev, pabeni, pengfei.xu, stephen,
	william.xuanziyang, yoshfuji

On Thu, Nov 17, 2022 at 5:08 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Joanne Koong <joannelkoong@gmail.com>
> Date:   Thu, 17 Nov 2022 16:55:59 -0800
> > On Thu, Nov 17, 2022 at 4:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From:   Joanne Koong <joannelkoong@gmail.com>
> > > Date:   Thu, 17 Nov 2022 13:32:18 -0800
> > > > On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> > > > > another thread iterating over the bhash2 bucket might see an inconsistent
> > >
> > > Sorry this should be just bhash       ^^^ here.
> > >
> > > > > address.
> > > > >
> > > > > Let's update saddr after unlinking sk from the old bhash2 bucket.
> > > >
> > > > I'm not sure whether this patch is necessary and I'm curious to hear
> > > > your thoughts. There's no adverse effect that comes from updating the
> > > > sk's saddr before calling inet_bhash2_update_saddr() in the current
> > > > code. Another thread can be iterating over the bhash2 bucket, but it
> > > > has no effect whether they see this new address or not (eg when they
> > > > are iterating through the bucket they are trying to check for bind
> > > > conflicts on another socket, and the sk having the new address doesn't
> > > > affect this). What are your thoughts?
> > >
> > > You are right, it seems I was confused.
> > >
> > > I was thinking that lockless change of saddr could result in data race;
> > > another process iterating over bhash might see a corrupted address.
> > >
> > > So, we need to acquire the bhash lock before updating saddr, and then
> > > related code should be in inet_bhash2_update_saddr().
> > >
> > > But I seem to have forgot to add the lock part... :p
> >
> > No worries! :) Is acquiring the bhash lock necessary before updating
> > saddr? I think the worst case scenario (which would only happen very
> > rarely) is that there is another process iterating over bhash, that
> > process tries to access the address the exact time the address is
> > being updated in this function, causing the other process to see the
> > corrupted address, that corrupted address matches that other process's
> > socket address, thus causing that other process to reject the bind
> > request.
> >
> > It doesn't seem like that is a big deal, in the rare event where that
> > would happen. In my opinion, it's not worth solving for by making the
> > common case slower by grabbing the bhash lock.
> >
> > What are your thoughts?
>
> In that sense, inet_bhash2_update_saddr() is not the common case, I think.
>
> For the IPv4 case, we need not acquire the lock.  Adding READ_ONCE()
> and WRITE_ONCE() would be enough, but we cannot do so for IPv6 addr.
>
> Also, I think netdev code often fixes such data races reported by
> KCSAN.

I'll leave the final decision on this up to you :)

My line of reasoning is that:

1) This case will be run into *very* rarely - a lot of things would
need to align, not only that the read and write occur at the same
time, but that the address gets corrupted to the exact address of the
other socket

2) There's no pernicious effect from this scenario; the worst thing
that happens is that the other socket's bind request fails and it'll
need to retry

3) Grabbing the bhash lock opens the door to unpleasant cases that
would happen a lot more commonly than this one. In particular, the
case I'm thinking of is where another socket is binding to the same
port and can't use bhash2 (eg they're binding on INADDR_ANY or
IPV6_ADDR_MAPPED); this socket will grab the bhash lock, go through
every socket binded to this port to check for a bind conflict (can
take a very long time if there are many sockets), while that is
happening this connect call will be blocked waiting for the bhash lock
to be released.

>
>
> > > > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > ---
> > > > >  include/net/inet_hashtables.h |  2 +-
> > > > >  net/dccp/ipv4.c               | 22 ++++------------------
> > > > >  net/dccp/ipv6.c               | 23 ++++-------------------
> > > > >  net/ipv4/af_inet.c            | 11 +----------
> > > > >  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
> > > > >  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
> > > > >  net/ipv6/tcp_ipv6.c           | 19 +++----------------
> > > > >  7 files changed, 45 insertions(+), 83 deletions(-)
> > > > >
> > > > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > > > index 3af1e927247d..ba06e8b52264 100644
> > > > > --- a/include/net/inet_hashtables.h
> > > > > +++ b/include/net/inet_hashtables.h
> > > > > @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > > >   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
> > > > >   * rcv_saddr field should already have been updated when this is called.
> > > > >   */
> > > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> > > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > > > >
> > > > >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > > > >                     struct inet_bind2_bucket *tb2, unsigned short port);
> > > > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > > > index 40640c26680e..95e376e3b911 100644
> > > > > --- a/net/dccp/ipv4.c
> > > > > +++ b/net/dccp/ipv4.c
> > > > > @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
> > > > >  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > >  {
> > > > >         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > >         struct dccp_sock *dp = dccp_sk(sk);
> > > > >         __be16 orig_sport, orig_dport;
> > > > > +       __be32 daddr, nexthop;
> > > > >         struct flowi4 *fl4;
> > > > >         struct rtable *rt;
> > > > >         int err;
> > > > > @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > >                 daddr = fl4->daddr;
> > > > >
> > > > >         if (inet->inet_saddr == 0) {
> > > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > > -                       prev_addr_hashbucket =
> > > > > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> > > > > -                                                     sock_net(sk),
> > > > > -                                                     inet->inet_num);
> > > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > > -               }
> > > > > -               inet->inet_saddr = fl4->saddr;
> > > > > -       }
> > > > > -
> > > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > -
> > > > > -       if (prev_addr_hashbucket) {
> > > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > > >                 if (err) {
> > > > > -                       inet->inet_saddr = 0;
> > > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > > >                         ip_rt_put(rt);
> > > > >                         return err;
> > > > >                 }
> > > > > +       } else {
> > > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > >         }
> > > > >
> > > > >         inet->inet_dport = usin->sin_port;
> > > > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > > > index 626166cb6d7e..94c101ed57a9 100644
> > > > > --- a/net/dccp/ipv6.c
> > > > > +++ b/net/dccp/ipv6.c
> > > > > @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > >         }
> > > > >
> > > > >         if (saddr == NULL) {
> > > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > > -
> > > > > -               if (icsk->icsk_bind2_hash) {
> > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> > > > > -                                                                    sk, sock_net(sk),
> > > > > -                                                                    inet->inet_num);
> > > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > > -               }
> > > > > -
> > > > >                 saddr = &fl6.saddr;
> > > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > > > -
> > > > > -               if (prev_addr_hashbucket) {
> > > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > -                       if (err) {
> > > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > > -                               goto failure;
> > > > > -                       }
> > > > > -               }
> > > > > +
> > > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > > +               if (err)
> > > > > +                       goto failure;
> > > > >         }
> > > > >
> > > > >         /* set the source address */
> > > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > > > index 4728087c42a5..0da679411330 100644
> > > > > --- a/net/ipv4/af_inet.c
> > > > > +++ b/net/ipv4/af_inet.c
> > > > > @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
> > > > >
> > > > >  static int inet_sk_reselect_saddr(struct sock *sk)
> > > > >  {
> > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket;
> > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > >         __be32 old_saddr = inet->inet_saddr;
> > > > >         __be32 daddr = inet->inet_daddr;
> > > > > @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > > > >                 return 0;
> > > > >         }
> > > > >
> > > > > -       prev_addr_hashbucket =
> > > > > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> > > > > -                                     sock_net(sk), inet->inet_num);
> > > > > -
> > > > > -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > > > -
> > > > > -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
> > > > >         if (err) {
> > > > > -               inet->inet_saddr = old_saddr;
> > > > > -               inet->inet_rcv_saddr = old_saddr;
> > > > >                 ip_rt_put(rt);
> > > > >                 return err;
> > > > >         }
> > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > index d745f962745e..dcb6bc918966 100644
> > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > > >         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> > > > >  }
> > > > >
> > > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> > > > > +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > > > > +{
> > > > > +#if IS_ENABLED(CONFIG_IPV6)
> > > > > +       if (family == AF_INET6) {
> > > > > +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> > > > > +       } else
> > > > > +#endif
> > > > > +       {
> > > > > +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> > > > > +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> > > > > +       }
> > > > > +}
> > > > > +
> > > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > > > >  {
> > > > >         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> > > > >         struct inet_bind2_bucket *tb2, *new_tb2;
> > > > > @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > > >         int port = inet_sk(sk)->inet_num;
> > > > >         struct net *net = sock_net(sk);
> > > > >
> > > > > +       if (!inet_csk(sk)->icsk_bind2_hash) {
> > > > > +               /* Not bind()ed before. */
> > > > > +               inet_update_saddr(sk, saddr, family);
> > > > > +               return 0;
> > > > > +       }
> > > > > +
> > > > >         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
> > > > >          * the bhash2 table in an inconsistent state if a new tb2 bucket
> > > > >          * allocation fails.
> > > > > @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > > >         if (!new_tb2)
> > > > >                 return -ENOMEM;
> > > > >
> > > > > +       /* Unlink first not to show the wrong address for other threads. */
> > > > >         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > > >
> > > > > -       spin_lock_bh(&prev_saddr->lock);
> > > > > +       spin_lock_bh(&head2->lock);
> > > > >         __sk_del_bind2_node(sk);
> > > > >         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > > > > -       spin_unlock_bh(&prev_saddr->lock);
> > > > > +       spin_unlock_bh(&head2->lock);
> > > > > +
> > > > > +       inet_update_saddr(sk, saddr, family);
> > > > > +
> > > > > +       /* Update bhash2 bucket. */
> > > > > +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > > >
> > > > >         spin_lock_bh(&head2->lock);
> > > > >         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > > index 6a3a732b584d..23dd7e9df2d5 100644
> > > > > --- a/net/ipv4/tcp_ipv4.c
> > > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > > @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > >  /* This will initiate an outgoing connection. */
> > > > >  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > >  {
> > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > >         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > > >         struct inet_timewait_death_row *tcp_death_row;
> > > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > >         struct tcp_sock *tp = tcp_sk(sk);
> > > > >         struct ip_options_rcu *inet_opt;
> > > > >         struct net *net = sock_net(sk);
> > > > >         __be16 orig_sport, orig_dport;
> > > > > +       __be32 daddr, nexthop;
> > > > >         struct flowi4 *fl4;
> > > > >         struct rtable *rt;
> > > > >         int err;
> > > > > @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > > >
> > > > >         if (!inet->inet_saddr) {
> > > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > > -                                                                    sk, net, inet->inet_num);
> > > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > > -               }
> > > > > -               inet->inet_saddr = fl4->saddr;
> > > > > -       }
> > > > > -
> > > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > -
> > > > > -       if (prev_addr_hashbucket) {
> > > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > > >                 if (err) {
> > > > > -                       inet->inet_saddr = 0;
> > > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > > >                         ip_rt_put(rt);
> > > > >                         return err;
> > > > >                 }
> > > > > +       } else {
> > > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > >         }
> > > > >
> > > > >         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > > > index 81b396e5cf79..2f3ca3190d26 100644
> > > > > --- a/net/ipv6/tcp_ipv6.c
> > > > > +++ b/net/ipv6/tcp_ipv6.c
> > > > > @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > > >
> > > > >         if (!saddr) {
> > > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > > -
> > > > > -               if (icsk->icsk_bind2_hash) {
> > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > > -                                                                    sk, net, inet->inet_num);
> > > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > > -               }
> > > > >                 saddr = &fl6.saddr;
> > > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > > >
> > > > > -               if (prev_addr_hashbucket) {
> > > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > -                       if (err) {
> > > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > > -                               goto failure;
> > > > > -                       }
> > > > > -               }
> > > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > > +               if (err)
> > > > > +                       goto failure;
> > > > >         }
> > > > >
> > > > >         /* set the source address */
> > > > > --
> > > > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
  2022-11-16 22:28   ` Kuniyuki Iwashima
@ 2022-11-18 19:58       ` Kuniyuki Iwashima
  -1 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-18 19:58 UTC (permalink / raw)
  To: joannelkoong
  Cc: acme, davem, dccp, dsahern, edumazet, kuba, kuni1840, kuniyu,
	martin.lau, mathew.j.martineau, netdev, pabeni, pengfei.xu,
	stephen, william.xuanziyang, yoshfuji

From:   Joanne Koong <joannelkoong@gmail.com>
Date:   Fri, 18 Nov 2022 11:02:10 -0800
> On Thu, Nov 17, 2022 at 5:08 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From:   Joanne Koong <joannelkoong@gmail.com>
> > Date:   Thu, 17 Nov 2022 16:55:59 -0800
> > > On Thu, Nov 17, 2022 at 4:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From:   Joanne Koong <joannelkoong@gmail.com>
> > > > Date:   Thu, 17 Nov 2022 13:32:18 -0800
> > > > > On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> > > > > > another thread iterating over the bhash2 bucket might see an inconsistent
> > > >
> > > > Sorry this should be just bhash       ^^^ here.
> > > >
> > > > > > address.
> > > > > >
> > > > > > Let's update saddr after unlinking sk from the old bhash2 bucket.
> > > > >
> > > > > I'm not sure whether this patch is necessary and I'm curious to hear
> > > > > your thoughts. There's no adverse effect that comes from updating the
> > > > > sk's saddr before calling inet_bhash2_update_saddr() in the current
> > > > > code. Another thread can be iterating over the bhash2 bucket, but it
> > > > > has no effect whether they see this new address or not (eg when they
> > > > > are iterating through the bucket they are trying to check for bind
> > > > > conflicts on another socket, and the sk having the new address doesn't
> > > > > affect this). What are your thoughts?
> > > >
> > > > You are right, it seems I was confused.
> > > >
> > > > I was thinking that lockless change of saddr could result in data race;
> > > > another process iterating over bhash might see a corrupted address.
> > > >
> > > > So, we need to acquire the bhash lock before updating saddr, and then
> > > > related code should be in inet_bhash2_update_saddr().
> > > >
> > > > But I seem to have forgot to add the lock part... :p
> > >
> > > No worries! :) Is acquiring the bhash lock necessary before updating
> > > saddr? I think the worst case scenario (which would only happen very
> > > rarely) is that there is another process iterating over bhash, that
> > > process tries to access the address the exact time the address is
> > > being updated in this function, causing the other process to see the
> > > corrupted address, that corrupted address matches that other process's
> > > socket address, thus causing that other process to reject the bind
> > > request.
> > >
> > > It doesn't seem like that is a big deal, in the rare event where that
> > > would happen. In my opinion, it's not worth solving for by making the
> > > common case slower by grabbing the bhash lock.
> > >
> > > What are your thoughts?
> >
> > In that sense, inet_bhash2_update_saddr() is not the common case, I think.
> >
> > For the IPv4 case, we need not acquire the lock.  Adding READ_ONCE()
> > and WRITE_ONCE() would be enough, but we cannot do so for IPv6 addr.
> >
> > Also, I think netdev code often fixes such data races reported by
> > KCSAN.
> 
> I'll leave the final decision on this up to you :)

Thank you.

I understand what you are saying, but I still think this is needed.
But the final decision will be made by maintainers, so let's see what
they do :)

This series is now marked as Changes Requested in patchwork, so I'll
respin v3 with the bhash lock.


> My line of reasoning is that:
> 
> 1) This case will be run into *very* rarely - a lot of things would
> need to align, not only that the read and write occur at the same
> time, but that the address gets corrupted to the exact address of the
> other socket
> 
> 2) There's no pernicious effect from this scenario; the worst thing
> that happens is that the other socket's bind request fails and it'll
> need to retry

I think this is the same issue with the one syzbot reported in that
inconsistency exists in bhash and bhash2.  The difference is just the
span.  It is not a problem just because WARN_ON() fired, WARN_ON() only
caught the issue and WARN_ON() itself is not a problem.


> 3) Grabbing the bhash lock opens the door to unpleasant cases that
> would happen a lot more commonly than this one. In particular, the
> case I'm thinking of is where another socket is binding to the same
> port and can't use bhash2 (eg they're binding on INADDR_ANY or
> IPV6_ADDR_MAPPED); this socket will grab the bhash lock, go through
> every socket binded to this port to check for a bind conflict (can
> take a very long time if there are many sockets), while that is
> happening this connect call will be blocked waiting for the bhash lock
> to be released.

I think bind(INADDR_ANY) + connect() is not a common case and does not
affect common cases frequently.  And if it affects the common case, it
is what it is.  Even if it slows down the common case, it is not a good
reason we can leave a bug in the uncommon path.  It means we need another
improvements/optmisation.  Fast code which slows down by fixing bugs is
not fast code.

Thank you again for discussion!


> > > > > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > ---
> > > > > >  include/net/inet_hashtables.h |  2 +-
> > > > > >  net/dccp/ipv4.c               | 22 ++++------------------
> > > > > >  net/dccp/ipv6.c               | 23 ++++-------------------
> > > > > >  net/ipv4/af_inet.c            | 11 +----------
> > > > > >  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
> > > > > >  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
> > > > > >  net/ipv6/tcp_ipv6.c           | 19 +++----------------
> > > > > >  7 files changed, 45 insertions(+), 83 deletions(-)
> > > > > >
> > > > > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > > > > index 3af1e927247d..ba06e8b52264 100644
> > > > > > --- a/include/net/inet_hashtables.h
> > > > > > +++ b/include/net/inet_hashtables.h
> > > > > > @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > > > >   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
> > > > > >   * rcv_saddr field should already have been updated when this is called.
> > > > > >   */
> > > > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> > > > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > > > > >
> > > > > >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > > > > >                     struct inet_bind2_bucket *tb2, unsigned short port);
> > > > > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > > > > index 40640c26680e..95e376e3b911 100644
> > > > > > --- a/net/dccp/ipv4.c
> > > > > > +++ b/net/dccp/ipv4.c
> > > > > > @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
> > > > > >  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > > >  {
> > > > > >         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > > >         struct dccp_sock *dp = dccp_sk(sk);
> > > > > >         __be16 orig_sport, orig_dport;
> > > > > > +       __be32 daddr, nexthop;
> > > > > >         struct flowi4 *fl4;
> > > > > >         struct rtable *rt;
> > > > > >         int err;
> > > > > > @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > > >                 daddr = fl4->daddr;
> > > > > >
> > > > > >         if (inet->inet_saddr == 0) {
> > > > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > > > -                       prev_addr_hashbucket =
> > > > > > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> > > > > > -                                                     sock_net(sk),
> > > > > > -                                                     inet->inet_num);
> > > > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > > > -               }
> > > > > > -               inet->inet_saddr = fl4->saddr;
> > > > > > -       }
> > > > > > -
> > > > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > > -
> > > > > > -       if (prev_addr_hashbucket) {
> > > > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > > > >                 if (err) {
> > > > > > -                       inet->inet_saddr = 0;
> > > > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > > > >                         ip_rt_put(rt);
> > > > > >                         return err;
> > > > > >                 }
> > > > > > +       } else {
> > > > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > >         }
> > > > > >
> > > > > >         inet->inet_dport = usin->sin_port;
> > > > > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > > > > index 626166cb6d7e..94c101ed57a9 100644
> > > > > > --- a/net/dccp/ipv6.c
> > > > > > +++ b/net/dccp/ipv6.c
> > > > > > @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > > >         }
> > > > > >
> > > > > >         if (saddr == NULL) {
> > > > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > > > -
> > > > > > -               if (icsk->icsk_bind2_hash) {
> > > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> > > > > > -                                                                    sk, sock_net(sk),
> > > > > > -                                                                    inet->inet_num);
> > > > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > > > -               }
> > > > > > -
> > > > > >                 saddr = &fl6.saddr;
> > > > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > > > > -
> > > > > > -               if (prev_addr_hashbucket) {
> > > > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > > -                       if (err) {
> > > > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > > > -                               goto failure;
> > > > > > -                       }
> > > > > > -               }
> > > > > > +
> > > > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > > > +               if (err)
> > > > > > +                       goto failure;
> > > > > >         }
> > > > > >
> > > > > >         /* set the source address */
> > > > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > > > > index 4728087c42a5..0da679411330 100644
> > > > > > --- a/net/ipv4/af_inet.c
> > > > > > +++ b/net/ipv4/af_inet.c
> > > > > > @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
> > > > > >
> > > > > >  static int inet_sk_reselect_saddr(struct sock *sk)
> > > > > >  {
> > > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket;
> > > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > > >         __be32 old_saddr = inet->inet_saddr;
> > > > > >         __be32 daddr = inet->inet_daddr;
> > > > > > @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > > > > >                 return 0;
> > > > > >         }
> > > > > >
> > > > > > -       prev_addr_hashbucket =
> > > > > > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> > > > > > -                                     sock_net(sk), inet->inet_num);
> > > > > > -
> > > > > > -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > > > > -
> > > > > > -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > > +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
> > > > > >         if (err) {
> > > > > > -               inet->inet_saddr = old_saddr;
> > > > > > -               inet->inet_rcv_saddr = old_saddr;
> > > > > >                 ip_rt_put(rt);
> > > > > >                 return err;
> > > > > >         }
> > > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > > index d745f962745e..dcb6bc918966 100644
> > > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > > @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > > > >         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> > > > > >  }
> > > > > >
> > > > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> > > > > > +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > > > > > +{
> > > > > > +#if IS_ENABLED(CONFIG_IPV6)
> > > > > > +       if (family == AF_INET6) {
> > > > > > +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> > > > > > +       } else
> > > > > > +#endif
> > > > > > +       {
> > > > > > +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> > > > > > +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> > > > > > +       }
> > > > > > +}
> > > > > > +
> > > > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > > > > >  {
> > > > > >         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> > > > > >         struct inet_bind2_bucket *tb2, *new_tb2;
> > > > > > @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > > > >         int port = inet_sk(sk)->inet_num;
> > > > > >         struct net *net = sock_net(sk);
> > > > > >
> > > > > > +       if (!inet_csk(sk)->icsk_bind2_hash) {
> > > > > > +               /* Not bind()ed before. */
> > > > > > +               inet_update_saddr(sk, saddr, family);
> > > > > > +               return 0;
> > > > > > +       }
> > > > > > +
> > > > > >         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
> > > > > >          * the bhash2 table in an inconsistent state if a new tb2 bucket
> > > > > >          * allocation fails.
> > > > > > @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > > > >         if (!new_tb2)
> > > > > >                 return -ENOMEM;
> > > > > >
> > > > > > +       /* Unlink first not to show the wrong address for other threads. */
> > > > > >         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > > > >
> > > > > > -       spin_lock_bh(&prev_saddr->lock);
> > > > > > +       spin_lock_bh(&head2->lock);
> > > > > >         __sk_del_bind2_node(sk);
> > > > > >         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > > > > > -       spin_unlock_bh(&prev_saddr->lock);
> > > > > > +       spin_unlock_bh(&head2->lock);
> > > > > > +
> > > > > > +       inet_update_saddr(sk, saddr, family);
> > > > > > +
> > > > > > +       /* Update bhash2 bucket. */
> > > > > > +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > > > >
> > > > > >         spin_lock_bh(&head2->lock);
> > > > > >         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> > > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > > > index 6a3a732b584d..23dd7e9df2d5 100644
> > > > > > --- a/net/ipv4/tcp_ipv4.c
> > > > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > > > @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > > >  /* This will initiate an outgoing connection. */
> > > > > >  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > > >  {
> > > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > >         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > > > >         struct inet_timewait_death_row *tcp_death_row;
> > > > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > > >         struct tcp_sock *tp = tcp_sk(sk);
> > > > > >         struct ip_options_rcu *inet_opt;
> > > > > >         struct net *net = sock_net(sk);
> > > > > >         __be16 orig_sport, orig_dport;
> > > > > > +       __be32 daddr, nexthop;
> > > > > >         struct flowi4 *fl4;
> > > > > >         struct rtable *rt;
> > > > > >         int err;
> > > > > > @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > > > >
> > > > > >         if (!inet->inet_saddr) {
> > > > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > > > -                                                                    sk, net, inet->inet_num);
> > > > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > > > -               }
> > > > > > -               inet->inet_saddr = fl4->saddr;
> > > > > > -       }
> > > > > > -
> > > > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > > -
> > > > > > -       if (prev_addr_hashbucket) {
> > > > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > > > >                 if (err) {
> > > > > > -                       inet->inet_saddr = 0;
> > > > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > > > >                         ip_rt_put(rt);
> > > > > >                         return err;
> > > > > >                 }
> > > > > > +       } else {
> > > > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > >         }
> > > > > >
> > > > > >         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > > > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > > > > index 81b396e5cf79..2f3ca3190d26 100644
> > > > > > --- a/net/ipv6/tcp_ipv6.c
> > > > > > +++ b/net/ipv6/tcp_ipv6.c
> > > > > > @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > > > >
> > > > > >         if (!saddr) {
> > > > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > > > -
> > > > > > -               if (icsk->icsk_bind2_hash) {
> > > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > > > -                                                                    sk, net, inet->inet_num);
> > > > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > > > -               }
> > > > > >                 saddr = &fl6.saddr;
> > > > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > > > >
> > > > > > -               if (prev_addr_hashbucket) {
> > > > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > > -                       if (err) {
> > > > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > > > -                               goto failure;
> > > > > > -                       }
> > > > > > -               }
> > > > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > > > +               if (err)
> > > > > > +                       goto failure;
> > > > > >         }
> > > > > >
> > > > > >         /* set the source address */
> > > > > > --
> > > > > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket
@ 2022-11-18 19:58       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 36+ messages in thread
From: Kuniyuki Iwashima @ 2022-11-18 19:58 UTC (permalink / raw)
  To: dccp

From:   Joanne Koong <joannelkoong@gmail.com>
Date:   Fri, 18 Nov 2022 11:02:10 -0800
> On Thu, Nov 17, 2022 at 5:08 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From:   Joanne Koong <joannelkoong@gmail.com>
> > Date:   Thu, 17 Nov 2022 16:55:59 -0800
> > > On Thu, Nov 17, 2022 at 4:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From:   Joanne Koong <joannelkoong@gmail.com>
> > > > Date:   Thu, 17 Nov 2022 13:32:18 -0800
> > > > > On Wed, Nov 16, 2022 at 2:29 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > Currently, we update saddr before calling inet_bhash2_update_saddr(), so
> > > > > > another thread iterating over the bhash2 bucket might see an inconsistent
> > > >
> > > > Sorry this should be just bhash       ^^^ here.
> > > >
> > > > > > address.
> > > > > >
> > > > > > Let's update saddr after unlinking sk from the old bhash2 bucket.
> > > > >
> > > > > I'm not sure whether this patch is necessary and I'm curious to hear
> > > > > your thoughts. There's no adverse effect that comes from updating the
> > > > > sk's saddr before calling inet_bhash2_update_saddr() in the current
> > > > > code. Another thread can be iterating over the bhash2 bucket, but it
> > > > > has no effect whether they see this new address or not (eg when they
> > > > > are iterating through the bucket they are trying to check for bind
> > > > > conflicts on another socket, and the sk having the new address doesn't
> > > > > affect this). What are your thoughts?
> > > >
> > > > You are right, it seems I was confused.
> > > >
> > > > I was thinking that lockless change of saddr could result in data race;
> > > > another process iterating over bhash might see a corrupted address.
> > > >
> > > > So, we need to acquire the bhash lock before updating saddr, and then
> > > > related code should be in inet_bhash2_update_saddr().
> > > >
> > > > But I seem to have forgot to add the lock part... :p
> > >
> > > No worries! :) Is acquiring the bhash lock necessary before updating
> > > saddr? I think the worst case scenario (which would only happen very
> > > rarely) is that there is another process iterating over bhash, that
> > > process tries to access the address the exact time the address is
> > > being updated in this function, causing the other process to see the
> > > corrupted address, that corrupted address matches that other process's
> > > socket address, thus causing that other process to reject the bind
> > > request.
> > >
> > > It doesn't seem like that is a big deal, in the rare event where that
> > > would happen. In my opinion, it's not worth solving for by making the
> > > common case slower by grabbing the bhash lock.
> > >
> > > What are your thoughts?
> >
> > In that sense, inet_bhash2_update_saddr() is not the common case, I think.
> >
> > For the IPv4 case, we need not acquire the lock.  Adding READ_ONCE()
> > and WRITE_ONCE() would be enough, but we cannot do so for IPv6 addr.
> >
> > Also, I think netdev code often fixes such data races reported by
> > KCSAN.
> 
> I'll leave the final decision on this up to you :)

Thank you.

I understand what you are saying, but I still think this is needed.
But the final decision will be made by maintainers, so let's see what
they do :)

This series is now marked as Changes Requested in patchwork, so I'll
respin v3 with the bhash lock.


> My line of reasoning is that:
> 
> 1) This case will be run into *very* rarely - a lot of things would
> need to align, not only that the read and write occur at the same
> time, but that the address gets corrupted to the exact address of the
> other socket
> 
> 2) There's no pernicious effect from this scenario; the worst thing
> that happens is that the other socket's bind request fails and it'll
> need to retry

I think this is the same issue with the one syzbot reported in that
inconsistency exists in bhash and bhash2.  The difference is just the
span.  It is not a problem just because WARN_ON() fired, WARN_ON() only
caught the issue and WARN_ON() itself is not a problem.


> 3) Grabbing the bhash lock opens the door to unpleasant cases that
> would happen a lot more commonly than this one. In particular, the
> case I'm thinking of is where another socket is binding to the same
> port and can't use bhash2 (eg they're binding on INADDR_ANY or
> IPV6_ADDR_MAPPED); this socket will grab the bhash lock, go through
> every socket binded to this port to check for a bind conflict (can
> take a very long time if there are many sockets), while that is
> happening this connect call will be blocked waiting for the bhash lock
> to be released.

I think bind(INADDR_ANY) + connect() is not a common case and does not
affect common cases frequently.  And if it affects the common case, it
is what it is.  Even if it slows down the common case, it is not a good
reason we can leave a bug in the uncommon path.  It means we need another
improvements/optmisation.  Fast code which slows down by fixing bugs is
not fast code.

Thank you again for discussion!


> > > > > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > ---
> > > > > >  include/net/inet_hashtables.h |  2 +-
> > > > > >  net/dccp/ipv4.c               | 22 ++++------------------
> > > > > >  net/dccp/ipv6.c               | 23 ++++-------------------
> > > > > >  net/ipv4/af_inet.c            | 11 +----------
> > > > > >  net/ipv4/inet_hashtables.c    | 31 ++++++++++++++++++++++++++++---
> > > > > >  net/ipv4/tcp_ipv4.c           | 20 ++++----------------
> > > > > >  net/ipv6/tcp_ipv6.c           | 19 +++----------------
> > > > > >  7 files changed, 45 insertions(+), 83 deletions(-)
> > > > > >
> > > > > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > > > > index 3af1e927247d..ba06e8b52264 100644
> > > > > > --- a/include/net/inet_hashtables.h
> > > > > > +++ b/include/net/inet_hashtables.h
> > > > > > @@ -281,7 +281,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > > > >   * sk_v6_rcv_saddr (ipv6) changes after it has been binded. The socket's
> > > > > >   * rcv_saddr field should already have been updated when this is called.
> > > > > >   */
> > > > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk);
> > > > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family);
> > > > > >
> > > > > >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > > > > >                     struct inet_bind2_bucket *tb2, unsigned short port);
> > > > > > diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> > > > > > index 40640c26680e..95e376e3b911 100644
> > > > > > --- a/net/dccp/ipv4.c
> > > > > > +++ b/net/dccp/ipv4.c
> > > > > > @@ -45,11 +45,10 @@ static unsigned int dccp_v4_pernet_id __read_mostly;
> > > > > >  int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > > >  {
> > > > > >         const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > > >         struct dccp_sock *dp = dccp_sk(sk);
> > > > > >         __be16 orig_sport, orig_dport;
> > > > > > +       __be32 daddr, nexthop;
> > > > > >         struct flowi4 *fl4;
> > > > > >         struct rtable *rt;
> > > > > >         int err;
> > > > > > @@ -91,26 +90,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > > >                 daddr = fl4->daddr;
> > > > > >
> > > > > >         if (inet->inet_saddr = 0) {
> > > > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > > > -                       prev_addr_hashbucket > > > > > > -                               inet_bhashfn_portaddr(&dccp_hashinfo, sk,
> > > > > > -                                                     sock_net(sk),
> > > > > > -                                                     inet->inet_num);
> > > > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > > > -               }
> > > > > > -               inet->inet_saddr = fl4->saddr;
> > > > > > -       }
> > > > > > -
> > > > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > > -
> > > > > > -       if (prev_addr_hashbucket) {
> > > > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > > > >                 if (err) {
> > > > > > -                       inet->inet_saddr = 0;
> > > > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > > > >                         ip_rt_put(rt);
> > > > > >                         return err;
> > > > > >                 }
> > > > > > +       } else {
> > > > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > >         }
> > > > > >
> > > > > >         inet->inet_dport = usin->sin_port;
> > > > > > diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> > > > > > index 626166cb6d7e..94c101ed57a9 100644
> > > > > > --- a/net/dccp/ipv6.c
> > > > > > +++ b/net/dccp/ipv6.c
> > > > > > @@ -934,26 +934,11 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > > >         }
> > > > > >
> > > > > >         if (saddr = NULL) {
> > > > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > > > -
> > > > > > -               if (icsk->icsk_bind2_hash) {
> > > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(&dccp_hashinfo,
> > > > > > -                                                                    sk, sock_net(sk),
> > > > > > -                                                                    inet->inet_num);
> > > > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > > > -               }
> > > > > > -
> > > > > >                 saddr = &fl6.saddr;
> > > > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > > > > -
> > > > > > -               if (prev_addr_hashbucket) {
> > > > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > > -                       if (err) {
> > > > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > > > -                               goto failure;
> > > > > > -                       }
> > > > > > -               }
> > > > > > +
> > > > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > > > +               if (err)
> > > > > > +                       goto failure;
> > > > > >         }
> > > > > >
> > > > > >         /* set the source address */
> > > > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > > > > index 4728087c42a5..0da679411330 100644
> > > > > > --- a/net/ipv4/af_inet.c
> > > > > > +++ b/net/ipv4/af_inet.c
> > > > > > @@ -1230,7 +1230,6 @@ EXPORT_SYMBOL(inet_unregister_protosw);
> > > > > >
> > > > > >  static int inet_sk_reselect_saddr(struct sock *sk)
> > > > > >  {
> > > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket;
> > > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > > >         __be32 old_saddr = inet->inet_saddr;
> > > > > >         __be32 daddr = inet->inet_daddr;
> > > > > > @@ -1260,16 +1259,8 @@ static int inet_sk_reselect_saddr(struct sock *sk)
> > > > > >                 return 0;
> > > > > >         }
> > > > > >
> > > > > > -       prev_addr_hashbucket > > > > > > -               inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
> > > > > > -                                     sock_net(sk), inet->inet_num);
> > > > > > -
> > > > > > -       inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
> > > > > > -
> > > > > > -       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > > +       err = inet_bhash2_update_saddr(sk, &new_saddr, AF_INET);
> > > > > >         if (err) {
> > > > > > -               inet->inet_saddr = old_saddr;
> > > > > > -               inet->inet_rcv_saddr = old_saddr;
> > > > > >                 ip_rt_put(rt);
> > > > > >                 return err;
> > > > > >         }
> > > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > > index d745f962745e..dcb6bc918966 100644
> > > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > > @@ -858,7 +858,20 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
> > > > > >         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> > > > > >  }
> > > > > >
> > > > > > -int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
> > > > > > +static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> > > > > > +{
> > > > > > +#if IS_ENABLED(CONFIG_IPV6)
> > > > > > +       if (family = AF_INET6) {
> > > > > > +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> > > > > > +       } else
> > > > > > +#endif
> > > > > > +       {
> > > > > > +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> > > > > > +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> > > > > > +       }
> > > > > > +}
> > > > > > +
> > > > > > +int inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family)
> > > > > >  {
> > > > > >         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> > > > > >         struct inet_bind2_bucket *tb2, *new_tb2;
> > > > > > @@ -867,6 +880,12 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > > > >         int port = inet_sk(sk)->inet_num;
> > > > > >         struct net *net = sock_net(sk);
> > > > > >
> > > > > > +       if (!inet_csk(sk)->icsk_bind2_hash) {
> > > > > > +               /* Not bind()ed before. */
> > > > > > +               inet_update_saddr(sk, saddr, family);
> > > > > > +               return 0;
> > > > > > +       }
> > > > > > +
> > > > > >         /* Allocate a bind2 bucket ahead of time to avoid permanently putting
> > > > > >          * the bhash2 table in an inconsistent state if a new tb2 bucket
> > > > > >          * allocation fails.
> > > > > > @@ -875,12 +894,18 @@ int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct soc
> > > > > >         if (!new_tb2)
> > > > > >                 return -ENOMEM;
> > > > > >
> > > > > > +       /* Unlink first not to show the wrong address for other threads. */
> > > > > >         head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > > > >
> > > > > > -       spin_lock_bh(&prev_saddr->lock);
> > > > > > +       spin_lock_bh(&head2->lock);
> > > > > >         __sk_del_bind2_node(sk);
> > > > > >         inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
> > > > > > -       spin_unlock_bh(&prev_saddr->lock);
> > > > > > +       spin_unlock_bh(&head2->lock);
> > > > > > +
> > > > > > +       inet_update_saddr(sk, saddr, family);
> > > > > > +
> > > > > > +       /* Update bhash2 bucket. */
> > > > > > +       head2 = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > > > > >
> > > > > >         spin_lock_bh(&head2->lock);
> > > > > >         tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
> > > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > > > index 6a3a732b584d..23dd7e9df2d5 100644
> > > > > > --- a/net/ipv4/tcp_ipv4.c
> > > > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > > > @@ -199,15 +199,14 @@ static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > > >  /* This will initiate an outgoing connection. */
> > > > > >  int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > > >  {
> > > > > > -       struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > >         struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
> > > > > >         struct inet_timewait_death_row *tcp_death_row;
> > > > > > -       __be32 daddr, nexthop, prev_sk_rcv_saddr;
> > > > > >         struct inet_sock *inet = inet_sk(sk);
> > > > > >         struct tcp_sock *tp = tcp_sk(sk);
> > > > > >         struct ip_options_rcu *inet_opt;
> > > > > >         struct net *net = sock_net(sk);
> > > > > >         __be16 orig_sport, orig_dport;
> > > > > > +       __be32 daddr, nexthop;
> > > > > >         struct flowi4 *fl4;
> > > > > >         struct rtable *rt;
> > > > > >         int err;
> > > > > > @@ -251,24 +250,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> > > > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > > > >
> > > > > >         if (!inet->inet_saddr) {
> > > > > > -               if (inet_csk(sk)->icsk_bind2_hash) {
> > > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > > > -                                                                    sk, net, inet->inet_num);
> > > > > > -                       prev_sk_rcv_saddr = sk->sk_rcv_saddr;
> > > > > > -               }
> > > > > > -               inet->inet_saddr = fl4->saddr;
> > > > > > -       }
> > > > > > -
> > > > > > -       sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > > -
> > > > > > -       if (prev_addr_hashbucket) {
> > > > > > -               err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > > +               err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
> > > > > >                 if (err) {
> > > > > > -                       inet->inet_saddr = 0;
> > > > > > -                       sk_rcv_saddr_set(sk, prev_sk_rcv_saddr);
> > > > > >                         ip_rt_put(rt);
> > > > > >                         return err;
> > > > > >                 }
> > > > > > +       } else {
> > > > > > +               sk_rcv_saddr_set(sk, inet->inet_saddr);
> > > > > >         }
> > > > > >
> > > > > >         if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
> > > > > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > > > > index 81b396e5cf79..2f3ca3190d26 100644
> > > > > > --- a/net/ipv6/tcp_ipv6.c
> > > > > > +++ b/net/ipv6/tcp_ipv6.c
> > > > > > @@ -292,24 +292,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> > > > > >         tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
> > > > > >
> > > > > >         if (!saddr) {
> > > > > > -               struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
> > > > > > -               struct in6_addr prev_v6_rcv_saddr;
> > > > > > -
> > > > > > -               if (icsk->icsk_bind2_hash) {
> > > > > > -                       prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
> > > > > > -                                                                    sk, net, inet->inet_num);
> > > > > > -                       prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > > > > > -               }
> > > > > >                 saddr = &fl6.saddr;
> > > > > > -               sk->sk_v6_rcv_saddr = *saddr;
> > > > > >
> > > > > > -               if (prev_addr_hashbucket) {
> > > > > > -                       err = inet_bhash2_update_saddr(prev_addr_hashbucket, sk);
> > > > > > -                       if (err) {
> > > > > > -                               sk->sk_v6_rcv_saddr = prev_v6_rcv_saddr;
> > > > > > -                               goto failure;
> > > > > > -                       }
> > > > > > -               }
> > > > > > +               err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> > > > > > +               if (err)
> > > > > > +                       goto failure;
> > > > > >         }
> > > > > >
> > > > > >         /* set the source address */
> > > > > > --
> > > > > > 2.30.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-11-18 19:58 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-16 22:28 [PATCH v2 net 0/4] dccp/tcp: Fix bhash2 issues related to WARN_ON() in inet_csk_get_port() Kuniyuki Iwashima
2022-11-16 22:28 ` Kuniyuki Iwashima
2022-11-16 22:28 ` [PATCH v2 net 1/4] dccp/tcp: Reset saddr on failure after inet6?_hash_connect() Kuniyuki Iwashima
2022-11-16 22:28   ` Kuniyuki Iwashima
2022-11-17  0:11   ` Joanne Koong
2022-11-17  0:11     ` Joanne Koong
2022-11-17  0:20     ` Kuniyuki Iwashima
2022-11-17  0:20       ` Kuniyuki Iwashima
2022-11-17  0:43       ` Joanne Koong
2022-11-17  0:43         ` Joanne Koong
2022-11-16 22:28 ` [PATCH v2 net 2/4] dccp/tcp: Remove NULL check for prev_saddr in inet_bhash2_update_saddr() Kuniyuki Iwashima
2022-11-16 22:28   ` Kuniyuki Iwashima
2022-11-17  0:07   ` Joanne Koong
2022-11-17  0:07     ` Joanne Koong
2022-11-16 22:28 ` [PATCH v2 net 3/4] dccp/tcp: Don't update saddr before unlinking sk from the old bucket Kuniyuki Iwashima
2022-11-16 22:28   ` Kuniyuki Iwashima
2022-11-17 21:32   ` Joanne Koong
2022-11-17 21:32     ` Joanne Koong
2022-11-17 23:59   ` Kuniyuki Iwashima
2022-11-18  0:06     ` Kuniyuki Iwashima
2022-11-18  0:55     ` Joanne Koong
2022-11-18  0:55       ` Joanne Koong
2022-11-18  1:08       ` Kuniyuki Iwashima
2022-11-18  1:08         ` Kuniyuki Iwashima
2022-11-18 18:56   ` Joanne Koong
2022-11-18 19:02     ` Joanne Koong
2022-11-18 19:58     ` Kuniyuki Iwashima
2022-11-18 19:58       ` Kuniyuki Iwashima
2022-11-16 22:28 ` [PATCH v2 net 4/4] dccp/tcp: Fixup bhash2 bucket when connect() fails Kuniyuki Iwashima
2022-11-16 22:28   ` Kuniyuki Iwashima
2022-11-17  2:23   ` Pengfei Xu
2022-11-17  2:23     ` Pengfei Xu
2022-11-17  3:20     ` Kuniyuki Iwashima
2022-11-17  3:20       ` Kuniyuki Iwashima
2022-11-17  4:56   ` Pengfei Xu
2022-11-17  5:02     ` Pengfei Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.