All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 net 0/2] tcp: Fix bhash2 and TIME_WAIT regression.
@ 2022-12-26 13:27 Kuniyuki Iwashima
  2022-12-26 13:27 ` [PATCH v1 net 1/2] tcp: Add TIME_WAIT sockets in bhash2 Kuniyuki Iwashima
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Kuniyuki Iwashima @ 2022-12-26 13:27 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Jiri Slaby, Joanne Koong, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We forgot to add twsk to bhash2.  Therefore TIME_WAIT sockets cannot
prevent bind() to the same local address and port.


Changes:
  v1:
    * Patch 1:
      * Add tw_bind2_node in inet_timewait_sock instead of
        moving sk_bind2_node from struct sock to struct
	sock_common.

  RFC: https://lore.kernel.org/netdev/20221221151258.25748-1-kuniyu@amazon.com/


Kuniyuki Iwashima (2):
  tcp: Add TIME_WAIT sockets in bhash2.
  tcp: Add selftest for bind() and TIME_WAIT.

 include/net/inet_hashtables.h               |  4 +
 include/net/inet_timewait_sock.h            |  5 ++
 net/ipv4/inet_connection_sock.c             | 26 +++++-
 net/ipv4/inet_hashtables.c                  |  8 +-
 net/ipv4/inet_timewait_sock.c               | 31 ++++++-
 tools/testing/selftests/net/.gitignore      |  1 +
 tools/testing/selftests/net/bind_timewait.c | 92 +++++++++++++++++++++
 7 files changed, 158 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/net/bind_timewait.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v1 net 1/2] tcp: Add TIME_WAIT sockets in bhash2.
  2022-12-26 13:27 [PATCH v1 net 0/2] tcp: Fix bhash2 and TIME_WAIT regression Kuniyuki Iwashima
@ 2022-12-26 13:27 ` Kuniyuki Iwashima
  2022-12-27 18:26   ` Joanne Koong
  2022-12-26 13:27 ` [PATCH v1 net 2/2] tcp: Add selftest for bind() and TIME_WAIT Kuniyuki Iwashima
  2022-12-30  7:40 ` [PATCH v1 net 0/2] tcp: Fix bhash2 and TIME_WAIT regression patchwork-bot+netdevbpf
  2 siblings, 1 reply; 6+ messages in thread
From: Kuniyuki Iwashima @ 2022-12-26 13:27 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Jiri Slaby, Joanne Koong, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Jiri Slaby reported regression of bind() with a simple repro. [0]

The repro creates a TIME_WAIT socket and tries to bind() a new socket
with the same local address and port.  Before commit 28044fc1d495 ("net:
Add a bhash2 table hashed by port and address"), the bind() failed with
-EADDRINUSE, but now it succeeds.

The cited commit should have put TIME_WAIT sockets into bhash2; otherwise,
inet_bhash2_conflict() misses TIME_WAIT sockets when validating bind()
requests if the address is not a wildcard one.

The straight option is to move sk_bind2_node from struct sock to struct
sock_common to add twsk to bhash2 as implemented as RFC. [1]  However, the
binary layout change in the struct sock could affect performances moving
hot fields on different cachelines.

To avoid that, we add another TIME_WAIT list in inet_bind2_bucket and check
it while validating bind().

[0]: https://lore.kernel.org/netdev/6b971a4e-c7d8-411e-1f92-fda29b5b2fb9@kernel.org/
[1]: https://lore.kernel.org/netdev/20221221151258.25748-2-kuniyu@amazon.com/

Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/inet_hashtables.h    |  4 ++++
 include/net/inet_timewait_sock.h |  5 +++++
 net/ipv4/inet_connection_sock.c  | 26 ++++++++++++++++++++++----
 net/ipv4/inet_hashtables.c       |  8 +++++---
 net/ipv4/inet_timewait_sock.c    | 31 +++++++++++++++++++++++++++++--
 5 files changed, 65 insertions(+), 9 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 69174093078f..99bd823e97f6 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -108,6 +108,10 @@ struct inet_bind2_bucket {
 	struct hlist_node	node;
 	/* List of sockets hashed to this bucket */
 	struct hlist_head	owners;
+	/* bhash has twsk in owners, but bhash2 has twsk in
+	 * deathrow not to add a member in struct sock_common.
+	 */
+	struct hlist_head	deathrow;
 };
 
 static inline struct net *ib_net(const struct inet_bind_bucket *ib)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 5b47545f22d3..4a8e578405cb 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -73,9 +73,14 @@ struct inet_timewait_sock {
 	u32			tw_priority;
 	struct timer_list	tw_timer;
 	struct inet_bind_bucket	*tw_tb;
+	struct inet_bind2_bucket	*tw_tb2;
+	struct hlist_node		tw_bind2_node;
 };
 #define tw_tclass tw_tos
 
+#define twsk_for_each_bound_bhash2(__tw, list) \
+	hlist_for_each_entry(__tw, list, tw_bind2_node)
+
 static inline struct inet_timewait_sock *inet_twsk(const struct sock *sk)
 {
 	return (struct inet_timewait_sock *)sk;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index b366ab9148f2..848ffc3e0239 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -173,22 +173,40 @@ static bool inet_bind_conflict(const struct sock *sk, struct sock *sk2,
 	return false;
 }
 
+static bool __inet_bhash2_conflict(const struct sock *sk, struct sock *sk2,
+				   kuid_t sk_uid, bool relax,
+				   bool reuseport_cb_ok, bool reuseport_ok)
+{
+	if (sk->sk_family == AF_INET && ipv6_only_sock(sk2))
+		return false;
+
+	return inet_bind_conflict(sk, sk2, sk_uid, relax,
+				  reuseport_cb_ok, reuseport_ok);
+}
+
 static bool inet_bhash2_conflict(const struct sock *sk,
 				 const struct inet_bind2_bucket *tb2,
 				 kuid_t sk_uid,
 				 bool relax, bool reuseport_cb_ok,
 				 bool reuseport_ok)
 {
+	struct inet_timewait_sock *tw2;
 	struct sock *sk2;
 
 	sk_for_each_bound_bhash2(sk2, &tb2->owners) {
-		if (sk->sk_family == AF_INET && ipv6_only_sock(sk2))
-			continue;
+		if (__inet_bhash2_conflict(sk, sk2, sk_uid, relax,
+					   reuseport_cb_ok, reuseport_ok))
+			return true;
+	}
 
-		if (inet_bind_conflict(sk, sk2, sk_uid, relax,
-				       reuseport_cb_ok, reuseport_ok))
+	twsk_for_each_bound_bhash2(tw2, &tb2->deathrow) {
+		sk2 = (struct sock *)tw2;
+
+		if (__inet_bhash2_conflict(sk, sk2, sk_uid, relax,
+					   reuseport_cb_ok, reuseport_ok))
 			return true;
 	}
+
 	return false;
 }
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index d039b4e732a3..24a38b56fab9 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -116,6 +116,7 @@ static void inet_bind2_bucket_init(struct inet_bind2_bucket *tb,
 #endif
 		tb->rcv_saddr = sk->sk_rcv_saddr;
 	INIT_HLIST_HEAD(&tb->owners);
+	INIT_HLIST_HEAD(&tb->deathrow);
 	hlist_add_head(&tb->node, &head->chain);
 }
 
@@ -137,7 +138,7 @@ struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
 /* Caller must hold hashbucket lock for this tb with local BH disabled */
 void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb)
 {
-	if (hlist_empty(&tb->owners)) {
+	if (hlist_empty(&tb->owners) && hlist_empty(&tb->deathrow)) {
 		__hlist_del(&tb->node);
 		kmem_cache_free(cachep, tb);
 	}
@@ -1103,15 +1104,16 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	/* Head lock still held and bh's disabled */
 	inet_bind_hash(sk, tb, tb2, port);
 
-	spin_unlock(&head2->lock);
-
 	if (sk_unhashed(sk)) {
 		inet_sk(sk)->inet_sport = htons(port);
 		inet_ehash_nolisten(sk, (struct sock *)tw, NULL);
 	}
 	if (tw)
 		inet_twsk_bind_unhash(tw, hinfo);
+
+	spin_unlock(&head2->lock);
 	spin_unlock(&head->lock);
+
 	if (tw)
 		inet_twsk_deschedule_put(tw);
 	local_bh_enable();
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 66fc940f9521..1d77d992e6e7 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -29,6 +29,7 @@
 void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
 			  struct inet_hashinfo *hashinfo)
 {
+	struct inet_bind2_bucket *tb2 = tw->tw_tb2;
 	struct inet_bind_bucket *tb = tw->tw_tb;
 
 	if (!tb)
@@ -37,6 +38,11 @@ void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
 	__hlist_del(&tw->tw_bind_node);
 	tw->tw_tb = NULL;
 	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+
+	__hlist_del(&tw->tw_bind2_node);
+	tw->tw_tb2 = NULL;
+	inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2);
+
 	__sock_put((struct sock *)tw);
 }
 
@@ -45,7 +51,7 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
 {
 	struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
 	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
-	struct inet_bind_hashbucket *bhead;
+	struct inet_bind_hashbucket *bhead, *bhead2;
 
 	spin_lock(lock);
 	sk_nulls_del_node_init_rcu((struct sock *)tw);
@@ -54,9 +60,13 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
 	/* Disassociate with bind bucket. */
 	bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw), tw->tw_num,
 			hashinfo->bhash_size)];
+	bhead2 = inet_bhashfn_portaddr(hashinfo, (struct sock *)tw,
+				       twsk_net(tw), tw->tw_num);
 
 	spin_lock(&bhead->lock);
+	spin_lock(&bhead2->lock);
 	inet_twsk_bind_unhash(tw, hashinfo);
+	spin_unlock(&bhead2->lock);
 	spin_unlock(&bhead->lock);
 
 	refcount_dec(&tw->tw_dr->tw_refcount);
@@ -93,6 +103,12 @@ static void inet_twsk_add_bind_node(struct inet_timewait_sock *tw,
 	hlist_add_head(&tw->tw_bind_node, list);
 }
 
+static void inet_twsk_add_bind2_node(struct inet_timewait_sock *tw,
+				     struct hlist_head *list)
+{
+	hlist_add_head(&tw->tw_bind2_node, list);
+}
+
 /*
  * Enter the time wait state. This is called with locally disabled BH.
  * Essentially we whip up a timewait bucket, copy the relevant info into it
@@ -105,17 +121,28 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct inet_ehash_bucket *ehead = inet_ehash_bucket(hashinfo, sk->sk_hash);
 	spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
-	struct inet_bind_hashbucket *bhead;
+	struct inet_bind_hashbucket *bhead, *bhead2;
+
 	/* Step 1: Put TW into bind hash. Original socket stays there too.
 	   Note, that any socket with inet->num != 0 MUST be bound in
 	   binding cache, even if it is closed.
 	 */
 	bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw), inet->inet_num,
 			hashinfo->bhash_size)];
+	bhead2 = inet_bhashfn_portaddr(hashinfo, sk, twsk_net(tw), inet->inet_num);
+
 	spin_lock(&bhead->lock);
+	spin_lock(&bhead2->lock);
+
 	tw->tw_tb = icsk->icsk_bind_hash;
 	WARN_ON(!icsk->icsk_bind_hash);
 	inet_twsk_add_bind_node(tw, &tw->tw_tb->owners);
+
+	tw->tw_tb2 = icsk->icsk_bind2_hash;
+	WARN_ON(!icsk->icsk_bind2_hash);
+	inet_twsk_add_bind2_node(tw, &tw->tw_tb2->deathrow);
+
+	spin_unlock(&bhead2->lock);
 	spin_unlock(&bhead->lock);
 
 	spin_lock(lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v1 net 2/2] tcp: Add selftest for bind() and TIME_WAIT.
  2022-12-26 13:27 [PATCH v1 net 0/2] tcp: Fix bhash2 and TIME_WAIT regression Kuniyuki Iwashima
  2022-12-26 13:27 ` [PATCH v1 net 1/2] tcp: Add TIME_WAIT sockets in bhash2 Kuniyuki Iwashima
@ 2022-12-26 13:27 ` Kuniyuki Iwashima
  2022-12-27 18:08   ` Joanne Koong
  2022-12-30  7:40 ` [PATCH v1 net 0/2] tcp: Fix bhash2 and TIME_WAIT regression patchwork-bot+netdevbpf
  2 siblings, 1 reply; 6+ messages in thread
From: Kuniyuki Iwashima @ 2022-12-26 13:27 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Jiri Slaby, Joanne Koong, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

bhash2 split the bind() validation logic into wildcard and non-wildcard
cases.  Let's add a test to catch future regression.

Before the previous patch:

  # ./bind_timewait
  TAP version 13
  1..2
  # Starting 2 tests from 3 test cases.
  #  RUN           bind_timewait.localhost.1 ...
  # bind_timewait.c:87:1:Expected ret (0) == -1 (-1)
  # 1: Test terminated by assertion
  #          FAIL  bind_timewait.localhost.1
  not ok 1 bind_timewait.localhost.1
  #  RUN           bind_timewait.addrany.1 ...
  #            OK  bind_timewait.addrany.1
  ok 2 bind_timewait.addrany.1
  # FAILED: 1 / 2 tests passed.
  # Totals: pass:1 fail:1 xfail:0 xpass:0 skip:0 error:0

After:

  # ./bind_timewait
  TAP version 13
  1..2
  # Starting 2 tests from 3 test cases.
  #  RUN           bind_timewait.localhost.1 ...
  #            OK  bind_timewait.localhost.1
  ok 1 bind_timewait.localhost.1
  #  RUN           bind_timewait.addrany.1 ...
  #            OK  bind_timewait.addrany.1
  ok 2 bind_timewait.addrany.1
  # PASSED: 2 / 2 tests passed.
  # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 tools/testing/selftests/net/.gitignore      |  1 +
 tools/testing/selftests/net/bind_timewait.c | 92 +++++++++++++++++++++
 2 files changed, 93 insertions(+)
 create mode 100644 tools/testing/selftests/net/bind_timewait.c

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 9cc84114741d..a6911cae368c 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 bind_bhash
+bind_timewait
 csum
 cmsg_sender
 diag_uid
diff --git a/tools/testing/selftests/net/bind_timewait.c b/tools/testing/selftests/net/bind_timewait.c
new file mode 100644
index 000000000000..cb9fdf51ea59
--- /dev/null
+++ b/tools/testing/selftests/net/bind_timewait.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#include <sys/socket.h>
+#include <netinet/in.h>
+
+#include "../kselftest_harness.h"
+
+FIXTURE(bind_timewait)
+{
+	struct sockaddr_in addr;
+	socklen_t addrlen;
+};
+
+FIXTURE_VARIANT(bind_timewait)
+{
+	__u32 addr_const;
+};
+
+FIXTURE_VARIANT_ADD(bind_timewait, localhost)
+{
+	.addr_const = INADDR_LOOPBACK
+};
+
+FIXTURE_VARIANT_ADD(bind_timewait, addrany)
+{
+	.addr_const = INADDR_ANY
+};
+
+FIXTURE_SETUP(bind_timewait)
+{
+	self->addr.sin_family = AF_INET;
+	self->addr.sin_port = 0;
+	self->addr.sin_addr.s_addr = htonl(variant->addr_const);
+	self->addrlen = sizeof(self->addr);
+}
+
+FIXTURE_TEARDOWN(bind_timewait)
+{
+}
+
+void create_timewait_socket(struct __test_metadata *_metadata,
+			    FIXTURE_DATA(bind_timewait) *self)
+{
+	int server_fd, client_fd, child_fd, ret;
+	struct sockaddr_in addr;
+	socklen_t addrlen;
+
+	server_fd = socket(AF_INET, SOCK_STREAM, 0);
+	ASSERT_GT(server_fd, 0);
+
+	ret = bind(server_fd, (struct sockaddr *)&self->addr, self->addrlen);
+	ASSERT_EQ(ret, 0);
+
+	ret = listen(server_fd, 1);
+	ASSERT_EQ(ret, 0);
+
+	ret = getsockname(server_fd, (struct sockaddr *)&self->addr, &self->addrlen);
+	ASSERT_EQ(ret, 0);
+
+	client_fd = socket(AF_INET, SOCK_STREAM, 0);
+	ASSERT_GT(client_fd, 0);
+
+	ret = connect(client_fd, (struct sockaddr *)&self->addr, self->addrlen);
+	ASSERT_EQ(ret, 0);
+
+	addrlen = sizeof(addr);
+	child_fd = accept(server_fd, (struct sockaddr *)&addr, &addrlen);
+	ASSERT_GT(child_fd, 0);
+
+	close(child_fd);
+	close(client_fd);
+	close(server_fd);
+}
+
+TEST_F(bind_timewait, 1)
+{
+	int fd, ret;
+
+	create_timewait_socket(_metadata, self);
+
+	fd = socket(AF_INET, SOCK_STREAM, 0);
+	ASSERT_GT(fd, 0);
+
+	ret = bind(fd, (struct sockaddr *)&self->addr, self->addrlen);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, EADDRINUSE);
+
+	close(fd);
+}
+
+TEST_HARNESS_MAIN
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 net 2/2] tcp: Add selftest for bind() and TIME_WAIT.
  2022-12-26 13:27 ` [PATCH v1 net 2/2] tcp: Add selftest for bind() and TIME_WAIT Kuniyuki Iwashima
@ 2022-12-27 18:08   ` Joanne Koong
  0 siblings, 0 replies; 6+ messages in thread
From: Joanne Koong @ 2022-12-27 18:08 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Slaby, Kuniyuki Iwashima, netdev

On Mon, Dec 26, 2022 at 5:29 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> bhash2 split the bind() validation logic into wildcard and non-wildcard
> cases.  Let's add a test to catch future regression.
>
> Before the previous patch:
>
>   # ./bind_timewait
>   TAP version 13
>   1..2
>   # Starting 2 tests from 3 test cases.
>   #  RUN           bind_timewait.localhost.1 ...
>   # bind_timewait.c:87:1:Expected ret (0) == -1 (-1)
>   # 1: Test terminated by assertion
>   #          FAIL  bind_timewait.localhost.1
>   not ok 1 bind_timewait.localhost.1
>   #  RUN           bind_timewait.addrany.1 ...
>   #            OK  bind_timewait.addrany.1
>   ok 2 bind_timewait.addrany.1
>   # FAILED: 1 / 2 tests passed.
>   # Totals: pass:1 fail:1 xfail:0 xpass:0 skip:0 error:0
>
> After:
>
>   # ./bind_timewait
>   TAP version 13
>   1..2
>   # Starting 2 tests from 3 test cases.
>   #  RUN           bind_timewait.localhost.1 ...
>   #            OK  bind_timewait.localhost.1
>   ok 1 bind_timewait.localhost.1
>   #  RUN           bind_timewait.addrany.1 ...
>   #            OK  bind_timewait.addrany.1
>   ok 2 bind_timewait.addrany.1
>   # PASSED: 2 / 2 tests passed.
>   # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Acked-by: Joanne Koong <joannelkoong@gmail.com>

> ---
>  tools/testing/selftests/net/.gitignore      |  1 +
>  tools/testing/selftests/net/bind_timewait.c | 92 +++++++++++++++++++++
>  2 files changed, 93 insertions(+)
>  create mode 100644 tools/testing/selftests/net/bind_timewait.c
>
> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
> index 9cc84114741d..a6911cae368c 100644
> --- a/tools/testing/selftests/net/.gitignore
> +++ b/tools/testing/selftests/net/.gitignore
> @@ -1,5 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  bind_bhash
> +bind_timewait
>  csum
>  cmsg_sender
>  diag_uid
> diff --git a/tools/testing/selftests/net/bind_timewait.c b/tools/testing/selftests/net/bind_timewait.c
> new file mode 100644
> index 000000000000..cb9fdf51ea59
> --- /dev/null
> +++ b/tools/testing/selftests/net/bind_timewait.c
> @@ -0,0 +1,92 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright Amazon.com Inc. or its affiliates. */
> +
> +#include <sys/socket.h>
> +#include <netinet/in.h>
> +
> +#include "../kselftest_harness.h"
> +
> +FIXTURE(bind_timewait)
> +{
> +       struct sockaddr_in addr;
> +       socklen_t addrlen;
> +};
> +
> +FIXTURE_VARIANT(bind_timewait)
> +{
> +       __u32 addr_const;
> +};
> +
> +FIXTURE_VARIANT_ADD(bind_timewait, localhost)
> +{
> +       .addr_const = INADDR_LOOPBACK
> +};
> +
> +FIXTURE_VARIANT_ADD(bind_timewait, addrany)
> +{
> +       .addr_const = INADDR_ANY
> +};
> +
> +FIXTURE_SETUP(bind_timewait)
> +{
> +       self->addr.sin_family = AF_INET;
> +       self->addr.sin_port = 0;
> +       self->addr.sin_addr.s_addr = htonl(variant->addr_const);
> +       self->addrlen = sizeof(self->addr);
> +}
> +
> +FIXTURE_TEARDOWN(bind_timewait)
> +{
> +}
> +
> +void create_timewait_socket(struct __test_metadata *_metadata,
> +                           FIXTURE_DATA(bind_timewait) *self)
> +{
> +       int server_fd, client_fd, child_fd, ret;
> +       struct sockaddr_in addr;
> +       socklen_t addrlen;
> +
> +       server_fd = socket(AF_INET, SOCK_STREAM, 0);
> +       ASSERT_GT(server_fd, 0);
> +
> +       ret = bind(server_fd, (struct sockaddr *)&self->addr, self->addrlen);
> +       ASSERT_EQ(ret, 0);
> +
> +       ret = listen(server_fd, 1);
> +       ASSERT_EQ(ret, 0);
> +
> +       ret = getsockname(server_fd, (struct sockaddr *)&self->addr, &self->addrlen);
> +       ASSERT_EQ(ret, 0);
> +
> +       client_fd = socket(AF_INET, SOCK_STREAM, 0);
> +       ASSERT_GT(client_fd, 0);
> +
> +       ret = connect(client_fd, (struct sockaddr *)&self->addr, self->addrlen);
> +       ASSERT_EQ(ret, 0);
> +
> +       addrlen = sizeof(addr);
> +       child_fd = accept(server_fd, (struct sockaddr *)&addr, &addrlen);
> +       ASSERT_GT(child_fd, 0);
> +
> +       close(child_fd);
> +       close(client_fd);
> +       close(server_fd);
> +}
> +
> +TEST_F(bind_timewait, 1)
> +{
> +       int fd, ret;
> +
> +       create_timewait_socket(_metadata, self);
> +
> +       fd = socket(AF_INET, SOCK_STREAM, 0);
> +       ASSERT_GT(fd, 0);
> +
> +       ret = bind(fd, (struct sockaddr *)&self->addr, self->addrlen);
> +       ASSERT_EQ(ret, -1);
> +       ASSERT_EQ(errno, EADDRINUSE);
> +
> +       close(fd);
> +}
> +
> +TEST_HARNESS_MAIN
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 net 1/2] tcp: Add TIME_WAIT sockets in bhash2.
  2022-12-26 13:27 ` [PATCH v1 net 1/2] tcp: Add TIME_WAIT sockets in bhash2 Kuniyuki Iwashima
@ 2022-12-27 18:26   ` Joanne Koong
  0 siblings, 0 replies; 6+ messages in thread
From: Joanne Koong @ 2022-12-27 18:26 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Slaby, Kuniyuki Iwashima, netdev

On Mon, Dec 26, 2022 at 5:28 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Jiri Slaby reported regression of bind() with a simple repro. [0]
>
> The repro creates a TIME_WAIT socket and tries to bind() a new socket
> with the same local address and port.  Before commit 28044fc1d495 ("net:
> Add a bhash2 table hashed by port and address"), the bind() failed with
> -EADDRINUSE, but now it succeeds.
>
> The cited commit should have put TIME_WAIT sockets into bhash2; otherwise,
> inet_bhash2_conflict() misses TIME_WAIT sockets when validating bind()
> requests if the address is not a wildcard one.
>
> The straight option is to move sk_bind2_node from struct sock to struct
> sock_common to add twsk to bhash2 as implemented as RFC. [1]  However, the
> binary layout change in the struct sock could affect performances moving
> hot fields on different cachelines.
>
> To avoid that, we add another TIME_WAIT list in inet_bind2_bucket and check
> it while validating bind().
>
> [0]: https://lore.kernel.org/netdev/6b971a4e-c7d8-411e-1f92-fda29b5b2fb9@kernel.org/
> [1]: https://lore.kernel.org/netdev/20221221151258.25748-2-kuniyu@amazon.com/
>
> Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> Reported-by: Jiri Slaby <jirislaby@kernel.org>
> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Acked-by: Joanne Koong <joannelkoong@gmail.com>

> ---
>  include/net/inet_hashtables.h    |  4 ++++
>  include/net/inet_timewait_sock.h |  5 +++++
>  net/ipv4/inet_connection_sock.c  | 26 ++++++++++++++++++++++----
>  net/ipv4/inet_hashtables.c       |  8 +++++---
>  net/ipv4/inet_timewait_sock.c    | 31 +++++++++++++++++++++++++++++--
>  5 files changed, 65 insertions(+), 9 deletions(-)
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 69174093078f..99bd823e97f6 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -108,6 +108,10 @@ struct inet_bind2_bucket {
>         struct hlist_node       node;
>         /* List of sockets hashed to this bucket */
>         struct hlist_head       owners;
> +       /* bhash has twsk in owners, but bhash2 has twsk in
> +        * deathrow not to add a member in struct sock_common.

nit: I think "but bhash2 has twsk in deathrow to avoid adding a member
to struct sock_common" would be clearer.

> +        */
> +       struct hlist_head       deathrow;
>  };
>
>  static inline struct net *ib_net(const struct inet_bind_bucket *ib)
> diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> index 5b47545f22d3..4a8e578405cb 100644
> --- a/include/net/inet_timewait_sock.h
> +++ b/include/net/inet_timewait_sock.h
> @@ -73,9 +73,14 @@ struct inet_timewait_sock {
>         u32                     tw_priority;
>         struct timer_list       tw_timer;
>         struct inet_bind_bucket *tw_tb;
> +       struct inet_bind2_bucket        *tw_tb2;
> +       struct hlist_node               tw_bind2_node;
>  };
>  #define tw_tclass tw_tos
>
> +#define twsk_for_each_bound_bhash2(__tw, list) \
> +       hlist_for_each_entry(__tw, list, tw_bind2_node)
> +
>  static inline struct inet_timewait_sock *inet_twsk(const struct sock *sk)
>  {
>         return (struct inet_timewait_sock *)sk;
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index b366ab9148f2..848ffc3e0239 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -173,22 +173,40 @@ static bool inet_bind_conflict(const struct sock *sk, struct sock *sk2,
>         return false;
>  }
>
> +static bool __inet_bhash2_conflict(const struct sock *sk, struct sock *sk2,
> +                                  kuid_t sk_uid, bool relax,
> +                                  bool reuseport_cb_ok, bool reuseport_ok)
> +{
> +       if (sk->sk_family == AF_INET && ipv6_only_sock(sk2))
> +               return false;
> +
> +       return inet_bind_conflict(sk, sk2, sk_uid, relax,
> +                                 reuseport_cb_ok, reuseport_ok);
> +}
> +
>  static bool inet_bhash2_conflict(const struct sock *sk,
>                                  const struct inet_bind2_bucket *tb2,
>                                  kuid_t sk_uid,
>                                  bool relax, bool reuseport_cb_ok,
>                                  bool reuseport_ok)
>  {
> +       struct inet_timewait_sock *tw2;
>         struct sock *sk2;
>
>         sk_for_each_bound_bhash2(sk2, &tb2->owners) {
> -               if (sk->sk_family == AF_INET && ipv6_only_sock(sk2))
> -                       continue;
> +               if (__inet_bhash2_conflict(sk, sk2, sk_uid, relax,
> +                                          reuseport_cb_ok, reuseport_ok))
> +                       return true;
> +       }
>
> -               if (inet_bind_conflict(sk, sk2, sk_uid, relax,
> -                                      reuseport_cb_ok, reuseport_ok))
> +       twsk_for_each_bound_bhash2(tw2, &tb2->deathrow) {
> +               sk2 = (struct sock *)tw2;
> +
> +               if (__inet_bhash2_conflict(sk, sk2, sk_uid, relax,
> +                                          reuseport_cb_ok, reuseport_ok))
>                         return true;
>         }
> +
>         return false;
>  }
>
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index d039b4e732a3..24a38b56fab9 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -116,6 +116,7 @@ static void inet_bind2_bucket_init(struct inet_bind2_bucket *tb,
>  #endif
>                 tb->rcv_saddr = sk->sk_rcv_saddr;
>         INIT_HLIST_HEAD(&tb->owners);
> +       INIT_HLIST_HEAD(&tb->deathrow);
>         hlist_add_head(&tb->node, &head->chain);
>  }
>
> @@ -137,7 +138,7 @@ struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
>  /* Caller must hold hashbucket lock for this tb with local BH disabled */
>  void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb)
>  {
> -       if (hlist_empty(&tb->owners)) {
> +       if (hlist_empty(&tb->owners) && hlist_empty(&tb->deathrow)) {
>                 __hlist_del(&tb->node);
>                 kmem_cache_free(cachep, tb);
>         }
> @@ -1103,15 +1104,16 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>         /* Head lock still held and bh's disabled */
>         inet_bind_hash(sk, tb, tb2, port);
>
> -       spin_unlock(&head2->lock);
> -
>         if (sk_unhashed(sk)) {
>                 inet_sk(sk)->inet_sport = htons(port);
>                 inet_ehash_nolisten(sk, (struct sock *)tw, NULL);
>         }
>         if (tw)
>                 inet_twsk_bind_unhash(tw, hinfo);
> +
> +       spin_unlock(&head2->lock);
>         spin_unlock(&head->lock);
> +
>         if (tw)
>                 inet_twsk_deschedule_put(tw);
>         local_bh_enable();
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 66fc940f9521..1d77d992e6e7 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -29,6 +29,7 @@
>  void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
>                           struct inet_hashinfo *hashinfo)
>  {
> +       struct inet_bind2_bucket *tb2 = tw->tw_tb2;
>         struct inet_bind_bucket *tb = tw->tw_tb;
>
>         if (!tb)
> @@ -37,6 +38,11 @@ void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
>         __hlist_del(&tw->tw_bind_node);
>         tw->tw_tb = NULL;
>         inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
> +
> +       __hlist_del(&tw->tw_bind2_node);
> +       tw->tw_tb2 = NULL;
> +       inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2);
> +
>         __sock_put((struct sock *)tw);
>  }
>
> @@ -45,7 +51,7 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
>  {
>         struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
>         spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
> -       struct inet_bind_hashbucket *bhead;
> +       struct inet_bind_hashbucket *bhead, *bhead2;
>
>         spin_lock(lock);
>         sk_nulls_del_node_init_rcu((struct sock *)tw);
> @@ -54,9 +60,13 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
>         /* Disassociate with bind bucket. */
>         bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw), tw->tw_num,
>                         hashinfo->bhash_size)];
> +       bhead2 = inet_bhashfn_portaddr(hashinfo, (struct sock *)tw,
> +                                      twsk_net(tw), tw->tw_num);
>
>         spin_lock(&bhead->lock);
> +       spin_lock(&bhead2->lock);
>         inet_twsk_bind_unhash(tw, hashinfo);
> +       spin_unlock(&bhead2->lock);
>         spin_unlock(&bhead->lock);
>
>         refcount_dec(&tw->tw_dr->tw_refcount);
> @@ -93,6 +103,12 @@ static void inet_twsk_add_bind_node(struct inet_timewait_sock *tw,
>         hlist_add_head(&tw->tw_bind_node, list);
>  }
>
> +static void inet_twsk_add_bind2_node(struct inet_timewait_sock *tw,
> +                                    struct hlist_head *list)
> +{
> +       hlist_add_head(&tw->tw_bind2_node, list);
> +}
> +
>  /*
>   * Enter the time wait state. This is called with locally disabled BH.
>   * Essentially we whip up a timewait bucket, copy the relevant info into it
> @@ -105,17 +121,28 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
>         const struct inet_connection_sock *icsk = inet_csk(sk);
>         struct inet_ehash_bucket *ehead = inet_ehash_bucket(hashinfo, sk->sk_hash);
>         spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
> -       struct inet_bind_hashbucket *bhead;
> +       struct inet_bind_hashbucket *bhead, *bhead2;
> +
>         /* Step 1: Put TW into bind hash. Original socket stays there too.
>            Note, that any socket with inet->num != 0 MUST be bound in
>            binding cache, even if it is closed.
>          */
>         bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw), inet->inet_num,
>                         hashinfo->bhash_size)];
> +       bhead2 = inet_bhashfn_portaddr(hashinfo, sk, twsk_net(tw), inet->inet_num);
> +
>         spin_lock(&bhead->lock);
> +       spin_lock(&bhead2->lock);
> +
>         tw->tw_tb = icsk->icsk_bind_hash;
>         WARN_ON(!icsk->icsk_bind_hash);
>         inet_twsk_add_bind_node(tw, &tw->tw_tb->owners);
> +
> +       tw->tw_tb2 = icsk->icsk_bind2_hash;
> +       WARN_ON(!icsk->icsk_bind2_hash);
> +       inet_twsk_add_bind2_node(tw, &tw->tw_tb2->deathrow);
> +
> +       spin_unlock(&bhead2->lock);
>         spin_unlock(&bhead->lock);
>
>         spin_lock(lock);
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 net 0/2] tcp: Fix bhash2 and TIME_WAIT regression.
  2022-12-26 13:27 [PATCH v1 net 0/2] tcp: Fix bhash2 and TIME_WAIT regression Kuniyuki Iwashima
  2022-12-26 13:27 ` [PATCH v1 net 1/2] tcp: Add TIME_WAIT sockets in bhash2 Kuniyuki Iwashima
  2022-12-26 13:27 ` [PATCH v1 net 2/2] tcp: Add selftest for bind() and TIME_WAIT Kuniyuki Iwashima
@ 2022-12-30  7:40 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 6+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-12-30  7:40 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, edumazet, kuba, pabeni, jirislaby, joannelkoong, kuni1840, netdev

Hello:

This series was applied to netdev/net.git (master)
by David S. Miller <davem@davemloft.net>:

On Mon, 26 Dec 2022 22:27:51 +0900 you wrote:
> We forgot to add twsk to bhash2.  Therefore TIME_WAIT sockets cannot
> prevent bind() to the same local address and port.
> 
> 
> Changes:
>   v1:
>     * Patch 1:
>       * Add tw_bind2_node in inet_timewait_sock instead of
>         moving sk_bind2_node from struct sock to struct
> 	sock_common.
> 
> [...]

Here is the summary with links:
  - [v1,net,1/2] tcp: Add TIME_WAIT sockets in bhash2.
    https://git.kernel.org/netdev/net/c/936a192f9740
  - [v1,net,2/2] tcp: Add selftest for bind() and TIME_WAIT.
    https://git.kernel.org/netdev/net/c/2c042e8e54ef

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-12-30  7:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-26 13:27 [PATCH v1 net 0/2] tcp: Fix bhash2 and TIME_WAIT regression Kuniyuki Iwashima
2022-12-26 13:27 ` [PATCH v1 net 1/2] tcp: Add TIME_WAIT sockets in bhash2 Kuniyuki Iwashima
2022-12-27 18:26   ` Joanne Koong
2022-12-26 13:27 ` [PATCH v1 net 2/2] tcp: Add selftest for bind() and TIME_WAIT Kuniyuki Iwashima
2022-12-27 18:08   ` Joanne Koong
2022-12-30  7:40 ` [PATCH v1 net 0/2] tcp: Fix bhash2 and TIME_WAIT regression patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.