All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 net-next 0/4] Improve bind(addr, 0) behaviour.
@ 2020-02-29 11:35 Kuniyuki Iwashima
  2020-02-29 11:35 ` [PATCH v3 net-next 1/4] tcp: Remove unnecessary conditions in inet_csk_bind_conflict() Kuniyuki Iwashima
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Kuniyuki Iwashima @ 2020-02-29 11:35 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, edumazet
  Cc: kuniyu, kuni1840, netdev, osa-contribution-log

Currently we fail to bind sockets to ephemeral ports when all of the ports
are exhausted even if all sockets have SO_REUSEADDR enabled. In this case,
we still have a chance to connect to the different remote hosts.

The second and third patches fix the behaviour to fully utilize all space
of the local (addr, port) tuples.

---
Changes in v3:
  - Change the title and write more specific description of the 3rd patch.
  - Add a test in tools/testing/selftests/net/ as the 4th patch.

Changes in v2:
  - Change the description of the 2nd patch ('localhost' -> 'address').
  - Correct the description and the if statement of the 3rd patch.
  https://lore.kernel.org/netdev/20200226074631.67688-1-kuniyu@amazon.co.jp/

v1 with tests:
  https://lore.kernel.org/netdev/20200220152020.13056-1-kuniyu@amazon.co.jp/
---

Kuniyuki Iwashima (4):
  tcp: Remove unnecessary conditions in inet_csk_bind_conflict().
  tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports
    are exhausted.
  tcp: Forbid to bind more than one sockets haveing SO_REUSEADDR and
    SO_REUSEPORT per EUID.
  selftests: net: Add SO_REUSEADDR test to check if 4-tuples are fully
    utilized.

 net/ipv4/inet_connection_sock.c               |  36 ++--
 tools/testing/selftests/net/Makefile          |   2 +
 .../selftests/net/reuseaddr_ports_exhausted.c | 162 ++++++++++++++++++
 .../net/reuseaddr_ports_exhausted.sh          |  33 ++++
 4 files changed, 221 insertions(+), 12 deletions(-)
 create mode 100644 tools/testing/selftests/net/reuseaddr_ports_exhausted.c
 create mode 100755 tools/testing/selftests/net/reuseaddr_ports_exhausted.sh

-- 
2.17.2 (Apple Git-113)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 net-next 1/4] tcp: Remove unnecessary conditions in inet_csk_bind_conflict().
  2020-02-29 11:35 [PATCH v3 net-next 0/4] Improve bind(addr, 0) behaviour Kuniyuki Iwashima
@ 2020-02-29 11:35 ` Kuniyuki Iwashima
  2020-02-29 11:35 ` [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Kuniyuki Iwashima @ 2020-02-29 11:35 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, edumazet
  Cc: kuniyu, kuni1840, netdev, osa-contribution-log

When we get an ephemeral port, the relax is false, so the SO_REUSEADDR
conditions may be evaluated twice. We do not need to check the conditions
again.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 net/ipv4/inet_connection_sock.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index a4db79b1b643..2e9549f49a82 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -146,17 +146,15 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 		    (!sk->sk_bound_dev_if ||
 		     !sk2->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
-			if ((!reuse || !sk2->sk_reuse ||
-			    sk2->sk_state == TCP_LISTEN) &&
-			    (!reuseport || !sk2->sk_reuseport ||
-			     rcu_access_pointer(sk->sk_reuseport_cb) ||
-			     (sk2->sk_state != TCP_TIME_WAIT &&
-			     !uid_eq(uid, sock_i_uid(sk2))))) {
-				if (inet_rcv_saddr_equal(sk, sk2, true))
-					break;
-			}
-			if (!relax && reuse && sk2->sk_reuse &&
+			if (reuse && sk2->sk_reuse &&
 			    sk2->sk_state != TCP_LISTEN) {
+				if (!relax &&
+				    inet_rcv_saddr_equal(sk, sk2, true))
+					break;
+			} else if (!reuseport || !sk2->sk_reuseport ||
+				   rcu_access_pointer(sk->sk_reuseport_cb) ||
+				   (sk2->sk_state != TCP_TIME_WAIT &&
+				    !uid_eq(uid, sock_i_uid(sk2)))) {
 				if (inet_rcv_saddr_equal(sk, sk2, true))
 					break;
 			}
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
  2020-02-29 11:35 [PATCH v3 net-next 0/4] Improve bind(addr, 0) behaviour Kuniyuki Iwashima
  2020-02-29 11:35 ` [PATCH v3 net-next 1/4] tcp: Remove unnecessary conditions in inet_csk_bind_conflict() Kuniyuki Iwashima
@ 2020-02-29 11:35 ` Kuniyuki Iwashima
  2020-03-02  3:42   ` Eric Dumazet
  2020-02-29 11:35 ` [PATCH v3 net-next 3/4] tcp: Forbid to automatically bind more than one sockets haveing SO_REUSEADDR and SO_REUSEPORT per EUID Kuniyuki Iwashima
  2020-02-29 11:35 ` [PATCH v3 net-next 4/4] selftests: net: Add SO_REUSEADDR test to check if 4-tuples are fully utilized Kuniyuki Iwashima
  3 siblings, 1 reply; 11+ messages in thread
From: Kuniyuki Iwashima @ 2020-02-29 11:35 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, edumazet
  Cc: kuniyu, kuni1840, netdev, osa-contribution-log

Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.

The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.

You can reproduce this issue by following instructions below.
  1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
  2. set SO_REUSEADDR to two sockets.
  3. bind two sockets to (address, 0) and the latter fails.

Therefore, when ephemeral ports are exhausted, bind(addr, 0) should
fallback to the legacy behaviour to enable the SO_REUSEADDR option and make
it possible to connect to different remote (addr, port) tuples.

This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when all ephemeral ports are exhausted.

The only notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().

Fixes: aacd9289af8b ("tcp: bind() use stronger condition for bind_conflict")

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 net/ipv4/inet_connection_sock.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 2e9549f49a82..cddeab240ea6 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -174,12 +174,14 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 	int port = 0;
 	struct inet_bind_hashbucket *head;
 	struct net *net = sock_net(sk);
+	bool relax = false;
 	int i, low, high, attempt_half;
 	struct inet_bind_bucket *tb;
 	u32 remaining, offset;
 	int l3mdev;
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
+ports_exhausted:
 	attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
 other_half_scan:
 	inet_get_local_port_range(net, &low, &high);
@@ -217,7 +219,7 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 		inet_bind_bucket_for_each(tb, &head->chain)
 			if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
 			    tb->port == port) {
-				if (!inet_csk_bind_conflict(sk, tb, false, false))
+				if (!inet_csk_bind_conflict(sk, tb, relax, false))
 					goto success;
 				goto next_port;
 			}
@@ -237,6 +239,12 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 		attempt_half = 2;
 		goto other_half_scan;
 	}
+
+	if (!relax) {
+		/* We still have a chance to connect to different destinations */
+		relax = true;
+		goto ports_exhausted;
+	}
 	return NULL;
 success:
 	*port_ret = port;
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 net-next 3/4] tcp: Forbid to automatically bind more than one sockets haveing SO_REUSEADDR and SO_REUSEPORT per EUID.
  2020-02-29 11:35 [PATCH v3 net-next 0/4] Improve bind(addr, 0) behaviour Kuniyuki Iwashima
  2020-02-29 11:35 ` [PATCH v3 net-next 1/4] tcp: Remove unnecessary conditions in inet_csk_bind_conflict() Kuniyuki Iwashima
  2020-02-29 11:35 ` [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted Kuniyuki Iwashima
@ 2020-02-29 11:35 ` Kuniyuki Iwashima
  2020-02-29 11:35 ` [PATCH v3 net-next 4/4] selftests: net: Add SO_REUSEADDR test to check if 4-tuples are fully utilized Kuniyuki Iwashima
  3 siblings, 0 replies; 11+ messages in thread
From: Kuniyuki Iwashima @ 2020-02-29 11:35 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, edumazet
  Cc: kuniyu, kuni1840, netdev, osa-contribution-log

If there is no TCP_LISTEN socket on a ephemeral port, we can bind multiple
sockets having SO_REUSEADDR to the same port. Then if all sockets bound to
the port have also SO_REUSEPORT enabled and have the same EUID, all of them
can be listened. This is not safe.

Let's say, an application has root privilege and binds sockets to an
ephemeral port with both of SO_REUSEADDR and SO_REUSEPORT. When none of
sockets is not listened yet, other applications can use sudo, exhaust
ephemeral ports, and autobind sockets to the same ephemeral port, then they
may call listen and steal the port unintentionally.

To prevent this issue, when selecting a ephemeral port automatically, we
must not bind more than one sockets that have the same EUID and both of
SO_REUSEADDR and SO_REUSEPORT.

On the other hand, if the sockets have different EUIDs, the issue above does
not occur. After sockets with different EUIDs are bound to the same port and
one of them is listened, no more socket can be listened. This is because the
condition below in the inet_csk_bind_conflict() is evaluated true and
listen() for the second socket fails.

			} else if (!reuseport_ok ||
				   !reuseport || !sk2->sk_reuseport ||
				   rcu_access_pointer(sk->sk_reuseport_cb) ||
				   (sk2->sk_state != TCP_TIME_WAIT &&
				    !uid_eq(uid, sock_i_uid(sk2)))) {
				if (inet_rcv_saddr_equal(sk, sk2, true))
					break;
			}

Therefore, on the same port, we cannot do listen() for multiple sockets with
different EUIDs and any other listen syscalls fail, so the problem does not
happen. In this case, we can still call connect() for other sockets that
cannot be listened, so we have to succeed to call bind() in order to fully
utilize 4-tuples.

Summarizing the above, we should be able to autobind only one socket having
SO_REUSEADDR and SO_REUSEPORT per EUID. Moreover, this is realised by the
negation of the quoted condition above from reuseport to uid_eq.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 net/ipv4/inet_connection_sock.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index cddeab240ea6..d27ed5fe7147 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -131,7 +131,7 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 {
 	struct sock *sk2;
 	bool reuse = sk->sk_reuse;
-	bool reuseport = !!sk->sk_reuseport && reuseport_ok;
+	bool reuseport = !!sk->sk_reuseport;
 	kuid_t uid = sock_i_uid((struct sock *)sk);
 
 	/*
@@ -148,10 +148,16 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
 			if (reuse && sk2->sk_reuse &&
 			    sk2->sk_state != TCP_LISTEN) {
-				if (!relax &&
+				if ((!relax ||
+				     (!reuseport_ok &&
+				      reuseport && sk2->sk_reuseport &&
+				      !rcu_access_pointer(sk->sk_reuseport_cb) &&
+				      (sk2->sk_state == TCP_TIME_WAIT ||
+				       uid_eq(uid, sock_i_uid(sk2))))) &&
 				    inet_rcv_saddr_equal(sk, sk2, true))
 					break;
-			} else if (!reuseport || !sk2->sk_reuseport ||
+			} else if (!reuseport_ok ||
+				   !reuseport || !sk2->sk_reuseport ||
 				   rcu_access_pointer(sk->sk_reuseport_cb) ||
 				   (sk2->sk_state != TCP_TIME_WAIT &&
 				    !uid_eq(uid, sock_i_uid(sk2)))) {
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 net-next 4/4] selftests: net: Add SO_REUSEADDR test to check if 4-tuples are fully utilized.
  2020-02-29 11:35 [PATCH v3 net-next 0/4] Improve bind(addr, 0) behaviour Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2020-02-29 11:35 ` [PATCH v3 net-next 3/4] tcp: Forbid to automatically bind more than one sockets haveing SO_REUSEADDR and SO_REUSEPORT per EUID Kuniyuki Iwashima
@ 2020-02-29 11:35 ` Kuniyuki Iwashima
  3 siblings, 0 replies; 11+ messages in thread
From: Kuniyuki Iwashima @ 2020-02-29 11:35 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, edumazet
  Cc: kuniyu, kuni1840, netdev, osa-contribution-log

This commit adds a test to check if we can fully utilize 4-tuples for
connect() when all ephemeral ports are exhausted.

The test program changes the local port range to use only one port and binds
two sockets with or without SO_REUSEADDR and SO_REUSEPORT, and with the same
EUID or with different EUIDs, then do listen().

We should be able to bind only one socket having both SO_REUSEADDR and
SO_REUSEPORT per EUID, which restriction is to prevent unintentional
listen().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 tools/testing/selftests/net/Makefile          |   2 +
 .../selftests/net/reuseaddr_ports_exhausted.c | 162 ++++++++++++++++++
 .../net/reuseaddr_ports_exhausted.sh          |  33 ++++
 3 files changed, 197 insertions(+)
 create mode 100644 tools/testing/selftests/net/reuseaddr_ports_exhausted.c
 create mode 100755 tools/testing/selftests/net/reuseaddr_ports_exhausted.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index b5694196430a..ded1aa394880 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -12,6 +12,7 @@ TEST_PROGS += udpgro_bench.sh udpgro.sh test_vxlan_under_vrf.sh reuseport_addr_a
 TEST_PROGS += test_vxlan_fdb_changelink.sh so_txtime.sh ipv6_flowlabel.sh
 TEST_PROGS += tcp_fastopen_backup_key.sh fcnal-test.sh l2tp.sh traceroute.sh
 TEST_PROGS += fin_ack_lat.sh
+TEST_PROGS += reuseaddr_ports_exhausted.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket nettest
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
@@ -22,6 +23,7 @@ TEST_GEN_FILES += tcp_fastopen_backup_key
 TEST_GEN_FILES += fin_ack_lat
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
+TEST_GEN_FILES += reuseaddr_ports_exhausted
 
 KSFT_KHDR_INSTALL := 1
 include ../lib.mk
diff --git a/tools/testing/selftests/net/reuseaddr_ports_exhausted.c b/tools/testing/selftests/net/reuseaddr_ports_exhausted.c
new file mode 100644
index 000000000000..7b01b7c2ec10
--- /dev/null
+++ b/tools/testing/selftests/net/reuseaddr_ports_exhausted.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Check if we can fully utilize 4-tuples for connect().
+ *
+ * Rules to bind sockets to the same port when all ephemeral ports are
+ * exhausted.
+ *
+ *   1. if there are TCP_LISTEN sockets on the port, fail to bind.
+ *   2. if there are sockets without SO_REUSEADDR, fail to bind.
+ *   3. if SO_REUSEADDR is disabled, fail to bind.
+ *   4. if SO_REUSEADDR is enabled and SO_REUSEPORT is disabled,
+ *        succeed to bind.
+ *   5. if SO_REUSEADDR and SO_REUSEPORT are enabled and
+ *        there is no socket having the both options and the same EUID,
+ *        succeed to bind.
+ *   6. fail to bind.
+ *
+ * Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
+ */
+#include <arpa/inet.h>
+#include <netinet/in.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include "../kselftest_harness.h"
+
+struct reuse_opts {
+	int reuseaddr[2];
+	int reuseport[2];
+};
+
+struct reuse_opts unreusable_opts[12] = {
+	{0, 0, 0, 0},
+	{0, 0, 0, 1},
+	{0, 0, 1, 0},
+	{0, 0, 1, 1},
+	{0, 1, 0, 0},
+	{0, 1, 0, 1},
+	{0, 1, 1, 0},
+	{0, 1, 1, 1},
+	{1, 0, 0, 0},
+	{1, 0, 0, 1},
+	{1, 0, 1, 0},
+	{1, 0, 1, 1},
+};
+
+struct reuse_opts reusable_opts[4] = {
+	{1, 1, 0, 0},
+	{1, 1, 0, 1},
+	{1, 1, 1, 0},
+	{1, 1, 1, 1},
+};
+
+int bind_port(struct __test_metadata *_metadata, int reuseaddr, int reuseport)
+{
+	struct sockaddr_in local_addr;
+	int len = sizeof(local_addr);
+	int fd, ret;
+
+	fd = socket(AF_INET, SOCK_STREAM, 0);
+	ASSERT_NE(-1, fd) TH_LOG("failed to open socket.");
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &reuseaddr, sizeof(int));
+	ASSERT_EQ(0, ret) TH_LOG("failed to setsockopt: SO_REUSEADDR.");
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &reuseport, sizeof(int));
+	ASSERT_EQ(0, ret) TH_LOG("failed to setsockopt: SO_REUSEPORT.");
+
+	local_addr.sin_family = AF_INET;
+	local_addr.sin_addr.s_addr = inet_addr("127.0.0.1");
+	local_addr.sin_port = 0;
+
+	if (bind(fd, (struct sockaddr *)&local_addr, len) == -1) {
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
+TEST(reuseaddr_ports_exhausted_unreusable)
+{
+	struct reuse_opts *opts;
+	int i, j, fd[2];
+
+	for (i = 0; i < 12; i++) {
+		opts = &unreusable_opts[i];
+
+		for (j = 0; j < 2; j++)
+			fd[j] = bind_port(_metadata, opts->reuseaddr[j], opts->reuseport[j]);
+
+		ASSERT_NE(-1, fd[0]) TH_LOG("failed to bind.");
+		EXPECT_EQ(-1, fd[1]) TH_LOG("should fail to bind.");
+
+		for (j = 0; j < 2; j++)
+			if (fd[j] != -1)
+				close(fd[j]);
+	}
+}
+
+TEST(reuseaddr_ports_exhausted_reusable_same_euid)
+{
+	struct reuse_opts *opts;
+	int i, j, fd[2];
+
+	for (i = 0; i < 4; i++) {
+		opts = &reusable_opts[i];
+
+		for (j = 0; j < 2; j++)
+			fd[j] = bind_port(_metadata, opts->reuseaddr[j], opts->reuseport[j]);
+
+		ASSERT_NE(-1, fd[0]) TH_LOG("failed to bind.");
+
+		if (opts->reuseport[0] && opts->reuseport[1]) {
+			EXPECT_EQ(-1, fd[1]) TH_LOG("should fail to bind because both sockets succeed to be listened.");
+		} else {
+			EXPECT_NE(-1, fd[1]) TH_LOG("should succeed to bind to connect to different destinations.");
+		}
+
+		for (j = 0; j < 2; j++)
+			if (fd[j] != -1)
+				close(fd[j]);
+	}
+}
+
+TEST(reuseaddr_ports_exhausted_reusable_different_euid)
+{
+	struct reuse_opts *opts;
+	int i, j, ret, fd[2];
+	uid_t euid[2] = {10, 20};
+
+	for (i = 0; i < 4; i++) {
+		opts = &reusable_opts[i];
+
+		for (j = 0; j < 2; j++) {
+			ret = seteuid(euid[j]);
+			ASSERT_EQ(0, ret) TH_LOG("failed to seteuid: %d.", euid[j]);
+
+			fd[j] = bind_port(_metadata, opts->reuseaddr[j], opts->reuseport[j]);
+
+			ret = seteuid(0);
+			ASSERT_EQ(0, ret) TH_LOG("failed to seteuid: 0.");
+		}
+
+		ASSERT_NE(-1, fd[0]) TH_LOG("failed to bind.");
+		EXPECT_NE(-1, fd[1]) TH_LOG("should succeed to bind because one socket can be bound in each euid.");
+
+		if (fd[1] != -1) {
+			ret = listen(fd[0], 5);
+			ASSERT_EQ(0, ret) TH_LOG("failed to listen.");
+
+			ret = listen(fd[1], 5);
+			EXPECT_EQ(-1, ret) TH_LOG("should fail to listen because only one uid reserves the port in TCP_LISTEN.");
+		}
+
+		for (j = 0; j < 2; j++)
+			if (fd[j] != -1)
+				close(fd[j]);
+	}
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/net/reuseaddr_ports_exhausted.sh b/tools/testing/selftests/net/reuseaddr_ports_exhausted.sh
new file mode 100755
index 000000000000..839269b7975a
--- /dev/null
+++ b/tools/testing/selftests/net/reuseaddr_ports_exhausted.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run tests when all ephemeral ports are exhausted.
+#
+# Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
+
+set +x
+set -e
+
+readonly NETNS="ns-$(mktemp -u XXXXXX)"
+
+setup() {
+	ip netns add "${NETNS}"
+	ip -netns "${NETNS}" link set lo up
+	ip netns exec "${NETNS}" \
+		sysctl -w net.ipv4.ip_local_port_range="32768 32768" \
+		> /dev/null 2>&1
+}
+
+cleanup() {
+	ip netns del "${NETNS}"
+}
+
+trap cleanup EXIT
+setup
+
+do_test() {
+	ip netns exec "${NETNS}" ./reuseaddr_ports_exhausted
+}
+
+do_test
+echo "tests done"
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
  2020-02-29 11:35 ` [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted Kuniyuki Iwashima
@ 2020-03-02  3:42   ` Eric Dumazet
  2020-03-02  4:31     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2020-03-02  3:42 UTC (permalink / raw)
  To: Kuniyuki Iwashima, davem, kuznet, yoshfuji, edumazet
  Cc: kuni1840, netdev, osa-contribution-log



On 2/29/20 3:35 AM, Kuniyuki Iwashima wrote:
> Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
> condition for bind_conflict") introduced a restriction to forbid to bind
> SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
> assign ports dispersedly so that we can connect to the same remote host.
> 
> The change results in accelerating port depletion so that we fail to bind
> sockets to the same local port even if we want to connect to the different
> remote hosts.
> 
> You can reproduce this issue by following instructions below.
>   1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
>   2. set SO_REUSEADDR to two sockets.
>   3. bind two sockets to (address, 0) and the latter fails.
> 
> Therefore, when ephemeral ports are exhausted, bind(addr, 0) should
> fallback to the legacy behaviour to enable the SO_REUSEADDR option and make
> it possible to connect to different remote (addr, port) tuples.
> 
> This patch allows us to bind SO_REUSEADDR enabled sockets to the same
> (addr, port) only when all ephemeral ports are exhausted.
> 
> The only notable thing is that if all sockets bound to the same port have
> both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
> ephemeral port and also do listen().
> 
> Fixes: aacd9289af8b ("tcp: bind() use stronger condition for bind_conflict")
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>

I am unsure about this, since this could double the time taken by this
function, which is already very time consuming.

We added years ago IP_BIND_ADDRESS_NO_PORT socket option, so that the kernel
has more choices at connect() time (instead of bind()) time to choose a source port.

This considerably lowers time taken to find an optimal source port, since
the kernel has full information (source address, destination address & port)

       IP_BIND_ADDRESS_NO_PORT (since Linux 4.2)
              Inform the kernel to not reserve an ephemeral port when using
              bind(2) with a port number of 0.  The port will later be auto‐
              matically chosen at connect(2) time, in a way that allows
              sharing a source port as long as the 4-tuple is unique.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
  2020-03-02  3:42   ` Eric Dumazet
@ 2020-03-02  4:31     ` Kuniyuki Iwashima
  2020-03-02  4:49       ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Kuniyuki Iwashima @ 2020-03-02  4:31 UTC (permalink / raw)
  To: eric.dumazet
  Cc: davem, edumazet, kuni1840, kuniyu, kuznet, netdev,
	osa-contribution-log, yoshfuji

From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Sun, 1 Mar 2020 19:42:25 -0800
> On 2/29/20 3:35 AM, Kuniyuki Iwashima wrote:
> > Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
> > condition for bind_conflict") introduced a restriction to forbid to bind
> > SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
> > assign ports dispersedly so that we can connect to the same remote host.
> > 
> > The change results in accelerating port depletion so that we fail to bind
> > sockets to the same local port even if we want to connect to the different
> > remote hosts.
> > 
> > You can reproduce this issue by following instructions below.
> >   1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
> >   2. set SO_REUSEADDR to two sockets.
> >   3. bind two sockets to (address, 0) and the latter fails.
> > 
> > Therefore, when ephemeral ports are exhausted, bind(addr, 0) should
> > fallback to the legacy behaviour to enable the SO_REUSEADDR option and make
> > it possible to connect to different remote (addr, port) tuples.
> > 
> > This patch allows us to bind SO_REUSEADDR enabled sockets to the same
> > (addr, port) only when all ephemeral ports are exhausted.
> > 
> > The only notable thing is that if all sockets bound to the same port have
> > both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
> > ephemeral port and also do listen().
> > 
> > Fixes: aacd9289af8b ("tcp: bind() use stronger condition for bind_conflict")
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> 
> I am unsure about this, since this could double the time taken by this
> function, which is already very time consuming.

This patch doubles the time on choosing a port only when all ephemeral ports
are exhausted, and this fallback behaviour can eventually decreases the time
on waiting for ports to be released. We cannot know when the ports are
released, so we may not be able to reuse ports without this patch. This
patch gives more chace and raises the probability to succeed to bind().

> We added years ago IP_BIND_ADDRESS_NO_PORT socket option, so that the kernel
> has more choices at connect() time (instead of bind()) time to choose a source port.
>
> This considerably lowers time taken to find an optimal source port, since
> the kernel has full information (source address, destination address & port)

I also think this option is usefull, but it does not allow us to reuse
ports that is reserved by bind(). This is because connect() can reuse ports
only when their tb->fastresue and tb->fastreuseport is -1. So we still
cannot fully utilize 4-tuples.

https://lore.kernel.org/netdev/20200221100155.76241-1-kuniyu@amazon.co.jp/#t

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
  2020-03-02  4:31     ` Kuniyuki Iwashima
@ 2020-03-02  4:49       ` Eric Dumazet
  2020-03-03 17:53         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2020-03-02  4:49 UTC (permalink / raw)
  To: Kuniyuki Iwashima, eric.dumazet
  Cc: davem, edumazet, kuni1840, kuznet, netdev, osa-contribution-log,
	yoshfuji



On 3/1/20 8:31 PM, Kuniyuki Iwashima wrote:
> From:   Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Sun, 1 Mar 2020 19:42:25 -0800
>> On 2/29/20 3:35 AM, Kuniyuki Iwashima wrote:
>>> Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
>>> condition for bind_conflict") introduced a restriction to forbid to bind
>>> SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
>>> assign ports dispersedly so that we can connect to the same remote host.
>>>
>>> The change results in accelerating port depletion so that we fail to bind
>>> sockets to the same local port even if we want to connect to the different
>>> remote hosts.
>>>
>>> You can reproduce this issue by following instructions below.
>>>   1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
>>>   2. set SO_REUSEADDR to two sockets.
>>>   3. bind two sockets to (address, 0) and the latter fails.
>>>
>>> Therefore, when ephemeral ports are exhausted, bind(addr, 0) should
>>> fallback to the legacy behaviour to enable the SO_REUSEADDR option and make
>>> it possible to connect to different remote (addr, port) tuples.
>>>
>>> This patch allows us to bind SO_REUSEADDR enabled sockets to the same
>>> (addr, port) only when all ephemeral ports are exhausted.
>>>
>>> The only notable thing is that if all sockets bound to the same port have
>>> both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
>>> ephemeral port and also do listen().
>>>
>>> Fixes: aacd9289af8b ("tcp: bind() use stronger condition for bind_conflict")
>>>
>>> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
>>
>> I am unsure about this, since this could double the time taken by this
>> function, which is already very time consuming.
> 
> This patch doubles the time on choosing a port only when all ephemeral ports
> are exhausted, and this fallback behaviour can eventually decreases the time
> on waiting for ports to be released. We cannot know when the ports are
> released, so we may not be able to reuse ports without this patch. This
> patch gives more chace and raises the probability to succeed to bind().
> 
>> We added years ago IP_BIND_ADDRESS_NO_PORT socket option, so that the kernel
>> has more choices at connect() time (instead of bind()) time to choose a source port.
>>
>> This considerably lowers time taken to find an optimal source port, since
>> the kernel has full information (source address, destination address & port)
> 
> I also think this option is usefull, but it does not allow us to reuse
> ports that is reserved by bind(). This is because connect() can reuse ports
> only when their tb->fastresue and tb->fastreuseport is -1. So we still
> cannot fully utilize 4-tuples.

The thing is : We do not want to allow active connections to use a source port
that is used for passive connections.

Many supervisions use dump commands like "ss -t src :40000" to list all connections
for a 'server' listening on port 40000,
or use ethtool to direct all traffic for this port on a particular RSS queue.

Some firewall setups also would need to be changed, since suddenly the port could
be used by unrelated applications.

Note that sharing ports space has been a long standing problem, this is why
we use one netns per job, so that we no longer worry about applications being
hungry.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
  2020-03-02  4:49       ` Eric Dumazet
@ 2020-03-03 17:53         ` Kuniyuki Iwashima
  2020-03-03 18:16           ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Kuniyuki Iwashima @ 2020-03-03 17:53 UTC (permalink / raw)
  To: eric.dumazet
  Cc: davem, edumazet, kuni1840, kuniyu, kuznet, netdev,
	osa-contribution-log, yoshfuji

From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Sun, 1 Mar 2020 20:49:49 -0800
> On 3/1/20 8:31 PM, Kuniyuki Iwashima wrote:
>> From:   Eric Dumazet <eric.dumazet@gmail.com>
>> Date:   Sun, 1 Mar 2020 19:42:25 -0800
>>> On 2/29/20 3:35 AM, Kuniyuki Iwashima wrote:
>>>> Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
>>>> condition for bind_conflict") introduced a restriction to forbid to bind
>>>> SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
>>>> assign ports dispersedly so that we can connect to the same remote host.
>>>>
>>>> The change results in accelerating port depletion so that we fail to bind
>>>> sockets to the same local port even if we want to connect to the different
>>>> remote hosts.
>>>>
>>>> You can reproduce this issue by following instructions below.
>>>>   1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
>>>>   2. set SO_REUSEADDR to two sockets.
>>>>   3. bind two sockets to (address, 0) and the latter fails.
>>>>
>>>> Therefore, when ephemeral ports are exhausted, bind(addr, 0) should
>>>> fallback to the legacy behaviour to enable the SO_REUSEADDR option and make
>>>> it possible to connect to different remote (addr, port) tuples.
>>>>
>>>> This patch allows us to bind SO_REUSEADDR enabled sockets to the same
>>>> (addr, port) only when all ephemeral ports are exhausted.
>>>>
>>>> The only notable thing is that if all sockets bound to the same port have
>>>> both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
>>>> ephemeral port and also do listen().
>>>>
>>>> Fixes: aacd9289af8b ("tcp: bind() use stronger condition for bind_conflict")
>>>>
>>>> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
>>>
>>> I am unsure about this, since this could double the time taken by this
>>> function, which is already very time consuming.
>> 
>> This patch doubles the time on choosing a port only when all ephemeral ports
>> are exhausted, and this fallback behaviour can eventually decreases the time
>> on waiting for ports to be released. We cannot know when the ports are
>> released, so we may not be able to reuse ports without this patch. This
>> patch gives more chace and raises the probability to succeed to bind().
>> 
>>> We added years ago IP_BIND_ADDRESS_NO_PORT socket option, so that the kernel
>>> has more choices at connect() time (instead of bind()) time to choose a source port.
>>>
>>> This considerably lowers time taken to find an optimal source port, since
>>> the kernel has full information (source address, destination address & port)
>> 
>> I also think this option is usefull, but it does not allow us to reuse
>> ports that is reserved by bind(). This is because connect() can reuse ports
>> only when their tb->fastresue and tb->fastreuseport is -1. So we still
>> cannot fully utilize 4-tuples.
> 
> The thing is : We do not want to allow active connections to use a source port
> that is used for passive connections.

When calling bind(addr, 0) without these patches, this problem does not
occur. Certainly these patches could make it possible to do bind(addr, 0)
and listen() on the port which is already reserved by connect(). However,
whether these patches are applied or not, this problem can be occurred by
calling bind with specifying the port.


> Many supervisions use dump commands like "ss -t src :40000" to list all connections
> for a 'server' listening on port 40000,
> or use ethtool to direct all traffic for this port on a particular RSS queue.
> 
> Some firewall setups also would need to be changed, since suddenly the port could
> be used by unrelated applications.

I think these are on promise that the server application specifies the port
and we know which port is used in advance. Moreover the port nerver be used
by unrelated application suddenly. When connect() and listen() share the
port, listen() is always called after connect().


I would like to think about two sockets (sk1 and sk2) in three cases.

1st case: sk1 is in TCP_LISTEN.
In this case, sk2 cannot get the port and my patches does not change the behaviour.

2nd case: sk1 is in TCP_ESTABLISHED and call bind(addr, 40000) for sk2.
In this case, sk2 can get the port by specifying the port, so listen() of
sk2 can share the port with connect() of sk1. This is because reuseport_ok
is true, but my patches add changes only for when reuseport_ok is false.
Therefore, whether my patches are applied or not, this problem can happen.

3rd case: sk1 is in TCP_ESTABLISHED and call bind(addr, 0) for sk2.
In this case, sk2 come to be able to get the port with my patches if both
sockets have SO_REUSEADDR enabled.
So, listen() also can share the port with connect().

However, I do not think this can be problem for two reasons:
  - the same problem already can happen in 2nd case, and the possibility of
    this case is lower than 2nd case because 3rd case is when the ports are exhausted.
  - I am unsure that there is supervisions that monitor the server
    applications which randomly select the ephemeral ports to listen on.

Although this may be a stupid question, is there famous server software
that do bind(addr, 0) and listen() ?


Hence, I think these patches are safe.
What would you think about this?

But also I think this depends on use cases. So I would like to ask for another opinion.
If we can enable/disable the fallback behaviour by sysctl or proc,
it is more useful because some people can fully utilize 4-tuples.
How would you think about this option?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
  2020-03-03 17:53         ` Kuniyuki Iwashima
@ 2020-03-03 18:16           ` Eric Dumazet
  2020-03-08 16:45             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2020-03-03 18:16 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Eric Dumazet, David Miller, kuni1840, Alexey Kuznetsov, netdev,
	osa-contribution-log, Hideaki YOSHIFUJI

On Tue, Mar 3, 2020 at 9:53 AM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
>
> From:   Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Sun, 1 Mar 2020 20:49:49 -0800
> > On 3/1/20 8:31 PM, Kuniyuki Iwashima wrote:
> >> From:   Eric Dumazet <eric.dumazet@gmail.com>
> >> Date:   Sun, 1 Mar 2020 19:42:25 -0800
> >>> On 2/29/20 3:35 AM, Kuniyuki Iwashima wrote:
> >>>> Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
> >>>> condition for bind_conflict") introduced a restriction to forbid to bind
> >>>> SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
> >>>> assign ports dispersedly so that we can connect to the same remote host.
> >>>>
> >>>> The change results in accelerating port depletion so that we fail to bind
> >>>> sockets to the same local port even if we want to connect to the different
> >>>> remote hosts.
> >>>>
> >>>> You can reproduce this issue by following instructions below.
> >>>>   1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
> >>>>   2. set SO_REUSEADDR to two sockets.
> >>>>   3. bind two sockets to (address, 0) and the latter fails.
> >>>>
> >>>> Therefore, when ephemeral ports are exhausted, bind(addr, 0) should
> >>>> fallback to the legacy behaviour to enable the SO_REUSEADDR option and make
> >>>> it possible to connect to different remote (addr, port) tuples.
> >>>>
> >>>> This patch allows us to bind SO_REUSEADDR enabled sockets to the same
> >>>> (addr, port) only when all ephemeral ports are exhausted.
> >>>>
> >>>> The only notable thing is that if all sockets bound to the same port have
> >>>> both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
> >>>> ephemeral port and also do listen().
> >>>>
> >>>> Fixes: aacd9289af8b ("tcp: bind() use stronger condition for bind_conflict")
> >>>>
> >>>> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> >>>
> >>> I am unsure about this, since this could double the time taken by this
> >>> function, which is already very time consuming.
> >>
> >> This patch doubles the time on choosing a port only when all ephemeral ports
> >> are exhausted, and this fallback behaviour can eventually decreases the time
> >> on waiting for ports to be released. We cannot know when the ports are
> >> released, so we may not be able to reuse ports without this patch. This
> >> patch gives more chace and raises the probability to succeed to bind().
> >>
> >>> We added years ago IP_BIND_ADDRESS_NO_PORT socket option, so that the kernel
> >>> has more choices at connect() time (instead of bind()) time to choose a source port.
> >>>
> >>> This considerably lowers time taken to find an optimal source port, since
> >>> the kernel has full information (source address, destination address & port)
> >>
> >> I also think this option is usefull, but it does not allow us to reuse
> >> ports that is reserved by bind(). This is because connect() can reuse ports
> >> only when their tb->fastresue and tb->fastreuseport is -1. So we still
> >> cannot fully utilize 4-tuples.
> >
> > The thing is : We do not want to allow active connections to use a source port
> > that is used for passive connections.
>
> When calling bind(addr, 0) without these patches, this problem does not
> occur. Certainly these patches could make it possible to do bind(addr, 0)
> and listen() on the port which is already reserved by connect(). However,
> whether these patches are applied or not, this problem can be occurred by
> calling bind with specifying the port.
>
>
> > Many supervisions use dump commands like "ss -t src :40000" to list all connections
> > for a 'server' listening on port 40000,
> > or use ethtool to direct all traffic for this port on a particular RSS queue.
> >
> > Some firewall setups also would need to be changed, since suddenly the port could
> > be used by unrelated applications.
>
> I think these are on promise that the server application specifies the port
> and we know which port is used in advance. Moreover the port nerver be used
> by unrelated application suddenly. When connect() and listen() share the
> port, listen() is always called after connect().
>
>
> I would like to think about two sockets (sk1 and sk2) in three cases.
>
> 1st case: sk1 is in TCP_LISTEN.
> In this case, sk2 cannot get the port and my patches does not change the behaviour.

Before being in TCP_LISTEN, it had to bind() on a sport.

Then _after_ reserving an exclusive sport, it can install whatever tc
/ iptables rule to implement additional security.

Before calling listen(), you do not want another socket being able to
use the same sport.

There is no atomic bind()+listen()  or bind()+install_firewalling_rules+listen()

This is why after bind(), the kernel has to guarantee the chosen sport
wont be used by other sockets.

Breaking this rule has a lot of implications.

>
> 2nd case: sk1 is in TCP_ESTABLISHED and call bind(addr, 40000) for sk2.
> In this case, sk2 can get the port by specifying the port, so listen() of
> sk2 can share the port with connect() of sk1. This is because reuseport_ok
> is true, but my patches add changes only for when reuseport_ok is false.
> Therefore, whether my patches are applied or not, this problem can happen.
>
> 3rd case: sk1 is in TCP_ESTABLISHED and call bind(addr, 0) for sk2.
> In this case, sk2 come to be able to get the port with my patches if both
> sockets have SO_REUSEADDR enabled.
> So, listen() also can share the port with connect().
>
> However, I do not think this can be problem for two reasons:
>   - the same problem already can happen in 2nd case, and the possibility of
>     this case is lower than 2nd case because 3rd case is when the ports are exhausted.
>   - I am unsure that there is supervisions that monitor the server
>     applications which randomly select the ephemeral ports to listen on.
>
> Although this may be a stupid question, is there famous server software
> that do bind(addr, 0) and listen() ?

I do not know, there are thousands of solutions using TCP, I can not
make sure they won't break.
It would take years to verify this.

>
>
> Hence, I think these patches are safe.

They are not.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
  2020-03-03 18:16           ` Eric Dumazet
@ 2020-03-08 16:45             ` Kuniyuki Iwashima
  0 siblings, 0 replies; 11+ messages in thread
From: Kuniyuki Iwashima @ 2020-03-08 16:45 UTC (permalink / raw)
  To: edumazet
  Cc: davem, eric.dumazet, kuni1840, kuniyu, kuznet, netdev,
	osa-contribution-log, yoshfuji

From:   Eric Dumazet <edumazet@google.com>
Date:   Tue, 3 Mar 2020 10:16:56 -0800
> On Tue, Mar 3, 2020 at 9:53 AM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> >
> > From:   Eric Dumazet <eric.dumazet@gmail.com>
> > Date:   Sun, 1 Mar 2020 20:49:49 -0800
> > > On 3/1/20 8:31 PM, Kuniyuki Iwashima wrote:
> > >> From:   Eric Dumazet <eric.dumazet@gmail.com>
> > >> Date:   Sun, 1 Mar 2020 19:42:25 -0800
> > >>> On 2/29/20 3:35 AM, Kuniyuki Iwashima wrote:
> > >>>> Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
> > >>>> condition for bind_conflict") introduced a restriction to forbid to bind
> > >>>> SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
> > >>>> assign ports dispersedly so that we can connect to the same remote host.
> > >>>>
> > >>>> The change results in accelerating port depletion so that we fail to bind
> > >>>> sockets to the same local port even if we want to connect to the different
> > >>>> remote hosts.
> > >>>>
> > >>>> You can reproduce this issue by following instructions below.
> > >>>>   1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
> > >>>>   2. set SO_REUSEADDR to two sockets.
> > >>>>   3. bind two sockets to (address, 0) and the latter fails.
> > >>>>
> > >>>> Therefore, when ephemeral ports are exhausted, bind(addr, 0) should
> > >>>> fallback to the legacy behaviour to enable the SO_REUSEADDR option and make
> > >>>> it possible to connect to different remote (addr, port) tuples.
> > >>>>
> > >>>> This patch allows us to bind SO_REUSEADDR enabled sockets to the same
> > >>>> (addr, port) only when all ephemeral ports are exhausted.
> > >>>>
> > >>>> The only notable thing is that if all sockets bound to the same port have
> > >>>> both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
> > >>>> ephemeral port and also do listen().
> > >>>>
> > >>>> Fixes: aacd9289af8b ("tcp: bind() use stronger condition for bind_conflict")
> > >>>>
> > >>>> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > >>>
> > >>> I am unsure about this, since this could double the time taken by this
> > >>> function, which is already very time consuming.
> > >>
> > >> This patch doubles the time on choosing a port only when all ephemeral ports
> > >> are exhausted, and this fallback behaviour can eventually decreases the time
> > >> on waiting for ports to be released. We cannot know when the ports are
> > >> released, so we may not be able to reuse ports without this patch. This
> > >> patch gives more chace and raises the probability to succeed to bind().
> > >>
> > >>> We added years ago IP_BIND_ADDRESS_NO_PORT socket option, so that the kernel
> > >>> has more choices at connect() time (instead of bind()) time to choose a source port.
> > >>>
> > >>> This considerably lowers time taken to find an optimal source port, since
> > >>> the kernel has full information (source address, destination address & port)
> > >>
> > >> I also think this option is usefull, but it does not allow us to reuse
> > >> ports that is reserved by bind(). This is because connect() can reuse ports
> > >> only when their tb->fastresue and tb->fastreuseport is -1. So we still
> > >> cannot fully utilize 4-tuples.
> > >
> > > The thing is : We do not want to allow active connections to use a source port
> > > that is used for passive connections.
> >
> > When calling bind(addr, 0) without these patches, this problem does not
> > occur. Certainly these patches could make it possible to do bind(addr, 0)
> > and listen() on the port which is already reserved by connect(). However,
> > whether these patches are applied or not, this problem can be occurred by
> > calling bind with specifying the port.
> >
> >
> > > Many supervisions use dump commands like "ss -t src :40000" to list all connections
> > > for a 'server' listening on port 40000,
> > > or use ethtool to direct all traffic for this port on a particular RSS queue.
> > >
> > > Some firewall setups also would need to be changed, since suddenly the port could
> > > be used by unrelated applications.
> >
> > I think these are on promise that the server application specifies the port
> > and we know which port is used in advance. Moreover the port nerver be used
> > by unrelated application suddenly. When connect() and listen() share the
> > port, listen() is always called after connect().
> >
> >
> > I would like to think about two sockets (sk1 and sk2) in three cases.
> >
> > 1st case: sk1 is in TCP_LISTEN.
> > In this case, sk2 cannot get the port and my patches does not change the behaviour.
> 
> Before being in TCP_LISTEN, it had to bind() on a sport.
> 
> Then _after_ reserving an exclusive sport, it can install whatever tc
> / iptables rule to implement additional security.
> 
> Before calling listen(), you do not want another socket being able to
> use the same sport.
> 
> There is no atomic bind()+listen()  or bind()+install_firewalling_rules+listen()
> 
> This is why after bind(), the kernel has to guarantee the chosen sport
> wont be used by other sockets.
> 
> Breaking this rule has a lot of implications.

I agree with this, but the current kernel can break the rule in the 2nd case,
and it rarely happens.


> > 2nd case: sk1 is in TCP_ESTABLISHED and call bind(addr, 40000) for sk2.
> > In this case, sk2 can get the port by specifying the port, so listen() of
> > sk2 can share the port with connect() of sk1. This is because reuseport_ok
> > is true, but my patches add changes only for when reuseport_ok is false.
> > Therefore, whether my patches are applied or not, this problem can happen.
> >
> > 3rd case: sk1 is in TCP_ESTABLISHED and call bind(addr, 0) for sk2.
> > In this case, sk2 come to be able to get the port with my patches if both
> > sockets have SO_REUSEADDR enabled.
> > So, listen() also can share the port with connect().
> >
> > However, I do not think this can be problem for two reasons:
> >   - the same problem already can happen in 2nd case, and the possibility of
> >     this case is lower than 2nd case because 3rd case is when the ports are exhausted.
> >   - I am unsure that there is supervisions that monitor the server
> >     applications which randomly select the ephemeral ports to listen on.
> >
> > Although this may be a stupid question, is there famous server software
> > that do bind(addr, 0) and listen() ?
> 
> I do not know, there are thousands of solutions using TCP, I can not
> make sure they won't break.
> It would take years to verify this.

I am sorry, I also cannot guarantee that these patches do not always break
the rule, so I should not have changed the current behaviour. I think this
should be verified by each user and we should give the chance to do it.


> > Hence, I think these patches are safe.
> 
> They are not.

I wanted to say that these patches are as (un)safe as the latest kernel.
Certainly they are not always safe. However if there are no such
applications, these patches can be safe and useful.

I will respin these again with a sysctl option in order to switch the
autobind behaviour.

Thank you.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-03-08 16:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-29 11:35 [PATCH v3 net-next 0/4] Improve bind(addr, 0) behaviour Kuniyuki Iwashima
2020-02-29 11:35 ` [PATCH v3 net-next 1/4] tcp: Remove unnecessary conditions in inet_csk_bind_conflict() Kuniyuki Iwashima
2020-02-29 11:35 ` [PATCH v3 net-next 2/4] tcp: bind(addr, 0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted Kuniyuki Iwashima
2020-03-02  3:42   ` Eric Dumazet
2020-03-02  4:31     ` Kuniyuki Iwashima
2020-03-02  4:49       ` Eric Dumazet
2020-03-03 17:53         ` Kuniyuki Iwashima
2020-03-03 18:16           ` Eric Dumazet
2020-03-08 16:45             ` Kuniyuki Iwashima
2020-02-29 11:35 ` [PATCH v3 net-next 3/4] tcp: Forbid to automatically bind more than one sockets haveing SO_REUSEADDR and SO_REUSEPORT per EUID Kuniyuki Iwashima
2020-02-29 11:35 ` [PATCH v3 net-next 4/4] selftests: net: Add SO_REUSEADDR test to check if 4-tuples are fully utilized Kuniyuki Iwashima

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.