All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v5 0/2] Add IP_LOCAL_PORT_RANGE socket option
@ 2023-01-24 10:05 Jakub Sitnicki
  2023-01-24 10:05 ` [PATCH net-next v5 1/2] inet: " Jakub Sitnicki
  2023-01-24 10:05 ` [PATCH net-next v5 2/2] selftests/net: Cover the " Jakub Sitnicki
  0 siblings, 2 replies; 7+ messages in thread
From: Jakub Sitnicki @ 2023-01-24 10:05 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Kuniyuki Iwashima, Neal Cardwell, Leon Romanovsky, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, kernel-team,
	Marek Majkowski

This patch set is a follow up to the "How to share IPv4 addresses by
partitioning the port space" talk given at LPC 2022 [1].

Please see patch #1 for the motivation & the use case description.
Patch #2 adds tests exercising the new option in various scenarios.

Documentation
-------------

Proposed update to the ip(7) man-page:

       IP_LOCAL_PORT_RANGE (since Linux X.Y)
              Set or get the per-socket default local  port  range.  This
              option  can  be  used  to  clamp down the global local port
              range, defined by the ip_local_port_range  /proc  interface
              described below, for a given socket.

              The  option  takes  an uint32_t value with the high 16 bits
              set to the upper range bound, and the low 16  bits  set  to
              the  lower  range  bound.  Range  bounds are inclusive. The
              16-bit values should be in host byte order.

              The lower bound has to be less than the  upper  bound  when
              both  bounds  are  not  zero. Otherwise, setting the option
              fails with EINVAL.

              If either bound is outside of the global local port  range,
              or is zero, then that bound has no effect.

              To  reset  the setting, pass zero as both the upper and the
              lower bound.

Interaction with SELinux bind() hook
------------------------------------

SELinux bind() hook - selinux_socket_bind() - performs a permission check
if the requested local port number lies outside of the netns ephemeral port
range.

The proposed socket option cannot be used change the ephemeral port range
to extend beyond the per-netns port range, as set by
net.ipv4.ip_local_port_range.

Hence, there is no interaction with SELinux, AFAICT.
	      
Changelog:
---------

v4 -> v5:
v4: https://lore.kernel.org/r/20221221-sockopt-port-range-v4-0-d7d2f2561238@cloudflare.com

 * Code changes called out in individual patches.

v3 -> v4:
v3: https://lore.kernel.org/r/20221221-sockopt-port-range-v3-0-36fa5f5996f4@cloudflare.com

 * Highlight that port bounds should be in host byte order. (Neal)

v2 -> v3:
v2: https://lore.kernel.org/r/20221221-sockopt-port-range-v2-0-1d5f114bf627@cloudflare.com

 * Describe interaction considerations with SELinux.
 * Code changes called out in individual patches.

v1 -> v2:
v1: https://lore.kernel.org/netdev/20221221-sockopt-port-range-v1-0-e2b094b60ffd@cloudflare.com/

 * Fix the corner case when the per-socket range doesn't overlap with the
   per-netns range. Fallback correctly to the per-netns range. (Kuniyuki)

 * selftests: Instead of iterating over socket families (ip4, ip6) and types
   (tcp, udp), generate tests for each combo from a template. This keeps the
   code indentation level down and makes tests more granular.

 * Rewrite man-page prose:
   - explain how to unset the option,
   - document when EINVAL is returned.

RFC -> v1
RFC: https://lore.kernel.org/netdev/20220912225308.93659-1-jakub@cloudflare.com/

 * Allow either the high bound or the low bound, or both, to be zero
 * Add getsockopt support
 * Add selftests

Links:
------

[1]: https://lpc.events/event/16/contributions/1349/

To: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: selinux@vger.kernel.org
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: kernel-team@cloudflare.com
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>

---
Jakub Sitnicki (2):
      inet: Add IP_LOCAL_PORT_RANGE socket option
      selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option

 include/net/inet_sock.h                            |   4 +
 include/net/ip.h                                   |   3 +-
 include/uapi/linux/in.h                            |   1 +
 net/ipv4/inet_connection_sock.c                    |  25 +-
 net/ipv4/inet_hashtables.c                         |   2 +-
 net/ipv4/ip_sockglue.c                             |  18 +
 net/ipv4/udp.c                                     |   2 +-
 net/sctp/socket.c                                  |   2 +-
 tools/testing/selftests/net/Makefile               |   2 +
 tools/testing/selftests/net/ip_local_port_range.c  | 447 +++++++++++++++++++++
 tools/testing/selftests/net/ip_local_port_range.sh |   5 +
 11 files changed, 505 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net-next v5 1/2] inet: Add IP_LOCAL_PORT_RANGE socket option
  2023-01-24 10:05 [PATCH net-next v5 0/2] Add IP_LOCAL_PORT_RANGE socket option Jakub Sitnicki
@ 2023-01-24 10:05 ` Jakub Sitnicki
  2023-01-24 12:23   ` Leon Romanovsky
  2023-01-24 10:05 ` [PATCH net-next v5 2/2] selftests/net: Cover the " Jakub Sitnicki
  1 sibling, 1 reply; 7+ messages in thread
From: Jakub Sitnicki @ 2023-01-24 10:05 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Kuniyuki Iwashima, Neal Cardwell, Leon Romanovsky, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, kernel-team,
	Marek Majkowski

Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
   and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
   the socket.

   This approach has a couple of downsides:

   a) Search for a free port has to be implemented in the user-space. If
      the chosen 4-tuple happens to be busy, the application needs to retry
      from a different local port number.

      Detecting if 4-tuple is busy can be either easy (TCP) or hard
      (UDP). In TCP case, the application simply has to check if connect()
      returned an error (EADDRNOTAVAIL). That is assuming that the local
      port sharing was enabled (REUSEADDR) by all the sockets.

        # Assume desired local port range is 60_000-60_511
        s = socket(AF_INET, SOCK_STREAM)
        s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s.bind(("192.0.2.1", 60_000))
        s.connect(("1.1.1.1", 53))
        # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
        # Application must retry with another local port

      In case of UDP, the network stack allows binding more than one socket
      to the same 4-tuple, when local port sharing is enabled
      (REUSEADDR). Hence detecting the conflict is much harder and involves
      querying sock_diag and toggling the REUSEADDR flag [1].

   b) For TCP, bind()-ing to a port within the ephemeral port range means
      that no connecting sockets, that is those which leave it to the
      network stack to find a free local port at connect() time, can use
      the this port.

      IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
      will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
   ip_local_port_range sysctl to adjust the ephemeral port range bounds.

   The per-netns setting affects all sockets, so this approach can be used
   only if:

   - there is just one egress IP address, or
   - the desired egress port range is the same for all egress IP addresses
     used by the application.

   For TCP, this approach avoids the downsides of (1). Free port search and
   4-tuple conflict detection is done by the network stack:

     system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

     s = socket(AF_INET, SOCK_STREAM)
     s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
     s.bind(("192.0.2.1", 0))
     s.connect(("1.1.1.1", 53))
     # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

  For UDP this approach has limited applicability. Setting the
  IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
  port being shared with other connected UDP sockets.

  Hence relying on the network stack to find a free source port, limits the
  number of outgoing UDP flows from a single IP address down to the number
  of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

  PORT_LO = 40_000
  PORT_HI = 40_511

  s = socket(AF_INET, SOCK_STREAM)
  v = struct.pack("I", PORT_HI << 16 | PORT_LO)
  s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
  s.bind(("127.0.0.1", 0))
  s.getsockname()
  # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
  # if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

v4 -> v5:
 * Use the fact that netns port range starts at 1 when clamping. (Kuniyuki)

v3 -> v4:
 * Clarify that u16 values are in host byte order (Neal)

v2 -> v3:
 * Make SCTP bind()/bind_add() respect IP_LOCAL_PORT_RANGE option (Eric)

v1 -> v2:
 * Fix the corner case when the per-socket range doesn't overlap with the
   per-netns range. Fallback correctly to the per-netns range. (Kuniyuki)

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet_sock.h         |  4 ++++
 include/net/ip.h                |  3 ++-
 include/uapi/linux/in.h         |  1 +
 net/ipv4/inet_connection_sock.c | 25 +++++++++++++++++++++++--
 net/ipv4/inet_hashtables.c      |  2 +-
 net/ipv4/ip_sockglue.c          | 18 ++++++++++++++++++
 net/ipv4/udp.c                  |  2 +-
 net/sctp/socket.c               |  2 +-
 8 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index bf5654ce711e..51857117ac09 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -249,6 +249,10 @@ struct inet_sock {
 	__be32			mc_addr;
 	struct ip_mc_socklist __rcu	*mc_list;
 	struct inet_cork_full	cork;
+	struct {
+		__u16 lo;
+		__u16 hi;
+	}			local_port_range;
 };
 
 #define IPCORK_OPT	1	/* ip-options has been held in ipcork.opt */
diff --git a/include/net/ip.h b/include/net/ip.h
index 144bdfbb25af..c3fffaa92d6e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -340,7 +340,8 @@ static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_o
 	} \
 }
 
-void inet_get_local_port_range(struct net *net, int *low, int *high);
+void inet_get_local_port_range(const struct net *net, int *low, int *high);
+void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high);
 
 #ifdef CONFIG_SYSCTL
 static inline bool inet_is_local_reserved_port(struct net *net, unsigned short port)
diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index 07a4cb149305..4b7f2df66b99 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -162,6 +162,7 @@ struct in_addr {
 #define MCAST_MSFILTER			48
 #define IP_MULTICAST_ALL		49
 #define IP_UNICAST_IF			50
+#define IP_LOCAL_PORT_RANGE		51
 
 #define MCAST_EXCLUDE	0
 #define MCAST_INCLUDE	1
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index d1f837579398..6ed7e65de494 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -117,7 +117,7 @@ bool inet_rcv_saddr_any(const struct sock *sk)
 	return !sk->sk_rcv_saddr;
 }
 
-void inet_get_local_port_range(struct net *net, int *low, int *high)
+void inet_get_local_port_range(const struct net *net, int *low, int *high)
 {
 	unsigned int seq;
 
@@ -130,6 +130,27 @@ void inet_get_local_port_range(struct net *net, int *low, int *high)
 }
 EXPORT_SYMBOL(inet_get_local_port_range);
 
+void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high)
+{
+	const struct inet_sock *inet = inet_sk(sk);
+	const struct net *net = sock_net(sk);
+	int lo, hi, sk_lo, sk_hi;
+
+	inet_get_local_port_range(net, &lo, &hi);
+
+	sk_lo = inet->local_port_range.lo;
+	sk_hi = inet->local_port_range.hi;
+
+	if (unlikely(lo <= sk_lo && sk_lo <= hi))
+		lo = sk_lo;
+	if (unlikely(lo <= sk_hi && sk_hi <= hi))
+		hi = sk_hi;
+
+	*low = lo;
+	*high = hi;
+}
+EXPORT_SYMBOL(inet_sk_get_local_port_range);
+
 static bool inet_use_bhash2_on_bind(const struct sock *sk)
 {
 #if IS_ENABLED(CONFIG_IPV6)
@@ -316,7 +337,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 ports_exhausted:
 	attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
 other_half_scan:
-	inet_get_local_port_range(net, &low, &high);
+	inet_sk_get_local_port_range(sk, &low, &high);
 	high++; /* [32768, 60999] -> [32768, 61000[ */
 	if (high - low < 4)
 		attempt_half = 0;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 7a13dd7f546b..e41fdc38ce19 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1016,7 +1016,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
 
-	inet_get_local_port_range(net, &low, &high);
+	inet_sk_get_local_port_range(sk, &low, &high);
 	high++; /* [32768, 60999] -> [32768, 61000[ */
 	remaining = high - low;
 	if (likely(remaining > 1))
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 9f92ae35bb01..b511ff0adc0a 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -923,6 +923,7 @@ int do_ip_setsockopt(struct sock *sk, int level, int optname,
 	case IP_CHECKSUM:
 	case IP_RECVFRAGSIZE:
 	case IP_RECVERR_RFC4884:
+	case IP_LOCAL_PORT_RANGE:
 		if (optlen >= sizeof(int)) {
 			if (copy_from_sockptr(&val, optval, sizeof(val)))
 				return -EFAULT;
@@ -1365,6 +1366,20 @@ int do_ip_setsockopt(struct sock *sk, int level, int optname,
 		WRITE_ONCE(inet->min_ttl, val);
 		break;
 
+	case IP_LOCAL_PORT_RANGE:
+	{
+		const __u16 lo = val;
+		const __u16 hi = val >> 16;
+
+		if (optlen != sizeof(__u32))
+			goto e_inval;
+		if (lo != 0 && hi != 0 && lo > hi)
+			goto e_inval;
+
+		inet->local_port_range.lo = lo;
+		inet->local_port_range.hi = hi;
+		break;
+	}
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -1743,6 +1758,9 @@ int do_ip_getsockopt(struct sock *sk, int level, int optname,
 	case IP_MINTTL:
 		val = inet->min_ttl;
 		break;
+	case IP_LOCAL_PORT_RANGE:
+		val = inet->local_port_range.hi << 16 | inet->local_port_range.lo;
+		break;
 	default:
 		sockopt_release_sock(sk);
 		return -ENOPROTOOPT;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9592fe3e444a..c605d171eb2d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -248,7 +248,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 		int low, high, remaining;
 		unsigned int rand;
 
-		inet_get_local_port_range(net, &low, &high);
+		inet_sk_get_local_port_range(sk, &low, &high);
 		remaining = (high - low) + 1;
 
 		rand = get_random_u32();
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index a98511b676cd..b91616f819de 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -8322,7 +8322,7 @@ static int sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 		int low, high, remaining, index;
 		unsigned int rover;
 
-		inet_get_local_port_range(net, &low, &high);
+		inet_sk_get_local_port_range(sk, &low, &high);
 		remaining = (high - low) + 1;
 		rover = get_random_u32_below(remaining) + low;
 

-- 
2.39.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net-next v5 2/2] selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option
  2023-01-24 10:05 [PATCH net-next v5 0/2] Add IP_LOCAL_PORT_RANGE socket option Jakub Sitnicki
  2023-01-24 10:05 ` [PATCH net-next v5 1/2] inet: " Jakub Sitnicki
@ 2023-01-24 10:05 ` Jakub Sitnicki
  1 sibling, 0 replies; 7+ messages in thread
From: Jakub Sitnicki @ 2023-01-24 10:05 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Kuniyuki Iwashima, Neal Cardwell, Leon Romanovsky, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, kernel-team,
	Marek Majkowski

Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios:

1. pass invalid values to setsockopt
2. pass a range outside of the per-netns port range
3. configure a single-port range
4. exhaust a configured multi-port range
5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT)
6. set then get the per-socket port range

v2 -> v3:
 * Switch from CPP-based templates to FIXTURE_VARIANT. (Kuniyuki)
 * Cover SOCK_STREAM/IPPROTO_SCTP where possible.

v1 -> v2:
 * selftests: Instead of iterating over socket families (ip4, ip6) and
   types (tcp, udp), generate tests for each combo from a template. This
   keeps the code indentation level down and makes tests more granular.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/net/Makefile               |   2 +
 tools/testing/selftests/net/ip_local_port_range.c  | 447 +++++++++++++++++++++
 tools/testing/selftests/net/ip_local_port_range.sh |   5 +
 3 files changed, 454 insertions(+)

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 47314f0b3006..951bd5342bc6 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -45,6 +45,7 @@ TEST_PROGS += arp_ndisc_untracked_subnets.sh
 TEST_PROGS += stress_reuseport_listen.sh
 TEST_PROGS += l2_tos_ttl_inherit.sh
 TEST_PROGS += bind_bhash.sh
+TEST_PROGS += ip_local_port_range.sh
 TEST_PROGS_EXTENDED := in_netns.sh setup_loopback.sh setup_veth.sh
 TEST_PROGS_EXTENDED += toeplitz_client.sh toeplitz.sh
 TEST_GEN_FILES =  socket nettest
@@ -76,6 +77,7 @@ TEST_PROGS += sctp_vrf.sh
 TEST_GEN_FILES += sctp_hello
 TEST_GEN_FILES += csum
 TEST_GEN_FILES += nat6to4.o
+TEST_GEN_FILES += ip_local_port_range
 
 TEST_FILES := settings
 
diff --git a/tools/testing/selftests/net/ip_local_port_range.c b/tools/testing/selftests/net/ip_local_port_range.c
new file mode 100644
index 000000000000..75e3fdacdf73
--- /dev/null
+++ b/tools/testing/selftests/net/ip_local_port_range.c
@@ -0,0 +1,447 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2023 Cloudflare
+
+/* Test IP_LOCAL_PORT_RANGE socket option: IPv4 + IPv6, TCP + UDP.
+ *
+ * Tests assume that net.ipv4.ip_local_port_range is [40000, 49999].
+ * Don't run these directly but with ip_local_port_range.sh script.
+ */
+
+#include <fcntl.h>
+#include <netinet/ip.h>
+
+#include "../kselftest_harness.h"
+
+#ifndef IP_LOCAL_PORT_RANGE
+#define IP_LOCAL_PORT_RANGE 51
+#endif
+
+static __u32 pack_port_range(__u16 lo, __u16 hi)
+{
+	return (hi << 16) | (lo << 0);
+}
+
+static void unpack_port_range(__u32 range, __u16 *lo, __u16 *hi)
+{
+	*lo = range & 0xffff;
+	*hi = range >> 16;
+}
+
+static int get_so_domain(int fd)
+{
+	int domain, err;
+	socklen_t len;
+
+	len = sizeof(domain);
+	err = getsockopt(fd, SOL_SOCKET, SO_DOMAIN, &domain, &len);
+	if (err)
+		return -1;
+
+	return domain;
+}
+
+static int bind_to_loopback_any_port(int fd)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_in v4;
+		struct sockaddr_in6 v6;
+	} addr;
+	socklen_t addr_len;
+
+	memset(&addr, 0, sizeof(addr));
+	switch (get_so_domain(fd)) {
+	case AF_INET:
+		addr.v4.sin_family = AF_INET;
+		addr.v4.sin_port = htons(0);
+		addr.v4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+		addr_len = sizeof(addr.v4);
+		break;
+	case AF_INET6:
+		addr.v6.sin6_family = AF_INET6;
+		addr.v6.sin6_port = htons(0);
+		addr.v6.sin6_addr = in6addr_loopback;
+		addr_len = sizeof(addr.v6);
+		break;
+	default:
+		return -1;
+	}
+
+	return bind(fd, &addr.sa, addr_len);
+}
+
+static int get_sock_port(int fd)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_in v4;
+		struct sockaddr_in6 v6;
+	} addr;
+	socklen_t addr_len;
+	int err;
+
+	addr_len = sizeof(addr);
+	memset(&addr, 0, sizeof(addr));
+	err = getsockname(fd, &addr.sa, &addr_len);
+	if (err)
+		return -1;
+
+	switch (addr.sa.sa_family) {
+	case AF_INET:
+		return ntohs(addr.v4.sin_port);
+	case AF_INET6:
+		return ntohs(addr.v6.sin6_port);
+	default:
+		errno = EAFNOSUPPORT;
+		return -1;
+	}
+}
+
+static int get_ip_local_port_range(int fd, __u32 *range)
+{
+	socklen_t len;
+	__u32 val;
+	int err;
+
+	len = sizeof(val);
+	err = getsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &val, &len);
+	if (err)
+		return -1;
+
+	*range = val;
+	return 0;
+}
+
+FIXTURE(ip_local_port_range) {};
+
+FIXTURE_SETUP(ip_local_port_range)
+{
+}
+
+FIXTURE_TEARDOWN(ip_local_port_range)
+{
+}
+
+FIXTURE_VARIANT(ip_local_port_range) {
+	int so_domain;
+	int so_type;
+	int so_protocol;
+};
+
+FIXTURE_VARIANT_ADD(ip_local_port_range, ip4_tcp) {
+	.so_domain	= AF_INET,
+	.so_type	= SOCK_STREAM,
+	.so_protocol	= 0,
+};
+
+FIXTURE_VARIANT_ADD(ip_local_port_range, ip4_udp) {
+	.so_domain	= AF_INET,
+	.so_type	= SOCK_DGRAM,
+	.so_protocol	= 0,
+};
+
+FIXTURE_VARIANT_ADD(ip_local_port_range, ip4_stcp) {
+	.so_domain	= AF_INET,
+	.so_type	= SOCK_STREAM,
+	.so_protocol	= IPPROTO_SCTP,
+};
+
+FIXTURE_VARIANT_ADD(ip_local_port_range, ip6_tcp) {
+	.so_domain	= AF_INET6,
+	.so_type	= SOCK_STREAM,
+	.so_protocol	= 0,
+};
+
+FIXTURE_VARIANT_ADD(ip_local_port_range, ip6_udp) {
+	.so_domain	= AF_INET6,
+	.so_type	= SOCK_DGRAM,
+	.so_protocol	= 0,
+};
+
+FIXTURE_VARIANT_ADD(ip_local_port_range, ip6_stcp) {
+	.so_domain	= AF_INET6,
+	.so_type	= SOCK_STREAM,
+	.so_protocol	= IPPROTO_SCTP,
+};
+
+TEST_F(ip_local_port_range, invalid_option_value)
+{
+	__u16 val16;
+	__u32 val32;
+	__u64 val64;
+	int fd, err;
+
+	fd = socket(variant->so_domain, variant->so_type, variant->so_protocol);
+	ASSERT_GE(fd, 0) TH_LOG("socket failed");
+
+	/* Too few bytes */
+	val16 = 40000;
+	err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &val16, sizeof(val16));
+	EXPECT_TRUE(err) TH_LOG("expected setsockopt(IP_LOCAL_PORT_RANGE) to fail");
+	EXPECT_EQ(errno, EINVAL);
+
+	/* Empty range: low port > high port */
+	val32 = pack_port_range(40222, 40111);
+	err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &val32, sizeof(val32));
+	EXPECT_TRUE(err) TH_LOG("expected setsockopt(IP_LOCAL_PORT_RANGE) to fail");
+	EXPECT_EQ(errno, EINVAL);
+
+	/* Too many bytes */
+	val64 = pack_port_range(40333, 40444);
+	err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &val64, sizeof(val64));
+	EXPECT_TRUE(err) TH_LOG("expected setsockopt(IP_LOCAL_PORT_RANGE) to fail");
+	EXPECT_EQ(errno, EINVAL);
+
+	err = close(fd);
+	ASSERT_TRUE(!err) TH_LOG("close failed");
+}
+
+TEST_F(ip_local_port_range, port_range_out_of_netns_range)
+{
+	const struct test {
+		__u16 range_lo;
+		__u16 range_hi;
+	} tests[] = {
+		{ 30000, 39999 }, /* socket range below netns range */
+		{ 50000, 59999 }, /* socket range above netns range */
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		/* Bind a couple of sockets, not just one, to check
+		 * that the range wasn't clamped to a single port from
+		 * the netns range. That is [40000, 40000] or [49999,
+		 * 49999], respectively for each test case.
+		 */
+		int fds[2], i;
+
+		TH_LOG("lo %5hu, hi %5hu", t->range_lo, t->range_hi);
+
+		for (i = 0; i < ARRAY_SIZE(fds); i++) {
+			int fd, err, port;
+			__u32 range;
+
+			fd = socket(variant->so_domain, variant->so_type, variant->so_protocol);
+			ASSERT_GE(fd, 0) TH_LOG("#%d: socket failed", i);
+
+			range = pack_port_range(t->range_lo, t->range_hi);
+			err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+			ASSERT_TRUE(!err) TH_LOG("#%d: setsockopt(IP_LOCAL_PORT_RANGE) failed", i);
+
+			err = bind_to_loopback_any_port(fd);
+			ASSERT_TRUE(!err) TH_LOG("#%d: bind failed", i);
+
+			/* Check that socket port range outside of ephemeral range is ignored */
+			port = get_sock_port(fd);
+			ASSERT_GE(port, 40000) TH_LOG("#%d: expected port within netns range", i);
+			ASSERT_LE(port, 49999) TH_LOG("#%d: expected port within netns range", i);
+
+			fds[i] = fd;
+		}
+
+		for (i = 0; i < ARRAY_SIZE(fds); i++)
+			ASSERT_TRUE(close(fds[i]) == 0) TH_LOG("#%d: close failed", i);
+	}
+}
+
+TEST_F(ip_local_port_range, single_port_range)
+{
+	const struct test {
+		__u16 range_lo;
+		__u16 range_hi;
+		__u16 expected;
+	} tests[] = {
+		/* single port range within ephemeral range */
+		{ 45000, 45000, 45000 },
+		/* first port in the ephemeral range (clamp from above) */
+		{ 0, 40000, 40000 },
+		/* last port in the ephemeral range (clamp from below)  */
+		{ 49999, 0, 49999 },
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		int fd, err, port;
+		__u32 range;
+
+		TH_LOG("lo %5hu, hi %5hu, expected %5hu",
+		       t->range_lo, t->range_hi, t->expected);
+
+		fd = socket(variant->so_domain, variant->so_type, variant->so_protocol);
+		ASSERT_GE(fd, 0) TH_LOG("socket failed");
+
+		range = pack_port_range(t->range_lo, t->range_hi);
+		err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+		ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed");
+
+		err = bind_to_loopback_any_port(fd);
+		ASSERT_TRUE(!err) TH_LOG("bind failed");
+
+		port = get_sock_port(fd);
+		ASSERT_EQ(port, t->expected) TH_LOG("unexpected local port");
+
+		err = close(fd);
+		ASSERT_TRUE(!err) TH_LOG("close failed");
+	}
+}
+
+TEST_F(ip_local_port_range, exhaust_8_port_range)
+{
+	__u8 port_set = 0;
+	int i, fd, err;
+	__u32 range;
+	__u16 port;
+	int fds[8];
+
+	for (i = 0; i < ARRAY_SIZE(fds); i++) {
+		fd = socket(variant->so_domain, variant->so_type, variant->so_protocol);
+		ASSERT_GE(fd, 0) TH_LOG("socket failed");
+
+		range = pack_port_range(40000, 40007);
+		err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+		ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed");
+
+		err = bind_to_loopback_any_port(fd);
+		ASSERT_TRUE(!err) TH_LOG("bind failed");
+
+		port = get_sock_port(fd);
+		ASSERT_GE(port, 40000) TH_LOG("expected port within sockopt range");
+		ASSERT_LE(port, 40007) TH_LOG("expected port within sockopt range");
+
+		port_set |= 1 << (port - 40000);
+		fds[i] = fd;
+	}
+
+	/* Check that all every port from the test range is in use */
+	ASSERT_EQ(port_set, 0xff) TH_LOG("expected all ports to be busy");
+
+	/* Check that bind() fails because the whole range is busy */
+	fd = socket(variant->so_domain, variant->so_type, variant->so_protocol);
+	ASSERT_GE(fd, 0) TH_LOG("socket failed");
+
+	range = pack_port_range(40000, 40007);
+	err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+	ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed");
+
+	err = bind_to_loopback_any_port(fd);
+	ASSERT_TRUE(err) TH_LOG("expected bind to fail");
+	ASSERT_EQ(errno, EADDRINUSE);
+
+	err = close(fd);
+	ASSERT_TRUE(!err) TH_LOG("close failed");
+
+	for (i = 0; i < ARRAY_SIZE(fds); i++) {
+		err = close(fds[i]);
+		ASSERT_TRUE(!err) TH_LOG("close failed");
+	}
+}
+
+TEST_F(ip_local_port_range, late_bind)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_in v4;
+		struct sockaddr_in6 v6;
+	} addr;
+	socklen_t addr_len;
+	const int one = 1;
+	int fd, err;
+	__u32 range;
+	__u16 port;
+
+	if (variant->so_protocol == IPPROTO_SCTP)
+		SKIP(return, "SCTP doesn't support IP_BIND_ADDRESS_NO_PORT");
+
+	fd = socket(variant->so_domain, variant->so_type, 0);
+	ASSERT_GE(fd, 0) TH_LOG("socket failed");
+
+	range = pack_port_range(40100, 40199);
+	err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+	ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed");
+
+	err = setsockopt(fd, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one));
+	ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_BIND_ADDRESS_NO_PORT) failed");
+
+	err = bind_to_loopback_any_port(fd);
+	ASSERT_TRUE(!err) TH_LOG("bind failed");
+
+	port = get_sock_port(fd);
+	ASSERT_EQ(port, 0) TH_LOG("getsockname failed");
+
+	/* Invalid destination */
+	memset(&addr, 0, sizeof(addr));
+	switch (variant->so_domain) {
+	case AF_INET:
+		addr.v4.sin_family = AF_INET;
+		addr.v4.sin_port = htons(0);
+		addr.v4.sin_addr.s_addr = htonl(INADDR_ANY);
+		addr_len = sizeof(addr.v4);
+		break;
+	case AF_INET6:
+		addr.v6.sin6_family = AF_INET6;
+		addr.v6.sin6_port = htons(0);
+		addr.v6.sin6_addr = in6addr_any;
+		addr_len = sizeof(addr.v6);
+		break;
+	default:
+		ASSERT_TRUE(false) TH_LOG("unsupported socket domain");
+	}
+
+	/* connect() doesn't need to succeed for late bind to happen */
+	connect(fd, &addr.sa, addr_len);
+
+	port = get_sock_port(fd);
+	ASSERT_GE(port, 40100);
+	ASSERT_LE(port, 40199);
+
+	err = close(fd);
+	ASSERT_TRUE(!err) TH_LOG("close failed");
+}
+
+TEST_F(ip_local_port_range, get_port_range)
+{
+	__u16 lo, hi;
+	__u32 range;
+	int fd, err;
+
+	fd = socket(variant->so_domain, variant->so_type, variant->so_protocol);
+	ASSERT_GE(fd, 0) TH_LOG("socket failed");
+
+	/* Get range before it will be set */
+	err = get_ip_local_port_range(fd, &range);
+	ASSERT_TRUE(!err) TH_LOG("getsockopt(IP_LOCAL_PORT_RANGE) failed");
+
+	unpack_port_range(range, &lo, &hi);
+	ASSERT_EQ(lo, 0) TH_LOG("unexpected low port");
+	ASSERT_EQ(hi, 0) TH_LOG("unexpected high port");
+
+	range = pack_port_range(12345, 54321);
+	err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+	ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed");
+
+	/* Get range after it has been set */
+	err = get_ip_local_port_range(fd, &range);
+	ASSERT_TRUE(!err) TH_LOG("getsockopt(IP_LOCAL_PORT_RANGE) failed");
+
+	unpack_port_range(range, &lo, &hi);
+	ASSERT_EQ(lo, 12345) TH_LOG("unexpected low port");
+	ASSERT_EQ(hi, 54321) TH_LOG("unexpected high port");
+
+	/* Unset the port range  */
+	range = pack_port_range(0, 0);
+	err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+	ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed");
+
+	/* Get range after it has been unset */
+	err = get_ip_local_port_range(fd, &range);
+	ASSERT_TRUE(!err) TH_LOG("getsockopt(IP_LOCAL_PORT_RANGE) failed");
+
+	unpack_port_range(range, &lo, &hi);
+	ASSERT_EQ(lo, 0) TH_LOG("unexpected low port");
+	ASSERT_EQ(hi, 0) TH_LOG("unexpected high port");
+
+	err = close(fd);
+	ASSERT_TRUE(!err) TH_LOG("close failed");
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/net/ip_local_port_range.sh b/tools/testing/selftests/net/ip_local_port_range.sh
new file mode 100755
index 000000000000..6c6ad346eaa0
--- /dev/null
+++ b/tools/testing/selftests/net/ip_local_port_range.sh
@@ -0,0 +1,5 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+./in_netns.sh \
+  sh -c 'sysctl -q -w net.ipv4.ip_local_port_range="40000 49999" && ./ip_local_port_range'

-- 
2.39.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v5 1/2] inet: Add IP_LOCAL_PORT_RANGE socket option
  2023-01-24 10:05 ` [PATCH net-next v5 1/2] inet: " Jakub Sitnicki
@ 2023-01-24 12:23   ` Leon Romanovsky
  2023-01-24 12:33     ` Jakub Sitnicki
  0 siblings, 1 reply; 7+ messages in thread
From: Leon Romanovsky @ 2023-01-24 12:23 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Kuniyuki Iwashima, Neal Cardwell, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, kernel-team,
	Marek Majkowski

On Tue, Jan 24, 2023 at 11:05:26AM +0100, Jakub Sitnicki wrote:
> Users who want to share a single public IP address for outgoing connections
> between several hosts traditionally reach for SNAT. However, SNAT requires
> state keeping on the node(s) performing the NAT.
> 
> A stateless alternative exists, where a single IP address used for egress
> can be shared between several hosts by partitioning the available ephemeral
> port range. In such a setup:
> 
> 1. Each host gets assigned a disjoint range of ephemeral ports.
> 2. Applications open connections from the host-assigned port range.
> 3. Return traffic gets routed to the host based on both, the destination IP
>    and the destination port.
> 
> An application which wants to open an outgoing connection (connect) from a
> given port range today can choose between two solutions:
> 
> 1. Manually pick the source port by bind()'ing to it before connect()'ing
>    the socket.
> 
>    This approach has a couple of downsides:
> 
>    a) Search for a free port has to be implemented in the user-space. If
>       the chosen 4-tuple happens to be busy, the application needs to retry
>       from a different local port number.
> 
>       Detecting if 4-tuple is busy can be either easy (TCP) or hard
>       (UDP). In TCP case, the application simply has to check if connect()
>       returned an error (EADDRNOTAVAIL). That is assuming that the local
>       port sharing was enabled (REUSEADDR) by all the sockets.
> 
>         # Assume desired local port range is 60_000-60_511
>         s = socket(AF_INET, SOCK_STREAM)
>         s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>         s.bind(("192.0.2.1", 60_000))
>         s.connect(("1.1.1.1", 53))
>         # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
>         # Application must retry with another local port
> 
>       In case of UDP, the network stack allows binding more than one socket
>       to the same 4-tuple, when local port sharing is enabled
>       (REUSEADDR). Hence detecting the conflict is much harder and involves
>       querying sock_diag and toggling the REUSEADDR flag [1].
> 
>    b) For TCP, bind()-ing to a port within the ephemeral port range means
>       that no connecting sockets, that is those which leave it to the
>       network stack to find a free local port at connect() time, can use
>       the this port.
> 
>       IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
>       will be skipped during the free port search at connect() time.
> 
> 2. Isolate the app in a dedicated netns and use the use the per-netns
>    ip_local_port_range sysctl to adjust the ephemeral port range bounds.
> 
>    The per-netns setting affects all sockets, so this approach can be used
>    only if:
> 
>    - there is just one egress IP address, or
>    - the desired egress port range is the same for all egress IP addresses
>      used by the application.
> 
>    For TCP, this approach avoids the downsides of (1). Free port search and
>    4-tuple conflict detection is done by the network stack:
> 
>      system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
> 
>      s = socket(AF_INET, SOCK_STREAM)
>      s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
>      s.bind(("192.0.2.1", 0))
>      s.connect(("1.1.1.1", 53))
>      # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
> 
>   For UDP this approach has limited applicability. Setting the
>   IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
>   port being shared with other connected UDP sockets.
> 
>   Hence relying on the network stack to find a free source port, limits the
>   number of outgoing UDP flows from a single IP address down to the number
>   of available ephemeral ports.
> 
> To put it another way, partitioning the ephemeral port range between hosts
> using the existing Linux networking API is cumbersome.
> 
> To address this use case, add a new socket option at the SOL_IP level,
> named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
> ephemeral port range for each socket individually.
> 
> The option can be used only to narrow down the per-netns local port
> range. If the per-socket range lies outside of the per-netns range, the
> latter takes precedence.
> 
> UAPI-wise, the low and high range bounds are passed to the kernel as a pair
> of u16 values in host byte order packed into a u32. This avoids pointer
> passing.
> 
>   PORT_LO = 40_000
>   PORT_HI = 40_511
> 
>   s = socket(AF_INET, SOCK_STREAM)
>   v = struct.pack("I", PORT_HI << 16 | PORT_LO)
>   s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
>   s.bind(("127.0.0.1", 0))
>   s.getsockname()
>   # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
>   # if there is a free port. EADDRINUSE otherwise.
> 
> [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
> 
> v4 -> v5:
>  * Use the fact that netns port range starts at 1 when clamping. (Kuniyuki)
> 
> v3 -> v4:
>  * Clarify that u16 values are in host byte order (Neal)
> 
> v2 -> v3:
>  * Make SCTP bind()/bind_add() respect IP_LOCAL_PORT_RANGE option (Eric)
> 
> v1 -> v2:
>  * Fix the corner case when the per-socket range doesn't overlap with the
>    per-netns range. Fallback correctly to the per-netns range. (Kuniyuki)

You silently ignored my review comment.
Let's repeat it again. Please put changelog after --- marker. Changelog
doesn't belong to commit message.

Thanks

> 
> Reviewed-by: Marek Majkowski <marek@cloudflare.com>
> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  include/net/inet_sock.h         |  4 ++++
>  include/net/ip.h                |  3 ++-
>  include/uapi/linux/in.h         |  1 +
>  net/ipv4/inet_connection_sock.c | 25 +++++++++++++++++++++++--
>  net/ipv4/inet_hashtables.c      |  2 +-
>  net/ipv4/ip_sockglue.c          | 18 ++++++++++++++++++
>  net/ipv4/udp.c                  |  2 +-
>  net/sctp/socket.c               |  2 +-
>  8 files changed, 51 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v5 1/2] inet: Add IP_LOCAL_PORT_RANGE socket option
  2023-01-24 12:23   ` Leon Romanovsky
@ 2023-01-24 12:33     ` Jakub Sitnicki
  2023-01-24 13:04       ` Leon Romanovsky
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Sitnicki @ 2023-01-24 12:33 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Kuniyuki Iwashima, Neal Cardwell, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, kernel-team,
	Marek Majkowski

On Tue, Jan 24, 2023 at 02:23 PM +02, Leon Romanovsky wrote:
> On Tue, Jan 24, 2023 at 11:05:26AM +0100, Jakub Sitnicki wrote:
>> Users who want to share a single public IP address for outgoing connections
>> between several hosts traditionally reach for SNAT. However, SNAT requires
>> state keeping on the node(s) performing the NAT.
>> 
>> A stateless alternative exists, where a single IP address used for egress
>> can be shared between several hosts by partitioning the available ephemeral
>> port range. In such a setup:
>> 
>> 1. Each host gets assigned a disjoint range of ephemeral ports.
>> 2. Applications open connections from the host-assigned port range.
>> 3. Return traffic gets routed to the host based on both, the destination IP
>>    and the destination port.
>> 
>> An application which wants to open an outgoing connection (connect) from a
>> given port range today can choose between two solutions:
>> 
>> 1. Manually pick the source port by bind()'ing to it before connect()'ing
>>    the socket.
>> 
>>    This approach has a couple of downsides:
>> 
>>    a) Search for a free port has to be implemented in the user-space. If
>>       the chosen 4-tuple happens to be busy, the application needs to retry
>>       from a different local port number.
>> 
>>       Detecting if 4-tuple is busy can be either easy (TCP) or hard
>>       (UDP). In TCP case, the application simply has to check if connect()
>>       returned an error (EADDRNOTAVAIL). That is assuming that the local
>>       port sharing was enabled (REUSEADDR) by all the sockets.
>> 
>>         # Assume desired local port range is 60_000-60_511
>>         s = socket(AF_INET, SOCK_STREAM)
>>         s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>>         s.bind(("192.0.2.1", 60_000))
>>         s.connect(("1.1.1.1", 53))
>>         # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
>>         # Application must retry with another local port
>> 
>>       In case of UDP, the network stack allows binding more than one socket
>>       to the same 4-tuple, when local port sharing is enabled
>>       (REUSEADDR). Hence detecting the conflict is much harder and involves
>>       querying sock_diag and toggling the REUSEADDR flag [1].
>> 
>>    b) For TCP, bind()-ing to a port within the ephemeral port range means
>>       that no connecting sockets, that is those which leave it to the
>>       network stack to find a free local port at connect() time, can use
>>       the this port.
>> 
>>       IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
>>       will be skipped during the free port search at connect() time.
>> 
>> 2. Isolate the app in a dedicated netns and use the use the per-netns
>>    ip_local_port_range sysctl to adjust the ephemeral port range bounds.
>> 
>>    The per-netns setting affects all sockets, so this approach can be used
>>    only if:
>> 
>>    - there is just one egress IP address, or
>>    - the desired egress port range is the same for all egress IP addresses
>>      used by the application.
>> 
>>    For TCP, this approach avoids the downsides of (1). Free port search and
>>    4-tuple conflict detection is done by the network stack:
>> 
>>      system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
>> 
>>      s = socket(AF_INET, SOCK_STREAM)
>>      s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
>>      s.bind(("192.0.2.1", 0))
>>      s.connect(("1.1.1.1", 53))
>>      # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
>> 
>>   For UDP this approach has limited applicability. Setting the
>>   IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
>>   port being shared with other connected UDP sockets.
>> 
>>   Hence relying on the network stack to find a free source port, limits the
>>   number of outgoing UDP flows from a single IP address down to the number
>>   of available ephemeral ports.
>> 
>> To put it another way, partitioning the ephemeral port range between hosts
>> using the existing Linux networking API is cumbersome.
>> 
>> To address this use case, add a new socket option at the SOL_IP level,
>> named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
>> ephemeral port range for each socket individually.
>> 
>> The option can be used only to narrow down the per-netns local port
>> range. If the per-socket range lies outside of the per-netns range, the
>> latter takes precedence.
>> 
>> UAPI-wise, the low and high range bounds are passed to the kernel as a pair
>> of u16 values in host byte order packed into a u32. This avoids pointer
>> passing.
>> 
>>   PORT_LO = 40_000
>>   PORT_HI = 40_511
>> 
>>   s = socket(AF_INET, SOCK_STREAM)
>>   v = struct.pack("I", PORT_HI << 16 | PORT_LO)
>>   s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
>>   s.bind(("127.0.0.1", 0))
>>   s.getsockname()
>>   # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
>>   # if there is a free port. EADDRINUSE otherwise.
>> 
>> [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
>> 
>> v4 -> v5:
>>  * Use the fact that netns port range starts at 1 when clamping. (Kuniyuki)
>> 
>> v3 -> v4:
>>  * Clarify that u16 values are in host byte order (Neal)
>> 
>> v2 -> v3:
>>  * Make SCTP bind()/bind_add() respect IP_LOCAL_PORT_RANGE option (Eric)
>> 
>> v1 -> v2:
>>  * Fix the corner case when the per-socket range doesn't overlap with the
>>    per-netns range. Fallback correctly to the per-netns range. (Kuniyuki)
>
> You silently ignored my review comment.
> Let's repeat it again. Please put changelog after --- marker. Changelog
> doesn't belong to commit message.

I did not. I'm under the impression that you might have missed my follow
up question if the changelog-above-trailer convention is still in place
[1] and the clarification from Jakub K. [2].

I'm happy to adjust the changelog in whichever way that will make
everyone content. However, ATM we don't have one, it seems.

[1] https://lore.kernel.org/all/87sfg1vuqj.fsf@cloudflare.com/
[2] https://lore.kernel.org/all/20230123193526.065a9879@kernel.org/ 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v5 1/2] inet: Add IP_LOCAL_PORT_RANGE socket option
  2023-01-24 12:33     ` Jakub Sitnicki
@ 2023-01-24 13:04       ` Leon Romanovsky
  2023-01-24 13:32         ` Jakub Sitnicki
  0 siblings, 1 reply; 7+ messages in thread
From: Leon Romanovsky @ 2023-01-24 13:04 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Kuniyuki Iwashima, Neal Cardwell, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, kernel-team,
	Marek Majkowski

On Tue, Jan 24, 2023 at 01:33:07PM +0100, Jakub Sitnicki wrote:
> On Tue, Jan 24, 2023 at 02:23 PM +02, Leon Romanovsky wrote:
> > On Tue, Jan 24, 2023 at 11:05:26AM +0100, Jakub Sitnicki wrote:
> >> Users who want to share a single public IP address for outgoing connections
> >> between several hosts traditionally reach for SNAT. However, SNAT requires
> >> state keeping on the node(s) performing the NAT.
> >> 
> >> A stateless alternative exists, where a single IP address used for egress
> >> can be shared between several hosts by partitioning the available ephemeral
> >> port range. In such a setup:
> >> 
> >> 1. Each host gets assigned a disjoint range of ephemeral ports.
> >> 2. Applications open connections from the host-assigned port range.
> >> 3. Return traffic gets routed to the host based on both, the destination IP
> >>    and the destination port.
> >> 
> >> An application which wants to open an outgoing connection (connect) from a
> >> given port range today can choose between two solutions:
> >> 
> >> 1. Manually pick the source port by bind()'ing to it before connect()'ing
> >>    the socket.
> >> 
> >>    This approach has a couple of downsides:
> >> 
> >>    a) Search for a free port has to be implemented in the user-space. If
> >>       the chosen 4-tuple happens to be busy, the application needs to retry
> >>       from a different local port number.
> >> 
> >>       Detecting if 4-tuple is busy can be either easy (TCP) or hard
> >>       (UDP). In TCP case, the application simply has to check if connect()
> >>       returned an error (EADDRNOTAVAIL). That is assuming that the local
> >>       port sharing was enabled (REUSEADDR) by all the sockets.
> >> 
> >>         # Assume desired local port range is 60_000-60_511
> >>         s = socket(AF_INET, SOCK_STREAM)
> >>         s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
> >>         s.bind(("192.0.2.1", 60_000))
> >>         s.connect(("1.1.1.1", 53))
> >>         # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
> >>         # Application must retry with another local port
> >> 
> >>       In case of UDP, the network stack allows binding more than one socket
> >>       to the same 4-tuple, when local port sharing is enabled
> >>       (REUSEADDR). Hence detecting the conflict is much harder and involves
> >>       querying sock_diag and toggling the REUSEADDR flag [1].
> >> 
> >>    b) For TCP, bind()-ing to a port within the ephemeral port range means
> >>       that no connecting sockets, that is those which leave it to the
> >>       network stack to find a free local port at connect() time, can use
> >>       the this port.
> >> 
> >>       IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
> >>       will be skipped during the free port search at connect() time.
> >> 
> >> 2. Isolate the app in a dedicated netns and use the use the per-netns
> >>    ip_local_port_range sysctl to adjust the ephemeral port range bounds.
> >> 
> >>    The per-netns setting affects all sockets, so this approach can be used
> >>    only if:
> >> 
> >>    - there is just one egress IP address, or
> >>    - the desired egress port range is the same for all egress IP addresses
> >>      used by the application.
> >> 
> >>    For TCP, this approach avoids the downsides of (1). Free port search and
> >>    4-tuple conflict detection is done by the network stack:
> >> 
> >>      system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
> >> 
> >>      s = socket(AF_INET, SOCK_STREAM)
> >>      s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
> >>      s.bind(("192.0.2.1", 0))
> >>      s.connect(("1.1.1.1", 53))
> >>      # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
> >> 
> >>   For UDP this approach has limited applicability. Setting the
> >>   IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
> >>   port being shared with other connected UDP sockets.
> >> 
> >>   Hence relying on the network stack to find a free source port, limits the
> >>   number of outgoing UDP flows from a single IP address down to the number
> >>   of available ephemeral ports.
> >> 
> >> To put it another way, partitioning the ephemeral port range between hosts
> >> using the existing Linux networking API is cumbersome.
> >> 
> >> To address this use case, add a new socket option at the SOL_IP level,
> >> named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
> >> ephemeral port range for each socket individually.
> >> 
> >> The option can be used only to narrow down the per-netns local port
> >> range. If the per-socket range lies outside of the per-netns range, the
> >> latter takes precedence.
> >> 
> >> UAPI-wise, the low and high range bounds are passed to the kernel as a pair
> >> of u16 values in host byte order packed into a u32. This avoids pointer
> >> passing.
> >> 
> >>   PORT_LO = 40_000
> >>   PORT_HI = 40_511
> >> 
> >>   s = socket(AF_INET, SOCK_STREAM)
> >>   v = struct.pack("I", PORT_HI << 16 | PORT_LO)
> >>   s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
> >>   s.bind(("127.0.0.1", 0))
> >>   s.getsockname()
> >>   # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
> >>   # if there is a free port. EADDRINUSE otherwise.
> >> 
> >> [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
> >> 
> >> v4 -> v5:
> >>  * Use the fact that netns port range starts at 1 when clamping. (Kuniyuki)
> >> 
> >> v3 -> v4:
> >>  * Clarify that u16 values are in host byte order (Neal)
> >> 
> >> v2 -> v3:
> >>  * Make SCTP bind()/bind_add() respect IP_LOCAL_PORT_RANGE option (Eric)
> >> 
> >> v1 -> v2:
> >>  * Fix the corner case when the per-socket range doesn't overlap with the
> >>    per-netns range. Fallback correctly to the per-netns range. (Kuniyuki)
> >
> > You silently ignored my review comment.
> > Let's repeat it again. Please put changelog after --- marker. Changelog
> > doesn't belong to commit message.
> 
> I did not. I'm under the impression that you might have missed my follow
> up question if the changelog-above-trailer convention is still in place
> [1] and the clarification from Jakub K. [2].

Yes, I missed it.

There is no value in seeing changelog in git log history as there is
only one version of patches is applied at the end.

Users don't care how many revisions you (or any developer) sent till the
patches were accepted.

> 
> I'm happy to adjust the changelog in whichever way that will make
> everyone content. However, ATM we don't have one, it seems.

We have, just some of us don't care. It doesn't mean they prefer to see
changelog before ---.

> 
> [1] https://lore.kernel.org/all/87sfg1vuqj.fsf@cloudflare.com/
> [2] https://lore.kernel.org/all/20230123193526.065a9879@kernel.org/ 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v5 1/2] inet: Add IP_LOCAL_PORT_RANGE socket option
  2023-01-24 13:04       ` Leon Romanovsky
@ 2023-01-24 13:32         ` Jakub Sitnicki
  0 siblings, 0 replies; 7+ messages in thread
From: Jakub Sitnicki @ 2023-01-24 13:32 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Kuniyuki Iwashima, Neal Cardwell, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, kernel-team,
	Marek Majkowski

On Tue, Jan 24, 2023 at 03:04 PM +02, Leon Romanovsky wrote:
> On Tue, Jan 24, 2023 at 01:33:07PM +0100, Jakub Sitnicki wrote:

[...]

>> I'm happy to adjust the changelog in whichever way that will make
>> everyone content. However, ATM we don't have one, it seems.
>
> We have, just some of us don't care. It doesn't mean they prefer to see
> changelog before ---.

OK. Thanks for clarifying. Will sent v6 with changelog moved.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-01-24 13:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-24 10:05 [PATCH net-next v5 0/2] Add IP_LOCAL_PORT_RANGE socket option Jakub Sitnicki
2023-01-24 10:05 ` [PATCH net-next v5 1/2] inet: " Jakub Sitnicki
2023-01-24 12:23   ` Leon Romanovsky
2023-01-24 12:33     ` Jakub Sitnicki
2023-01-24 13:04       ` Leon Romanovsky
2023-01-24 13:32         ` Jakub Sitnicki
2023-01-24 10:05 ` [PATCH net-next v5 2/2] selftests/net: Cover the " Jakub Sitnicki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.