All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5 net-next] Rework inet_csk_get_port
@ 2016-12-20 20:06 Josef Bacik
  2016-12-20 20:07 ` [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal Josef Bacik
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Josef Bacik @ 2016-12-20 20:06 UTC (permalink / raw)
  To: davem, hannes, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

At some point recently the guys working on our load balancer added the ability
to use SO_REUSEPORT.  When they restarted their app with this option enabled
they immediately hit a softlockup on what appeared to be the
inet_bind_bucket->lock.  Eventually what all of our debugging and discussion led
us to was the fact that the application comes up without SO_REUSEPORT, shuts
down which creates around 100k twsk's, and then comes up and tries to open a
bunch of sockets using SO_REUSEPORT, which meant traversing the inet_bind_bucket
owners list under the lock.  Since this lock is needed for dealing with the
twsk's and basically anything else related to connections we would softlockup,
and sometimes not ever recover.

To solve this problem I did what you see in Path 5/5.  Once we have a
SO_REUSEPORT socket on the tb->owners list we know that the socket has no
conflicts with any of the other sockets on that list.  So we can add a copy of
the sock_common (really all we need is the recv_saddr but it seemed ugly to copy
just the ipv6, ipv4, and flag to indicate if we were ipv6 only in there so I've
copied the whole common) in order to check subsequent SO_REUSEPORT sockets.  If
they match the previous one then we can skip the expensive
inet_csk_bind_conflict check.  This is what eliminated the soft lockup that we
were seeing.

Patches 1-4 are cleanups and re-workings.  For instance when we specify port ==
0 we need to find an open port, but we would do two passes through
inet_csk_bind_conflict every time we found a possible port.  We would also keep
track of the smallest_port value in order to try and use it if we found no
port our first run through.  This however made no sense as it would have had to
fail the first pass through inet_csk_bind_conflict, so would not actually pass
the second pass through either.  Finally I split the function into two functions
in order to make it easier to read and to distinguish between the two behaviors.

I have tested this on one of our load balancing boxes during peak traffic and it
hasn't fallen over.  But this is not my area, so obviously feel free to point
out where I'm being stupid and I'll get it fixed up and retested.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal
  2016-12-20 20:06 [RFC][PATCH 0/5 net-next] Rework inet_csk_get_port Josef Bacik
@ 2016-12-20 20:07 ` Josef Bacik
  2016-12-21 15:06   ` Hannes Frederic Sowa
                     ` (2 more replies)
  2016-12-20 20:07 ` [PATCH 2/5 net-next] inet: kill smallest_size and smallest_port Josef Bacik
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 18+ messages in thread
From: Josef Bacik @ 2016-12-20 20:07 UTC (permalink / raw)
  To: davem, hannes, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
is how they check the rcv_saddr.  Since we want to be able to check the saddr in
other places just drop the protocol specific ->bind_conflict and replace it with
->rcv_saddr_equal, then make inet_csk_bind_conflict the one true bind conflict
function.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 include/net/inet6_connection_sock.h |  5 -----
 include/net/inet_connection_sock.h  |  9 +++------
 net/dccp/ipv4.c                     |  3 ++-
 net/dccp/ipv6.c                     |  2 +-
 net/ipv4/inet_connection_sock.c     | 22 +++++++-------------
 net/ipv4/tcp_ipv4.c                 |  3 ++-
 net/ipv4/udp.c                      |  1 +
 net/ipv6/inet6_connection_sock.c    | 40 -------------------------------------
 net/ipv6/tcp_ipv6.c                 |  4 ++--
 9 files changed, 18 insertions(+), 71 deletions(-)

diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
index 3212b39..8ec87b6 100644
--- a/include/net/inet6_connection_sock.h
+++ b/include/net/inet6_connection_sock.h
@@ -15,16 +15,11 @@
 
 #include <linux/types.h>
 
-struct inet_bind_bucket;
 struct request_sock;
 struct sk_buff;
 struct sock;
 struct sockaddr;
 
-int inet6_csk_bind_conflict(const struct sock *sk,
-			    const struct inet_bind_bucket *tb, bool relax,
-			    bool soreuseport_ok);
-
 struct dst_entry *inet6_csk_route_req(const struct sock *sk, struct flowi6 *fl6,
 				      const struct request_sock *req, u8 proto);
 
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index ec0479a..9cd43c5 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -62,9 +62,9 @@ struct inet_connection_sock_af_ops {
 				char __user *optval, int __user *optlen);
 #endif
 	void	    (*addr2sockaddr)(struct sock *sk, struct sockaddr *);
-	int	    (*bind_conflict)(const struct sock *sk,
-				     const struct inet_bind_bucket *tb,
-				     bool relax, bool soreuseport_ok);
+	int         (*rcv_saddr_equal)(const struct sock *sk1,
+				       const struct sock *sk2,
+				       bool match_wildcard);
 	void	    (*mtu_reduced)(struct sock *sk);
 };
 
@@ -261,9 +261,6 @@ inet_csk_rto_backoff(const struct inet_connection_sock *icsk,
 
 struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
 
-int inet_csk_bind_conflict(const struct sock *sk,
-			   const struct inet_bind_bucket *tb, bool relax,
-			   bool soreuseport_ok);
 int inet_csk_get_port(struct sock *sk, unsigned short snum);
 
 struct dst_entry *inet_csk_route_req(const struct sock *sk, struct flowi4 *fl4,
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c67a96..1931324 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -17,6 +17,7 @@
 #include <linux/skbuff.h>
 #include <linux/random.h>
 
+#include <net/addrconf.h>
 #include <net/icmp.h>
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
@@ -901,7 +902,7 @@ static const struct inet_connection_sock_af_ops dccp_ipv4_af_ops = {
 	.getsockopt	   = ip_getsockopt,
 	.addr2sockaddr	   = inet_csk_addr2sockaddr,
 	.sockaddr_len	   = sizeof(struct sockaddr_in),
-	.bind_conflict	   = inet_csk_bind_conflict,
+	.rcv_saddr_equal   = ipv4_rcv_saddr_equal,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_ip_setsockopt,
 	.compat_getsockopt = compat_ip_getsockopt,
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 4663a01..45242b8 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -926,7 +926,7 @@ static const struct inet_connection_sock_af_ops dccp_ipv6_af_ops = {
 	.getsockopt	   = ipv6_getsockopt,
 	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
 	.sockaddr_len	   = sizeof(struct sockaddr_in6),
-	.bind_conflict	   = inet6_csk_bind_conflict,
+	.rcv_saddr_equal   = ipv6_rcv_saddr_equal,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_ipv6_setsockopt,
 	.compat_getsockopt = compat_ipv6_getsockopt,
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 5f44fa1..74f6a57 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -44,9 +44,9 @@ void inet_get_local_port_range(struct net *net, int *low, int *high)
 }
 EXPORT_SYMBOL(inet_get_local_port_range);
 
-int inet_csk_bind_conflict(const struct sock *sk,
-			   const struct inet_bind_bucket *tb, bool relax,
-			   bool reuseport_ok)
+static int inet_csk_bind_conflict(const struct sock *sk,
+				  const struct inet_bind_bucket *tb,
+				  bool relax, bool reuseport_ok)
 {
 	struct sock *sk2;
 	bool reuse = sk->sk_reuse;
@@ -62,7 +62,6 @@ int inet_csk_bind_conflict(const struct sock *sk,
 
 	sk_for_each_bound(sk2, &tb->owners) {
 		if (sk != sk2 &&
-		    !inet_v6_ipv6only(sk2) &&
 		    (!sk->sk_bound_dev_if ||
 		     !sk2->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
@@ -72,23 +71,18 @@ int inet_csk_bind_conflict(const struct sock *sk,
 			     rcu_access_pointer(sk->sk_reuseport_cb) ||
 			     (sk2->sk_state != TCP_TIME_WAIT &&
 			     !uid_eq(uid, sock_i_uid(sk2))))) {
-
-				if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr ||
-				    sk2->sk_rcv_saddr == sk->sk_rcv_saddr)
+				if (inet_csk(sk)->icsk_af_ops->rcv_saddr_equal(sk, sk2, true))
 					break;
 			}
 			if (!relax && reuse && sk2->sk_reuse &&
 			    sk2->sk_state != TCP_LISTEN) {
-
-				if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr ||
-				    sk2->sk_rcv_saddr == sk->sk_rcv_saddr)
+				if (inet_csk(sk)->icsk_af_ops->rcv_saddr_equal(sk, sk2, true))
 					break;
 			}
 		}
 	}
 	return sk2 != NULL;
 }
-EXPORT_SYMBOL_GPL(inet_csk_bind_conflict);
 
 /* Obtain a reference to a local port for the given sock,
  * if snum is zero it means select any available local port.
@@ -167,8 +161,7 @@ other_parity_scan:
 					smallest_size = tb->num_owners;
 					smallest_port = port;
 				}
-				if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false,
-									      reuseport_ok))
+				if (!inet_csk_bind_conflict(sk, tb, false, reuseport_ok))
 					goto tb_found;
 				goto next_port;
 			}
@@ -209,8 +202,7 @@ tb_found:
 		      sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
 		    smallest_size == -1)
 			goto success;
-		if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true,
-							     reuseport_ok)) {
+		if (inet_csk_bind_conflict(sk, tb, true, reuseport_ok)) {
 			if ((reuse ||
 			     (tb->fastreuseport > 0 &&
 			      sk->sk_reuseport &&
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 029708f..7608012 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -63,6 +63,7 @@
 #include <linux/times.h>
 #include <linux/slab.h>
 
+#include <net/addrconf.h>
 #include <net/net_namespace.h>
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -1781,7 +1782,7 @@ const struct inet_connection_sock_af_ops ipv4_specific = {
 	.getsockopt	   = ip_getsockopt,
 	.addr2sockaddr	   = inet_csk_addr2sockaddr,
 	.sockaddr_len	   = sizeof(struct sockaddr_in),
-	.bind_conflict	   = inet_csk_bind_conflict,
+	.rcv_saddr_equal   = ipv4_rcv_saddr_equal,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_ip_setsockopt,
 	.compat_getsockopt = compat_ip_getsockopt,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 2a70c05..6089ea8 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -374,6 +374,7 @@ int ipv4_rcv_saddr_equal(const struct sock *sk1, const struct sock *sk2,
 	}
 	return 0;
 }
+EXPORT_SYMBOL(ipv4_rcv_saddr_equal);
 
 static u32 udp4_portaddr_hash(const struct net *net, __be32 saddr,
 			      unsigned int port)
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 71939a2..7538715 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -28,46 +28,6 @@
 #include <net/inet6_connection_sock.h>
 #include <net/sock_reuseport.h>
 
-int inet6_csk_bind_conflict(const struct sock *sk,
-			    const struct inet_bind_bucket *tb, bool relax,
-			    bool reuseport_ok)
-{
-	const struct sock *sk2;
-	bool reuse = !!sk->sk_reuse;
-	bool reuseport = !!sk->sk_reuseport && reuseport_ok;
-	kuid_t uid = sock_i_uid((struct sock *)sk);
-
-	/* We must walk the whole port owner list in this case. -DaveM */
-	/*
-	 * See comment in inet_csk_bind_conflict about sock lookup
-	 * vs net namespaces issues.
-	 */
-	sk_for_each_bound(sk2, &tb->owners) {
-		if (sk != sk2 &&
-		    (!sk->sk_bound_dev_if ||
-		     !sk2->sk_bound_dev_if ||
-		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
-			if ((!reuse || !sk2->sk_reuse ||
-			     sk2->sk_state == TCP_LISTEN) &&
-			    (!reuseport || !sk2->sk_reuseport ||
-			     rcu_access_pointer(sk->sk_reuseport_cb) ||
-			     (sk2->sk_state != TCP_TIME_WAIT &&
-			      !uid_eq(uid,
-				      sock_i_uid((struct sock *)sk2))))) {
-				if (ipv6_rcv_saddr_equal(sk, sk2, true))
-					break;
-			}
-			if (!relax && reuse && sk2->sk_reuse &&
-			    sk2->sk_state != TCP_LISTEN &&
-			    ipv6_rcv_saddr_equal(sk, sk2, true))
-				break;
-		}
-	}
-
-	return sk2 != NULL;
-}
-EXPORT_SYMBOL_GPL(inet6_csk_bind_conflict);
-
 struct dst_entry *inet6_csk_route_req(const struct sock *sk,
 				      struct flowi6 *fl6,
 				      const struct request_sock *req,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bee59a6..2f40b98 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1603,7 +1603,7 @@ static const struct inet_connection_sock_af_ops ipv6_specific = {
 	.getsockopt	   = ipv6_getsockopt,
 	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
 	.sockaddr_len	   = sizeof(struct sockaddr_in6),
-	.bind_conflict	   = inet6_csk_bind_conflict,
+	.rcv_saddr_equal   = ipv6_rcv_saddr_equal,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_ipv6_setsockopt,
 	.compat_getsockopt = compat_ipv6_getsockopt,
@@ -1634,7 +1634,7 @@ static const struct inet_connection_sock_af_ops ipv6_mapped = {
 	.getsockopt	   = ipv6_getsockopt,
 	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
 	.sockaddr_len	   = sizeof(struct sockaddr_in6),
-	.bind_conflict	   = inet6_csk_bind_conflict,
+	.rcv_saddr_equal   = ipv6_rcv_saddr_equal,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_ipv6_setsockopt,
 	.compat_getsockopt = compat_ipv6_getsockopt,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5 net-next] inet: kill smallest_size and smallest_port
  2016-12-20 20:06 [RFC][PATCH 0/5 net-next] Rework inet_csk_get_port Josef Bacik
  2016-12-20 20:07 ` [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal Josef Bacik
@ 2016-12-20 20:07 ` Josef Bacik
  2016-12-21 18:30   ` David Miller
  2016-12-20 20:07 ` [PATCH 3/5 net-next] inet: don't check for bind conflicts twice when searching for a port Josef Bacik
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2016-12-20 20:07 UTC (permalink / raw)
  To: davem, hannes, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

In inet_csk_get_port we seem to be using smallest_port to figure out where the
best place to look for a SO_REUSEPORT sk that matches with an existing set of
SO_REUSEPORT's.  However if we get to the logic

if (smallest_size != -1) {
	port = smallest_port;
	goto have_port;
}

we will do a useless search, because we would have already done the
inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
would have gone to found_tb and succeeded.  Since this logic makes us do yet
another trip through inet_csk_bind_conflict for a port we know won't work just
delete this code and save us the time.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 net/ipv4/inet_connection_sock.c | 26 ++++----------------------
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 74f6a57..1a1a94bd 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -93,7 +93,6 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
 	int ret = 1, attempts = 5, port = snum;
-	int smallest_size = -1, smallest_port;
 	struct inet_bind_hashbucket *head;
 	struct net *net = sock_net(sk);
 	int i, low, high, attempt_half;
@@ -103,7 +102,6 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	bool reuseport_ok = !!snum;
 
 	if (port) {
-have_port:
 		head = &hinfo->bhash[inet_bhashfn(net, port,
 						  hinfo->bhash_size)];
 		spin_lock_bh(&head->lock);
@@ -137,8 +135,6 @@ other_half_scan:
 	 * We do the opposite to not pollute connect() users.
 	 */
 	offset |= 1U;
-	smallest_size = -1;
-	smallest_port = low; /* avoid compiler warning */
 
 other_parity_scan:
 	port = low + offset;
@@ -152,15 +148,6 @@ other_parity_scan:
 		spin_lock_bh(&head->lock);
 		inet_bind_bucket_for_each(tb, &head->chain)
 			if (net_eq(ib_net(tb), net) && tb->port == port) {
-				if (((tb->fastreuse > 0 && reuse) ||
-				     (tb->fastreuseport > 0 &&
-				      sk->sk_reuseport &&
-				      !rcu_access_pointer(sk->sk_reuseport_cb) &&
-				      uid_eq(tb->fastuid, uid))) &&
-				    (tb->num_owners < smallest_size || smallest_size == -1)) {
-					smallest_size = tb->num_owners;
-					smallest_port = port;
-				}
 				if (!inet_csk_bind_conflict(sk, tb, false, reuseport_ok))
 					goto tb_found;
 				goto next_port;
@@ -171,10 +158,6 @@ next_port:
 		cond_resched();
 	}
 
-	if (smallest_size != -1) {
-		port = smallest_port;
-		goto have_port;
-	}
 	offset--;
 	if (!(offset & 1))
 		goto other_parity_scan;
@@ -196,19 +179,18 @@ tb_found:
 		if (sk->sk_reuse == SK_FORCE_REUSE)
 			goto success;
 
-		if (((tb->fastreuse > 0 && reuse) ||
+		if ((tb->fastreuse > 0 && reuse) ||
 		     (tb->fastreuseport > 0 &&
 		      !rcu_access_pointer(sk->sk_reuseport_cb) &&
-		      sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
-		    smallest_size == -1)
+		      sk->sk_reuseport && uid_eq(tb->fastuid, uid)))
 			goto success;
 		if (inet_csk_bind_conflict(sk, tb, true, reuseport_ok)) {
 			if ((reuse ||
 			     (tb->fastreuseport > 0 &&
 			      sk->sk_reuseport &&
 			      !rcu_access_pointer(sk->sk_reuseport_cb) &&
-			      uid_eq(tb->fastuid, uid))) &&
-			    !snum && smallest_size != -1 && --attempts >= 0) {
+			      uid_eq(tb->fastuid, uid))) && !snum &&
+			    --attempts >= 0) {
 				spin_unlock_bh(&head->lock);
 				goto again;
 			}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5 net-next] inet: don't check for bind conflicts twice when searching for a port
  2016-12-20 20:06 [RFC][PATCH 0/5 net-next] Rework inet_csk_get_port Josef Bacik
  2016-12-20 20:07 ` [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal Josef Bacik
  2016-12-20 20:07 ` [PATCH 2/5 net-next] inet: kill smallest_size and smallest_port Josef Bacik
@ 2016-12-20 20:07 ` Josef Bacik
  2016-12-21 15:08   ` Hannes Frederic Sowa
  2016-12-20 20:07 ` [PATCH 4/5 net-next] inet: split inet_csk_get_port into two functions Josef Bacik
  2016-12-20 20:07 ` [PATCH 5/5 net-next] inet: reset tb->fastreuseport when adding a reuseport sk Josef Bacik
  4 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2016-12-20 20:07 UTC (permalink / raw)
  To: davem, hannes, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

This is just wasted time, we've already found a tb that doesn't have a bind
conflict, and we don't drop the head lock so scanning again isn't going to give
us a different answer.  Instead move the tb->reuse setting logic outside of the
found_tb path and put it in the success: path.  Then make it so that we don't
goto again if we find a bind conflict in the found_tb path as we won't reach
this anymore when we are scanning for an ephemeral port.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 net/ipv4/inet_connection_sock.c | 39 ++++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 1a1a94bd..fc9bfe1 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -92,7 +92,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 {
 	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
-	int ret = 1, attempts = 5, port = snum;
+	int ret = 1, port = snum;
 	struct inet_bind_hashbucket *head;
 	struct net *net = sock_net(sk);
 	int i, low, high, attempt_half;
@@ -100,6 +100,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	kuid_t uid = sock_i_uid(sk);
 	u32 remaining, offset;
 	bool reuseport_ok = !!snum;
+	bool empty_tb = true;
 
 	if (port) {
 		head = &hinfo->bhash[inet_bhashfn(net, port,
@@ -111,7 +112,6 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 
 		goto tb_not_found;
 	}
-again:
 	attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
 other_half_scan:
 	inet_get_local_port_range(net, &low, &high);
@@ -148,8 +148,12 @@ other_parity_scan:
 		spin_lock_bh(&head->lock);
 		inet_bind_bucket_for_each(tb, &head->chain)
 			if (net_eq(ib_net(tb), net) && tb->port == port) {
-				if (!inet_csk_bind_conflict(sk, tb, false, reuseport_ok))
-					goto tb_found;
+				if (hlist_empty(&tb->owners))
+					goto success;
+				if (!inet_csk_bind_conflict(sk, tb, false, reuseport_ok)) {
+					empty_tb = false;
+					goto success;
+				}
 				goto next_port;
 			}
 		goto tb_not_found;
@@ -184,23 +188,12 @@ tb_found:
 		      !rcu_access_pointer(sk->sk_reuseport_cb) &&
 		      sk->sk_reuseport && uid_eq(tb->fastuid, uid)))
 			goto success;
-		if (inet_csk_bind_conflict(sk, tb, true, reuseport_ok)) {
-			if ((reuse ||
-			     (tb->fastreuseport > 0 &&
-			      sk->sk_reuseport &&
-			      !rcu_access_pointer(sk->sk_reuseport_cb) &&
-			      uid_eq(tb->fastuid, uid))) && !snum &&
-			    --attempts >= 0) {
-				spin_unlock_bh(&head->lock);
-				goto again;
-			}
+		if (inet_csk_bind_conflict(sk, tb, true, reuseport_ok))
 			goto fail_unlock;
-		}
-		if (!reuse)
-			tb->fastreuse = 0;
-		if (!sk->sk_reuseport || !uid_eq(tb->fastuid, uid))
-			tb->fastreuseport = 0;
-	} else {
+		empty_tb = false;
+	}
+success:
+	if (empty_tb) {
 		tb->fastreuse = reuse;
 		if (sk->sk_reuseport) {
 			tb->fastreuseport = 1;
@@ -208,8 +201,12 @@ tb_found:
 		} else {
 			tb->fastreuseport = 0;
 		}
+	} else {
+		if (!reuse)
+			tb->fastreuse = 0;
+		if (!sk->sk_reuseport || !uid_eq(tb->fastuid, uid))
+			tb->fastreuseport = 0;
 	}
-success:
 	if (!inet_csk(sk)->icsk_bind_hash)
 		inet_bind_hash(sk, tb, port);
 	WARN_ON(inet_csk(sk)->icsk_bind_hash != tb);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5 net-next] inet: split inet_csk_get_port into two functions
  2016-12-20 20:06 [RFC][PATCH 0/5 net-next] Rework inet_csk_get_port Josef Bacik
                   ` (2 preceding siblings ...)
  2016-12-20 20:07 ` [PATCH 3/5 net-next] inet: don't check for bind conflicts twice when searching for a port Josef Bacik
@ 2016-12-20 20:07 ` Josef Bacik
  2016-12-20 20:07 ` [PATCH 5/5 net-next] inet: reset tb->fastreuseport when adding a reuseport sk Josef Bacik
  4 siblings, 0 replies; 18+ messages in thread
From: Josef Bacik @ 2016-12-20 20:07 UTC (permalink / raw)
  To: davem, hannes, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

inet_csk_get_port does two different things, it either scans for an open port,
or it tries to see if the specified port is available for use.  Since these two
operations have different rules and are basically independent lets split them
into two different functions to make them both more readable.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 net/ipv4/inet_connection_sock.c | 72 +++++++++++++++++++++++++++--------------
 1 file changed, 47 insertions(+), 25 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fc9bfe1..d3ccf62 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -84,34 +84,21 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 	return sk2 != NULL;
 }
 
-/* Obtain a reference to a local port for the given sock,
- * if snum is zero it means select any available local port.
- * We try to allocate an odd port (and leave even ports for connect())
+/*
+ * Find an open port number for the socket.  Returns with the
+ * inet_bind_hashbucket lock held.
  */
-int inet_csk_get_port(struct sock *sk, unsigned short snum)
+static struct inet_bind_hashbucket *
+inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *port_ret)
 {
-	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
-	int ret = 1, port = snum;
+	int port = 0;
 	struct inet_bind_hashbucket *head;
 	struct net *net = sock_net(sk);
 	int i, low, high, attempt_half;
 	struct inet_bind_bucket *tb;
-	kuid_t uid = sock_i_uid(sk);
 	u32 remaining, offset;
-	bool reuseport_ok = !!snum;
-	bool empty_tb = true;
 
-	if (port) {
-		head = &hinfo->bhash[inet_bhashfn(net, port,
-						  hinfo->bhash_size)];
-		spin_lock_bh(&head->lock);
-		inet_bind_bucket_for_each(tb, &head->chain)
-			if (net_eq(ib_net(tb), net) && tb->port == port)
-				goto tb_found;
-
-		goto tb_not_found;
-	}
 	attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
 other_half_scan:
 	inet_get_local_port_range(net, &low, &high);
@@ -150,13 +137,12 @@ other_parity_scan:
 			if (net_eq(ib_net(tb), net) && tb->port == port) {
 				if (hlist_empty(&tb->owners))
 					goto success;
-				if (!inet_csk_bind_conflict(sk, tb, false, reuseport_ok)) {
-					empty_tb = false;
+				if (!inet_csk_bind_conflict(sk, tb, false, false))
 					goto success;
-				}
 				goto next_port;
 			}
-		goto tb_not_found;
+		tb = NULL;
+		goto success;
 next_port:
 		spin_unlock_bh(&head->lock);
 		cond_resched();
@@ -171,8 +157,44 @@ next_port:
 		attempt_half = 2;
 		goto other_half_scan;
 	}
-	return ret;
+	return NULL;
+success:
+	*port_ret = port;
+	*tb_ret = tb;
+	return head;
+}
 
+/* Obtain a reference to a local port for the given sock,
+ * if snum is zero it means select any available local port.
+ * We try to allocate an odd port (and leave even ports for connect())
+ */
+int inet_csk_get_port(struct sock *sk, unsigned short snum)
+{
+	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
+	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	int ret = 1, port = snum;
+	struct inet_bind_hashbucket *head;
+	struct net *net = sock_net(sk);
+	struct inet_bind_bucket *tb = NULL;
+	kuid_t uid = sock_i_uid(sk);
+	bool empty_tb = true;
+
+	if (!port) {
+		head = inet_csk_find_open_port(sk, &tb, &port);
+		if (!head)
+			return 1;
+		if (!tb)
+			goto tb_not_found;
+		if (!hlist_empty(&tb->owners))
+			empty_tb = false;
+		goto success;
+	}
+	head = &hinfo->bhash[inet_bhashfn(net, port,
+					  hinfo->bhash_size)];
+	spin_lock_bh(&head->lock);
+	inet_bind_bucket_for_each(tb, &head->chain)
+		if (net_eq(ib_net(tb), net) && tb->port == port)
+			goto tb_found;
 tb_not_found:
 	tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
 				     net, head, port);
@@ -188,7 +210,7 @@ tb_found:
 		      !rcu_access_pointer(sk->sk_reuseport_cb) &&
 		      sk->sk_reuseport && uid_eq(tb->fastuid, uid)))
 			goto success;
-		if (inet_csk_bind_conflict(sk, tb, true, reuseport_ok))
+		if (inet_csk_bind_conflict(sk, tb, true, true))
 			goto fail_unlock;
 		empty_tb = false;
 	}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5 net-next] inet: reset tb->fastreuseport when adding a reuseport sk
  2016-12-20 20:06 [RFC][PATCH 0/5 net-next] Rework inet_csk_get_port Josef Bacik
                   ` (3 preceding siblings ...)
  2016-12-20 20:07 ` [PATCH 4/5 net-next] inet: split inet_csk_get_port into two functions Josef Bacik
@ 2016-12-20 20:07 ` Josef Bacik
  2016-12-21 16:49   ` Craig Gallek
  4 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2016-12-20 20:07 UTC (permalink / raw)
  To: davem, hannes, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
never set it again.  Which means that in the future if we end up adding a bunch
of reuseport sk's to that tb we'll have to do the expensive scan every time.
Instead add a sock_common to the tb so we know what reuseport sk succeeded last.
Once one sk has made it onto the list we know that there are no potential bind
conflicts on the owners list that match that sk's rcv_addr.  So copy the sk's
common into our tb->fastsock and set tb->fastruseport to FASTREUSESOCK_STRICT so
we know we have to do an extra check for subsequent reuseport sockets and skip
the expensive bind conflict check.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 include/net/inet_hashtables.h   |  4 ++++
 net/ipv4/inet_connection_sock.c | 53 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 50f635c..b776401 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -74,12 +74,16 @@ struct inet_ehash_bucket {
  * users logged onto your box, isn't it nice to know that new data
  * ports are created in O(1) time?  I thought so. ;-)	-DaveM
  */
+#define FASTREUSEPORT_ANY	1
+#define FASTREUSEPORT_STRICT	2
+
 struct inet_bind_bucket {
 	possible_net_t		ib_net;
 	unsigned short		port;
 	signed char		fastreuse;
 	signed char		fastreuseport;
 	kuid_t			fastuid;
+	struct sock_common	fastsock;
 	int			num_owners;
 	struct hlist_node	node;
 	struct hlist_head	owners;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index d3ccf62..9e29fad 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -164,6 +164,32 @@ success:
 	return head;
 }
 
+static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
+				     struct sock *sk)
+{
+	struct sock *sk2 = (struct sock *)&tb->fastsock;
+	kuid_t uid = sock_i_uid(sk);
+
+	if (tb->fastreuseport <= 0)
+		return 0;
+	if (!sk->sk_reuseport)
+		return 0;
+	if (rcu_access_pointer(sk->sk_reuseport_cb))
+		return 0;
+	if (!uid_eq(tb->fastuid, uid))
+		return 0;
+	/* We only need to check the rcv_saddr if this tb was once marked
+	 * without fastreuseport and then was reset, as we can only know that
+	 * the fastsock has no potential bind conflicts with the rest of the
+	 * possible socks on the owners list.
+	 */
+	if (tb->fastreuseport == FASTREUSEPORT_ANY)
+		return 1;
+	if (!inet_csk(sk)->icsk_af_ops->rcv_saddr_equal(sk, sk2, true))
+		return 0;
+	return 1;
+}
+
 /* Obtain a reference to a local port for the given sock,
  * if snum is zero it means select any available local port.
  * We try to allocate an odd port (and leave even ports for connect())
@@ -206,9 +232,7 @@ tb_found:
 			goto success;
 
 		if ((tb->fastreuse > 0 && reuse) ||
-		     (tb->fastreuseport > 0 &&
-		      !rcu_access_pointer(sk->sk_reuseport_cb) &&
-		      sk->sk_reuseport && uid_eq(tb->fastuid, uid)))
+		    sk_reuseport_match(tb, sk))
 			goto success;
 		if (inet_csk_bind_conflict(sk, tb, true, true))
 			goto fail_unlock;
@@ -220,14 +244,35 @@ success:
 		if (sk->sk_reuseport) {
 			tb->fastreuseport = 1;
 			tb->fastuid = uid;
+			memcpy(&tb->fastsock, &sk->__sk_common,
+			       sizeof(struct sock_common));
 		} else {
 			tb->fastreuseport = 0;
 		}
 	} else {
 		if (!reuse)
 			tb->fastreuse = 0;
-		if (!sk->sk_reuseport || !uid_eq(tb->fastuid, uid))
+		if (sk->sk_reuseport) {
+			/* We didn't match or we don't have fastreuseport set on
+			 * the tb, but we have sk_reuseport set on this socket
+			 * and we know that there are no bind conflicts with
+			 * this socket in this tb, so reset our tb's reuseport
+			 * settings so that any subsequent sockets that match
+			 * our current socket will be put on the fast path.
+			 *
+			 * If we reset we need to set FASTREUSEPORT_STRICT so we
+			 * do extra checking for all subsequent sk_reuseport
+			 * socks.
+			 */
+			if (!sk_reuseport_match(tb, sk)) {
+				tb->fastreuseport = FASTREUSEPORT_STRICT;
+				tb->fastuid = uid;
+				memcpy(&tb->fastsock, &sk->__sk_common,
+				       sizeof(struct sock_common));
+			}
+		} else {
 			tb->fastreuseport = 0;
+		}
 	}
 	if (!inet_csk(sk)->icsk_bind_hash)
 		inet_bind_hash(sk, tb, port);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal
  2016-12-20 20:07 ` [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal Josef Bacik
@ 2016-12-21 15:06   ` Hannes Frederic Sowa
  2016-12-21 15:16     ` Josef Bacik
  2016-12-21 15:23   ` Hannes Frederic Sowa
  2016-12-21 18:28   ` David Miller
  2 siblings, 1 reply; 18+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-21 15:06 UTC (permalink / raw)
  To: Josef Bacik, davem, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

On Tue, 2016-12-20 at 15:07 -0500, Josef Bacik wrote:
> The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
> is how they check the rcv_saddr.  Since we want to be able to check the saddr in
> other places just drop the protocol specific ->bind_conflict and replace it with
> ->rcv_saddr_equal, then make inet_csk_bind_conflict the one true bind conflict
> function.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> 



> ---
>  include/net/inet6_connection_sock.h |  5 -----
>  include/net/inet_connection_sock.h  |  9 +++------
>  net/dccp/ipv4.c                     |  3 ++-
>  net/dccp/ipv6.c                     |  2 +-
>  net/ipv4/inet_connection_sock.c     | 22 +++++++-------------
>  net/ipv4/tcp_ipv4.c                 |  3 ++-
>  net/ipv4/udp.c                      |  1 +
>  net/ipv6/inet6_connection_sock.c    | 40 -------------------------------------
>  net/ipv6/tcp_ipv6.c                 |  4 ++--
>  9 files changed, 18 insertions(+), 71 deletions(-)
> 
> diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
> index 3212b39..8ec87b6 100644
> --- a/include/net/inet6_connection_sock.h
> +++ b/include/net/inet6_connection_sock.h
> @@ -15,16 +15,11 @@
>  
>  #include <linux/types.h>
>  
> -struct inet_bind_bucket;
>  struct request_sock;
>  struct sk_buff;
>  struct sock;
>  struct sockaddr;
>  
> -int inet6_csk_bind_conflict(const struct sock *sk,
> -			    const struct inet_bind_bucket *tb, bool relax,
> -			    bool soreuseport_ok);
> -
>  struct dst_entry *inet6_csk_route_req(const struct sock *sk, struct flowi6 *fl6,
>  				      const struct request_sock *req, u8 proto);
>  
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index ec0479a..9cd43c5 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -62,9 +62,9 @@ struct inet_connection_sock_af_ops {
>  				char __user *optval, int __user *optlen);
>  #endif
>  	void	    (*addr2sockaddr)(struct sock *sk, struct sockaddr *);
> -	int	    (*bind_conflict)(const struct sock *sk,
> -				     const struct inet_bind_bucket *tb,
> -				     bool relax, bool soreuseport_ok);
> +	int         (*rcv_saddr_equal)(const struct sock *sk1,
> +				       const struct sock *sk2,
> +				       bool match_wildcard);
>  	void	    (*mtu_reduced)(struct sock *sk);
>  };
>  
> 

The patch looks as a nice code cleanup already!

Have you looked if we can simply have one rcv_saddr_equal for both ipv4
and ipv6 that e.g. uses sk->sk_family instead of function pointers?
This could give us even more possibilities to remove some indirect
functions calls and thus might relieve some cycles?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5 net-next] inet: don't check for bind conflicts twice when searching for a port
  2016-12-20 20:07 ` [PATCH 3/5 net-next] inet: don't check for bind conflicts twice when searching for a port Josef Bacik
@ 2016-12-21 15:08   ` Hannes Frederic Sowa
  2016-12-21 15:12     ` Josef Bacik
  0 siblings, 1 reply; 18+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-21 15:08 UTC (permalink / raw)
  To: Josef Bacik, davem, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

On Tue, 2016-12-20 at 15:07 -0500, Josef Bacik wrote:
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -92,7 +92,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
>  {
>  	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
>  	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
> -	int ret = 1, attempts = 5, port = snum;
> +	int ret = 1, port = snum;
>  	struct inet_bind_hashbucket *head;
>  	struct net *net = sock_net(sk);
>  	int i, low, high, attempt_half;
> @@ -100,6 +100,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
>  	kuid_t uid = sock_i_uid(sk);
>  	u32 remaining, offset;
>  	bool reuseport_ok = !!snum;
> +	bool empty_tb = true;
>  
>  	if (port) {
>  		head = &hinfo->bhash[inet_bhashfn(net, port,
> @@ -111,7 +112,6 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
>  
>  		goto tb_not_found;
>  	}
> -again:
>  	attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
>  other_half_scan:
>  	inet_get_local_port_range(net, &low, &high);
> @@ -148,8 +148,12 @@ other_parity_scan:
>  		spin_lock_bh(&head->lock);
>  		inet_bind_bucket_for_each(tb, &head->chain)
>  			if (net_eq(ib_net(tb), net) && tb->port == port) {
> -				if (!inet_csk_bind_conflict(sk, tb, false, reuseport_ok))
> -					goto tb_found;
> +				if (hlist_empty(&tb->owners))
> +					goto success;
> +				if (!inet_csk_bind_conflict(sk, tb, false, reuseport_ok)) {
> +					empty_tb = false;
> +					goto success;
> +				}
>  				goto next_port;
>  			}
>  		goto tb_not_found;
> @@ -184,23 +188,12 @@ tb_found:
>  		      !rcu_access_pointer(sk->sk_reuseport_cb) &&
>  		      sk->sk_reuseport && uid_eq(tb->fastuid, uid)))
>  			goto success;
> -		if (inet_csk_bind_conflict(sk, tb, true, reuseport_ok)) {
> -			if ((reuse ||
> -			     (tb->fastreuseport > 0 &&
> -			      sk->sk_reuseport &&
> -			      !rcu_access_pointer(sk->sk_reuseport_cb) &&
> -			      uid_eq(tb->fastuid, uid))) && !snum &&
> -			    --attempts >= 0) {
> -				spin_unlock_bh(&head->lock);
> -				goto again;
> -			}
> +		if (inet_csk_bind_conflict(sk, tb, true, reuseport_ok))
>  			goto fail_unlock;
> -		}
> -		if (!reuse)
> -			tb->fastreuse = 0;
> -		if (!sk->sk_reuseport || !uid_eq(tb->fastuid, uid))
> -			tb->fastreuseport = 0;
> -	} else {
> +		empty_tb = false;
> +	}
> +success:
> +	if (empty_tb) {


I would fine it even more simple to read, if you redo the hlist_empty
check here instead of someone has to review all the paths where this
might get set. hlist_empty is a very quick test.


Thanks,
Hannes

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5 net-next] inet: don't check for bind conflicts twice when searching for a port
  2016-12-21 15:08   ` Hannes Frederic Sowa
@ 2016-12-21 15:12     ` Josef Bacik
  0 siblings, 0 replies; 18+ messages in thread
From: Josef Bacik @ 2016-12-21 15:12 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

On Wed, Dec 21, 2016 at 10:08 AM, Hannes Frederic Sowa 
<hannes@stressinduktion.org> wrote:
> On Tue, 2016-12-20 at 15:07 -0500, Josef Bacik wrote:
>>  --- a/net/ipv4/inet_connection_sock.c
>>  +++ b/net/ipv4/inet_connection_sock.c
>>  @@ -92,7 +92,7 @@ int inet_csk_get_port(struct sock *sk, unsigned 
>> short snum)
>>   {
>>   	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
>>   	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
>>  -	int ret = 1, attempts = 5, port = snum;
>>  +	int ret = 1, port = snum;
>>   	struct inet_bind_hashbucket *head;
>>   	struct net *net = sock_net(sk);
>>   	int i, low, high, attempt_half;
>>  @@ -100,6 +100,7 @@ int inet_csk_get_port(struct sock *sk, unsigned 
>> short snum)
>>   	kuid_t uid = sock_i_uid(sk);
>>   	u32 remaining, offset;
>>   	bool reuseport_ok = !!snum;
>>  +	bool empty_tb = true;
>> 
>>   	if (port) {
>>   		head = &hinfo->bhash[inet_bhashfn(net, port,
>>  @@ -111,7 +112,6 @@ int inet_csk_get_port(struct sock *sk, unsigned 
>> short snum)
>> 
>>   		goto tb_not_found;
>>   	}
>>  -again:
>>   	attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
>>   other_half_scan:
>>   	inet_get_local_port_range(net, &low, &high);
>>  @@ -148,8 +148,12 @@ other_parity_scan:
>>   		spin_lock_bh(&head->lock);
>>   		inet_bind_bucket_for_each(tb, &head->chain)
>>   			if (net_eq(ib_net(tb), net) && tb->port == port) {
>>  -				if (!inet_csk_bind_conflict(sk, tb, false, reuseport_ok))
>>  -					goto tb_found;
>>  +				if (hlist_empty(&tb->owners))
>>  +					goto success;
>>  +				if (!inet_csk_bind_conflict(sk, tb, false, reuseport_ok)) {
>>  +					empty_tb = false;
>>  +					goto success;
>>  +				}
>>   				goto next_port;
>>   			}
>>   		goto tb_not_found;
>>  @@ -184,23 +188,12 @@ tb_found:
>>   		      !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>   		      sk->sk_reuseport && uid_eq(tb->fastuid, uid)))
>>   			goto success;
>>  -		if (inet_csk_bind_conflict(sk, tb, true, reuseport_ok)) {
>>  -			if ((reuse ||
>>  -			     (tb->fastreuseport > 0 &&
>>  -			      sk->sk_reuseport &&
>>  -			      !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>  -			      uid_eq(tb->fastuid, uid))) && !snum &&
>>  -			    --attempts >= 0) {
>>  -				spin_unlock_bh(&head->lock);
>>  -				goto again;
>>  -			}
>>  +		if (inet_csk_bind_conflict(sk, tb, true, reuseport_ok))
>>   			goto fail_unlock;
>>  -		}
>>  -		if (!reuse)
>>  -			tb->fastreuse = 0;
>>  -		if (!sk->sk_reuseport || !uid_eq(tb->fastuid, uid))
>>  -			tb->fastreuseport = 0;
>>  -	} else {
>>  +		empty_tb = false;
>>  +	}
>>  +success:
>>  +	if (empty_tb) {
> 
> 
> I would fine it even more simple to read, if you redo the hlist_empty
> check here instead of someone has to review all the paths where this
> might get set. hlist_empty is a very quick test.

Yup that's fair, I'll fix that up.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal
  2016-12-21 15:06   ` Hannes Frederic Sowa
@ 2016-12-21 15:16     ` Josef Bacik
  2016-12-22 20:00       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2016-12-21 15:16 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

On Wed, Dec 21, 2016 at 10:06 AM, Hannes Frederic Sowa 
<hannes@stressinduktion.org> wrote:
> On Tue, 2016-12-20 at 15:07 -0500, Josef Bacik wrote:
>>  The only difference between inet6_csk_bind_conflict and 
>> inet_csk_bind_conflict
>>  is how they check the rcv_saddr.  Since we want to be able to check 
>> the saddr in
>>  other places just drop the protocol specific ->bind_conflict and 
>> replace it with
>>  ->rcv_saddr_equal, then make inet_csk_bind_conflict the one true 
>> bind conflict
>>  function.
>> 
>>  Signed-off-by: Josef Bacik <jbacik@fb.com>
>> 
> 
> 
> 
>>  ---
>>   include/net/inet6_connection_sock.h |  5 -----
>>   include/net/inet_connection_sock.h  |  9 +++------
>>   net/dccp/ipv4.c                     |  3 ++-
>>   net/dccp/ipv6.c                     |  2 +-
>>   net/ipv4/inet_connection_sock.c     | 22 +++++++-------------
>>   net/ipv4/tcp_ipv4.c                 |  3 ++-
>>   net/ipv4/udp.c                      |  1 +
>>   net/ipv6/inet6_connection_sock.c    | 40 
>> -------------------------------------
>>   net/ipv6/tcp_ipv6.c                 |  4 ++--
>>   9 files changed, 18 insertions(+), 71 deletions(-)
>> 
>>  diff --git a/include/net/inet6_connection_sock.h 
>> b/include/net/inet6_connection_sock.h
>>  index 3212b39..8ec87b6 100644
>>  --- a/include/net/inet6_connection_sock.h
>>  +++ b/include/net/inet6_connection_sock.h
>>  @@ -15,16 +15,11 @@
>> 
>>   #include <linux/types.h>
>> 
>>  -struct inet_bind_bucket;
>>   struct request_sock;
>>   struct sk_buff;
>>   struct sock;
>>   struct sockaddr;
>> 
>>  -int inet6_csk_bind_conflict(const struct sock *sk,
>>  -			    const struct inet_bind_bucket *tb, bool relax,
>>  -			    bool soreuseport_ok);
>>  -
>>   struct dst_entry *inet6_csk_route_req(const struct sock *sk, 
>> struct flowi6 *fl6,
>>   				      const struct request_sock *req, u8 proto);
>> 
>>  diff --git a/include/net/inet_connection_sock.h 
>> b/include/net/inet_connection_sock.h
>>  index ec0479a..9cd43c5 100644
>>  --- a/include/net/inet_connection_sock.h
>>  +++ b/include/net/inet_connection_sock.h
>>  @@ -62,9 +62,9 @@ struct inet_connection_sock_af_ops {
>>   				char __user *optval, int __user *optlen);
>>   #endif
>>   	void	    (*addr2sockaddr)(struct sock *sk, struct sockaddr *);
>>  -	int	    (*bind_conflict)(const struct sock *sk,
>>  -				     const struct inet_bind_bucket *tb,
>>  -				     bool relax, bool soreuseport_ok);
>>  +	int         (*rcv_saddr_equal)(const struct sock *sk1,
>>  +				       const struct sock *sk2,
>>  +				       bool match_wildcard);
>>   	void	    (*mtu_reduced)(struct sock *sk);
>>   };
>> 
>> 
> 
> The patch looks as a nice code cleanup already!
> 
> Have you looked if we can simply have one rcv_saddr_equal for both 
> ipv4
> and ipv6 that e.g. uses sk->sk_family instead of function pointers?
> This could give us even more possibilities to remove some indirect
> functions calls and thus might relieve some cycles?

I was going to do that but I'm not familiar enough with how sockets 
work to be comfortable.  My main concern is we have the ipv6_only() 
check on a socket, which seems to indicate to me that you can have a 
socket that can do both ipv4/ipv6, so what if we're specifically going 
through the ipv6 code, but we aren't ipv6_only() and we end up doing 
the ipv4 address compare when we really need to do the ipv6 address 
compare?  If this can't happen (and honestly as I type it out it sounds 
crazier than it did in my head) then yeah I'll totally do that as well 
and we can just have a global function without the protocol specific 
callbacks, but I need you or somebody to tell me I'm crazy and that it 
would be ok to have it all in one function.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal
  2016-12-20 20:07 ` [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal Josef Bacik
  2016-12-21 15:06   ` Hannes Frederic Sowa
@ 2016-12-21 15:23   ` Hannes Frederic Sowa
  2016-12-21 15:59     ` Josef Bacik
  2016-12-21 18:28   ` David Miller
  2 siblings, 1 reply; 18+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-21 15:23 UTC (permalink / raw)
  To: Josef Bacik, davem, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

On Tue, 2016-12-20 at 15:07 -0500, Josef Bacik wrote:
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -926,7 +926,7 @@ static const struct inet_connection_sock_af_ops dccp_ipv6_af_ops = {
>  	.getsockopt	   = ipv6_getsockopt,
>  	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
>  	.sockaddr_len	   = sizeof(struct sockaddr_in6),
> -	.bind_conflict	   = inet6_csk_bind_conflict,
> +	.rcv_saddr_equal   = ipv6_rcv_saddr_equal,
>  #ifdef CONFIG_COMPAT
>  	.compat_setsockopt = compat_ipv6_setsockopt,
>  	.compat_getsockopt = compat_ipv6_getsockopt,
> 

Btw, small nit, you forgot the corresponding changes in
dccp_ipv6_mapped, thus causing this compiler error:

net/dccp/ipv6.c:961:2: error: unknown field ‘bind_conflict’ specified in initializer
  .bind_conflict    = inet6_csk_bind_conflict,
  ^
net/dccp/ipv6.c:961:22: error: ‘inet6_csk_bind_conflict’ undeclared here (not in a function)
  .bind_conflict    = inet6_csk_bind_conflict,
                      ^~~~~~~~~~~~~~~~~~~~~~~
scripts/Makefile.build:293: recipe for target 'net/dccp/ipv6.o' failed

Bye,
Hannes

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal
  2016-12-21 15:23   ` Hannes Frederic Sowa
@ 2016-12-21 15:59     ` Josef Bacik
  0 siblings, 0 replies; 18+ messages in thread
From: Josef Bacik @ 2016-12-21 15:59 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

On Wed, Dec 21, 2016 at 10:23 AM, Hannes Frederic Sowa 
<hannes@stressinduktion.org> wrote:
> On Tue, 2016-12-20 at 15:07 -0500, Josef Bacik wrote:
>>  --- a/net/dccp/ipv6.c
>>  +++ b/net/dccp/ipv6.c
>>  @@ -926,7 +926,7 @@ static const struct inet_connection_sock_af_ops 
>> dccp_ipv6_af_ops = {
>>   	.getsockopt	   = ipv6_getsockopt,
>>   	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
>>   	.sockaddr_len	   = sizeof(struct sockaddr_in6),
>>  -	.bind_conflict	   = inet6_csk_bind_conflict,
>>  +	.rcv_saddr_equal   = ipv6_rcv_saddr_equal,
>>   #ifdef CONFIG_COMPAT
>>   	.compat_setsockopt = compat_ipv6_setsockopt,
>>   	.compat_getsockopt = compat_ipv6_getsockopt,
>> 
> 
> Btw, small nit, you forgot the corresponding changes in
> dccp_ipv6_mapped, thus causing this compiler error:
> 
> net/dccp/ipv6.c:961:2: error: unknown field ‘bind_conflict’ 
> specified in initializer
>   .bind_conflict    = inet6_csk_bind_conflict,
>   ^
> net/dccp/ipv6.c:961:22: error: ‘inet6_csk_bind_conflict’ 
> undeclared here (not in a function)
>   .bind_conflict    = inet6_csk_bind_conflict,
>                       ^~~~~~~~~~~~~~~~~~~~~~~
> scripts/Makefile.build:293: recipe for target 'net/dccp/ipv6.o' failed

Yeah kbuild caught that yesterday and I have it fixed up in my git tree 
already, thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5 net-next] inet: reset tb->fastreuseport when adding a reuseport sk
  2016-12-20 20:07 ` [PATCH 5/5 net-next] inet: reset tb->fastreuseport when adding a reuseport sk Josef Bacik
@ 2016-12-21 16:49   ` Craig Gallek
  0 siblings, 0 replies; 18+ messages in thread
From: Craig Gallek @ 2016-12-21 16:49 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Miller, Hannes Frederic Sowa, Eric Dumazet, Tom Herbert,
	netdev, kernel-team

On Tue, Dec 20, 2016 at 3:07 PM, Josef Bacik <jbacik@fb.com> wrote:
> If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
> never set it again.  Which means that in the future if we end up adding a bunch
> of reuseport sk's to that tb we'll have to do the expensive scan every time.
> Instead add a sock_common to the tb so we know what reuseport sk succeeded last.
> Once one sk has made it onto the list we know that there are no potential bind
> conflicts on the owners list that match that sk's rcv_addr.  So copy the sk's
> common into our tb->fastsock and set tb->fastruseport to FASTREUSESOCK_STRICT so
> we know we have to do an extra check for subsequent reuseport sockets and skip
> the expensive bind conflict check.
>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  include/net/inet_hashtables.h   |  4 ++++
>  net/ipv4/inet_connection_sock.c | 53 +++++++++++++++++++++++++++++++++++++----
>  2 files changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 50f635c..b776401 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -74,12 +74,16 @@ struct inet_ehash_bucket {
>   * users logged onto your box, isn't it nice to know that new data
>   * ports are created in O(1) time?  I thought so. ;-)  -DaveM
>   */
> +#define FASTREUSEPORT_ANY      1
> +#define FASTREUSEPORT_STRICT   2
> +
>  struct inet_bind_bucket {
>         possible_net_t          ib_net;
>         unsigned short          port;
>         signed char             fastreuse;
>         signed char             fastreuseport;
>         kuid_t                  fastuid;
> +       struct sock_common      fastsock;
>         int                     num_owners;
>         struct hlist_node       node;
>         struct hlist_head       owners;
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index d3ccf62..9e29fad 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -164,6 +164,32 @@ success:
>         return head;
>  }
>
> +static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
> +                                    struct sock *sk)
> +{
> +       struct sock *sk2 = (struct sock *)&tb->fastsock;
> +       kuid_t uid = sock_i_uid(sk);
> +
> +       if (tb->fastreuseport <= 0)
> +               return 0;
> +       if (!sk->sk_reuseport)
> +               return 0;
> +       if (rcu_access_pointer(sk->sk_reuseport_cb))
> +               return 0;
> +       if (!uid_eq(tb->fastuid, uid))
> +               return 0;
> +       /* We only need to check the rcv_saddr if this tb was once marked
> +        * without fastreuseport and then was reset, as we can only know that
> +        * the fastsock has no potential bind conflicts with the rest of the
> +        * possible socks on the owners list.
> +        */
> +       if (tb->fastreuseport == FASTREUSEPORT_ANY)
> +               return 1;
> +       if (!inet_csk(sk)->icsk_af_ops->rcv_saddr_equal(sk, sk2, true))
The rcv_saddr_equal functions assume the type of the sk to be
inet_sock.  It doesn't look like they actually depend on any of the
inet-specific fields, but it's probably safer to either remove this
assumption or change the type of tb->fastsock to be a full inet_sock.

> +               return 0;
> +       return 1;
> +}
> +
>  /* Obtain a reference to a local port for the given sock,
>   * if snum is zero it means select any available local port.
>   * We try to allocate an odd port (and leave even ports for connect())
> @@ -206,9 +232,7 @@ tb_found:
>                         goto success;
>
>                 if ((tb->fastreuse > 0 && reuse) ||
> -                    (tb->fastreuseport > 0 &&
> -                     !rcu_access_pointer(sk->sk_reuseport_cb) &&
> -                     sk->sk_reuseport && uid_eq(tb->fastuid, uid)))
> +                   sk_reuseport_match(tb, sk))
>                         goto success;
>                 if (inet_csk_bind_conflict(sk, tb, true, true))
>                         goto fail_unlock;
> @@ -220,14 +244,35 @@ success:
>                 if (sk->sk_reuseport) {
>                         tb->fastreuseport = 1;
>                         tb->fastuid = uid;
> +                       memcpy(&tb->fastsock, &sk->__sk_common,
> +                              sizeof(struct sock_common));
>                 } else {
>                         tb->fastreuseport = 0;
>                 }
>         } else {
>                 if (!reuse)
>                         tb->fastreuse = 0;
> -               if (!sk->sk_reuseport || !uid_eq(tb->fastuid, uid))
> +               if (sk->sk_reuseport) {
> +                       /* We didn't match or we don't have fastreuseport set on
> +                        * the tb, but we have sk_reuseport set on this socket
> +                        * and we know that there are no bind conflicts with
> +                        * this socket in this tb, so reset our tb's reuseport
> +                        * settings so that any subsequent sockets that match
> +                        * our current socket will be put on the fast path.
> +                        *
> +                        * If we reset we need to set FASTREUSEPORT_STRICT so we
> +                        * do extra checking for all subsequent sk_reuseport
> +                        * socks.
> +                        */
> +                       if (!sk_reuseport_match(tb, sk)) {
> +                               tb->fastreuseport = FASTREUSEPORT_STRICT;
> +                               tb->fastuid = uid;
> +                               memcpy(&tb->fastsock, &sk->__sk_common,
> +                                      sizeof(struct sock_common));
> +                       }
> +               } else {
>                         tb->fastreuseport = 0;
> +               }
>         }
>         if (!inet_csk(sk)->icsk_bind_hash)
>                 inet_bind_hash(sk, tb, port);
> --
> 2.9.3
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal
  2016-12-20 20:07 ` [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal Josef Bacik
  2016-12-21 15:06   ` Hannes Frederic Sowa
  2016-12-21 15:23   ` Hannes Frederic Sowa
@ 2016-12-21 18:28   ` David Miller
  2 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2016-12-21 18:28 UTC (permalink / raw)
  To: jbacik; +Cc: hannes, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

From: Josef Bacik <jbacik@fb.com>
Date: Tue, 20 Dec 2016 15:07:00 -0500

> The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
> is how they check the rcv_saddr.  Since we want to be able to check the saddr in
> other places just drop the protocol specific ->bind_conflict and replace it with
> ->rcv_saddr_equal, then make inet_csk_bind_conflict the one true bind conflict
> function.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

This may be a nice cleanup and all, but realize that if we do actually
have to traverse a lot of sockets this code has become significantly
slower.

We now have to execute a hard to predict indirect call for every
socket we process on the list.

This is almost certainly why we had two seperate functions expanded
rather than having an AF-specific helper execute in the inner loop
of a generic function.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5 net-next] inet: kill smallest_size and smallest_port
  2016-12-20 20:07 ` [PATCH 2/5 net-next] inet: kill smallest_size and smallest_port Josef Bacik
@ 2016-12-21 18:30   ` David Miller
  2016-12-21 19:29     ` Eric Dumazet
  2016-12-21 19:32     ` Josef Bacik
  0 siblings, 2 replies; 18+ messages in thread
From: David Miller @ 2016-12-21 18:30 UTC (permalink / raw)
  To: jbacik; +Cc: hannes, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

From: Josef Bacik <jbacik@fb.com>
Date: Tue, 20 Dec 2016 15:07:01 -0500

> In inet_csk_get_port we seem to be using smallest_port to figure out where the
> best place to look for a SO_REUSEPORT sk that matches with an existing set of
> SO_REUSEPORT's.  However if we get to the logic
> 
> if (smallest_size != -1) {
> 	port = smallest_port;
> 	goto have_port;
> }
> 
> we will do a useless search, because we would have already done the
> inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
> would have gone to found_tb and succeeded.  Since this logic makes us do yet
> another trip through inet_csk_bind_conflict for a port we know won't work just
> delete this code and save us the time.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

So the "all else being equal, use 'tb' with smallest socket count" logic
wasn't being used at all?

Instead of removing it why don't we make it work properly again?  Something
obviously broke it somewhere along the line, because I am pretty sure this
heuristic worked at some point in the past.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5 net-next] inet: kill smallest_size and smallest_port
  2016-12-21 18:30   ` David Miller
@ 2016-12-21 19:29     ` Eric Dumazet
  2016-12-21 19:32     ` Josef Bacik
  1 sibling, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2016-12-21 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: jbacik, hannes, kraigatgoog, tom, netdev, kernel-team

On Wed, 2016-12-21 at 13:30 -0500, David Miller wrote:
> From: Josef Bacik <jbacik@fb.com>
> Date: Tue, 20 Dec 2016 15:07:01 -0500
> 
> > In inet_csk_get_port we seem to be using smallest_port to figure out where the
> > best place to look for a SO_REUSEPORT sk that matches with an existing set of
> > SO_REUSEPORT's.  However if we get to the logic
> > 
> > if (smallest_size != -1) {
> > 	port = smallest_port;
> > 	goto have_port;
> > }
> > 
> > we will do a useless search, because we would have already done the
> > inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
> > would have gone to found_tb and succeeded.  Since this logic makes us do yet
> > another trip through inet_csk_bind_conflict for a port we know won't work just
> > delete this code and save us the time.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> 
> So the "all else being equal, use 'tb' with smallest socket count" logic
> wasn't being used at all?
> 
> Instead of removing it why don't we make it work properly again?  Something
> obviously broke it somewhere along the line, because I am pretty sure this
> heuristic worked at some point in the past.

Yeah, but heuristic to choose the smallest list was hurting a lot, since
it tends to consume all ports.

At connect() time we prefer to have another heuristic, actually leaving
some slots absolutely empty, so that a bind() will succeed later.

Remember I had more than 30 ms latency as described in the 2nd commit
changelog :

References :

commit ea8add2b190395408b22a9127bed2c0912aecbc8
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 11 16:28:50 2016 -0800

    tcp/dccp: better use of ephemeral ports in bind()
    
    Implement strategy used in __inet_hash_connect() in opposite way :
    
    Try to find a candidate using odd ports, then fallback to even ports.
    
    We no longer disable BH for whole traversal, but one bucket at a time.
    We also use cond_resched() to yield cpu to other tasks if needed.
    
    I removed one indentation level and tried to mirror the loop we have
    in __inet_hash_connect() and variable names to ease code maintenance.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 1580ab63fc9a03593072cc5656167a75c4f1d173
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 11 16:28:49 2016 -0800

    tcp/dccp: better use of ephemeral ports in connect()
    
    In commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
    in connect()"), I added a very simple heuristic, so that we got better
    chances to use even ports, and allow bind() users to have more available
    slots.
    
    It gave nice results, but with more than 200,000 TCP sessions on a typical
    server, the ~30,000 ephemeral ports are still a rare resource.
    
    I chose to go a step further, by looking at all even ports, and if none
    was available, fallback to odd ports.
    
    The companion patch does the same in bind(), but in opposite way.
    
    I've seen exec times of up to 30ms on busy servers, so I no longer
    disable BH for the whole traversal, but only for each hash bucket.
    I also call cond_resched() to be gentle to other tasks.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5 net-next] inet: kill smallest_size and smallest_port
  2016-12-21 18:30   ` David Miller
  2016-12-21 19:29     ` Eric Dumazet
@ 2016-12-21 19:32     ` Josef Bacik
  1 sibling, 0 replies; 18+ messages in thread
From: Josef Bacik @ 2016-12-21 19:32 UTC (permalink / raw)
  To: David Miller; +Cc: hannes, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

On Wed, Dec 21, 2016 at 1:30 PM, David Miller <davem@davemloft.net> 
wrote:
> From: Josef Bacik <jbacik@fb.com>
> Date: Tue, 20 Dec 2016 15:07:01 -0500
> 
>>  In inet_csk_get_port we seem to be using smallest_port to figure 
>> out where the
>>  best place to look for a SO_REUSEPORT sk that matches with an 
>> existing set of
>>  SO_REUSEPORT's.  However if we get to the logic
>> 
>>  if (smallest_size != -1) {
>>  	port = smallest_port;
>>  	goto have_port;
>>  }
>> 
>>  we will do a useless search, because we would have already done the
>>  inet_csk_bind_conflict for that port and it would have returned 1, 
>> otherwise we
>>  would have gone to found_tb and succeeded.  Since this logic makes 
>> us do yet
>>  another trip through inet_csk_bind_conflict for a port we know 
>> won't work just
>>  delete this code and save us the time.
>> 
>>  Signed-off-by: Josef Bacik <jbacik@fb.com>
> 
> So the "all else being equal, use 'tb' with smallest socket count" 
> logic
> wasn't being used at all?
> 
> Instead of removing it why don't we make it work properly again?  
> Something
> obviously broke it somewhere along the line, because I am pretty sure 
> this
> heuristic worked at some point in the past.

Yeah as soon as we would find a tb with no bind conflicts we'd 
immediately jump to found_tb: and subsequently exit, so if we did 
manage to get to the point of checking smallest_size it would be 
redundant as we would have had to hit a bind conflict for that port to 
even reach that code.

How do you want me to add it back?  The logic only kicked in if we were 
SO_REUSEPORT with snum == 0, but Tom tells me that is basically 
useless, so we've disallowed that behavior.  Should we call a 
bind_conflict() for every port and then go back and pick the one with 
the smallest tb?  That's a lot of scanning.  Can you tell me what 
behavior you desire and I'll add another patch to reintroduce it?  
Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal
  2016-12-21 15:16     ` Josef Bacik
@ 2016-12-22 20:00       ` Hannes Frederic Sowa
  0 siblings, 0 replies; 18+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-22 20:00 UTC (permalink / raw)
  To: Josef Bacik; +Cc: davem, kraigatgoog, eric.dumazet, tom, netdev, kernel-team

On 21.12.2016 16:16, Josef Bacik wrote:
> On Wed, Dec 21, 2016 at 10:06 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> On Tue, 2016-12-20 at 15:07 -0500, Josef Bacik wrote:
>>>  The only difference between inet6_csk_bind_conflict and
>>> inet_csk_bind_conflict
>>>  is how they check the rcv_saddr.  Since we want to be able to check
>>> the saddr in
>>>  other places just drop the protocol specific ->bind_conflict and
>>> replace it with
>>>  ->rcv_saddr_equal, then make inet_csk_bind_conflict the one true
>>> bind conflict
>>>  function.
>>>
>>>  Signed-off-by: Josef Bacik <jbacik@fb.com>
>>>
>>
>>
>>
>>>  ---
>>>   include/net/inet6_connection_sock.h |  5 -----
>>>   include/net/inet_connection_sock.h  |  9 +++------
>>>   net/dccp/ipv4.c                     |  3 ++-
>>>   net/dccp/ipv6.c                     |  2 +-
>>>   net/ipv4/inet_connection_sock.c     | 22 +++++++-------------
>>>   net/ipv4/tcp_ipv4.c                 |  3 ++-
>>>   net/ipv4/udp.c                      |  1 +
>>>   net/ipv6/inet6_connection_sock.c    | 40
>>> -------------------------------------
>>>   net/ipv6/tcp_ipv6.c                 |  4 ++--
>>>   9 files changed, 18 insertions(+), 71 deletions(-)
>>>
>>>  diff --git a/include/net/inet6_connection_sock.h
>>> b/include/net/inet6_connection_sock.h
>>>  index 3212b39..8ec87b6 100644
>>>  --- a/include/net/inet6_connection_sock.h
>>>  +++ b/include/net/inet6_connection_sock.h
>>>  @@ -15,16 +15,11 @@
>>>
>>>   #include <linux/types.h>
>>>
>>>  -struct inet_bind_bucket;
>>>   struct request_sock;
>>>   struct sk_buff;
>>>   struct sock;
>>>   struct sockaddr;
>>>
>>>  -int inet6_csk_bind_conflict(const struct sock *sk,
>>>  -                const struct inet_bind_bucket *tb, bool relax,
>>>  -                bool soreuseport_ok);
>>>  -
>>>   struct dst_entry *inet6_csk_route_req(const struct sock *sk, struct
>>> flowi6 *fl6,
>>>                         const struct request_sock *req, u8 proto);
>>>
>>>  diff --git a/include/net/inet_connection_sock.h
>>> b/include/net/inet_connection_sock.h
>>>  index ec0479a..9cd43c5 100644
>>>  --- a/include/net/inet_connection_sock.h
>>>  +++ b/include/net/inet_connection_sock.h
>>>  @@ -62,9 +62,9 @@ struct inet_connection_sock_af_ops {
>>>                   char __user *optval, int __user *optlen);
>>>   #endif
>>>       void        (*addr2sockaddr)(struct sock *sk, struct sockaddr *);
>>>  -    int        (*bind_conflict)(const struct sock *sk,
>>>  -                     const struct inet_bind_bucket *tb,
>>>  -                     bool relax, bool soreuseport_ok);
>>>  +    int         (*rcv_saddr_equal)(const struct sock *sk1,
>>>  +                       const struct sock *sk2,
>>>  +                       bool match_wildcard);
>>>       void        (*mtu_reduced)(struct sock *sk);
>>>   };
>>>
>>>
>>
>> The patch looks as a nice code cleanup already!
>>
>> Have you looked if we can simply have one rcv_saddr_equal for both ipv4
>> and ipv6 that e.g. uses sk->sk_family instead of function pointers?
>> This could give us even more possibilities to remove some indirect
>> functions calls and thus might relieve some cycles?
> 
> I was going to do that but I'm not familiar enough with how sockets work
> to be comfortable.  My main concern is we have the ipv6_only() check on
> a socket, which seems to indicate to me that you can have a socket that
> can do both ipv4/ipv6, so what if we're specifically going through the
> ipv6 code, but we aren't ipv6_only() and we end up doing the ipv4
> address compare when we really need to do the ipv6 address compare?  If
> this can't happen (and honestly as I type it out it sounds crazier than
> it did in my head) then yeah I'll totally do that as well and we can
> just have a global function without the protocol specific callbacks, but
> I need you or somebody to tell me I'm crazy and that it would be ok to
> have it all in one function.  Thanks,

IPv6 sockets can do IPv4 via mapped addresses. The other way around
doesn't work, there are no IPv4 sockets that can speak IPv6. The
ipv6_only flags in IPv4 sockets should always stay 0 but they need to be
evaluated from there side for possible port conflicts.

My idea is to use sk->sk_family, which is in sync with
icsk->icsk_af_ops, which you use for the function pointer lookup, to
switch between those functions. Looking through a lot of callback, I
don't see this assumption violated so far.

This would also solve the problem which David described, your search
would be free of those indirect jumps.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-12-22 20:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-20 20:06 [RFC][PATCH 0/5 net-next] Rework inet_csk_get_port Josef Bacik
2016-12-20 20:07 ` [PATCH 1/5 net-next] inet: replace ->bind_conflict with ->rcv_saddr_equal Josef Bacik
2016-12-21 15:06   ` Hannes Frederic Sowa
2016-12-21 15:16     ` Josef Bacik
2016-12-22 20:00       ` Hannes Frederic Sowa
2016-12-21 15:23   ` Hannes Frederic Sowa
2016-12-21 15:59     ` Josef Bacik
2016-12-21 18:28   ` David Miller
2016-12-20 20:07 ` [PATCH 2/5 net-next] inet: kill smallest_size and smallest_port Josef Bacik
2016-12-21 18:30   ` David Miller
2016-12-21 19:29     ` Eric Dumazet
2016-12-21 19:32     ` Josef Bacik
2016-12-20 20:07 ` [PATCH 3/5 net-next] inet: don't check for bind conflicts twice when searching for a port Josef Bacik
2016-12-21 15:08   ` Hannes Frederic Sowa
2016-12-21 15:12     ` Josef Bacik
2016-12-20 20:07 ` [PATCH 4/5 net-next] inet: split inet_csk_get_port into two functions Josef Bacik
2016-12-20 20:07 ` [PATCH 5/5 net-next] inet: reset tb->fastreuseport when adding a reuseport sk Josef Bacik
2016-12-21 16:49   ` Craig Gallek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.