All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Willem de Bruijn <willemb@google.com>,
	Benjamin Herrenschmidt <benh@amazon.com>,
	Kuniyuki Iwashima <kuniyu@amazon.co.jp>,
	"David S. Miller" <davem@davemloft.net>
Subject: [PATCH 5.7 17/20] udp: Improve load balancing for SO_REUSEPORT.
Date: Thu, 30 Jul 2020 10:04:07 +0200	[thread overview]
Message-ID: <20200730074421.342711211@linuxfoundation.org> (raw)
In-Reply-To: <20200730074420.533211699@linuxfoundation.org>

From: Kuniyuki Iwashima <kuniyu@amazon.co.jp>

[ Upstream commit efc6b6f6c3113e8b203b9debfb72d81e0f3dcace ]

Currently, SO_REUSEPORT does not work well if connected sockets are in a
UDP reuseport group.

Then reuseport_has_conns() returns true and the result of
reuseport_select_sock() is discarded. Also, unconnected sockets have the
same score, hence only does the first unconnected socket in udp_hslot
always receive all packets sent to unconnected sockets.

So, the result of reuseport_select_sock() should be used for load
balancing.

The noteworthy point is that the unconnected sockets placed after
connected sockets in sock_reuseport.socks will receive more packets than
others because of the algorithm in reuseport_select_sock().

    index | connected | reciprocal_scale | result
    ---------------------------------------------
    0     | no        | 20%              | 40%
    1     | no        | 20%              | 20%
    2     | yes       | 20%              | 0%
    3     | no        | 20%              | 40%
    4     | yes       | 20%              | 0%

If most of the sockets are connected, this can be a problem, but it still
works better than now.

Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets")
CC: Willem de Bruijn <willemb@google.com>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/ipv4/udp.c |   15 +++++++++------
 net/ipv6/udp.c |   15 +++++++++------
 2 files changed, 18 insertions(+), 12 deletions(-)

--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -413,7 +413,7 @@ static struct sock *udp4_lib_lookup2(str
 				     struct udp_hslot *hslot2,
 				     struct sk_buff *skb)
 {
-	struct sock *sk, *result;
+	struct sock *sk, *result, *reuseport_result;
 	int score, badness;
 	u32 hash = 0;
 
@@ -423,17 +423,20 @@ static struct sock *udp4_lib_lookup2(str
 		score = compute_score(sk, net, saddr, sport,
 				      daddr, hnum, dif, sdif);
 		if (score > badness) {
+			reuseport_result = NULL;
+
 			if (sk->sk_reuseport &&
 			    sk->sk_state != TCP_ESTABLISHED) {
 				hash = udp_ehashfn(net, daddr, hnum,
 						   saddr, sport);
-				result = reuseport_select_sock(sk, hash, skb,
-							sizeof(struct udphdr));
-				if (result && !reuseport_has_conns(sk, false))
-					return result;
+				reuseport_result = reuseport_select_sock(sk, hash, skb,
+									 sizeof(struct udphdr));
+				if (reuseport_result && !reuseport_has_conns(sk, false))
+					return reuseport_result;
 			}
+
+			result = reuseport_result ? : sk;
 			badness = score;
-			result = sk;
 		}
 	}
 	return result;
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -148,7 +148,7 @@ static struct sock *udp6_lib_lookup2(str
 		int dif, int sdif, struct udp_hslot *hslot2,
 		struct sk_buff *skb)
 {
-	struct sock *sk, *result;
+	struct sock *sk, *result, *reuseport_result;
 	int score, badness;
 	u32 hash = 0;
 
@@ -158,17 +158,20 @@ static struct sock *udp6_lib_lookup2(str
 		score = compute_score(sk, net, saddr, sport,
 				      daddr, hnum, dif, sdif);
 		if (score > badness) {
+			reuseport_result = NULL;
+
 			if (sk->sk_reuseport &&
 			    sk->sk_state != TCP_ESTABLISHED) {
 				hash = udp6_ehashfn(net, daddr, hnum,
 						    saddr, sport);
 
-				result = reuseport_select_sock(sk, hash, skb,
-							sizeof(struct udphdr));
-				if (result && !reuseport_has_conns(sk, false))
-					return result;
+				reuseport_result = reuseport_select_sock(sk, hash, skb,
+									 sizeof(struct udphdr));
+				if (reuseport_result && !reuseport_has_conns(sk, false))
+					return reuseport_result;
 			}
-			result = sk;
+
+			result = reuseport_result ? : sk;
 			badness = score;
 		}
 	}



  parent reply	other threads:[~2020-07-30  8:21 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-30  8:03 [PATCH 5.7 00/20] 5.7.12-rc1 review Greg Kroah-Hartman
2020-07-30  8:03 ` [PATCH 5.7 01/20] AX.25: Fix out-of-bounds read in ax25_connect() Greg Kroah-Hartman
2020-07-30  8:03 ` [PATCH 5.7 02/20] AX.25: Prevent out-of-bounds read in ax25_sendmsg() Greg Kroah-Hartman
2020-07-30  8:03 ` [PATCH 5.7 03/20] dev: Defer free of skbs in flush_backlog Greg Kroah-Hartman
2020-07-30  8:03 ` [PATCH 5.7 04/20] drivers/net/wan/x25_asy: Fix to make it work Greg Kroah-Hartman
2020-07-30  8:03 ` [PATCH 5.7 05/20] ip6_gre: fix null-ptr-deref in ip6gre_init_net() Greg Kroah-Hartman
2020-07-30  8:03 ` [PATCH 5.7 06/20] net/sched: act_ct: fix restore the qdisc_skb_cb after defrag Greg Kroah-Hartman
2020-07-30  8:03 ` [PATCH 5.7 07/20] net-sysfs: add a newline when printing tx_timeout by sysfs Greg Kroah-Hartman
2020-07-30  8:03 ` [PATCH 5.7 08/20] net: udp: Fix wrong clean up for IS_UDPLITE macro Greg Kroah-Hartman
2020-07-30  8:03 ` [PATCH 5.7 09/20] qrtr: orphan socket in qrtr_release() Greg Kroah-Hartman
2020-07-30  8:04 ` [PATCH 5.7 10/20] rtnetlink: Fix memory(net_device) leak when ->newlink fails Greg Kroah-Hartman
2020-07-30  8:04 ` [PATCH 5.7 11/20] rxrpc: Fix sendmsg() returning EPIPE due to recvmsg() returning ENODATA Greg Kroah-Hartman
2020-07-30  8:04 ` [PATCH 5.7 12/20] tcp: allow at most one TLP probe per flight Greg Kroah-Hartman
2020-07-30  8:04 ` [PATCH 5.7 13/20] AX.25: Prevent integer overflows in connect and sendmsg Greg Kroah-Hartman
2020-07-30  8:04 ` [PATCH 5.7 14/20] sctp: shrink stream outq only when new outcnt < old outcnt Greg Kroah-Hartman
2020-07-30  8:04 ` [PATCH 5.7 15/20] sctp: shrink stream outq when fails to do addstream reconf Greg Kroah-Hartman
2020-07-30  8:04 ` [PATCH 5.7 16/20] udp: Copy has_conns in reuseport_grow() Greg Kroah-Hartman
2020-07-30  8:04 ` Greg Kroah-Hartman [this message]
2020-07-30  8:04 ` [PATCH 5.7 18/20] tipc: allow to build NACK message in link timeout function Greg Kroah-Hartman
2020-07-30  8:04 ` [PATCH 5.7 19/20] io_uring: ensure double poll additions work with both request types Greg Kroah-Hartman
2020-07-30  8:04 ` [PATCH 5.7 20/20] regmap: debugfs: check count when read regmap file Greg Kroah-Hartman
2020-07-30 16:48 ` [PATCH 5.7 00/20] 5.7.12-rc1 review Guenter Roeck
2020-07-31 17:15   ` Greg Kroah-Hartman
2020-07-31  8:59 ` Naresh Kamboju
2020-07-31 12:53 ` Jon Hunter
2020-07-31 17:15   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200730074421.342711211@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=benh@amazon.com \
    --cc=davem@davemloft.net \
    --cc=kuniyu@amazon.co.jp \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.