[PATCH v2] udp: Avoid call to compute_score on multiple sites

* [PATCH v2] udp: Avoid call to compute_score on multiple sites
@ 2024-04-10 21:50 Gabriel Krisman Bertazi
  2024-04-10 22:13 ` Kuniyuki Iwashima
  0 siblings, 1 reply; 10+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-04-10 21:50 UTC (permalink / raw)
  To: willemdebruijn.kernel, davem
  Cc: netdev, martin.lau, Gabriel Krisman Bertazi, Lorenz Bauer

We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
sockets are present").  The failing tests were those that would spawn
UDP sockets per-cpu on systems that have a high number of cpus.

Unsurprisingly, it is not caused by the extra re-scoring of the reused
socket, but due to the compiler no longer inlining compute_score, once
it has the extra call site in udp4_lib_lookup2.  This is augmented by
the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.

We could just explicitly inline it, but compute_score() is quite a large
function, around 300b.  Inlining in two sites would almost double
udp4_lib_lookup2, which is a silly thing to do just to workaround a
mitigation.  Instead, this patch shuffles the code a bit to avoid the
multiple calls to compute_score.  Since it is a static function used in
one spot, the compiler can safely fold it in, as it did before, without
increasing the text size.

With this patch applied I ran my original iperf3 testcases.  The failing
cases all looked like this (ipv4):
	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2

where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
tree. harmean == harmonic mean; CV == coefficient of variation.

ipv4:
                 1G                10G                  MAX
	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)

ipv6:
                 1G                10G                  MAX
	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)

This restores the performance we had before the change above with this
benchmark.  We obviously don't expect any real impact when mitigations
are disabled, but just to be sure it also doesn't regresses:

mitigations=off ipv4:
                 1G                10G                  MAX
	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)

Cc: Lorenz Bauer <lmb@isovalent.com>
Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>

---
Changes since v1:
(me)
  - recollected performance data after changes below only for the
  mitigations enabled case.
(suggested by Willem de Bruijn)
  - Drop __always_inline in compute_score
  - Simplify logic by replacing third struct sock pointer with bool
  - Fix typo in commit message
  - Don't explicitly break out of loop after rescore
---
 net/ipv4/udp.c | 18 +++++++++++++-----
 net/ipv6/udp.c | 17 +++++++++++++----
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 661d0e0d273f..a13ef8e06093 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
+	bool rescore = false;
 
 	result = NULL;
 	badness = 0;
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		score = compute_score(sk, net, saddr, sport,
-				      daddr, hnum, dif, sdif);
+rescore:
+		score = compute_score((rescore ? result : sk), net, saddr,
+				      sport, daddr, hnum, dif, sdif);
+		rescore = false;
 		if (score > badness) {
 			badness = score;
 
@@ -456,9 +459,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 			if (IS_ERR(result))
 				continue;
 
-			badness = compute_score(result, net, saddr, sport,
-						daddr, hnum, dif, sdif);
-
+			/* compute_score is too long of a function to be
+			 * inlined, and calling it again here yields
+			 * measureable overhead for some
+			 * workloads. Work around it by jumping
+			 * backwards to rescore 'result'.
+			 */
+			rescore = true;
+			goto rescore;
 		}
 	}
 	return result;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7c1e6469d091..7a55c050de2b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -168,12 +168,15 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
+	bool rescore = false;
 
 	result = NULL;
 	badness = -1;
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		score = compute_score(sk, net, saddr, sport,
-				      daddr, hnum, dif, sdif);
+rescore:
+		score = compute_score((rescore ? result : sk), net, saddr,
+				      sport, daddr, hnum, dif, sdif);
+		rescore = false;
 		if (score > badness) {
 			badness = score;
 
@@ -197,8 +200,14 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 			if (IS_ERR(result))
 				continue;
 
-			badness = compute_score(sk, net, saddr, sport,
-						daddr, hnum, dif, sdif);
+			/* compute_score is too long of a function to be
+			 * inlined, and calling it again here yields
+			 * measureable overhead for some
+			 * workloads. Work around it by jumping
+			 * backwards to rescore 'result'.
+			 */
+			rescore = true;
+			goto rescore;
 		}
 	}
 	return result;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread