All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] net: prefer socket bound to interface when not in VRF
@ 2021-10-05 13:03 Mike Manning
  2021-10-07 14:07 ` Jakub Kicinski
  2022-06-13  3:14 ` Benjamin Poirier
  0 siblings, 2 replies; 9+ messages in thread
From: Mike Manning @ 2021-10-05 13:03 UTC (permalink / raw)
  To: Netdev; +Cc: David Ahern, Saikrishna Arcot

The commit 6da5b0f027a8 ("net: ensure unbound datagram socket to be
chosen when not in a VRF") modified compute_score() so that a device
match is always made, not just in the case of an l3mdev skb, then
increments the score also for unbound sockets. This ensures that
sockets bound to an l3mdev are never selected when not in a VRF.
But as unbound and bound sockets are now scored equally, this results
in the last opened socket being selected if there are matches in the
default VRF for an unbound socket and a socket bound to a dev that is
not an l3mdev. However, handling prior to this commit was to always
select the bound socket in this case. Reinstate this handling by
incrementing the score only for bound sockets. The required isolation
due to choosing between an unbound socket and a socket bound to an
l3mdev remains in place due to the device match always being made.
The same approach is taken for compute_score() for stream sockets.

Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF")
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---

diff nettest-baseline-9e9fb7655ed5.txt nettest-fix.txt
955,956c955,956
< TEST: IPv4 TCP connection over VRF with SNAT                                  [FAIL]
< TEST: IPv6 TCP connection over VRF with SNAT                                  [FAIL]
---
> TEST: IPv4 TCP connection over VRF with SNAT                                  [ OK ]
> TEST: IPv6 TCP connection over VRF with SNAT                                  [ OK ]
958,959c958,959
< Tests passed: 713
< Tests failed:   5
---
> Tests passed: 715
> Tests failed:   3

---
 net/ipv4/inet_hashtables.c  | 4 +++-
 net/ipv4/udp.c              | 3 ++-
 net/ipv6/inet6_hashtables.c | 2 +-
 net/ipv6/udp.c              | 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 80aeaf9e6e16..bfb522e51346 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -242,8 +242,10 @@ static inline int compute_score(struct sock *sk, struct net *net,
 
 		if (!inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif))
 			return -1;
+		score =  sk->sk_bound_dev_if ? 2 : 1;
 
-		score = sk->sk_family == PF_INET ? 2 : 1;
+		if (sk->sk_family == PF_INET)
+			score++;
 		if (READ_ONCE(sk->sk_incoming_cpu) == raw_smp_processor_id())
 			score++;
 	}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8851c9463b4b..c6aedc674713 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -390,7 +390,8 @@ static int compute_score(struct sock *sk, struct net *net,
 					dif, sdif);
 	if (!dev_match)
 		return -1;
-	score += 4;
+	if (sk->sk_bound_dev_if)
+		score += 4;
 
 	if (READ_ONCE(sk->sk_incoming_cpu) == raw_smp_processor_id())
 		score++;
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 55c290d55605..67c9114835c8 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -106,7 +106,7 @@ static inline int compute_score(struct sock *sk, struct net *net,
 		if (!inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif))
 			return -1;
 
-		score = 1;
+		score =  sk->sk_bound_dev_if ? 2 : 1;
 		if (READ_ONCE(sk->sk_incoming_cpu) == raw_smp_processor_id())
 			score++;
 	}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index ea53847b5b7e..c5267929825d 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -133,7 +133,8 @@ static int compute_score(struct sock *sk, struct net *net,
 	dev_match = udp_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif);
 	if (!dev_match)
 		return -1;
-	score++;
+	if (sk->sk_bound_dev_if)
+		score++;
 
 	if (READ_ONCE(sk->sk_incoming_cpu) == raw_smp_processor_id())
 		score++;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] net: prefer socket bound to interface when not in VRF
  2021-10-05 13:03 [PATCH] net: prefer socket bound to interface when not in VRF Mike Manning
@ 2021-10-07 14:07 ` Jakub Kicinski
  2021-10-07 14:10   ` David Ahern
  2022-06-13  3:14 ` Benjamin Poirier
  1 sibling, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2021-10-07 14:07 UTC (permalink / raw)
  To: David Ahern; +Cc: Mike Manning, Netdev, Saikrishna Arcot

On Tue, 5 Oct 2021 14:03:42 +0100 Mike Manning wrote:
> The commit 6da5b0f027a8 ("net: ensure unbound datagram socket to be
> chosen when not in a VRF") modified compute_score() so that a device
> match is always made, not just in the case of an l3mdev skb, then
> increments the score also for unbound sockets. This ensures that
> sockets bound to an l3mdev are never selected when not in a VRF.
> But as unbound and bound sockets are now scored equally, this results
> in the last opened socket being selected if there are matches in the
> default VRF for an unbound socket and a socket bound to a dev that is
> not an l3mdev. However, handling prior to this commit was to always
> select the bound socket in this case. Reinstate this handling by
> incrementing the score only for bound sockets. The required isolation
> due to choosing between an unbound socket and a socket bound to an
> l3mdev remains in place due to the device match always being made.
> The same approach is taken for compute_score() for stream sockets.
> 
> Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
> Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF")
> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>

David A, Ack?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net: prefer socket bound to interface when not in VRF
  2021-10-07 14:07 ` Jakub Kicinski
@ 2021-10-07 14:10   ` David Ahern
  2021-10-07 14:27     ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: David Ahern @ 2021-10-07 14:10 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Mike Manning, Netdev, Saikrishna Arcot

On 10/7/21 8:07 AM, Jakub Kicinski wrote:
> On Tue, 5 Oct 2021 14:03:42 +0100 Mike Manning wrote:
>> The commit 6da5b0f027a8 ("net: ensure unbound datagram socket to be
>> chosen when not in a VRF") modified compute_score() so that a device
>> match is always made, not just in the case of an l3mdev skb, then
>> increments the score also for unbound sockets. This ensures that
>> sockets bound to an l3mdev are never selected when not in a VRF.
>> But as unbound and bound sockets are now scored equally, this results
>> in the last opened socket being selected if there are matches in the
>> default VRF for an unbound socket and a socket bound to a dev that is
>> not an l3mdev. However, handling prior to this commit was to always
>> select the bound socket in this case. Reinstate this handling by
>> incrementing the score only for bound sockets. The required isolation
>> due to choosing between an unbound socket and a socket bound to an
>> l3mdev remains in place due to the device match always being made.
>> The same approach is taken for compute_score() for stream sockets.
>>
>> Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
>> Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF")
>> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
> 
> David A, Ack?
> 

yep, sorry, forgot about this one.

Reviewed-by: David Ahern <dsahern@kernel.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net: prefer socket bound to interface when not in VRF
  2021-10-07 14:10   ` David Ahern
@ 2021-10-07 14:27     ` Jakub Kicinski
  0 siblings, 0 replies; 9+ messages in thread
From: Jakub Kicinski @ 2021-10-07 14:27 UTC (permalink / raw)
  To: David Ahern, Mike Manning; +Cc: Netdev, Saikrishna Arcot

On Thu, 7 Oct 2021 08:10:17 -0600 David Ahern wrote:
> On 10/7/21 8:07 AM, Jakub Kicinski wrote:
> > On Tue, 5 Oct 2021 14:03:42 +0100 Mike Manning wrote:  
> >> The commit 6da5b0f027a8 ("net: ensure unbound datagram socket to be
> >> chosen when not in a VRF") modified compute_score() so that a device
> >> match is always made, not just in the case of an l3mdev skb, then
> >> increments the score also for unbound sockets. This ensures that
> >> sockets bound to an l3mdev are never selected when not in a VRF.
> >> But as unbound and bound sockets are now scored equally, this results
> >> in the last opened socket being selected if there are matches in the
> >> default VRF for an unbound socket and a socket bound to a dev that is
> >> not an l3mdev. However, handling prior to this commit was to always
> >> select the bound socket in this case. Reinstate this handling by
> >> incrementing the score only for bound sockets. The required isolation
> >> due to choosing between an unbound socket and a socket bound to an
> >> l3mdev remains in place due to the device match always being made.
> >> The same approach is taken for compute_score() for stream sockets.
> >>
> >> Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
> >> Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF")
> >> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>  
>
> Reviewed-by: David Ahern <dsahern@kernel.org>

Applied, thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net: prefer socket bound to interface when not in VRF
  2021-10-05 13:03 [PATCH] net: prefer socket bound to interface when not in VRF Mike Manning
  2021-10-07 14:07 ` Jakub Kicinski
@ 2022-06-13  3:14 ` Benjamin Poirier
  2022-06-13  3:52   ` David Ahern
  2022-06-26 22:25   ` Mike Manning
  1 sibling, 2 replies; 9+ messages in thread
From: Benjamin Poirier @ 2022-06-13  3:14 UTC (permalink / raw)
  To: Mike Manning; +Cc: Netdev, David Ahern, Saikrishna Arcot, Craig Gallek

On 2021-10-05 14:03 +0100, Mike Manning wrote:
[...]
> 
> Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
> Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF")
> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
> ---
> 
> diff nettest-baseline-9e9fb7655ed5.txt nettest-fix.txt
> 955,956c955,956
> < TEST: IPv4 TCP connection over VRF with SNAT                                  [FAIL]
> < TEST: IPv6 TCP connection over VRF with SNAT                                  [FAIL]
> ---
> > TEST: IPv4 TCP connection over VRF with SNAT                                  [ OK ]
> > TEST: IPv6 TCP connection over VRF with SNAT                                  [ OK ]
> 958,959c958,959
> < Tests passed: 713
> < Tests failed:   5
> ---
> > Tests passed: 715
> > Tests failed:   3
> 
> ---
>  net/ipv4/inet_hashtables.c  | 4 +++-
>  net/ipv4/udp.c              | 3 ++-
>  net/ipv6/inet6_hashtables.c | 2 +-
>  net/ipv6/udp.c              | 3 ++-
>  4 files changed, 8 insertions(+), 4 deletions(-)
> 

Hi Mike,

I was looking at this commit, 8d6c414cd2fb ("net: prefer socket bound to
interface when not in VRF"), and I get the feeling that it is only
partially effective. It works with UDP connected sockets but it doesn't
work for TCP and UDP unconnected sockets.

The compute_score() functions are a bit misleading. Because of the
reuseport shortcut in their callers (inet_lhash2_lookup() and the like),
the first socket with score > 0 may be chosen, not necessarily the
socket with highest score. In order to prefer certain sockets, I think
an approach like commit d894ba18d4e4 ("soreuseport: fix ordering for
mixed v4/v6 sockets") would be needed. What do you think?

Extra info:
1) fcnal-test.sh results

I tried to reproduce the fcnal-test.sh test results quoted above but in
my case the test cases already pass at 8d6c414cd2fb^ and 9e9fb7655ed5.
Moreover I believe those test cases don't have multiple listening
sockets. So that just added to my confusion.

Running 9e9fb7655ed5,
root@vsid:/src/linux/tools/testing/selftests/net# ./fcnal-test.sh -t use_cases
[...]

#################################################################
SNAT on VRF

TEST: IPv4 TCP connection over VRF with SNAT                                  [ OK ]
TEST: IPv6 TCP connection over VRF with SNAT                                  [ OK ]

Tests passed:  16
Tests failed:   0


2) reuseport_bindtodevice test

I wrote a selftest based on
tools/testing/selftests/net/reuseport_addr_any.c It tests that listening
sockets that have SO_BINDTODEVICE set are preferred over ones that do
not. All of the sockets have SO_REUSEPORT set. I ran it over a few
relevant revisions:

               IPv4                       IPv6
HEAD           TCP  UDP unconn  UDP conn  TCP  UDP unconn  UDP conn
6a5ef90c58da^  ✔    ✔           ✔         ✔    ✔           ✔   
6a5ef90c58da   ✔    ✘           ✔         ✔    ✘           ✔   
fd1914b2901b   ✘    ✘           ✔         ✘    ✘           ✔   
7e225619e8af   ✘    ✘           ✘         ✘    ✘           ✘   
8d6c414cd2fb   ✘    ✘           ✔         ✘    ✘           ✔   

✔ pass
✘ fail

reuseport_bindtodevice.c:

// SPDX-License-Identifier: GPL-2.0

/* Test that listening sockets that have SO_BINDTODEVICE set are preferred
 * over ones that do not. All of the sockets have SO_REUSEPORT set.
 */

#define _GNU_SOURCE

#include <arpa/inet.h>
#include <errno.h>
#include <error.h>
#include <linux/in.h>
#include <linux/unistd.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/epoll.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <unistd.h>


static const int SEND_PORT = 8888;
static const int RECV_PORT = 8889;

static const char *get_family_name(int domain)
{
	if (domain == AF_INET)
		return "IPv4";
	else if (domain == AF_INET6)
		return "IPv6";
	else
		error(1, 0, "Unknown address family \"%d\"", domain);

	return "";
}

static void build_rcv_fd(int domain, int type, int *rcv_fds, int count,
			 const char *ifname, bool do_connect)
{
	struct sockaddr_storage saddr, daddr;
	int opt, i;

	if (domain == AF_INET) {
		struct sockaddr_in *saddr4 = (struct sockaddr_in *)&saddr,
				   *daddr4 = (struct sockaddr_in *)&daddr;

		saddr4->sin_family = AF_INET;
		saddr4->sin_addr.s_addr = htonl(INADDR_ANY);
		saddr4->sin_port = htons(RECV_PORT);

		daddr4->sin_family = AF_INET;
		daddr4->sin_addr.s_addr = htonl(INADDR_ANY);
		daddr4->sin_port = htons(SEND_PORT);
	} else if (domain == AF_INET6) {
		struct sockaddr_in6 *saddr6 = (struct sockaddr_in6 *)&saddr,
				    *daddr6 = (struct sockaddr_in6 *)&daddr;

		saddr6->sin6_family = AF_INET6;
		saddr6->sin6_addr = in6addr_any;
		saddr6->sin6_port = htons(RECV_PORT);

		daddr6->sin6_family = AF_INET6;
		daddr6->sin6_addr = in6addr_any;
		daddr6->sin6_port = htons(SEND_PORT);
	} else {
		error(1, 0, "Unsupported family %d", domain);
	}

	for (i = 0; i < count; ++i) {
		rcv_fds[i] = socket(domain, type, 0);
		if (rcv_fds[i] < 0)
			error(1, errno, "failed to create receive socket");

		opt = 1;
		if (setsockopt(rcv_fds[i], SOL_SOCKET, SO_REUSEPORT, &opt,
			       sizeof(opt)))
			error(1, errno, "failed to set SO_REUSEPORT");

		if (ifname && setsockopt(rcv_fds[i], SOL_SOCKET,
					 SO_BINDTODEVICE, ifname,
					 strlen(ifname)))
			error(1, errno, "failed to set SO_BINDTODEVICE");

		if (bind(rcv_fds[i], (struct sockaddr *)&saddr, sizeof(saddr)))
			error(1, errno, "failed to bind receive socket");

		if (do_connect &&
		    connect(rcv_fds[i], (struct sockaddr *)&daddr,
			    sizeof(daddr)))
			error(1, errno, "failed to connect receive socket");

		if (type == SOCK_STREAM && listen(rcv_fds[i], 10))
			error(1, errno, "failed to listen on receive socket");
	}
}

static int connect_and_send(int domain, int type)
{
	struct sockaddr_storage saddr, daddr;
	int fd;

	if (domain == AF_INET) {
		struct sockaddr_in *saddr4 = (struct sockaddr_in *)&saddr,
				   *daddr4 = (struct sockaddr_in *)&daddr;

		saddr4->sin_family = AF_INET;
		saddr4->sin_addr.s_addr = htonl(INADDR_ANY);
		saddr4->sin_port = htons(SEND_PORT);

		daddr4->sin_family = AF_INET;
		daddr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
		daddr4->sin_port = htons(RECV_PORT);
	} else if (domain == AF_INET6) {
		struct sockaddr_in6 *saddr6 = (struct sockaddr_in6 *)&saddr,
				    *daddr6 = (struct sockaddr_in6 *)&daddr;

		saddr6->sin6_family = AF_INET6;
		saddr6->sin6_addr = in6addr_any;
		saddr6->sin6_port = htons(SEND_PORT);

		daddr6->sin6_family = AF_INET6;
		daddr6->sin6_addr = in6addr_loopback;
		daddr6->sin6_port = htons(RECV_PORT);
	} else {
		error(1, 0, "Unsupported family %d", domain);
	}

	fd = socket(domain, type, 0);
	if (fd < 0)
		error(1, errno, "failed to create send socket");

	if (bind(fd, (struct sockaddr *)&saddr, sizeof(saddr)))
		error(1, errno, "failed to bind send socket");

	if (connect(fd, (struct sockaddr *)&daddr, sizeof(daddr)))
		error(1, errno, "failed to connect send socket");

	if (send(fd, "a", 1, 0) < 0)
		error(1, errno, "failed to send message");

	return fd;
}

static int receive_once(int epfd, int type)
{
	struct epoll_event ev;
	int i, fd;
	char buf[8];

	i = epoll_wait(epfd, &ev, 1, 3);
	if (i < 0)
		error(1, errno, "epoll_wait failed");
	else if (i == 0)
		error(1, errno, "no socket is ready");

	if (type == SOCK_STREAM) {
		fd = accept(ev.data.fd, NULL, NULL);
		if (fd < 0)
			error(1, errno, "failed to accept");
		i = recv(fd, buf, sizeof(buf), 0);
		close(fd);
	} else {
		i = recv(ev.data.fd, buf, sizeof(buf), 0);
	}

	if (i < 0)
		error(1, errno, "failed to recv");

	return ev.data.fd;
}

static int test(int *rcv_fds, int count, int domain, int type, int fd)
{
	int epfd, i, send_fd, recv_fd;
	struct epoll_event ev;

	epfd = epoll_create(1);
	if (epfd < 0)
		error(1, errno, "failed to create epoll");

	ev.events = EPOLLIN;
	for (i = 0; i < count; ++i) {
		ev.data.fd = rcv_fds[i];
		if (epoll_ctl(epfd, EPOLL_CTL_ADD, rcv_fds[i], &ev))
			error(1, errno, "failed to register sock epoll");
	}

	send_fd = connect_and_send(domain, type);

	recv_fd = receive_once(epfd, type);

	close(send_fd);
	close(epfd);

	return recv_fd == fd;
}

static int run_one_test(int domain, int type, bool do_connect)
{
	/* Below we test that a socket listening with SO_BINDTODEVICE set is
	 * always selected in preference over a socket listening without. Bugs
	 * where this is not the case often result in sockets created first or
	 * last to get picked. So below we make sure that there are always
	 * sockets with SO_BINDTODEVICE created before and after a specific
	 * socket is created.
	 */
	int rcv_fds[10], i, result;

	build_rcv_fd(AF_INET, type, rcv_fds, 2, NULL, do_connect);
	build_rcv_fd(AF_INET6, type, rcv_fds + 2, 2, NULL, do_connect);
	build_rcv_fd(domain, type, rcv_fds + 4, 1, "lo", do_connect);
	build_rcv_fd(AF_INET, type, rcv_fds + 5, 2, NULL, do_connect);
	build_rcv_fd(AF_INET6, type, rcv_fds + 7, 2, NULL, do_connect);
	result = test(rcv_fds, 9, domain, type, rcv_fds[4]);
	for (i = 0; i < 9; ++i)
		close(rcv_fds[i]);
	if (result)
		fprintf(stderr, "pass\n");
	else
		fprintf(stderr, "fail\n");
	return result;
}

static int test_family(int domain)
{
	int result = 1;

	fprintf(stderr, "%s TCP ... ", get_family_name(domain));
	result &= run_one_test(domain, SOCK_STREAM, false);

	fprintf(stderr, "%s UDP unconnected ... ", get_family_name(domain));
	result &= run_one_test(domain, SOCK_DGRAM, false);

	fprintf(stderr, "%s UDP connected ... ", get_family_name(domain));
	result &= run_one_test(domain, SOCK_DGRAM, true);

	return result;
}

int main(void)
{
	int result = 1;

	result &= test_family(AF_INET);
	result &= test_family(AF_INET6);

	if (result) {
		fprintf(stderr, "SUCCESS\n");
		return 0;
	}
	fprintf(stderr, "FAIL\n");
	return 1;
}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net: prefer socket bound to interface when not in VRF
  2022-06-13  3:14 ` Benjamin Poirier
@ 2022-06-13  3:52   ` David Ahern
  2022-06-14  0:39     ` Benjamin Poirier
  2022-06-26 22:25   ` Mike Manning
  1 sibling, 1 reply; 9+ messages in thread
From: David Ahern @ 2022-06-13  3:52 UTC (permalink / raw)
  To: Benjamin Poirier, Mike Manning; +Cc: Netdev, Saikrishna Arcot, Craig Gallek

On 6/12/22 9:14 PM, Benjamin Poirier wrote:
> On 2021-10-05 14:03 +0100, Mike Manning wrote:
> [...]
>>
>> Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
>> Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF")
>> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
>> ---
>>
>> diff nettest-baseline-9e9fb7655ed5.txt nettest-fix.txt
>> 955,956c955,956
>> < TEST: IPv4 TCP connection over VRF with SNAT                                  [FAIL]
>> < TEST: IPv6 TCP connection over VRF with SNAT                                  [FAIL]
>> ---
>>> TEST: IPv4 TCP connection over VRF with SNAT                                  [ OK ]
>>> TEST: IPv6 TCP connection over VRF with SNAT                                  [ OK ]
>> 958,959c958,959
>> < Tests passed: 713
>> < Tests failed:   5
>> ---
>>> Tests passed: 715
>>> Tests failed:   3
>>
>> ---
>>  net/ipv4/inet_hashtables.c  | 4 +++-
>>  net/ipv4/udp.c              | 3 ++-
>>  net/ipv6/inet6_hashtables.c | 2 +-
>>  net/ipv6/udp.c              | 3 ++-
>>  4 files changed, 8 insertions(+), 4 deletions(-)
>>
> 
> Hi Mike,
> 
> I was looking at this commit, 8d6c414cd2fb ("net: prefer socket bound to
> interface when not in VRF"), and I get the feeling that it is only
> partially effective. It works with UDP connected sockets but it doesn't
> work for TCP and UDP unconnected sockets.
> 
> The compute_score() functions are a bit misleading. Because of the
> reuseport shortcut in their callers (inet_lhash2_lookup() and the like),
> the first socket with score > 0 may be chosen, not necessarily the
> socket with highest score. In order to prefer certain sockets, I think
> an approach like commit d894ba18d4e4 ("soreuseport: fix ordering for
> mixed v4/v6 sockets") would be needed. What do you think?
> 
> Extra info:
> 1) fcnal-test.sh results
> 
> I tried to reproduce the fcnal-test.sh test results quoted above but in
> my case the test cases already pass at 8d6c414cd2fb^ and 9e9fb7655ed5.
> Moreover I believe those test cases don't have multiple listening
> sockets. So that just added to my confusion.
> 
> Running 9e9fb7655ed5,
> root@vsid:/src/linux/tools/testing/selftests/net# ./fcnal-test.sh -t use_cases

use_cases group is a catchall for bug reports. You want run all of the
TCP and UDP cases to cover socket permutations and I know some of those
cover dual listeners (though I can't remember ATM if that is only the
MD5 tests). If not, you can add them fairly easily and illustrate your
point.

As for compute_score, it does weight device binds a bit higher. TCP:

score =  sk->sk_bound_dev_if ? 2 : 1;

UDP:
if (sk->sk_bound_dev_if)
        score += 4;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net: prefer socket bound to interface when not in VRF
  2022-06-13  3:52   ` David Ahern
@ 2022-06-14  0:39     ` Benjamin Poirier
  0 siblings, 0 replies; 9+ messages in thread
From: Benjamin Poirier @ 2022-06-14  0:39 UTC (permalink / raw)
  To: David Ahern; +Cc: Mike Manning, Netdev, Saikrishna Arcot, Craig Gallek

On 2022-06-12 21:52 -0600, David Ahern wrote:
[...]
> > 
> > Hi Mike,
> > 
> > I was looking at this commit, 8d6c414cd2fb ("net: prefer socket bound to
> > interface when not in VRF"), and I get the feeling that it is only
> > partially effective. It works with UDP connected sockets but it doesn't
> > work for TCP and UDP unconnected sockets.
> > 
> > The compute_score() functions are a bit misleading. Because of the
> > reuseport shortcut in their callers (inet_lhash2_lookup() and the like),
> > the first socket with score > 0 may be chosen, not necessarily the
> > socket with highest score. In order to prefer certain sockets, I think
> > an approach like commit d894ba18d4e4 ("soreuseport: fix ordering for
> > mixed v4/v6 sockets") would be needed. What do you think?
> > 
> > Extra info:
> > 1) fcnal-test.sh results
> > 
> > I tried to reproduce the fcnal-test.sh test results quoted above but in
> > my case the test cases already pass at 8d6c414cd2fb^ and 9e9fb7655ed5.
> > Moreover I believe those test cases don't have multiple listening
> > sockets. So that just added to my confusion.
> > 
> > Running 9e9fb7655ed5,
> > root@vsid:/src/linux/tools/testing/selftests/net# ./fcnal-test.sh -t use_cases
> 
> use_cases group is a catchall for bug reports. You want run all of the
> TCP and UDP cases to cover socket permutations and I know some of those
> cover dual listeners (though I can't remember ATM if that is only the
> MD5 tests).

I was talking specifically about the two test cases quoted in Mike's
original posting, in case that wasn't clear.

> If not, you can add them fairly easily and illustrate your
> point.

What about reuseport_bindtodevice.c from my previous message? Running it
on the current net/master (619c010a6539) shows:

root@vsid:~# ./reuseport_bindtodevice
IPv4 TCP ... fail
IPv4 UDP unconnected ... fail
IPv4 UDP connected ... pass
IPv6 TCP ... fail
IPv6 UDP unconnected ... fail
IPv6 UDP connected ... pass
FAIL

> 
> As for compute_score, it does weight device binds a bit higher. TCP:
> 
> score =  sk->sk_bound_dev_if ? 2 : 1;
> 
> UDP:
> if (sk->sk_bound_dev_if)
>         score += 4;

I think there was a misunderstanding. A higher score does not lead to a
higher preference for a socket in many cases. That's the pitfall I tried
to describe earlier.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net: prefer socket bound to interface when not in VRF
  2022-06-13  3:14 ` Benjamin Poirier
  2022-06-13  3:52   ` David Ahern
@ 2022-06-26 22:25   ` Mike Manning
  2022-06-27  0:05     ` Benjamin Poirier
  1 sibling, 1 reply; 9+ messages in thread
From: Mike Manning @ 2022-06-26 22:25 UTC (permalink / raw)
  To: Benjamin Poirier; +Cc: Netdev, David Ahern, Saikrishna Arcot, Craig Gallek

On 13/06/2022 04:14, Benjamin Poirier wrote:
> On 2021-10-05 14:03 +0100, Mike Manning wrote:
> [...]
>> Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
>> Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF")
>> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
>> ---
>>
>> diff nettest-baseline-9e9fb7655ed5.txt nettest-fix.txt
>> 955,956c955,956
>> < TEST: IPv4 TCP connection over VRF with SNAT                                  [FAIL]
>> < TEST: IPv6 TCP connection over VRF with SNAT                                  [FAIL]
>> ---
>>> TEST: IPv4 TCP connection over VRF with SNAT                                  [ OK ]
>>> TEST: IPv6 TCP connection over VRF with SNAT                                  [ OK ]
>> 958,959c958,959
>> < Tests passed: 713
>> < Tests failed:   5
>> ---
>>> Tests passed: 715
>>> Tests failed:   3
>> ---
>>  net/ipv4/inet_hashtables.c  | 4 +++-
>>  net/ipv4/udp.c              | 3 ++-
>>  net/ipv6/inet6_hashtables.c | 2 +-
>>  net/ipv6/udp.c              | 3 ++-
>>  4 files changed, 8 insertions(+), 4 deletions(-)
>>
> Hi Mike,
>
> I was looking at this commit, 8d6c414cd2fb ("net: prefer socket bound to
> interface when not in VRF"), and I get the feeling that it is only
> partially effective. It works with UDP connected sockets but it doesn't
> work for TCP and UDP unconnected sockets.
>
> The compute_score() functions are a bit misleading. Because of the
> reuseport shortcut in their callers (inet_lhash2_lookup() and the like),
> the first socket with score > 0 may be chosen, not necessarily the
> socket with highest score. In order to prefer certain sockets, I think
> an approach like commit d894ba18d4e4 ("soreuseport: fix ordering for
> mixed v4/v6 sockets") would be needed. What do you think?

Hi Benjamin,

We had never observed any issues with any of our configurations. The VRF changes introduced

in 7e225619e8af result in a failure being returned when there is no device match, which satisfies

the requirements for VRF handling so unbound vs. bound to an l3mdev - the score is irrelevant.

However, 8d6c414cd2fb was subsequently needed as unbound and bound sockets were scored

equally, so that fix reinstated a higher score needed for sockets bound to an interface. Wrt to

your query, the scoring resolved the issue. I am unaware of any problematic use-cases, but in

any case, my changes are in line with the current approach.


>
> Extra info:
> 1) fcnal-test.sh results
>
> I tried to reproduce the fcnal-test.sh test results quoted above but in
> my case the test cases already pass at 8d6c414cd2fb^ and 9e9fb7655ed5.
> Moreover I believe those test cases don't have multiple listening
> sockets. So that just added to my confusion.

The fix was not targeting those 2 failed test cases, the output was only to show the before/after

test results. It is unclear why they failed for me with with the 9e9fb7655ed5 baseline,


> Running 9e9fb7655ed5,
> root@vsid:/src/linux/tools/testing/selftests/net# ./fcnal-test.sh -t use_cases
> [...]
>
> #################################################################
> SNAT on VRF
>
> TEST: IPv4 TCP connection over VRF with SNAT                                  [ OK ]
> TEST: IPv6 TCP connection over VRF with SNAT                                  [ OK ]
>
> Tests passed:  16
> Tests failed:   0
>
>
> 2) reuseport_bindtodevice test
>
> I wrote a selftest based on
> tools/testing/selftests/net/reuseport_addr_any.c It tests that listening
> sockets that have SO_BINDTODEVICE set are preferred over ones that do
> not. All of the sockets have SO_REUSEPORT set. I ran it over a few
> relevant revisions:
>
>                IPv4                       IPv6
> HEAD           TCP  UDP unconn  UDP conn  TCP  UDP unconn  UDP conn
> 6a5ef90c58da^  ✔    ✔           ✔         ✔    ✔           ✔   
> 6a5ef90c58da   ✔    ✘           ✔         ✔    ✘           ✔   
> fd1914b2901b   ✘    ✘           ✔         ✘    ✘           ✔   
> 7e225619e8af   ✘    ✘           ✘         ✘    ✘           ✘   
> 8d6c414cd2fb   ✘    ✘           ✔         ✘    ✘           ✔   
>
> ✔ pass
> ✘ fail
>
> reuseport_bindtodevice.c:
>
> // SPDX-License-Identifier: GPL-2.0
>
> /* Test that listening sockets that have SO_BINDTODEVICE set are preferred
>  * over ones that do not. All of the sockets have SO_REUSEPORT set.
>  */
>
> #define _GNU_SOURCE
>
> #include <arpa/inet.h>
> #include <errno.h>
> #include <error.h>
> #include <linux/in.h>
> #include <linux/unistd.h>
> #include <stdbool.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/epoll.h>
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <unistd.h>
>
>
> static const int SEND_PORT = 8888;
> static const int RECV_PORT = 8889;
>
> static const char *get_family_name(int domain)
> {
> 	if (domain == AF_INET)
> 		return "IPv4";
> 	else if (domain == AF_INET6)
> 		return "IPv6";
> 	else
> 		error(1, 0, "Unknown address family \"%d\"", domain);
>
> 	return "";
> }
>
> static void build_rcv_fd(int domain, int type, int *rcv_fds, int count,
> 			 const char *ifname, bool do_connect)
> {
> 	struct sockaddr_storage saddr, daddr;
> 	int opt, i;
>
> 	if (domain == AF_INET) {
> 		struct sockaddr_in *saddr4 = (struct sockaddr_in *)&saddr,
> 				   *daddr4 = (struct sockaddr_in *)&daddr;
>
> 		saddr4->sin_family = AF_INET;
> 		saddr4->sin_addr.s_addr = htonl(INADDR_ANY);
> 		saddr4->sin_port = htons(RECV_PORT);
>
> 		daddr4->sin_family = AF_INET;
> 		daddr4->sin_addr.s_addr = htonl(INADDR_ANY);
> 		daddr4->sin_port = htons(SEND_PORT);
> 	} else if (domain == AF_INET6) {
> 		struct sockaddr_in6 *saddr6 = (struct sockaddr_in6 *)&saddr,
> 				    *daddr6 = (struct sockaddr_in6 *)&daddr;
>
> 		saddr6->sin6_family = AF_INET6;
> 		saddr6->sin6_addr = in6addr_any;
> 		saddr6->sin6_port = htons(RECV_PORT);
>
> 		daddr6->sin6_family = AF_INET6;
> 		daddr6->sin6_addr = in6addr_any;
> 		daddr6->sin6_port = htons(SEND_PORT);
> 	} else {
> 		error(1, 0, "Unsupported family %d", domain);
> 	}
>
> 	for (i = 0; i < count; ++i) {
> 		rcv_fds[i] = socket(domain, type, 0);
> 		if (rcv_fds[i] < 0)
> 			error(1, errno, "failed to create receive socket");
>
> 		opt = 1;
> 		if (setsockopt(rcv_fds[i], SOL_SOCKET, SO_REUSEPORT, &opt,
> 			       sizeof(opt)))
> 			error(1, errno, "failed to set SO_REUSEPORT");
>
> 		if (ifname && setsockopt(rcv_fds[i], SOL_SOCKET,
> 					 SO_BINDTODEVICE, ifname,
> 					 strlen(ifname)))
> 			error(1, errno, "failed to set SO_BINDTODEVICE");
>
> 		if (bind(rcv_fds[i], (struct sockaddr *)&saddr, sizeof(saddr)))
> 			error(1, errno, "failed to bind receive socket");
>
> 		if (do_connect &&
> 		    connect(rcv_fds[i], (struct sockaddr *)&daddr,
> 			    sizeof(daddr)))
> 			error(1, errno, "failed to connect receive socket");
>
> 		if (type == SOCK_STREAM && listen(rcv_fds[i], 10))
> 			error(1, errno, "failed to listen on receive socket");
> 	}
> }
>
> static int connect_and_send(int domain, int type)
> {
> 	struct sockaddr_storage saddr, daddr;
> 	int fd;
>
> 	if (domain == AF_INET) {
> 		struct sockaddr_in *saddr4 = (struct sockaddr_in *)&saddr,
> 				   *daddr4 = (struct sockaddr_in *)&daddr;
>
> 		saddr4->sin_family = AF_INET;
> 		saddr4->sin_addr.s_addr = htonl(INADDR_ANY);
> 		saddr4->sin_port = htons(SEND_PORT);
>
> 		daddr4->sin_family = AF_INET;
> 		daddr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> 		daddr4->sin_port = htons(RECV_PORT);
> 	} else if (domain == AF_INET6) {
> 		struct sockaddr_in6 *saddr6 = (struct sockaddr_in6 *)&saddr,
> 				    *daddr6 = (struct sockaddr_in6 *)&daddr;
>
> 		saddr6->sin6_family = AF_INET6;
> 		saddr6->sin6_addr = in6addr_any;
> 		saddr6->sin6_port = htons(SEND_PORT);
>
> 		daddr6->sin6_family = AF_INET6;
> 		daddr6->sin6_addr = in6addr_loopback;
> 		daddr6->sin6_port = htons(RECV_PORT);
> 	} else {
> 		error(1, 0, "Unsupported family %d", domain);
> 	}
>
> 	fd = socket(domain, type, 0);
> 	if (fd < 0)
> 		error(1, errno, "failed to create send socket");
>
> 	if (bind(fd, (struct sockaddr *)&saddr, sizeof(saddr)))
> 		error(1, errno, "failed to bind send socket");
>
> 	if (connect(fd, (struct sockaddr *)&daddr, sizeof(daddr)))
> 		error(1, errno, "failed to connect send socket");
>
> 	if (send(fd, "a", 1, 0) < 0)
> 		error(1, errno, "failed to send message");
>
> 	return fd;
> }
>
> static int receive_once(int epfd, int type)
> {
> 	struct epoll_event ev;
> 	int i, fd;
> 	char buf[8];
>
> 	i = epoll_wait(epfd, &ev, 1, 3);
> 	if (i < 0)
> 		error(1, errno, "epoll_wait failed");
> 	else if (i == 0)
> 		error(1, errno, "no socket is ready");
>
> 	if (type == SOCK_STREAM) {
> 		fd = accept(ev.data.fd, NULL, NULL);
> 		if (fd < 0)
> 			error(1, errno, "failed to accept");
> 		i = recv(fd, buf, sizeof(buf), 0);
> 		close(fd);
> 	} else {
> 		i = recv(ev.data.fd, buf, sizeof(buf), 0);
> 	}
>
> 	if (i < 0)
> 		error(1, errno, "failed to recv");
>
> 	return ev.data.fd;
> }
>
> static int test(int *rcv_fds, int count, int domain, int type, int fd)
> {
> 	int epfd, i, send_fd, recv_fd;
> 	struct epoll_event ev;
>
> 	epfd = epoll_create(1);
> 	if (epfd < 0)
> 		error(1, errno, "failed to create epoll");
>
> 	ev.events = EPOLLIN;
> 	for (i = 0; i < count; ++i) {
> 		ev.data.fd = rcv_fds[i];
> 		if (epoll_ctl(epfd, EPOLL_CTL_ADD, rcv_fds[i], &ev))
> 			error(1, errno, "failed to register sock epoll");
> 	}
>
> 	send_fd = connect_and_send(domain, type);
>
> 	recv_fd = receive_once(epfd, type);
>
> 	close(send_fd);
> 	close(epfd);
>
> 	return recv_fd == fd;
> }
>
> static int run_one_test(int domain, int type, bool do_connect)
> {
> 	/* Below we test that a socket listening with SO_BINDTODEVICE set is
> 	 * always selected in preference over a socket listening without. Bugs
> 	 * where this is not the case often result in sockets created first or
> 	 * last to get picked. So below we make sure that there are always
> 	 * sockets with SO_BINDTODEVICE created before and after a specific
> 	 * socket is created.
> 	 */
> 	int rcv_fds[10], i, result;
>
> 	build_rcv_fd(AF_INET, type, rcv_fds, 2, NULL, do_connect);
> 	build_rcv_fd(AF_INET6, type, rcv_fds + 2, 2, NULL, do_connect);
> 	build_rcv_fd(domain, type, rcv_fds + 4, 1, "lo", do_connect);
> 	build_rcv_fd(AF_INET, type, rcv_fds + 5, 2, NULL, do_connect);
> 	build_rcv_fd(AF_INET6, type, rcv_fds + 7, 2, NULL, do_connect);
> 	result = test(rcv_fds, 9, domain, type, rcv_fds[4]);
> 	for (i = 0; i < 9; ++i)
> 		close(rcv_fds[i]);
> 	if (result)
> 		fprintf(stderr, "pass\n");
> 	else
> 		fprintf(stderr, "fail\n");
> 	return result;
> }
>
> static int test_family(int domain)
> {
> 	int result = 1;
>
> 	fprintf(stderr, "%s TCP ... ", get_family_name(domain));
> 	result &= run_one_test(domain, SOCK_STREAM, false);
>
> 	fprintf(stderr, "%s UDP unconnected ... ", get_family_name(domain));
> 	result &= run_one_test(domain, SOCK_DGRAM, false);
>
> 	fprintf(stderr, "%s UDP connected ... ", get_family_name(domain));
> 	result &= run_one_test(domain, SOCK_DGRAM, true);
>
> 	return result;
> }
>
> int main(void)
> {
> 	int result = 1;
>
> 	result &= test_family(AF_INET);
> 	result &= test_family(AF_INET6);
>
> 	if (result) {
> 		fprintf(stderr, "SUCCESS\n");
> 		return 0;
> 	}
> 	fprintf(stderr, "FAIL\n");
> 	return 1;
> }



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net: prefer socket bound to interface when not in VRF
  2022-06-26 22:25   ` Mike Manning
@ 2022-06-27  0:05     ` Benjamin Poirier
  0 siblings, 0 replies; 9+ messages in thread
From: Benjamin Poirier @ 2022-06-27  0:05 UTC (permalink / raw)
  To: Mike Manning; +Cc: Netdev, David Ahern, Saikrishna Arcot, Craig Gallek

On 2022-06-26 23:25 +0100, Mike Manning wrote:
> On 13/06/2022 04:14, Benjamin Poirier wrote:
> > On 2021-10-05 14:03 +0100, Mike Manning wrote:
[...]
> > Hi Mike,
> >
> > I was looking at this commit, 8d6c414cd2fb ("net: prefer socket bound to
> > interface when not in VRF"), and I get the feeling that it is only
> > partially effective. It works with UDP connected sockets but it doesn't
> > work for TCP and UDP unconnected sockets.
> >
> > The compute_score() functions are a bit misleading. Because of the
> > reuseport shortcut in their callers (inet_lhash2_lookup() and the like),
> > the first socket with score > 0 may be chosen, not necessarily the
> > socket with highest score. In order to prefer certain sockets, I think
> > an approach like commit d894ba18d4e4 ("soreuseport: fix ordering for
> > mixed v4/v6 sockets") would be needed. What do you think?
> 
> Hi Benjamin,
> 
> We had never observed any issues with any of our configurations. The VRF changes introduced
> 
> in 7e225619e8af result in a failure being returned when there is no device match, which satisfies
> 
> the requirements for VRF handling so unbound vs. bound to an l3mdev - the score is irrelevant.
> 
> However, 8d6c414cd2fb was subsequently needed as unbound and bound sockets were scored
> 
> equally, so that fix reinstated a higher score needed for sockets bound to an interface. Wrt to
> 
> your query, the scoring resolved the issue. I am unaware of any problematic use-cases, but in
> 
> any case, my changes are in line with the current approach.

The problematic use case involves sockets that have SO_REUSEPORT +
SO_BINDTODEVICE. Earlier in the thread I've included a test that
demonstrates the issue.

For the Cumulus kernel I've put in place a workaround that removes the
reuseport optimization (see below). I probably won't have time to work
on a proper upstream solution.

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index b9d995b5ce24..1765ac837358 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -253,16 +253,20 @@ static struct sock *inet_lhash2_lookup(struct net *net,
 	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
 		score = compute_score(sk, net, hnum, daddr, dif, sdif);
 		if (score > hiscore) {
-			result = lookup_reuseport(net, sk, skb, doff,
-						  saddr, sport, daddr, hnum);
-			if (result)
-				return result;
-
 			result = sk;
 			hiscore = score;
 		}
 	}
 
+	if (result) {
+		struct sock *reuse_sk;
+
+		reuse_sk = lookup_reuseport(net, result, skb, doff,
+					    saddr, sport, daddr, hnum);
+		if (reuse_sk)
+			result = reuse_sk;
+	}
+
 	return result;
 }
 
> > Extra info:
> > 1) fcnal-test.sh results
> >
> > I tried to reproduce the fcnal-test.sh test results quoted above but in
> > my case the test cases already pass at 8d6c414cd2fb^ and 9e9fb7655ed5.
> > Moreover I believe those test cases don't have multiple listening
> > sockets. So that just added to my confusion.
> 
> The fix was not targeting those 2 failed test cases, the output was only to show the before/after
> 
> test results. It is unclear why they failed for me with with the 9e9fb7655ed5 baseline,

Thanks for taking a look, that was also my guess.

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-06-27  0:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-05 13:03 [PATCH] net: prefer socket bound to interface when not in VRF Mike Manning
2021-10-07 14:07 ` Jakub Kicinski
2021-10-07 14:10   ` David Ahern
2021-10-07 14:27     ` Jakub Kicinski
2022-06-13  3:14 ` Benjamin Poirier
2022-06-13  3:52   ` David Ahern
2022-06-14  0:39     ` Benjamin Poirier
2022-06-26 22:25   ` Mike Manning
2022-06-27  0:05     ` Benjamin Poirier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.