On Wed, 2021-06-23 at 12:02 +0800, Jason Wang wrote:
> 在 2021/6/23 上午12:15, David Woodhouse 写道:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > This creates a tun device and brings it up, then finds out the link-local
> > address the kernel automatically assigns to it.
> > 
> > It sends a ping to that address, from a fake LL address of its own, and
> > then waits for a response.
> > 
> > If the virtio_net_hdr stuff is all working correctly, it gets a response
> > and manages to understand it.
> 
> 
> I wonder whether it worth to bother the dependency like ipv6 or kernel 
> networking stack.
> 
> How about simply use packet socket that is bound to tun to receive and 
> send packets?
> 

I pondered that but figured that using the kernel's network stack
wasn't too much of an additional dependency. We *could* use an
AF_PACKET socket on the tun device and then drive both ends, but given
that the kernel *automatically* assigns a link-local address when we
bring the device up anyway, it seemed simple enough just to use ICMP.
I also happened to have the ICMP generation/checking code lying around
anyway in the same emacs instance, so it was reduced to a previously
solved problem.

We *should* eventually expand this test case to attach an AF_PACKET
device to the vhost-net, instead of using a tun device as the back end.
(Although I don't really see *why* vhost is limited to AF_PACKET. Why
*can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)


> > +	/*
> > +	 * I just want to map the *whole* of userspace address space. But
> > +	 * from userspace I don't know what that is. On x86_64 it would be:
> > +	 *
> > +	 * vmem->regions[0].guest_phys_addr = 4096;
> > +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
> > +	 * vmem->regions[0].userspace_addr = 4096;
> > +	 *
> > +	 * For now, just ensure we put everything inside a single BSS region.
> > +	 */
> > +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
> > +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
> > +	vmem->regions[0].memory_size = sizeof(rings);
> 
> 
> Instead of doing tricks like this, we can do it in another way:
> 
> 1) enable device IOTLB
> 2) wait for the IOTLB miss request (iova, len) and update identity 
> mapping accordingly
> 
> This should work for all the archs (with some performance hit).

Ick. For my actual application (OpenConnect) I'm either going to suck
it up and put in the arch-specific limits like in the comment above, or
I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking
about elsewhere. (Probably the former, since if I'm requiring kernel
changes then I have grander plans around extending AF_TLS to do DTLS,
then hooking that directly up to the tun socket via BPF and a sockmap
without the data frames ever going to userspace at all.)

For this test case, a hard-coded single address range in BSS is fine.

I've now added !IFF_NO_PI support to the test case, but as noted it
fails just like the other ones I'd already marked with #if 0, which is
because vhost-net pulls some value for 'sock_hlen' out of its posterior
based on some assumption around the vhost features. And then expects
sock_recvmsg() to return precisely that number of bytes more than the
value it peeks in the skb at the head of the sock's queue.

I think I can fix *all* those test cases by making tun_get_socket()
take an extra 'int *' argument, and use that to return the *actual*
value of sock_hlen. Here's the updated test case in the meantime:


From cf74e3fc80b8fd9df697a42cfc1ff3887de18f78 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Wed, 23 Jun 2021 16:38:56 +0100
Subject: [PATCH] test_vhost_net: add test cases with tun_pi header

These fail too, for the same reason as the previous tests were
guarded with #if 0: vhost-net pulls 'sock_hlen' out of its
posterior and just assumes it's 10 bytes. And then barfs when
a sock_recvmsg() doesn't return precisely ten bytes more than
it peeked in the head skb:

[1296757.531103] Discarded rx packet:  len 78, expected 74

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 .../testing/selftests/vhost/test_vhost_net.c  | 97 +++++++++++++------
 1 file changed, 65 insertions(+), 32 deletions(-)

diff --git a/tools/testing/selftests/vhost/test_vhost_net.c b/tools/testing/selftests/vhost/test_vhost_net.c
index fd4a2b0e42f0..734b3015a5bd 100644
--- a/tools/testing/selftests/vhost/test_vhost_net.c
+++ b/tools/testing/selftests/vhost/test_vhost_net.c
@@ -48,7 +48,7 @@ static unsigned char hexchar(char *hex)
 	return (hexnybble(hex[0]) << 4) | hexnybble(hex[1]);
 }
 
-int open_tun(int vnet_hdr_sz, struct in6_addr *addr)
+int open_tun(int vnet_hdr_sz, int pi, struct in6_addr *addr)
 {
 	int tun_fd = open("/dev/net/tun", O_RDWR);
 	if (tun_fd == -1)
@@ -56,7 +56,9 @@ int open_tun(int vnet_hdr_sz, struct in6_addr *addr)
 
 	struct ifreq ifr = { 0 };
 
-	ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
+	ifr.ifr_flags = IFF_TUN;
+	if (!pi)
+		ifr.ifr_flags |= IFF_NO_PI;
 	if (vnet_hdr_sz)
 		ifr.ifr_flags |= IFF_VNET_HDR;
 
@@ -249,11 +251,18 @@ static inline uint16_t csum_finish(uint32_t sum)
 	return htons((uint16_t)(~sum));
 }
 
-static int create_icmp_echo(unsigned char *data, struct in6_addr *dst,
+static int create_icmp_echo(unsigned char *data, int pi, struct in6_addr *dst,
 			    struct in6_addr *src, uint16_t id, uint16_t seq)
 {
 	const int icmplen = ICMP_MINLEN + sizeof(ping_payload);
-	const int plen = sizeof(struct ip6_hdr) + icmplen;
+	int plen = sizeof(struct ip6_hdr) + icmplen;
+
+	if (pi) {
+		struct tun_pi *pi = (void *)data;
+		data += sizeof(*pi);
+		plen += sizeof(*pi);
+		pi->proto = htons(ETH_P_IPV6);
+	}
 
 	struct ip6_hdr *iph = (void *)data;
 	struct icmp6_hdr *icmph = (void *)(data + sizeof(*iph));
@@ -312,8 +321,21 @@ static int create_icmp_echo(unsigned char *data, struct in6_addr *dst,
 }
 
 
-static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_addr *dst, struct in6_addr *src)
+static int check_icmp_response(unsigned char *data, uint32_t len, int pi,
+			       struct in6_addr *dst, struct in6_addr *src)
 {
+	if (pi) {
+		struct tun_pi *pi = (void *)data;
+		if (len < sizeof(*pi))
+			return 0;
+
+		if (pi->proto != htons(ETH_P_IPV6))
+			return 0;
+
+		data += sizeof(*pi);
+		len -= sizeof(*pi);
+	}
+
 	struct ip6_hdr *iph = (void *)data;
 	return ( len >= 41 && (ntohl(iph->ip6_flow) >> 28)==6 /* IPv6 header */
 		 && iph->ip6_nxt == IPPROTO_ICMPV6 /* IPv6 next header field = ICMPv6 */
@@ -337,7 +359,7 @@ static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_add
 #endif
 
 
-int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
+int test_vhost(int vnet_hdr_sz, int pi, int xdp, uint64_t features)
 {
 	int call_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
 	int kick_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
@@ -353,7 +375,7 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 	/* Pick up the link-local address that the kernel
 	 * assigns to the tun device. */
 	struct in6_addr tun_addr;
-	tun_fd = open_tun(vnet_hdr_sz, &tun_addr);
+	tun_fd = open_tun(vnet_hdr_sz, pi, &tun_addr);
 	if (tun_fd < 0)
 		goto err;
 
@@ -387,18 +409,18 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 	local_addr.s6_addr16[0] = htons(0xfe80);
 	local_addr.s6_addr16[7] = htons(1);
 
+
 	/* Set up RX and TX descriptors; the latter with ping packets ready to
 	 * send to the kernel, but don't actually send them yet. */
 	for (int i = 0; i < RING_SIZE; i++) {
 		struct pkt_buf *pkt = &rings[1].pkts[i];
-		int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], &tun_addr,
-					    &local_addr, 0x4747, i);
+		int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], pi,
+					    &tun_addr, &local_addr, 0x4747, i);
 
 		rings[1].desc[i].addr = vio64((uint64_t)pkt);
 		rings[1].desc[i].len = vio32(plen + vnet_hdr_sz);
 		rings[1].avail_ring[i] = vio16(i);
 
-
 		pkt = &rings[0].pkts[i];
 		rings[0].desc[i].addr = vio64((uint64_t)pkt);
 		rings[0].desc[i].len = vio32(sizeof(*pkt));
@@ -438,9 +460,10 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 				return -1;
 
 			if (check_icmp_response((void *)(addr + vnet_hdr_sz), len - vnet_hdr_sz,
-						&local_addr, &tun_addr)) {
+						pi, &local_addr, &tun_addr)) {
 				ret = 0;
-				printf("Success (%d %d %llx)\n", vnet_hdr_sz, xdp, (unsigned long long)features);
+				printf("Success (hdr %d, xdp %d, pi %d, features %llx)\n",
+				       vnet_hdr_sz, xdp, pi, (unsigned long long)features);
 				goto err;
 			}
 
@@ -466,51 +489,61 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 	return ret;
 }
 
-
-int main(void)
+/* Perform the given test with all four combinations of XDP/PI */
+int test_four(int vnet_hdr_sz, uint64_t features)
 {
-	int ret;
-
-	ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-				(1ULL << VIRTIO_F_VERSION_1)));
+	int ret = test_vhost(vnet_hdr_sz, 0, 0, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-				(1ULL << VIRTIO_F_VERSION_1)));
+	ret = test_vhost(vnet_hdr_sz, 0, 1, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
-
-	ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
+#if 0 /* These don't work *either* for the same reason as the #if 0 later */
+	ret = test_vhost(vnet_hdr_sz, 1, 0, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
+	ret = test_vhost(vnet_hdr_sz, 1, 1, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
+#endif
+}
 
-	ret = test_vhost(10, 0, 0);
-	if (ret && ret != KSFT_SKIP)
-		return ret;
+int main(void)
+{
+	int ret;
 
-	ret = test_vhost(10, 1, 0);
+	ret = test_four(10, 0);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-#if 0 /* These ones will fail */
-	ret = test_vhost(0, 0, 0);
+	ret = test_four(0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+			    (1ULL << VIRTIO_F_VERSION_1)));
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(0, 1, 0);
+	ret = test_four(0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(12, 0, 0);
+
+#if 0
+	/*
+	 * These ones will fail, because right now vhost *assumes* that the
+	 * underlying (tun, etc.) socket will be doing a header of precisely
+	 * sizeof(struct virtio_net_hdr), if vhost isn't doing so itself due
+	 * to VHOST_NET_F_VIRTIO_NET_HDR.
+	 *
+	 * That assumption breaks both tun with no IFF_VNET_HDR, and also
+	 * presumably raw sockets. So leave these test cases disabled for
+	 * now until it's fixed.
+	 */
+	ret = test_four(0, 0);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(12, 1, 0);
+	ret = test_four(12, 0);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 #endif
-- 
2.31.1