All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
@ 2021-06-19 13:33 David Woodhouse
  2021-06-21  7:00 ` Jason Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-19 13:33 UTC (permalink / raw)
  To: netdev; +Cc: Jason Wang

[-- Attachment #1: Type: text/plain, Size: 2410 bytes --]

From: David Woodhouse <dwmw@amazon.co.uk>

In tun_get_user(), skb->protocol is either taken from the tun_pi header
or inferred from the first byte of the packet in IFF_TUN mode, while
eth_type_trans() is called only in the IFF_TAP mode where the payload
is expected to be an Ethernet frame.

The alternative path in tun_xdp_one() was unconditionally using
eth_type_trans(), which corrupts packets in IFF_TUN mode. Fix it to
do the correct thing for IFF_TUN mode, as tun_get_user() does.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
---
How is my userspace application going to know that the kernel has this
fix? Should we add a flag to TUN_FEATURES to show that vhost-net in
*IFF_TUN* mode is supported?

 drivers/net/tun.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4cf38be26dc9..f812dcdc640e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2394,8 +2394,50 @@ static int tun_xdp_one(struct tun_struct *tun,
 		err = -EINVAL;
 		goto out;
 	}
+	switch (tun->flags & TUN_TYPE_MASK) {
+	case IFF_TUN:
+		if (tun->flags & IFF_NO_PI) {
+			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
+
+			switch (ip_version) {
+			case 4:
+				skb->protocol = htons(ETH_P_IP);
+				break;
+			case 6:
+				skb->protocol = htons(ETH_P_IPV6);
+				break;
+			default:
+				atomic_long_inc(&tun->dev->rx_dropped);
+				kfree_skb(skb);
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			struct tun_pi *pi = (struct tun_pi *)skb->data;
+			if (!pskb_may_pull(skb, sizeof(*pi))) {
+				atomic_long_inc(&tun->dev->rx_dropped);
+				kfree_skb(skb);
+				err = -ENOMEM;
+				goto out;
+			}
+			skb_pull_inline(skb, sizeof(*pi));
+			skb->protocol = pi->proto;
+		}
+
+		skb_reset_mac_header(skb);
+		skb->dev = tun->dev;
+		break;
+	case IFF_TAP:
+		if (!pskb_may_pull(skb, ETH_HLEN)) {
+			atomic_long_inc(&tun->dev->rx_dropped);
+			kfree_skb(skb);
+			err = -ENOMEM;
+			goto out;
+		}
+		skb->protocol = eth_type_trans(skb, tun->dev);
+		break;
+	}
 
-	skb->protocol = eth_type_trans(skb, tun->dev);
 	skb_reset_network_header(skb);
 	skb_probe_transport_header(skb);
 	skb_record_rx_queue(skb, tfile->queue_index);
-- 
2.31.1


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-19 13:33 [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode David Woodhouse
@ 2021-06-21  7:00 ` Jason Wang
  2021-06-21 10:52   ` David Woodhouse
  2021-06-22 16:15 ` [PATCH v2 1/4] " David Woodhouse
  2021-06-24 12:30 ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() David Woodhouse
  2 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-21  7:00 UTC (permalink / raw)
  To: David Woodhouse, netdev


在 2021/6/19 下午9:33, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> In tun_get_user(), skb->protocol is either taken from the tun_pi header
> or inferred from the first byte of the packet in IFF_TUN mode, while
> eth_type_trans() is called only in the IFF_TAP mode where the payload
> is expected to be an Ethernet frame.
>
> The alternative path in tun_xdp_one() was unconditionally using
> eth_type_trans(), which corrupts packets in IFF_TUN mode. Fix it to
> do the correct thing for IFF_TUN mode, as tun_get_user() does.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
> ---
> How is my userspace application going to know that the kernel has this
> fix? Should we add a flag to TUN_FEATURES to show that vhost-net in
> *IFF_TUN* mode is supported?


I think it's probably too late to fix? Since it should work before 
043d222f93ab.

The only way is to backport this fix to stable.


>
>   drivers/net/tun.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 4cf38be26dc9..f812dcdc640e 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2394,8 +2394,50 @@ static int tun_xdp_one(struct tun_struct *tun,
>   		err = -EINVAL;
>   		goto out;
>   	}
> +	switch (tun->flags & TUN_TYPE_MASK) {
> +	case IFF_TUN:
> +		if (tun->flags & IFF_NO_PI) {
> +			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
> +
> +			switch (ip_version) {
> +			case 4:
> +				skb->protocol = htons(ETH_P_IP);
> +				break;
> +			case 6:
> +				skb->protocol = htons(ETH_P_IPV6);
> +				break;
> +			default:
> +				atomic_long_inc(&tun->dev->rx_dropped);
> +				kfree_skb(skb);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +		} else {
> +			struct tun_pi *pi = (struct tun_pi *)skb->data;
> +			if (!pskb_may_pull(skb, sizeof(*pi))) {
> +				atomic_long_inc(&tun->dev->rx_dropped);
> +				kfree_skb(skb);
> +				err = -ENOMEM;
> +				goto out;
> +			}
> +			skb_pull_inline(skb, sizeof(*pi));
> +			skb->protocol = pi->proto;
> +		}
> +
> +		skb_reset_mac_header(skb);
> +		skb->dev = tun->dev;
> +		break;
> +	case IFF_TAP:
> +		if (!pskb_may_pull(skb, ETH_HLEN)) {
> +			atomic_long_inc(&tun->dev->rx_dropped);
> +			kfree_skb(skb);
> +			err = -ENOMEM;
> +			goto out;
> +		}
> +		skb->protocol = eth_type_trans(skb, tun->dev);
> +		break;


I wonder whether we can have some codes unification with tun_get_user().

Thanks


> +	}
>   
> -	skb->protocol = eth_type_trans(skb, tun->dev);
>   	skb_reset_network_header(skb);
>   	skb_probe_transport_header(skb);
>   	skb_record_rx_queue(skb, tfile->queue_index);


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-21  7:00 ` Jason Wang
@ 2021-06-21 10:52   ` David Woodhouse
  2021-06-21 14:50     ` David Woodhouse
  2021-06-22  4:34     ` Jason Wang
  0 siblings, 2 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-21 10:52 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 2301 bytes --]

On Mon, 2021-06-21 at 15:00 +0800, Jason Wang wrote:
> I think it's probably too late to fix? Since it should work before 
> 043d222f93ab.
> 
> The only way is to backport this fix to stable.

Yeah, I assumed the fix would be backported; if not then the "does the
kernel have it" check is fairly trivial.

I *can* avoid it for now by just using TUNSNDBUF to reduce the sndbuf
and then we never take the XDP path at all.

My initial crappy hacks are slowly turning into something that I might
actually want to commit to mainline (once I've fixed endianness and
memory ordering issues):
https://gitlab.com/openconnect/openconnect/-/compare/master...vhost

I have a couple of remaining problems using vhost-net directly from
userspace though.

Firstly, I don't think I can set IFF_VNET_HDR on the tun device after
opening it. So my model of "open the tun device, then *see* if we can
use vhost to accelerate it" doesn't work.

I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features
instead, but that gives me a weird failure mode where it drops around
half the incoming packets, and I haven't yet worked out why.

Of course I don't *actually* want a vnet header at all but the vhost
code really assumes that *someone* will add one; if I *don't* set
VHOST_NET_F_VIRTIO_NET_HDR then it always *assumes* it can read ten
bytes more from the tun socket than the 'peek' says, and barfs when it
can't. (Or such was my initial half-thought-through diagnosis before I
made it go away by setting IFF_VNET_HDR, at least).


Secondly, I need to pull numbers out of my posterior for the
VHOST_SET_MEM_TABLE call. This works for x86_64:

	vmem->nregions = 1;
	vmem->regions[0].guest_phys_addr = 4096;
	vmem->regions[0].memory_size = 0x7fffffffe000;
	vmem->regions[0].userspace_addr = 4096;
	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {

Is there a way to bypass that and just unconditionally set a 1:1
mapping of *all* userspace address space? 



It's possible that one or the other of those problems will result in a
new advertised "feature" which is so simple (like a 1:1 map) that we
can call it a bugfix and backport it along with the tun fix I already
posted, and the presence of *that* can indicate that the tun bug is
fixed :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-21 10:52   ` David Woodhouse
@ 2021-06-21 14:50     ` David Woodhouse
  2021-06-21 20:43       ` David Woodhouse
  2021-06-22  4:34       ` Jason Wang
  2021-06-22  4:34     ` Jason Wang
  1 sibling, 2 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-21 14:50 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 548 bytes --]

On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote:
> 
> Firstly, I don't think I can set IFF_VNET_HDR on the tun device after
> opening it. So my model of "open the tun device, then *see* if we can
> use vhost to accelerate it" doesn't work.
> 
> I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features
> instead, but that gives me a weird failure mode where it drops around
> half the incoming packets, and I haven't yet worked out why.

FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP
data path.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-21 14:50     ` David Woodhouse
@ 2021-06-21 20:43       ` David Woodhouse
  2021-06-22  4:52         ` Jason Wang
  2021-06-22  4:34       ` Jason Wang
  1 sibling, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-21 20:43 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

On Mon, 2021-06-21 at 15:50 +0100, David Woodhouse wrote:
> On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote:
> > 
> > Firstly, I don't think I can set IFF_VNET_HDR on the tun device after
> > opening it. So my model of "open the tun device, then *see* if we can
> > use vhost to accelerate it" doesn't work.
> > 
> > I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features
> > instead, but that gives me a weird failure mode where it drops around
> > half the incoming packets, and I haven't yet worked out why.
> 
> FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP
> data path.

Looks like there are two problems there.

Firstly, vhost_net_build_xdp() doesn't cope well with sock_hlen being
zero. It reads those zero bytes into its buffer, then points 'gso' at
the buffer with no valid data in it, and checks gso->flags for the
NEEDS_CSUM flag.

Secondly, tun_xdp_one() doesn't cope with receiving packets without the
virtio header either. While tun_get_user() correctly checks
IFF_VNET_HDR, tun_xdp_one() does not, and treats the start of my IP
packets as if they were a virtio_net_hdr.

I'll look at turning my code into a test case for kernel selftests.



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-21 10:52   ` David Woodhouse
  2021-06-21 14:50     ` David Woodhouse
@ 2021-06-22  4:34     ` Jason Wang
  2021-06-22  7:28       ` David Woodhouse
  1 sibling, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-22  4:34 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/21 下午6:52, David Woodhouse 写道:
> On Mon, 2021-06-21 at 15:00 +0800, Jason Wang wrote:
>> I think it's probably too late to fix? Since it should work before
>> 043d222f93ab.
>>
>> The only way is to backport this fix to stable.
> Yeah, I assumed the fix would be backported; if not then the "does the
> kernel have it" check is fairly trivial.
>
> I *can* avoid it for now by just using TUNSNDBUF to reduce the sndbuf
> and then we never take the XDP path at all.
>
> My initial crappy hacks are slowly turning into something that I might
> actually want to commit to mainline (once I've fixed endianness and
> memory ordering issues):
> https://gitlab.com/openconnect/openconnect/-/compare/master...vhost
>
> I have a couple of remaining problems using vhost-net directly from
> userspace though.
>
> Firstly, I don't think I can set IFF_VNET_HDR on the tun device after
> opening it. So my model of "open the tun device, then *see* if we can
> use vhost to accelerate it" doesn't work.


Yes, IFF_VNET_HDR is set during TUN_SET_IFF which can't be changed 
afterwards.


>
> I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features
> instead, but that gives me a weird failure mode where it drops around
> half the incoming packets, and I haven't yet worked out why.
>
> Of course I don't *actually* want a vnet header at all but the vhost
> code really assumes that *someone* will add one; if I *don't* set
> VHOST_NET_F_VIRTIO_NET_HDR then it always *assumes* it can read ten
> bytes more from the tun socket than the 'peek' says, and barfs when it
> can't. (Or such was my initial half-thought-through diagnosis before I
> made it go away by setting IFF_VNET_HDR, at least).


Yes, vhost always assumes there's a vnet header.


>
>
> Secondly, I need to pull numbers out of my posterior for the
> VHOST_SET_MEM_TABLE call. This works for x86_64:
>
> 	vmem->nregions = 1;
> 	vmem->regions[0].guest_phys_addr = 4096;
> 	vmem->regions[0].memory_size = 0x7fffffffe000;
> 	vmem->regions[0].userspace_addr = 4096;
> 	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {
>
> Is there a way to bypass that and just unconditionally set a 1:1
> mapping of *all* userspace address space?


Memory Table is one of the basic abstraction of the vhost. Basically, 
you only need to map the userspace buffers. This is how DPDK virtio-user 
PMD did. Vhost will validate the addresses through access_ok() during 
VHOST_SET_MEM_TABLE.

The range of all usersapce space seems architecture specific, I'm not 
sure if it's worth to bother.

Thanks


>
>
>
> It's possible that one or the other of those problems will result in a
> new advertised "feature" which is so simple (like a 1:1 map) that we
> can call it a bugfix and backport it along with the tun fix I already
> posted, and the presence of *that* can indicate that the tun bug is
> fixed :)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-21 14:50     ` David Woodhouse
  2021-06-21 20:43       ` David Woodhouse
@ 2021-06-22  4:34       ` Jason Wang
  1 sibling, 0 replies; 73+ messages in thread
From: Jason Wang @ 2021-06-22  4:34 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/21 下午10:50, David Woodhouse 写道:
> On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote:
>> Firstly, I don't think I can set IFF_VNET_HDR on the tun device after
>> opening it. So my model of "open the tun device, then *see* if we can
>> use vhost to accelerate it" doesn't work.
>>
>> I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features
>> instead, but that gives me a weird failure mode where it drops around
>> half the incoming packets, and I haven't yet worked out why.
> FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP
> data path.


That looks a workaround.

Thanks



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-21 20:43       ` David Woodhouse
@ 2021-06-22  4:52         ` Jason Wang
  2021-06-22  7:24           ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-22  4:52 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 1377 bytes --]


在 2021/6/22 上午4:43, David Woodhouse 写道:
> On Mon, 2021-06-21 at 15:50 +0100, David Woodhouse wrote:
>> On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote:
>>> Firstly, I don't think I can set IFF_VNET_HDR on the tun device after
>>> opening it. So my model of "open the tun device, then *see* if we can
>>> use vhost to accelerate it" doesn't work.
>>>
>>> I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features
>>> instead, but that gives me a weird failure mode where it drops around
>>> half the incoming packets, and I haven't yet worked out why.
>> FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP
>> data path.
> Looks like there are two problems there.
>
> Firstly, vhost_net_build_xdp() doesn't cope well with sock_hlen being
> zero. It reads those zero bytes into its buffer, then points 'gso' at
> the buffer with no valid data in it, and checks gso->flags for the
> NEEDS_CSUM flag.
>
> Secondly, tun_xdp_one() doesn't cope with receiving packets without the
> virtio header either. While tun_get_user() correctly checks
> IFF_VNET_HDR, tun_xdp_one() does not, and treats the start of my IP
> packets as if they were a virtio_net_hdr.
>
> I'll look at turning my code into a test case for kernel selftests.


I cook two patches. Please see and check if they fix the problem. 
(compile test only for me).

Thanks


>
>

[-- Attachment #2: 0001-vhost_net-validate-gso-metadata-only-if-socket-has-v.patch --]
[-- Type: text/plain, Size: 1004 bytes --]

From 5d785027da87e40138407d76841f5cdd64914541 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang@redhat.com>
Date: Tue, 22 Jun 2021 12:07:59 +0800
Subject: [PATCH 1/2] vhost_net: validate gso metadata only if socket has vnet
 header

When sock_hlen is zero, there's no need to validate the socket vnet
header since it doesn't support that.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index df82b124170e..5034c4949bc4 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -725,7 +725,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 	hdr = buf;
 	gso = &hdr->gso;
 
-	if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+	if (nvq->sock_hlen && (gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
 	    vhost16_to_cpu(vq, gso->csum_start) +
 	    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
 	    vhost16_to_cpu(vq, gso->hdr_len)) {
-- 
2.25.1


[-- Attachment #3: 0002-tun-use-vnet-header-only-when-IFF_VNET_HDR-in-tun_xd.patch --]
[-- Type: text/plain, Size: 1724 bytes --]

From d5f36f73be05ee425eca73b63b37d5a20b22837a Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang@redhat.com>
Date: Tue, 22 Jun 2021 12:34:58 +0800
Subject: [PATCH 2/2] tun: use vnet header only when IFF_VNET_HDR in
 tun_xdp_one()

We should not try to read vnet header from the xdp buffer if
IFF_VNET_HDR is not set, otherwise we break the semantic of
IFF_VNET_HDR.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 84f832806313..70c4bb22ef78 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2334,18 +2334,22 @@ static int tun_xdp_one(struct tun_struct *tun,
 {
 	unsigned int datasize = xdp->data_end - xdp->data;
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-	struct virtio_net_hdr *gso = &hdr->gso;
+	struct virtio_net_hdr gso = { 0 };
 	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb = NULL;
 	u32 rxhash = 0, act;
 	int buflen = hdr->buflen;
+	bool vnet_hdr = tun->flags & IFF_VNET_HDR;
 	int err = 0;
 	bool skb_xdp = false;
 	struct page *page;
 
+	if (vnet_hdr)
+		memcpy(&gso, &hdr->gso, sizeof(gso));
+
 	xdp_prog = rcu_dereference(tun->xdp_prog);
 	if (xdp_prog) {
-		if (gso->gso_type) {
+		if (gso.gso_type) {
 			skb_xdp = true;
 			goto build;
 		}
@@ -2391,7 +2395,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
 	skb_put(skb, xdp->data_end - xdp->data);
 
-	if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
+	if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun))) {
 		atomic_long_inc(&tun->rx_frame_errors);
 		kfree_skb(skb);
 		err = -EINVAL;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-22  4:52         ` Jason Wang
@ 2021-06-22  7:24           ` David Woodhouse
  2021-06-22  7:51             ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-22  7:24 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

On Tue, 2021-06-22 at 12:52 +0800, Jason Wang wrote:
> 
> 
> I cook two patches. Please see and check if they fix the problem. 
> (compile test only for me).

I did the second one slightly differently (below) but those are what I
came up with too, which seems to be working.

@@ -2331,7 +2344,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 {
        unsigned int datasize = xdp->data_end - xdp->data;
        struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-       struct virtio_net_hdr *gso = &hdr->gso;
+       struct virtio_net_hdr *gso = NULL;
        struct bpf_prog *xdp_prog;
        struct sk_buff *skb = NULL;
        u32 rxhash = 0, act;
@@ -2340,9 +2353,12 @@ static int tun_xdp_one(struct tun_struct *tun,
        bool skb_xdp = false;
        struct page *page;
 
+       if (tun->flags & IFF_VNET_HDR)
+               gso = &hdr->gso;
+
        xdp_prog = rcu_dereference(tun->xdp_prog);
        if (xdp_prog) {
-               if (gso->gso_type) {
+               if (gso && gso->gso_type) {
                        skb_xdp = true;
                        goto build;
                }
@@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct *tun,
        skb_reserve(skb, xdp->data - xdp->data_hard_start);
        skb_put(skb, xdp->data_end - xdp->data);
 
-       if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
+       if (!gso)
+               skb_reset_mac_header(skb);
+       else if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
                atomic_long_inc(&tun->rx_frame_errors);
                kfree_skb(skb);

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-22  4:34     ` Jason Wang
@ 2021-06-22  7:28       ` David Woodhouse
  2021-06-22  8:00         ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-22  7:28 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 1386 bytes --]

On Tue, 2021-06-22 at 12:34 +0800, Jason Wang wrote:
> > 
> > Secondly, I need to pull numbers out of my posterior for the
> > VHOST_SET_MEM_TABLE call. This works for x86_64:
> > 
> >        vmem->nregions = 1;
> >        vmem->regions[0].guest_phys_addr = 4096;
> >        vmem->regions[0].memory_size = 0x7fffffffe000;
> >        vmem->regions[0].userspace_addr = 4096;
> >        if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {
> > 
> > Is there a way to bypass that and just unconditionally set a 1:1
> > mapping of *all* userspace address space?
> 
> 
> Memory Table is one of the basic abstraction of the vhost. Basically, 
> you only need to map the userspace buffers. This is how DPDK virtio-user 
> PMD did. Vhost will validate the addresses through access_ok() during 
> VHOST_SET_MEM_TABLE.
> 
> The range of all usersapce space seems architecture specific, I'm not 
> sure if it's worth to bother.

The buffers are just malloc'd. I just need a full 1:1 mapping of all
"guest" memory to userspace addresses, and was trying to avoid having
to map them on demand *just* because I don't know the full range of
possible addresses that malloc will return, in advance.

I'm tempted to add a new feature for that 1:1 access, with no ->umem or
->iotlb at all. And then I can use it as a key to know that the XDP
bugs are fixed too :)


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-22  7:24           ` David Woodhouse
@ 2021-06-22  7:51             ` Jason Wang
  2021-06-22  8:10               ` David Woodhouse
  2021-06-22 11:36               ` David Woodhouse
  0 siblings, 2 replies; 73+ messages in thread
From: Jason Wang @ 2021-06-22  7:51 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/22 下午3:24, David Woodhouse 写道:
> On Tue, 2021-06-22 at 12:52 +0800, Jason Wang wrote:
>>
>> I cook two patches. Please see and check if they fix the problem.
>> (compile test only for me).
> I did the second one slightly differently (below) but those are what I
> came up with too, which seems to be working.
>
> @@ -2331,7 +2344,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>   {
>          unsigned int datasize = xdp->data_end - xdp->data;
>          struct tun_xdp_hdr *hdr = xdp->data_hard_start;
> -       struct virtio_net_hdr *gso = &hdr->gso;
> +       struct virtio_net_hdr *gso = NULL;
>          struct bpf_prog *xdp_prog;
>          struct sk_buff *skb = NULL;
>          u32 rxhash = 0, act;
> @@ -2340,9 +2353,12 @@ static int tun_xdp_one(struct tun_struct *tun,
>          bool skb_xdp = false;
>          struct page *page;
>   
> +       if (tun->flags & IFF_VNET_HDR)
> +               gso = &hdr->gso;
> +
>          xdp_prog = rcu_dereference(tun->xdp_prog);
>          if (xdp_prog) {
> -               if (gso->gso_type) {
> +               if (gso && gso->gso_type) {
>                          skb_xdp = true;
>                          goto build;
>                  }
> @@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct *tun,
>          skb_reserve(skb, xdp->data - xdp->data_hard_start);
>          skb_put(skb, xdp->data_end - xdp->data);
>   
> -       if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
> +       if (!gso)
> +               skb_reset_mac_header(skb);
> +       else if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
>                  atomic_long_inc(&tun->rx_frame_errors);
>                  kfree_skb(skb);


This should work as well.

Thanks


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-22  7:28       ` David Woodhouse
@ 2021-06-22  8:00         ` Jason Wang
  2021-06-22  8:29           ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-22  8:00 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/22 下午3:28, David Woodhouse 写道:
> On Tue, 2021-06-22 at 12:34 +0800, Jason Wang wrote:
>>> Secondly, I need to pull numbers out of my posterior for the
>>> VHOST_SET_MEM_TABLE call. This works for x86_64:
>>>
>>>         vmem->nregions = 1;
>>>         vmem->regions[0].guest_phys_addr = 4096;
>>>         vmem->regions[0].memory_size = 0x7fffffffe000;
>>>         vmem->regions[0].userspace_addr = 4096;
>>>         if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {
>>>
>>> Is there a way to bypass that and just unconditionally set a 1:1
>>> mapping of *all* userspace address space?
>>
>> Memory Table is one of the basic abstraction of the vhost. Basically,
>> you only need to map the userspace buffers. This is how DPDK virtio-user
>> PMD did. Vhost will validate the addresses through access_ok() during
>> VHOST_SET_MEM_TABLE.
>>
>> The range of all usersapce space seems architecture specific, I'm not
>> sure if it's worth to bother.
> The buffers are just malloc'd. I just need a full 1:1 mapping of all
> "guest" memory to userspace addresses, and was trying to avoid having
> to map them on demand *just* because I don't know the full range of
> possible addresses that malloc will return, in advance.
>
> I'm tempted to add a new feature for that 1:1 access, with no ->umem or
> ->iotlb at all. And then I can use it as a key to know that the XDP
> bugs are fixed too :)


This means we need validate the userspace address each time before vhost 
tries to use that. This will de-gradate the performance. So we still 
need to figure out the legal userspace address range which might not be 
easy.

Thanks


>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-22  7:51             ` Jason Wang
@ 2021-06-22  8:10               ` David Woodhouse
  2021-06-22 11:36               ` David Woodhouse
  1 sibling, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-22  8:10 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez



On 22 June 2021 08:51:43 BST, Jason Wang <jasowang@redhat.com> wrote:
>
>在 2021/6/22 下午3:24, David Woodhouse 写道:
>> On Tue, 2021-06-22 at 12:52 +0800, Jason Wang wrote:
>>>
>>> I cook two patches. Please see and check if they fix the problem.
>>> (compile test only for me).
>> I did the second one slightly differently (below) but those are what
>I
>> came up with too, which seems to be working.
>>
>> @@ -2331,7 +2344,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>   {
>>          unsigned int datasize = xdp->data_end - xdp->data;
>>          struct tun_xdp_hdr *hdr = xdp->data_hard_start;
>> -       struct virtio_net_hdr *gso = &hdr->gso;
>> +       struct virtio_net_hdr *gso = NULL;
>>          struct bpf_prog *xdp_prog;
>>          struct sk_buff *skb = NULL;
>>          u32 rxhash = 0, act;
>> @@ -2340,9 +2353,12 @@ static int tun_xdp_one(struct tun_struct *tun,
>>          bool skb_xdp = false;
>>          struct page *page;
>>   
>> +       if (tun->flags & IFF_VNET_HDR)
>> +               gso = &hdr->gso;
>> +
>>          xdp_prog = rcu_dereference(tun->xdp_prog);
>>          if (xdp_prog) {
>> -               if (gso->gso_type) {
>> +               if (gso && gso->gso_type) {
>>                          skb_xdp = true;
>>                          goto build;
>>                  }
>> @@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct
>*tun,
>>          skb_reserve(skb, xdp->data - xdp->data_hard_start);
>>          skb_put(skb, xdp->data_end - xdp->data);
>>   
>> -       if (virtio_net_hdr_to_skb(skb, gso,
>tun_is_little_endian(tun))) {
>> +       if (!gso)
>> +               skb_reset_mac_header(skb);
>> +       else if (virtio_net_hdr_to_skb(skb, gso,
>tun_is_little_endian(tun))) {
>>                  atomic_long_inc(&tun->rx_frame_errors);
>>                  kfree_skb(skb);
>
>
>This should work as well.

I'll rip out the rest of my debugging hacks and check that these two are sufficient, and that I hadn't accidentally papered over something else as I debugged it.

Then I'll look at the case of having no virtio_net_hdr on either side, and also different sized headers on the tun device (why does it even support that?).

And the test case, of course. 

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-22  8:00         ` Jason Wang
@ 2021-06-22  8:29           ` David Woodhouse
  2021-06-23  3:39             ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-22  8:29 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 1314 bytes --]

On Tue, 2021-06-22 at 16:00 +0800, Jason Wang wrote:
> > I'm tempted to add a new feature for that 1:1 access, with no ->umem or
> > ->iotlb at all. And then I can use it as a key to know that the XDP
> > bugs are fixed too :)
> 
> 
> This means we need validate the userspace address each time before vhost 
> tries to use that. This will de-gradate the performance. So we still 
> need to figure out the legal userspace address range which might not be 
> easy.

Easier from the kernel than from userspace though :)

But I don't know that we need it. Isn't a call to access_ok() going to
be faster than what translate_desc() does to look things up anyway?

In the 1:1 mode, the access_ok() is all that's needed since there's no
translation.

@@ -2038,6 +2065,14 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
        u64 s = 0;
        int ret = 0;
 
+       if (vhost_has_feature(vq, VHOST_F_IDENTITY_MAPPING)) {
+               if (!access_ok((void __user *)addr, len))
+                       return -EFAULT;
+
+               iov[0].iov_len = len;
+               iov[0].iov_base = (void __user *)addr;
+               return 1;
+       }
        while ((u64)len > s) {
                u64 size;
                if (unlikely(ret >= iov_size)) {

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-22  7:51             ` Jason Wang
  2021-06-22  8:10               ` David Woodhouse
@ 2021-06-22 11:36               ` David Woodhouse
  1 sibling, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-22 11:36 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 1284 bytes --]

On Tue, 2021-06-22 at 15:51 +0800, Jason Wang wrote:
> > @@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct *tun,
> >           skb_reserve(skb, xdp->data - xdp->data_hard_start);
> >           skb_put(skb, xdp->data_end - xdp->data);
> >    
> > -       if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
> > +       if (!gso)
> > +               skb_reset_mac_header(skb);
> > +       else if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
> >                   atomic_long_inc(&tun->rx_frame_errors);
> >                   kfree_skb(skb);
> 
> 
> This should work as well.

Actually there's no need for the skb_reset_mac_header() as that's going
to happen anyway just a few lines down — either from eth_type_trans()
or explicitly in the IFF_TUN code path that I added in my first patch.

I also stripped out a lot more code from vhost_net_build_xdp() in the
!sock_hlen case where it was pointless.

https://git.infradead.org/users/dwmw2/linux.git/shortlog/vhost-net

I'll repost the whole series after I take look at whether I can get it
to work without *either* side doing the vnet header, and add a test
case based on my userspace code in 
https://gitlab.com/openconnect/openconnect/-/blob/vhost/vhost.c

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-19 13:33 [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode David Woodhouse
  2021-06-21  7:00 ` Jason Wang
@ 2021-06-22 16:15 ` David Woodhouse
  2021-06-22 16:15   ` [PATCH v2 2/4] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path David Woodhouse
                     ` (3 more replies)
  2021-06-24 12:30 ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() David Woodhouse
  2 siblings, 4 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-22 16:15 UTC (permalink / raw)
  To: netdev; +Cc: Jason Wang, Eugenio Pérez

From: David Woodhouse <dwmw@amazon.co.uk>

In tun_get_user(), skb->protocol is either taken from the tun_pi header
or inferred from the first byte of the packet in IFF_TUN mode, while
eth_type_trans() is called only in the IFF_TAP mode where the payload
is expected to be an Ethernet frame.

The alternative path in tun_xdp_one() was unconditionally using
eth_type_trans(), which corrupts packets in IFF_TUN mode. Fix it to
do the correct thing for IFF_TUN mode, as tun_get_user() does.

Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/net/tun.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4cf38be26dc9..f812dcdc640e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2394,8 +2394,50 @@ static int tun_xdp_one(struct tun_struct *tun,
 		err = -EINVAL;
 		goto out;
 	}
+	switch (tun->flags & TUN_TYPE_MASK) {
+	case IFF_TUN:
+		if (tun->flags & IFF_NO_PI) {
+			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
+
+			switch (ip_version) {
+			case 4:
+				skb->protocol = htons(ETH_P_IP);
+				break;
+			case 6:
+				skb->protocol = htons(ETH_P_IPV6);
+				break;
+			default:
+				atomic_long_inc(&tun->dev->rx_dropped);
+				kfree_skb(skb);
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			struct tun_pi *pi = (struct tun_pi *)skb->data;
+			if (!pskb_may_pull(skb, sizeof(*pi))) {
+				atomic_long_inc(&tun->dev->rx_dropped);
+				kfree_skb(skb);
+				err = -ENOMEM;
+				goto out;
+			}
+			skb_pull_inline(skb, sizeof(*pi));
+			skb->protocol = pi->proto;
+		}
+
+		skb_reset_mac_header(skb);
+		skb->dev = tun->dev;
+		break;
+	case IFF_TAP:
+		if (!pskb_may_pull(skb, ETH_HLEN)) {
+			atomic_long_inc(&tun->dev->rx_dropped);
+			kfree_skb(skb);
+			err = -ENOMEM;
+			goto out;
+		}
+		skb->protocol = eth_type_trans(skb, tun->dev);
+		break;
+	}
 
-	skb->protocol = eth_type_trans(skb, tun->dev);
 	skb_reset_network_header(skb);
 	skb_probe_transport_header(skb);
 	skb_record_rx_queue(skb, tfile->queue_index);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v2 2/4] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path
  2021-06-22 16:15 ` [PATCH v2 1/4] " David Woodhouse
@ 2021-06-22 16:15   ` David Woodhouse
  2021-06-23  3:46     ` Jason Wang
  2021-06-22 16:15   ` [PATCH v2 3/4] vhost_net: validate virtio_net_hdr only if it exists David Woodhouse
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-22 16:15 UTC (permalink / raw)
  To: netdev; +Cc: Jason Wang, Eugenio Pérez

From: David Woodhouse <dwmw@amazon.co.uk>

Sometimes it's just a data packet. The virtio_net_hdr processing should be
conditional on IFF_VNET_HDR, just as it is in tun_get_user().

Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/net/tun.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index f812dcdc640e..96933887d03d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2331,7 +2331,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 {
 	unsigned int datasize = xdp->data_end - xdp->data;
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-	struct virtio_net_hdr *gso = &hdr->gso;
+	struct virtio_net_hdr *gso = NULL;
 	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb = NULL;
 	u32 rxhash = 0, act;
@@ -2340,9 +2340,12 @@ static int tun_xdp_one(struct tun_struct *tun,
 	bool skb_xdp = false;
 	struct page *page;
 
+	if (tun->flags & IFF_VNET_HDR)
+		gso = &hdr->gso;
+
 	xdp_prog = rcu_dereference(tun->xdp_prog);
 	if (xdp_prog) {
-		if (gso->gso_type) {
+		if (gso && gso->gso_type) {
 			skb_xdp = true;
 			goto build;
 		}
@@ -2388,7 +2391,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
 	skb_put(skb, xdp->data_end - xdp->data);
 
-	if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
+	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
 		atomic_long_inc(&tun->rx_frame_errors);
 		kfree_skb(skb);
 		err = -EINVAL;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v2 3/4] vhost_net: validate virtio_net_hdr only if it exists
  2021-06-22 16:15 ` [PATCH v2 1/4] " David Woodhouse
  2021-06-22 16:15   ` [PATCH v2 2/4] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path David Woodhouse
@ 2021-06-22 16:15   ` David Woodhouse
  2021-06-23  3:48     ` Jason Wang
  2021-06-22 16:15   ` [PATCH v2 4/4] vhost_net: Add self test with tun device David Woodhouse
  2021-06-23  3:45   ` [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode Jason Wang
  3 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-22 16:15 UTC (permalink / raw)
  To: netdev; +Cc: Jason Wang, Eugenio Pérez

From: David Woodhouse <dwmw@amazon.co.uk>

When the underlying socket doesn't handle the virtio_net_hdr, the
existing code in vhost_net_build_xdp() would attempt to validate stack
noise, by copying zero bytes into the local copy of the header and then
validating that. Skip the whole pointless pointer arithmetic and partial
copy (of zero bytes) in this case.

Fixes: 0a0be13b8fe2 ("vhost_net: batch submitting XDP buffers to underlayer sockets")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/vhost/net.c | 43 ++++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index df82b124170e..1e3652eb53af 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -690,7 +690,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 					     dev);
 	struct socket *sock = vhost_vq_get_backend(vq);
 	struct page_frag *alloc_frag = &net->page_frag;
-	struct virtio_net_hdr *gso;
 	struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
 	struct tun_xdp_hdr *hdr;
 	size_t len = iov_iter_count(from);
@@ -715,29 +714,31 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 		return -ENOMEM;
 
 	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-	copied = copy_page_from_iter(alloc_frag->page,
-				     alloc_frag->offset +
-				     offsetof(struct tun_xdp_hdr, gso),
-				     sock_hlen, from);
-	if (copied != sock_hlen)
-		return -EFAULT;
-
 	hdr = buf;
-	gso = &hdr->gso;
-
-	if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-	    vhost16_to_cpu(vq, gso->csum_start) +
-	    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
-	    vhost16_to_cpu(vq, gso->hdr_len)) {
-		gso->hdr_len = cpu_to_vhost16(vq,
-			       vhost16_to_cpu(vq, gso->csum_start) +
-			       vhost16_to_cpu(vq, gso->csum_offset) + 2);
-
-		if (vhost16_to_cpu(vq, gso->hdr_len) > len)
-			return -EINVAL;
+	if (sock_hlen) {
+		struct virtio_net_hdr *gso = &hdr->gso;
+
+		copied = copy_page_from_iter(alloc_frag->page,
+					     alloc_frag->offset +
+					     offsetof(struct tun_xdp_hdr, gso),
+					     sock_hlen, from);
+		if (copied != sock_hlen)
+			return -EFAULT;
+
+		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+		    vhost16_to_cpu(vq, gso->csum_start) +
+		    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
+		    vhost16_to_cpu(vq, gso->hdr_len)) {
+			gso->hdr_len = cpu_to_vhost16(vq,
+						      vhost16_to_cpu(vq, gso->csum_start) +
+						      vhost16_to_cpu(vq, gso->csum_offset) + 2);
+
+			if (vhost16_to_cpu(vq, gso->hdr_len) > len)
+				return -EINVAL;
+		}
+		len -= sock_hlen;
 	}
 
-	len -= sock_hlen;
 	copied = copy_page_from_iter(alloc_frag->page,
 				     alloc_frag->offset + pad,
 				     len, from);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v2 4/4] vhost_net: Add self test with tun device
  2021-06-22 16:15 ` [PATCH v2 1/4] " David Woodhouse
  2021-06-22 16:15   ` [PATCH v2 2/4] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path David Woodhouse
  2021-06-22 16:15   ` [PATCH v2 3/4] vhost_net: validate virtio_net_hdr only if it exists David Woodhouse
@ 2021-06-22 16:15   ` David Woodhouse
  2021-06-23  4:02     ` Jason Wang
  2021-06-23  3:45   ` [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode Jason Wang
  3 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-22 16:15 UTC (permalink / raw)
  To: netdev; +Cc: Jason Wang, Eugenio Pérez

From: David Woodhouse <dwmw@amazon.co.uk>

This creates a tun device and brings it up, then finds out the link-local
address the kernel automatically assigns to it.

It sends a ping to that address, from a fake LL address of its own, and
then waits for a response.

If the virtio_net_hdr stuff is all working correctly, it gets a response
and manages to understand it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/vhost/Makefile        |  16 +
 tools/testing/selftests/vhost/config          |   2 +
 .../testing/selftests/vhost/test_vhost_net.c  | 522 ++++++++++++++++++
 4 files changed, 541 insertions(+)
 create mode 100644 tools/testing/selftests/vhost/Makefile
 create mode 100644 tools/testing/selftests/vhost/config
 create mode 100644 tools/testing/selftests/vhost/test_vhost_net.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6c575cf34a71..300c03cfd0c7 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -71,6 +71,7 @@ TARGETS += user
 TARGETS += vDSO
 TARGETS += vm
 TARGETS += x86
+TARGETS += vhost
 TARGETS += zram
 #Please keep the TARGETS list alphabetically sorted
 # Run "make quicktest=1 run_tests" or
diff --git a/tools/testing/selftests/vhost/Makefile b/tools/testing/selftests/vhost/Makefile
new file mode 100644
index 000000000000..f5e565d80733
--- /dev/null
+++ b/tools/testing/selftests/vhost/Makefile
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0
+all:
+
+include ../lib.mk
+
+.PHONY: all clean
+
+BINARIES := test_vhost_net
+
+test_vhost_net: test_vhost_net.c ../kselftest.h ../kselftest_harness.h
+	$(CC) $(CFLAGS) -g $< -o $@
+
+TEST_PROGS += $(BINARIES)
+EXTRA_CLEAN := $(BINARIES)
+
+all: $(BINARIES)
diff --git a/tools/testing/selftests/vhost/config b/tools/testing/selftests/vhost/config
new file mode 100644
index 000000000000..6391c1f32c34
--- /dev/null
+++ b/tools/testing/selftests/vhost/config
@@ -0,0 +1,2 @@
+CONFIG_VHOST_NET=y
+CONFIG_TUN=y
diff --git a/tools/testing/selftests/vhost/test_vhost_net.c b/tools/testing/selftests/vhost/test_vhost_net.c
new file mode 100644
index 000000000000..14acf2c0e049
--- /dev/null
+++ b/tools/testing/selftests/vhost/test_vhost_net.c
@@ -0,0 +1,522 @@
+// SPDX-License-Identifier: LGPL-2.1
+
+#include "../kselftest_harness.h"
+#include "../../../virtio/asm/barrier.h"
+
+#include <sys/eventfd.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/wait.h>
+#include <sys/ioctl.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <net/if.h>
+#include <sys/socket.h>
+
+#include <netinet/tcp.h>
+#include <netinet/ip.h>
+#include <netinet/ip_icmp.h>
+#include <netinet/ip6.h>
+#include <netinet/icmp6.h>
+
+#include <linux/if_tun.h>
+#include <linux/virtio_net.h>
+#include <linux/vhost.h>
+
+static unsigned char hexnybble(char hex)
+{
+	switch (hex) {
+	case '0'...'9':
+		return hex - '0';
+	case 'a'...'f':
+		return 10 + hex - 'a';
+	case 'A'...'F':
+		return 10 + hex - 'A';
+	default:
+		exit (KSFT_SKIP);
+	}
+}
+
+static unsigned char hexchar(char *hex)
+{
+	return (hexnybble(hex[0]) << 4) | hexnybble(hex[1]);
+}
+
+int open_tun(int vnet_hdr_sz, struct in6_addr *addr)
+{
+	int tun_fd = open("/dev/net/tun", O_RDWR);
+	if (tun_fd == -1)
+		return -1;
+
+	struct ifreq ifr = { 0 };
+
+	ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
+	if (vnet_hdr_sz)
+		ifr.ifr_flags |= IFF_VNET_HDR;
+
+	if (ioctl(tun_fd, TUNSETIFF, (void *)&ifr) < 0)
+		goto out_tun;
+
+	if (vnet_hdr_sz &&
+	    ioctl(tun_fd, TUNSETVNETHDRSZ, &vnet_hdr_sz) < 0)
+		goto out_tun;
+
+	int sockfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP);
+	if (sockfd == -1)
+		goto out_tun;
+
+	if (ioctl(sockfd, SIOCGIFFLAGS, (void *)&ifr) < 0)
+		goto out_sock;
+
+	ifr.ifr_flags |= IFF_UP;
+	if (ioctl(sockfd, SIOCSIFFLAGS, (void *)&ifr) < 0)
+		goto out_sock;
+
+	close(sockfd);
+
+	FILE *inet6 = fopen("/proc/net/if_inet6", "r");
+	if (!inet6)
+		goto out_tun;
+
+	char buf[80];
+	while (fgets(buf, sizeof(buf), inet6)) {
+		size_t len = strlen(buf), namelen = strlen(ifr.ifr_name);
+		if (!strncmp(buf, "fe80", 4) &&
+		    buf[len - namelen - 2] == ' ' &&
+		    !strncmp(buf + len - namelen - 1, ifr.ifr_name, namelen)) {
+			for (int i = 0; i < 16; i++) {
+				addr->s6_addr[i] = hexchar(buf + i*2);
+			}
+			fclose(inet6);
+			return tun_fd;
+		}
+	}
+	/* Not found */
+	fclose(inet6);
+ out_sock:
+	close(sockfd);
+ out_tun:
+	close(tun_fd);
+	return -1;
+}
+
+#define RING_SIZE 32
+#define RING_MASK(x) ((x) & (RING_SIZE-1))
+
+struct pkt_buf {
+	unsigned char data[2048];
+};
+
+struct test_vring {
+	struct vring_desc desc[RING_SIZE];
+	struct vring_avail avail;
+	__virtio16 avail_ring[RING_SIZE];
+	struct vring_used used;
+	struct vring_used_elem used_ring[RING_SIZE];
+	struct pkt_buf pkts[RING_SIZE];
+} rings[2];
+
+static int setup_vring(int vhost_fd, int tun_fd, int call_fd, int kick_fd, int idx)
+{
+	struct test_vring *vring = &rings[idx];
+	int ret;
+
+	memset(vring, 0, sizeof(vring));
+
+	struct vhost_vring_state vs = { };
+	vs.index = idx;
+	vs.num = RING_SIZE;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_NUM, &vs) < 0) {
+		perror("VHOST_SET_VRING_NUM");
+		return -1;
+	}
+
+	vs.num = 0;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_BASE, &vs) < 0) {
+		perror("VHOST_SET_VRING_BASE");
+		return -1;
+	}
+
+	struct vhost_vring_addr va = { };
+	va.index = idx;
+	va.desc_user_addr = (uint64_t)vring->desc;
+	va.avail_user_addr = (uint64_t)&vring->avail;
+	va.used_user_addr  = (uint64_t)&vring->used;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &va) < 0) {
+		perror("VHOST_SET_VRING_ADDR");
+		return -1;
+	}
+
+	struct vhost_vring_file vf = { };
+	vf.index = idx;
+	vf.fd = tun_fd;
+	if (ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &vf) < 0) {
+		perror("VHOST_NET_SET_BACKEND");
+		return -1;
+	}
+
+	vf.fd = call_fd;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_CALL, &vf) < 0) {
+		perror("VHOST_SET_VRING_CALL");
+		return -1;
+	}
+
+	vf.fd = kick_fd;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_KICK, &vf) < 0) {
+		perror("VHOST_SET_VRING_KICK");
+		return -1;
+	}
+
+	return 0;
+}
+
+int setup_vhost(int vhost_fd, int tun_fd, int call_fd, int kick_fd, uint64_t want_features)
+{
+	int ret;
+
+	if (ioctl(vhost_fd, VHOST_SET_OWNER, NULL) < 0) {
+		perror("VHOST_SET_OWNER");
+		return -1;
+	}
+
+	uint64_t features;
+	if (ioctl(vhost_fd, VHOST_GET_FEATURES, &features) < 0) {
+		perror("VHOST_GET_FEATURES");
+		return -1;
+	}
+
+	if ((features & want_features) != want_features)
+		return KSFT_SKIP;
+
+	if (ioctl(vhost_fd, VHOST_SET_FEATURES, &want_features) < 0) {
+		perror("VHOST_SET_FEATURES");
+		return -1;
+	}
+
+	struct vhost_memory *vmem = alloca(sizeof(*vmem) + sizeof(vmem->regions[0]));
+
+	memset(vmem, 0, sizeof(*vmem) + sizeof(vmem->regions[0]));
+	vmem->nregions = 1;
+	/*
+	 * I just want to map the *whole* of userspace address space. But
+	 * from userspace I don't know what that is. On x86_64 it would be:
+	 *
+	 * vmem->regions[0].guest_phys_addr = 4096;
+	 * vmem->regions[0].memory_size = 0x7fffffffe000;
+	 * vmem->regions[0].userspace_addr = 4096;
+	 *
+	 * For now, just ensure we put everything inside a single BSS region.
+	 */
+	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
+	vmem->regions[0].userspace_addr = (uint64_t)&rings;
+	vmem->regions[0].memory_size = sizeof(rings);
+
+	if (ioctl(vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {
+		perror("VHOST_SET_MEM_TABLE");
+		return -1;
+	}
+
+	if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 0))
+		return -1;
+
+	if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 1))
+		return -1;
+
+	return 0;
+}
+
+
+static char ping_payload[16] = "VHOST TEST PACKT";
+
+static inline uint32_t csum_partial(uint16_t *buf, int nwords)
+{
+	uint32_t sum = 0;
+	for(sum=0; nwords>0; nwords--)
+		sum += ntohs(*buf++);
+	return sum;
+}
+
+static inline uint16_t csum_finish(uint32_t sum)
+{
+	sum = (sum >> 16) + (sum &0xffff);
+	sum += (sum >> 16);
+	return htons((uint16_t)(~sum));
+}
+
+static int create_icmp_echo(unsigned char *data, struct in6_addr *dst,
+			    struct in6_addr *src, uint16_t id, uint16_t seq)
+{
+	const int icmplen = ICMP_MINLEN + sizeof(ping_payload);
+	const int plen = sizeof(struct ip6_hdr) + icmplen;
+
+	struct ip6_hdr *iph = (void *)data;
+	struct icmp6_hdr *icmph = (void *)(data + sizeof(*iph));
+
+	/* IPv6 Header */
+	iph->ip6_flow = htonl((6 << 28) + /* version 6 */
+			      (0 << 20) + /* traffic class */
+			      (0 << 0));  /* flow ID  */
+	iph->ip6_nxt = IPPROTO_ICMPV6;
+	iph->ip6_plen = htons(icmplen);
+	iph->ip6_hlim = 128;
+	iph->ip6_src = *src;
+	iph->ip6_dst = *dst;
+
+	/* ICMPv6 echo request */
+	icmph->icmp6_type = ICMP6_ECHO_REQUEST;
+	icmph->icmp6_code = 0;
+	icmph->icmp6_data16[0] = htons(id);	/* ID */
+	icmph->icmp6_data16[1] = htons(seq);	/* sequence */
+
+	/* Some arbitrary payload */
+	memcpy(&icmph[1], ping_payload, sizeof(ping_payload));
+
+	/*
+	 * IPv6 upper-layer checksums include a pseudo-header
+	 * for IPv6 which contains the source address, the
+	 * destination address, the upper-layer packet length
+	 * and next-header field. See RFC8200 §8.1. The
+	 * checksum is as follows:
+	 *
+	 *   checksum 32 bytes of real IPv6 header:
+	 *     src addr (16 bytes)
+	 *     dst addr (16 bytes)
+	 *   8 bytes more:
+	 *     length of ICMPv6 in bytes (be32)
+	 *     3 bytes of 0
+	 *     next header byte (IPPROTO_ICMPV6)
+	 *   Then the actual ICMPv6 bytes
+	 */
+	uint32_t sum = csum_partial((uint16_t *)&iph->ip6_src, 8);      /* 8 uint16_t */
+	sum += csum_partial((uint16_t *)&iph->ip6_dst, 8);              /* 8 uint16_t */
+
+	/* The easiest way to checksum the following 8-byte
+	 * part of the pseudo-header without horridly violating
+	 * C type aliasing rules is *not* to build it in memory
+	 * at all. We know the length fits in 16 bits so the
+	 * partial checksum of 00 00 LL LL 00 00 00 NH ends up
+	 * being just LLLL + NH.
+	 */
+	sum += IPPROTO_ICMPV6;
+	sum += ICMP_MINLEN + sizeof(ping_payload);
+
+	sum += csum_partial((uint16_t *)icmph, icmplen / 2);
+	icmph->icmp6_cksum = csum_finish(sum);
+	return plen;
+}
+
+
+static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_addr *dst, struct in6_addr *src)
+{
+	struct ip6_hdr *iph = (void *)data;
+	return ( len >= 41 && (ntohl(iph->ip6_flow) >> 28)==6 /* IPv6 header */
+		 && iph->ip6_nxt == IPPROTO_ICMPV6 /* IPv6 next header field = ICMPv6 */
+		 && !memcmp(&iph->ip6_src, src, 16) /* source == magic address */
+		 && !memcmp(&iph->ip6_dst, dst, 16) /* source == magic address */
+		 && len >= 40 + ICMP_MINLEN + sizeof(ping_payload) /* No short-packet segfaults */
+		 && data[40] == ICMP6_ECHO_REPLY /* ICMPv6 reply */
+		 && !memcmp(&data[40 + ICMP_MINLEN], ping_payload, sizeof(ping_payload)) /* Same payload in response */
+		 );
+
+}
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+#define vio16(x) (x)
+#define vio32(x) (x)
+#define vio64(x) (x)
+#else
+#define vio16(x) __builtin_bswap16(x)
+#define vio32(x) __builtin_bswap32(x)
+#define vio64(x) __builtin_bswap64(x)
+#endif
+
+
+int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
+{
+	int call_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
+	int kick_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
+	int vhost_fd = open("/dev/vhost-net", O_RDWR);
+	int tun_fd = -1;
+	int ret = KSFT_SKIP;
+
+	if (call_fd < 0 || kick_fd < 0 || vhost_fd < 0)
+		goto err;
+
+	memset(rings, 0, sizeof(rings));
+
+	/* Pick up the link-local address that the kernel
+	 * assigns to the tun device. */
+	struct in6_addr tun_addr;
+	tun_fd = open_tun(vnet_hdr_sz, &tun_addr);
+	if (tun_fd < 0)
+		goto err;
+
+	if (features & (1ULL << VHOST_NET_F_VIRTIO_NET_HDR)) {
+		if (vnet_hdr_sz) {
+			ret = -1;
+			goto err;
+		}
+
+		vnet_hdr_sz = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+					   (1ULL << VIRTIO_F_VERSION_1))) ?
+			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
+			sizeof(struct virtio_net_hdr);
+	}
+
+	if (!xdp) {
+		int sndbuf = RING_SIZE * 2048;
+		if (ioctl(tun_fd, TUNSETSNDBUF, &sndbuf) < 0) {
+			perror("TUNSETSNDBUF");
+			ret = -1;
+			goto err;
+		}
+	}
+
+	ret = setup_vhost(vhost_fd, tun_fd, call_fd, kick_fd, features);
+	if (ret)
+		goto err;
+
+	/* A fake link-local address for the userspace end */
+	struct in6_addr local_addr = { 0 };
+	local_addr.s6_addr16[0] = htons(0xfe80);
+	local_addr.s6_addr16[7] = htons(1);
+
+	/* Set up RX and TX descriptors; the latter with ping packets ready to
+	 * send to the kernel, but don't actually send them yet. */
+	for (int i = 0; i < RING_SIZE; i++) {
+		struct pkt_buf *pkt = &rings[1].pkts[i];
+		int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], &tun_addr,
+					    &local_addr, 0x4747, i);
+
+		rings[1].desc[i].addr = vio64((uint64_t)pkt);
+		rings[1].desc[i].len = vio32(plen + vnet_hdr_sz);
+		rings[1].avail_ring[i] = vio16(i);
+
+
+		pkt = &rings[0].pkts[i];
+		rings[0].desc[i].addr = vio64((uint64_t)pkt);
+		rings[0].desc[i].len = vio32(sizeof(*pkt));
+		rings[0].desc[i].flags = vio16(VRING_DESC_F_WRITE);
+		rings[0].avail_ring[i] = vio16(i);
+	}
+	barrier();
+
+	rings[0].avail.idx = RING_SIZE;
+	rings[1].avail.idx = vio16(1);
+
+	barrier();
+	eventfd_write(kick_fd, 1);
+
+	uint16_t rx_seen_used = 0;
+	struct timeval tv = { 1, 0 };
+	while (1) {
+		fd_set rfds = { 0 };
+		FD_SET(call_fd, &rfds);
+
+		if (select(call_fd + 1, &rfds, NULL, NULL, &tv) <= 0) {
+			ret = -1;
+			goto err;
+		}
+
+		uint16_t rx_used_idx = vio16(rings[0].used.idx);
+		barrier();
+
+		while(rx_used_idx != rx_seen_used) {
+			uint32_t desc = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].id);
+			uint32_t len  = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].len);
+
+			if (desc >= RING_SIZE || len < vnet_hdr_sz)
+				return -1;
+
+			uint64_t addr = vio64(rings[0].desc[desc].addr);
+			if (!addr)
+				return -1;
+
+			if (check_icmp_response((void *)(addr + vnet_hdr_sz), len - vnet_hdr_sz,
+						&local_addr, &tun_addr)) {
+				ret = 0;
+				printf("Success (%d %d %llx)\n", vnet_hdr_sz, xdp, (unsigned long long)features);
+				goto err;
+			}
+			rx_seen_used++;
+
+			/* Give the same buffer back */
+			rings[0].avail.idx = vio16(rx_seen_used + RING_SIZE);
+			barrier();
+			eventfd_write(kick_fd, 1);
+		}
+
+		uint64_t ev_val;
+		eventfd_read(call_fd, &ev_val);
+	}
+
+ err:
+	if (call_fd != -1)
+		close(call_fd);
+	if (kick_fd != -1)
+		close(kick_fd);
+	if (vhost_fd != -1)
+		close(vhost_fd);
+	if (tun_fd != -1)
+		close(tun_fd);
+
+	return ret;
+}
+
+
+int main(void)
+{
+	int ret;
+
+	ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+				(1ULL << VIRTIO_F_VERSION_1)));
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+				(1ULL << VIRTIO_F_VERSION_1)));
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(10, 0, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(10, 1, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+#if 0 /* These ones will fail */
+	ret = test_vhost(0, 0, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(0, 1, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(12, 0, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(12, 1, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+#endif
+
+	return ret;
+}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-22  8:29           ` David Woodhouse
@ 2021-06-23  3:39             ` Jason Wang
  2021-06-24 12:39               ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-23  3:39 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/22 下午4:29, David Woodhouse 写道:
> On Tue, 2021-06-22 at 16:00 +0800, Jason Wang wrote:
>>> I'm tempted to add a new feature for that 1:1 access, with no ->umem or
>>> ->iotlb at all. And then I can use it as a key to know that the XDP
>>> bugs are fixed too :)
>>
>> This means we need validate the userspace address each time before vhost
>> tries to use that. This will de-gradate the performance. So we still
>> need to figure out the legal userspace address range which might not be
>> easy.
> Easier from the kernel than from userspace though :)


Yes.


>
> But I don't know that we need it. Isn't a call to access_ok() going to
> be faster than what translate_desc() does to look things up anyway?


Right.


>
> In the 1:1 mode, the access_ok() is all that's needed since there's no
> translation.
>
> @@ -2038,6 +2065,14 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
>          u64 s = 0;
>          int ret = 0;
>   
> +       if (vhost_has_feature(vq, VHOST_F_IDENTITY_MAPPING)) {


Using vhost_has_feature() is kind of tricky since it's used for virtio 
feature negotiation.

We probably need to use backend_features instead.

I think we should probably do more:

1) forbid the feature to be set when mem table / IOTLB has at least one 
mapping
2) forbid the mem table / IOTLB updating after the feature is set

Thanks


> +               if (!access_ok((void __user *)addr, len))
> +                       return -EFAULT;
> +
> +               iov[0].iov_len = len;
> +               iov[0].iov_base = (void __user *)addr;
> +               return 1;
> +       }
>          while ((u64)len > s) {
>                  u64 size;
>                  if (unlikely(ret >= iov_size)) {


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-22 16:15 ` [PATCH v2 1/4] " David Woodhouse
                     ` (2 preceding siblings ...)
  2021-06-22 16:15   ` [PATCH v2 4/4] vhost_net: Add self test with tun device David Woodhouse
@ 2021-06-23  3:45   ` Jason Wang
  2021-06-23  8:30     ` David Woodhouse
  2021-06-23 13:52     ` David Woodhouse
  3 siblings, 2 replies; 73+ messages in thread
From: Jason Wang @ 2021-06-23  3:45 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/23 上午12:15, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> In tun_get_user(), skb->protocol is either taken from the tun_pi header
> or inferred from the first byte of the packet in IFF_TUN mode, while
> eth_type_trans() is called only in the IFF_TAP mode where the payload
> is expected to be an Ethernet frame.
>
> The alternative path in tun_xdp_one() was unconditionally using
> eth_type_trans(), which corrupts packets in IFF_TUN mode. Fix it to
> do the correct thing for IFF_TUN mode, as tun_get_user() does.
>
> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   drivers/net/tun.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 4cf38be26dc9..f812dcdc640e 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2394,8 +2394,50 @@ static int tun_xdp_one(struct tun_struct *tun,
>   		err = -EINVAL;
>   		goto out;
>   	}
> +	switch (tun->flags & TUN_TYPE_MASK) {
> +	case IFF_TUN:
> +		if (tun->flags & IFF_NO_PI) {
> +			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
> +
> +			switch (ip_version) {
> +			case 4:
> +				skb->protocol = htons(ETH_P_IP);
> +				break;
> +			case 6:
> +				skb->protocol = htons(ETH_P_IPV6);
> +				break;
> +			default:
> +				atomic_long_inc(&tun->dev->rx_dropped);
> +				kfree_skb(skb);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +		} else {
> +			struct tun_pi *pi = (struct tun_pi *)skb->data;
> +			if (!pskb_may_pull(skb, sizeof(*pi))) {
> +				atomic_long_inc(&tun->dev->rx_dropped);
> +				kfree_skb(skb);
> +				err = -ENOMEM;
> +				goto out;
> +			}
> +			skb_pull_inline(skb, sizeof(*pi));
> +			skb->protocol = pi->proto;


As replied in previous version, it would be better if we can unify 
similar logic in tun_get_user().

Thanks


> +		}
> +
> +		skb_reset_mac_header(skb);
> +		skb->dev = tun->dev;
> +		break;
> +	case IFF_TAP:
> +		if (!pskb_may_pull(skb, ETH_HLEN)) {
> +			atomic_long_inc(&tun->dev->rx_dropped);
> +			kfree_skb(skb);
> +			err = -ENOMEM;
> +			goto out;
> +		}
> +		skb->protocol = eth_type_trans(skb, tun->dev);
> +		break;
> +	}
>   
> -	skb->protocol = eth_type_trans(skb, tun->dev);
>   	skb_reset_network_header(skb);
>   	skb_probe_transport_header(skb);
>   	skb_record_rx_queue(skb, tfile->queue_index);


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 2/4] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path
  2021-06-22 16:15   ` [PATCH v2 2/4] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path David Woodhouse
@ 2021-06-23  3:46     ` Jason Wang
  0 siblings, 0 replies; 73+ messages in thread
From: Jason Wang @ 2021-06-23  3:46 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/23 上午12:15, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> Sometimes it's just a data packet. The virtio_net_hdr processing should be
> conditional on IFF_VNET_HDR, just as it is in tun_get_user().
>
> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/net/tun.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index f812dcdc640e..96933887d03d 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2331,7 +2331,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>   {
>   	unsigned int datasize = xdp->data_end - xdp->data;
>   	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
> -	struct virtio_net_hdr *gso = &hdr->gso;
> +	struct virtio_net_hdr *gso = NULL;
>   	struct bpf_prog *xdp_prog;
>   	struct sk_buff *skb = NULL;
>   	u32 rxhash = 0, act;
> @@ -2340,9 +2340,12 @@ static int tun_xdp_one(struct tun_struct *tun,
>   	bool skb_xdp = false;
>   	struct page *page;
>   
> +	if (tun->flags & IFF_VNET_HDR)
> +		gso = &hdr->gso;
> +
>   	xdp_prog = rcu_dereference(tun->xdp_prog);
>   	if (xdp_prog) {
> -		if (gso->gso_type) {
> +		if (gso && gso->gso_type) {
>   			skb_xdp = true;
>   			goto build;
>   		}
> @@ -2388,7 +2391,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>   	skb_reserve(skb, xdp->data - xdp->data_hard_start);
>   	skb_put(skb, xdp->data_end - xdp->data);
>   
> -	if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
> +	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
>   		atomic_long_inc(&tun->rx_frame_errors);
>   		kfree_skb(skb);
>   		err = -EINVAL;


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 3/4] vhost_net: validate virtio_net_hdr only if it exists
  2021-06-22 16:15   ` [PATCH v2 3/4] vhost_net: validate virtio_net_hdr only if it exists David Woodhouse
@ 2021-06-23  3:48     ` Jason Wang
  0 siblings, 0 replies; 73+ messages in thread
From: Jason Wang @ 2021-06-23  3:48 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/23 上午12:15, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> When the underlying socket doesn't handle the virtio_net_hdr, the
> existing code in vhost_net_build_xdp() would attempt to validate stack
> noise, by copying zero bytes into the local copy of the header and then
> validating that. Skip the whole pointless pointer arithmetic and partial
> copy (of zero bytes) in this case.
>
> Fixes: 0a0be13b8fe2 ("vhost_net: batch submitting XDP buffers to underlayer sockets")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/vhost/net.c | 43 ++++++++++++++++++++++---------------------
>   1 file changed, 22 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index df82b124170e..1e3652eb53af 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -690,7 +690,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>   					     dev);
>   	struct socket *sock = vhost_vq_get_backend(vq);
>   	struct page_frag *alloc_frag = &net->page_frag;
> -	struct virtio_net_hdr *gso;
>   	struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
>   	struct tun_xdp_hdr *hdr;
>   	size_t len = iov_iter_count(from);
> @@ -715,29 +714,31 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>   		return -ENOMEM;
>   
>   	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> -	copied = copy_page_from_iter(alloc_frag->page,
> -				     alloc_frag->offset +
> -				     offsetof(struct tun_xdp_hdr, gso),
> -				     sock_hlen, from);
> -	if (copied != sock_hlen)
> -		return -EFAULT;
> -
>   	hdr = buf;
> -	gso = &hdr->gso;
> -
> -	if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> -	    vhost16_to_cpu(vq, gso->csum_start) +
> -	    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> -	    vhost16_to_cpu(vq, gso->hdr_len)) {
> -		gso->hdr_len = cpu_to_vhost16(vq,
> -			       vhost16_to_cpu(vq, gso->csum_start) +
> -			       vhost16_to_cpu(vq, gso->csum_offset) + 2);
> -
> -		if (vhost16_to_cpu(vq, gso->hdr_len) > len)
> -			return -EINVAL;
> +	if (sock_hlen) {
> +		struct virtio_net_hdr *gso = &hdr->gso;
> +
> +		copied = copy_page_from_iter(alloc_frag->page,
> +					     alloc_frag->offset +
> +					     offsetof(struct tun_xdp_hdr, gso),
> +					     sock_hlen, from);
> +		if (copied != sock_hlen)
> +			return -EFAULT;
> +
> +		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> +		    vhost16_to_cpu(vq, gso->csum_start) +
> +		    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> +		    vhost16_to_cpu(vq, gso->hdr_len)) {
> +			gso->hdr_len = cpu_to_vhost16(vq,
> +						      vhost16_to_cpu(vq, gso->csum_start) +
> +						      vhost16_to_cpu(vq, gso->csum_offset) + 2);
> +
> +			if (vhost16_to_cpu(vq, gso->hdr_len) > len)
> +				return -EINVAL;
> +		}
> +		len -= sock_hlen;
>   	}
>   
> -	len -= sock_hlen;
>   	copied = copy_page_from_iter(alloc_frag->page,
>   				     alloc_frag->offset + pad,
>   				     len, from);


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 4/4] vhost_net: Add self test with tun device
  2021-06-22 16:15   ` [PATCH v2 4/4] vhost_net: Add self test with tun device David Woodhouse
@ 2021-06-23  4:02     ` Jason Wang
  2021-06-23 16:12       ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-23  4:02 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/23 上午12:15, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> This creates a tun device and brings it up, then finds out the link-local
> address the kernel automatically assigns to it.
>
> It sends a ping to that address, from a fake LL address of its own, and
> then waits for a response.
>
> If the virtio_net_hdr stuff is all working correctly, it gets a response
> and manages to understand it.


I wonder whether it worth to bother the dependency like ipv6 or kernel 
networking stack.

How about simply use packet socket that is bound to tun to receive and 
send packets?


>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   tools/testing/selftests/Makefile              |   1 +
>   tools/testing/selftests/vhost/Makefile        |  16 +
>   tools/testing/selftests/vhost/config          |   2 +
>   .../testing/selftests/vhost/test_vhost_net.c  | 522 ++++++++++++++++++
>   4 files changed, 541 insertions(+)
>   create mode 100644 tools/testing/selftests/vhost/Makefile
>   create mode 100644 tools/testing/selftests/vhost/config
>   create mode 100644 tools/testing/selftests/vhost/test_vhost_net.c


[...]


> +	/*
> +	 * I just want to map the *whole* of userspace address space. But
> +	 * from userspace I don't know what that is. On x86_64 it would be:
> +	 *
> +	 * vmem->regions[0].guest_phys_addr = 4096;
> +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
> +	 * vmem->regions[0].userspace_addr = 4096;
> +	 *
> +	 * For now, just ensure we put everything inside a single BSS region.
> +	 */
> +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
> +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
> +	vmem->regions[0].memory_size = sizeof(rings);


Instead of doing tricks like this, we can do it in another way:

1) enable device IOTLB
2) wait for the IOTLB miss request (iova, len) and update identity 
mapping accordingly

This should work for all the archs (with some performance hit).

Thanks



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-23  3:45   ` [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode Jason Wang
@ 2021-06-23  8:30     ` David Woodhouse
  2021-06-23 13:52     ` David Woodhouse
  1 sibling, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-23  8:30 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 2118 bytes --]

On Wed, 2021-06-23 at 11:45 +0800, Jason Wang wrote:
> As replied in previous version, it would be better if we can unify 
> similar logic in tun_get_user().

Ah sorry, I missed that the first time.

Yes, that was my initial inclination too. But in the tun_get_user()
case we already *have* "pi", having read it separately into a local
variable; it never made it to the skb. So the cases are subtly
different enough that abstracting it out didn't seem to make sense.

If I try harder to unify it, I suppose it looks something like this and
*might* just make the cut for "simple enough to be backported to stable
kernels in a bug fix".

I'll add the PI mode to my test cases and try it (as well as *actually*
unifying the offending code, of course).

--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2332,10 +2332,12 @@ static int tun_xdp_one(struct tun_struct *tun,
        unsigned int datasize = xdp->data_end - xdp->data;
        struct tun_xdp_hdr *hdr = xdp->data_hard_start;
        struct virtio_net_hdr *gso = NULL;
+       struct tun_pi *pi = NULL;
        struct bpf_prog *xdp_prog;
        struct sk_buff *skb = NULL;
        u32 rxhash = 0, act;
        int buflen = hdr->buflen;
+       int reservelen = xdp->data - xdp->data_hard_start;
        int err = 0;
        bool skb_xdp = false;
        struct page *page;
@@ -2343,6 +2345,11 @@ static int tun_xdp_one(struct tun_struct *tun,
        if (tun->flags & IFF_VNET_HDR)
                gso = &hdr->gso;
 
+       if (!(tun->flags & IFF_NO_PI)) {
+               pi = xdp->data;
+               reservelen += sizeof(*pi);
+       }
+
        xdp_prog = rcu_dereference(tun->xdp_prog);
        if (xdp_prog) {
                if (gso && gso->gso_type) {
@@ -2388,7 +2395,7 @@ static int tun_xdp_one(struct tun_struct *tun,
                goto out;
        }
 
-       skb_reserve(skb, xdp->data - xdp->data_hard_start);
+       skb_reserve(skb, reservelen);
        skb_put(skb, xdp->data_end - xdp->data);
 
        if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-23  3:45   ` [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode Jason Wang
  2021-06-23  8:30     ` David Woodhouse
@ 2021-06-23 13:52     ` David Woodhouse
  2021-06-23 17:31       ` David Woodhouse
  2021-06-24  6:18       ` Jason Wang
  1 sibling, 2 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-23 13:52 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 5628 bytes --]

On Wed, 2021-06-23 at 11:45 +0800, Jason Wang wrote:
> 
> As replied in previous version, it would be better if we can unify 
> similar logic in tun_get_user().

So that ends up looking something like this (incremental).

Note the '/* XXX: frags && */' part in tun_skb_set_protocol(), because
the 'frags &&' was there in tun_get_user() and it wasn't obvious to me
whether I should be lifting that out as a separate argument to
tun_skb_set_protocol() or if there's a better way.

Either way, in my judgement this is less suitable for a stable fix and
more appropriate for a follow-on cleanup. But I don't feel that
strongly; I'm more than happy for you to overrule me on that.
Especially if you fix the above XXX part while you're at it :)

I tested this with vhost-net and !IFF_NO_PI, and TX works. RX is still
hosed on the vhost-net side, for the same reason that a bunch of test
cases were already listed in #if 0, but I'll address that in a separate
email. It's not part of *this* patch.

--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1641,6 +1641,40 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	return NULL;
 }
 
+static int tun_skb_set_protocol(struct tun_struct *tun, struct sk_buff *skb,
+				__be16 pi_proto)
+{
+	switch (tun->flags & TUN_TYPE_MASK) {
+	case IFF_TUN:
+		if (tun->flags & IFF_NO_PI) {
+			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
+
+			switch (ip_version) {
+			case 4:
+				pi_proto = htons(ETH_P_IP);
+				break;
+			case 6:
+				pi_proto = htons(ETH_P_IPV6);
+				break;
+			default:
+				return -EINVAL;
+			}
+		}
+
+		skb_reset_mac_header(skb);
+		skb->protocol = pi_proto;
+		skb->dev = tun->dev;
+		break;
+	case IFF_TAP:
+		if (/* XXX frags && */!pskb_may_pull(skb, ETH_HLEN))
+			return -ENOMEM;
+
+		skb->protocol = eth_type_trans(skb, tun->dev);
+		break;
+	}
+	return 0;
+}
+
 /* Get packet from user space buffer */
 static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 			    void *msg_control, struct iov_iter *from,
@@ -1784,37 +1818,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 		return -EINVAL;
 	}
 
-	switch (tun->flags & TUN_TYPE_MASK) {
-	case IFF_TUN:
-		if (tun->flags & IFF_NO_PI) {
-			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
-
-			switch (ip_version) {
-			case 4:
-				pi.proto = htons(ETH_P_IP);
-				break;
-			case 6:
-				pi.proto = htons(ETH_P_IPV6);
-				break;
-			default:
-				atomic_long_inc(&tun->dev->rx_dropped);
-				kfree_skb(skb);
-				return -EINVAL;
-			}
-		}
-
-		skb_reset_mac_header(skb);
-		skb->protocol = pi.proto;
-		skb->dev = tun->dev;
-		break;
-	case IFF_TAP:
-		if (frags && !pskb_may_pull(skb, ETH_HLEN)) {
-			err = -ENOMEM;
-			goto drop;
-		}
-		skb->protocol = eth_type_trans(skb, tun->dev);
-		break;
-	}
+	err = tun_skb_set_protocol(tun, skb, pi.proto);
+	if (err)
+		goto drop;
 
 	/* copy skb_ubuf_info for callback when skb has no error */
 	if (zerocopy) {
@@ -2334,8 +2340,10 @@ static int tun_xdp_one(struct tun_struct *tun,
 	struct virtio_net_hdr *gso = NULL;
 	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb = NULL;
+	__be16 proto = 0;
 	u32 rxhash = 0, act;
 	int buflen = hdr->buflen;
+	int reservelen = xdp->data - xdp->data_hard_start;
 	int err = 0;
 	bool skb_xdp = false;
 	struct page *page;
@@ -2343,6 +2351,17 @@ static int tun_xdp_one(struct tun_struct *tun,
 	if (tun->flags & IFF_VNET_HDR)
 		gso = &hdr->gso;
 
+	if (!(tun->flags & IFF_NO_PI)) {
+		struct tun_pi *pi = xdp->data;
+		if (datasize < sizeof(*pi)) {
+			atomic_long_inc(&tun->rx_frame_errors);
+			return  -EINVAL;
+		}
+		proto = pi->proto;
+		reservelen += sizeof(*pi);
+		datasize -= sizeof(*pi);
+	}
+
 	xdp_prog = rcu_dereference(tun->xdp_prog);
 	if (xdp_prog) {
 		if (gso && gso->gso_type) {
@@ -2388,8 +2407,8 @@ static int tun_xdp_one(struct tun_struct *tun,
 		goto out;
 	}
 
-	skb_reserve(skb, xdp->data - xdp->data_hard_start);
-	skb_put(skb, xdp->data_end - xdp->data);
+	skb_reserve(skb, reservelen);
+	skb_put(skb, datasize);
 
 	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
 		atomic_long_inc(&tun->rx_frame_errors);
@@ -2397,48 +2416,12 @@ static int tun_xdp_one(struct tun_struct *tun,
 		err = -EINVAL;
 		goto out;
 	}
-	switch (tun->flags & TUN_TYPE_MASK) {
-	case IFF_TUN:
-		if (tun->flags & IFF_NO_PI) {
-			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
 
-			switch (ip_version) {
-			case 4:
-				skb->protocol = htons(ETH_P_IP);
-				break;
-			case 6:
-				skb->protocol = htons(ETH_P_IPV6);
-				break;
-			default:
-				atomic_long_inc(&tun->dev->rx_dropped);
-				kfree_skb(skb);
-				err = -EINVAL;
-				goto out;
-			}
-		} else {
-			struct tun_pi *pi = (struct tun_pi *)skb->data;
-			if (!pskb_may_pull(skb, sizeof(*pi))) {
-				atomic_long_inc(&tun->dev->rx_dropped);
-				kfree_skb(skb);
-				err = -ENOMEM;
-				goto out;
-			}
-			skb_pull_inline(skb, sizeof(*pi));
-			skb->protocol = pi->proto;
-		}
-
-		skb_reset_mac_header(skb);
-		skb->dev = tun->dev;
-		break;
-	case IFF_TAP:
-		if (!pskb_may_pull(skb, ETH_HLEN)) {
-			atomic_long_inc(&tun->dev->rx_dropped);
-			kfree_skb(skb);
-			err = -ENOMEM;
-			goto out;
-		}
-		skb->protocol = eth_type_trans(skb, tun->dev);
-		break;
+	err = tun_skb_set_protocol(tun, skb, proto);
+	if (err) {
+		atomic_long_inc(&tun->dev->rx_dropped);
+		kfree_skb(skb);
+		goto out;
 	}
 
 	skb_reset_network_header(skb);


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 4/4] vhost_net: Add self test with tun device
  2021-06-23  4:02     ` Jason Wang
@ 2021-06-23 16:12       ` David Woodhouse
  2021-06-24  6:12         ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-23 16:12 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 11142 bytes --]

On Wed, 2021-06-23 at 12:02 +0800, Jason Wang wrote:
> 在 2021/6/23 上午12:15, David Woodhouse 写道:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > This creates a tun device and brings it up, then finds out the link-local
> > address the kernel automatically assigns to it.
> > 
> > It sends a ping to that address, from a fake LL address of its own, and
> > then waits for a response.
> > 
> > If the virtio_net_hdr stuff is all working correctly, it gets a response
> > and manages to understand it.
> 
> 
> I wonder whether it worth to bother the dependency like ipv6 or kernel 
> networking stack.
> 
> How about simply use packet socket that is bound to tun to receive and 
> send packets?
> 

I pondered that but figured that using the kernel's network stack
wasn't too much of an additional dependency. We *could* use an
AF_PACKET socket on the tun device and then drive both ends, but given
that the kernel *automatically* assigns a link-local address when we
bring the device up anyway, it seemed simple enough just to use ICMP.
I also happened to have the ICMP generation/checking code lying around
anyway in the same emacs instance, so it was reduced to a previously
solved problem.

We *should* eventually expand this test case to attach an AF_PACKET
device to the vhost-net, instead of using a tun device as the back end.
(Although I don't really see *why* vhost is limited to AF_PACKET. Why
*can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)


> > +	/*
> > +	 * I just want to map the *whole* of userspace address space. But
> > +	 * from userspace I don't know what that is. On x86_64 it would be:
> > +	 *
> > +	 * vmem->regions[0].guest_phys_addr = 4096;
> > +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
> > +	 * vmem->regions[0].userspace_addr = 4096;
> > +	 *
> > +	 * For now, just ensure we put everything inside a single BSS region.
> > +	 */
> > +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
> > +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
> > +	vmem->regions[0].memory_size = sizeof(rings);
> 
> 
> Instead of doing tricks like this, we can do it in another way:
> 
> 1) enable device IOTLB
> 2) wait for the IOTLB miss request (iova, len) and update identity 
> mapping accordingly
> 
> This should work for all the archs (with some performance hit).

Ick. For my actual application (OpenConnect) I'm either going to suck
it up and put in the arch-specific limits like in the comment above, or
I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking
about elsewhere. (Probably the former, since if I'm requiring kernel
changes then I have grander plans around extending AF_TLS to do DTLS,
then hooking that directly up to the tun socket via BPF and a sockmap
without the data frames ever going to userspace at all.)

For this test case, a hard-coded single address range in BSS is fine.

I've now added !IFF_NO_PI support to the test case, but as noted it
fails just like the other ones I'd already marked with #if 0, which is
because vhost-net pulls some value for 'sock_hlen' out of its posterior
based on some assumption around the vhost features. And then expects
sock_recvmsg() to return precisely that number of bytes more than the
value it peeks in the skb at the head of the sock's queue.

I think I can fix *all* those test cases by making tun_get_socket()
take an extra 'int *' argument, and use that to return the *actual*
value of sock_hlen. Here's the updated test case in the meantime:


From cf74e3fc80b8fd9df697a42cfc1ff3887de18f78 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Wed, 23 Jun 2021 16:38:56 +0100
Subject: [PATCH] test_vhost_net: add test cases with tun_pi header

These fail too, for the same reason as the previous tests were
guarded with #if 0: vhost-net pulls 'sock_hlen' out of its
posterior and just assumes it's 10 bytes. And then barfs when
a sock_recvmsg() doesn't return precisely ten bytes more than
it peeked in the head skb:

[1296757.531103] Discarded rx packet:  len 78, expected 74

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 .../testing/selftests/vhost/test_vhost_net.c  | 97 +++++++++++++------
 1 file changed, 65 insertions(+), 32 deletions(-)

diff --git a/tools/testing/selftests/vhost/test_vhost_net.c b/tools/testing/selftests/vhost/test_vhost_net.c
index fd4a2b0e42f0..734b3015a5bd 100644
--- a/tools/testing/selftests/vhost/test_vhost_net.c
+++ b/tools/testing/selftests/vhost/test_vhost_net.c
@@ -48,7 +48,7 @@ static unsigned char hexchar(char *hex)
 	return (hexnybble(hex[0]) << 4) | hexnybble(hex[1]);
 }
 
-int open_tun(int vnet_hdr_sz, struct in6_addr *addr)
+int open_tun(int vnet_hdr_sz, int pi, struct in6_addr *addr)
 {
 	int tun_fd = open("/dev/net/tun", O_RDWR);
 	if (tun_fd == -1)
@@ -56,7 +56,9 @@ int open_tun(int vnet_hdr_sz, struct in6_addr *addr)
 
 	struct ifreq ifr = { 0 };
 
-	ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
+	ifr.ifr_flags = IFF_TUN;
+	if (!pi)
+		ifr.ifr_flags |= IFF_NO_PI;
 	if (vnet_hdr_sz)
 		ifr.ifr_flags |= IFF_VNET_HDR;
 
@@ -249,11 +251,18 @@ static inline uint16_t csum_finish(uint32_t sum)
 	return htons((uint16_t)(~sum));
 }
 
-static int create_icmp_echo(unsigned char *data, struct in6_addr *dst,
+static int create_icmp_echo(unsigned char *data, int pi, struct in6_addr *dst,
 			    struct in6_addr *src, uint16_t id, uint16_t seq)
 {
 	const int icmplen = ICMP_MINLEN + sizeof(ping_payload);
-	const int plen = sizeof(struct ip6_hdr) + icmplen;
+	int plen = sizeof(struct ip6_hdr) + icmplen;
+
+	if (pi) {
+		struct tun_pi *pi = (void *)data;
+		data += sizeof(*pi);
+		plen += sizeof(*pi);
+		pi->proto = htons(ETH_P_IPV6);
+	}
 
 	struct ip6_hdr *iph = (void *)data;
 	struct icmp6_hdr *icmph = (void *)(data + sizeof(*iph));
@@ -312,8 +321,21 @@ static int create_icmp_echo(unsigned char *data, struct in6_addr *dst,
 }
 
 
-static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_addr *dst, struct in6_addr *src)
+static int check_icmp_response(unsigned char *data, uint32_t len, int pi,
+			       struct in6_addr *dst, struct in6_addr *src)
 {
+	if (pi) {
+		struct tun_pi *pi = (void *)data;
+		if (len < sizeof(*pi))
+			return 0;
+
+		if (pi->proto != htons(ETH_P_IPV6))
+			return 0;
+
+		data += sizeof(*pi);
+		len -= sizeof(*pi);
+	}
+
 	struct ip6_hdr *iph = (void *)data;
 	return ( len >= 41 && (ntohl(iph->ip6_flow) >> 28)==6 /* IPv6 header */
 		 && iph->ip6_nxt == IPPROTO_ICMPV6 /* IPv6 next header field = ICMPv6 */
@@ -337,7 +359,7 @@ static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_add
 #endif
 
 
-int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
+int test_vhost(int vnet_hdr_sz, int pi, int xdp, uint64_t features)
 {
 	int call_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
 	int kick_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
@@ -353,7 +375,7 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 	/* Pick up the link-local address that the kernel
 	 * assigns to the tun device. */
 	struct in6_addr tun_addr;
-	tun_fd = open_tun(vnet_hdr_sz, &tun_addr);
+	tun_fd = open_tun(vnet_hdr_sz, pi, &tun_addr);
 	if (tun_fd < 0)
 		goto err;
 
@@ -387,18 +409,18 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 	local_addr.s6_addr16[0] = htons(0xfe80);
 	local_addr.s6_addr16[7] = htons(1);
 
+
 	/* Set up RX and TX descriptors; the latter with ping packets ready to
 	 * send to the kernel, but don't actually send them yet. */
 	for (int i = 0; i < RING_SIZE; i++) {
 		struct pkt_buf *pkt = &rings[1].pkts[i];
-		int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], &tun_addr,
-					    &local_addr, 0x4747, i);
+		int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], pi,
+					    &tun_addr, &local_addr, 0x4747, i);
 
 		rings[1].desc[i].addr = vio64((uint64_t)pkt);
 		rings[1].desc[i].len = vio32(plen + vnet_hdr_sz);
 		rings[1].avail_ring[i] = vio16(i);
 
-
 		pkt = &rings[0].pkts[i];
 		rings[0].desc[i].addr = vio64((uint64_t)pkt);
 		rings[0].desc[i].len = vio32(sizeof(*pkt));
@@ -438,9 +460,10 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 				return -1;
 
 			if (check_icmp_response((void *)(addr + vnet_hdr_sz), len - vnet_hdr_sz,
-						&local_addr, &tun_addr)) {
+						pi, &local_addr, &tun_addr)) {
 				ret = 0;
-				printf("Success (%d %d %llx)\n", vnet_hdr_sz, xdp, (unsigned long long)features);
+				printf("Success (hdr %d, xdp %d, pi %d, features %llx)\n",
+				       vnet_hdr_sz, xdp, pi, (unsigned long long)features);
 				goto err;
 			}
 
@@ -466,51 +489,61 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 	return ret;
 }
 
-
-int main(void)
+/* Perform the given test with all four combinations of XDP/PI */
+int test_four(int vnet_hdr_sz, uint64_t features)
 {
-	int ret;
-
-	ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-				(1ULL << VIRTIO_F_VERSION_1)));
+	int ret = test_vhost(vnet_hdr_sz, 0, 0, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-				(1ULL << VIRTIO_F_VERSION_1)));
+	ret = test_vhost(vnet_hdr_sz, 0, 1, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
-
-	ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
+#if 0 /* These don't work *either* for the same reason as the #if 0 later */
+	ret = test_vhost(vnet_hdr_sz, 1, 0, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
+	ret = test_vhost(vnet_hdr_sz, 1, 1, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
+#endif
+}
 
-	ret = test_vhost(10, 0, 0);
-	if (ret && ret != KSFT_SKIP)
-		return ret;
+int main(void)
+{
+	int ret;
 
-	ret = test_vhost(10, 1, 0);
+	ret = test_four(10, 0);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-#if 0 /* These ones will fail */
-	ret = test_vhost(0, 0, 0);
+	ret = test_four(0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+			    (1ULL << VIRTIO_F_VERSION_1)));
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(0, 1, 0);
+	ret = test_four(0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(12, 0, 0);
+
+#if 0
+	/*
+	 * These ones will fail, because right now vhost *assumes* that the
+	 * underlying (tun, etc.) socket will be doing a header of precisely
+	 * sizeof(struct virtio_net_hdr), if vhost isn't doing so itself due
+	 * to VHOST_NET_F_VIRTIO_NET_HDR.
+	 *
+	 * That assumption breaks both tun with no IFF_VNET_HDR, and also
+	 * presumably raw sockets. So leave these test cases disabled for
+	 * now until it's fixed.
+	 */
+	ret = test_four(0, 0);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(12, 1, 0);
+	ret = test_four(12, 0);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 #endif
-- 
2.31.1


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-23 13:52     ` David Woodhouse
@ 2021-06-23 17:31       ` David Woodhouse
  2021-06-23 22:52         ` David Woodhouse
  2021-06-24  6:18       ` Jason Wang
  1 sibling, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-23 17:31 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 1010 bytes --]

On Wed, 2021-06-23 at 14:52 +0100, David Woodhouse wrote:
> @@ -2343,6 +2351,17 @@ static int tun_xdp_one(struct tun_struct *tun,
>         if (tun->flags & IFF_VNET_HDR)
>                 gso = &hdr->gso;
>  
> +       if (!(tun->flags & IFF_NO_PI)) {
> +               struct tun_pi *pi = xdp->data;
> +               if (datasize < sizeof(*pi)) {
> +                       atomic_long_inc(&tun->rx_frame_errors);
> +                       return  -EINVAL;
> +               }
> +               proto = pi->proto;
> +               reservelen += sizeof(*pi);
> +               datasize -= sizeof(*pi);
> +       }
> +
>         xdp_prog = rcu_dereference(tun->xdp_prog);
>         if (xdp_prog) {
>                 if (gso && gso->gso_type) {

Joy... that's wrong because when tun does both the PI and the vnet
headers, the PI header comes *first*. When tun does only PI and vhost
does the vnet headers, they come in the other order.

Will fix (and adjust the test cases to cope).


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-23 17:31       ` David Woodhouse
@ 2021-06-23 22:52         ` David Woodhouse
  2021-06-24  6:37           ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-23 22:52 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 31057 bytes --]

On Wed, 2021-06-23 at 18:31 +0100, David Woodhouse wrote:
> 
> Joy... that's wrong because when tun does both the PI and the vnet
> headers, the PI header comes *first*. When tun does only PI and vhost
> does the vnet headers, they come in the other order.
> 
> Will fix (and adjust the test cases to cope).


I got this far, pushed to
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vhost-net

All the test cases are now passing. I don't guarantee I haven't
actually broken qemu and IFF_TAP mode though, mind you :)

I'll need to refactor the intermediate commits a little so I won't
repost the series quite yet, but figured I should at least show what I
have for comments, as my day ends and yours begins.


As discussed, I expanded tun_get_socket()/tap_get_socket() to return
the actual header length instead of letting vhost make wild guesses.
Note that in doing so, I have made tun_get_socket() return -ENOTCONN if
the tun fd *isn't* actually attached (TUNSETIFF) to a real device yet.

I moved the sanity check back to tun/tap instead of doing it in
vhost_net_build_xdp(), because the latter has no clue about the tun PI
header and doesn't know *where* the virtio header is.


diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 8e3a28ba6b28..d1b1f1de374e 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1132,16 +1132,35 @@ static const struct file_operations tap_fops = {
 static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 {
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-	struct virtio_net_hdr *gso = &hdr->gso;
+	struct virtio_net_hdr *gso = NULL;
 	int buflen = hdr->buflen;
 	int vnet_hdr_len = 0;
 	struct tap_dev *tap;
 	struct sk_buff *skb;
 	int err, depth;
 
-	if (q->flags & IFF_VNET_HDR)
+	if (q->flags & IFF_VNET_HDR) {
 		vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
+		if (xdp->data != xdp->data_hard_start + sizeof(*hdr) + vnet_hdr_len) {
+			err = -EINVAL;
+			goto err;
+		}
+
+		gso = (void *)&hdr[1];
+
+		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+		     tap16_to_cpu(q, gso->csum_start) +
+		     tap16_to_cpu(q, gso->csum_offset) + 2 >
+			     tap16_to_cpu(q, gso->hdr_len))
+			gso->hdr_len = cpu_to_tap16(q,
+				 tap16_to_cpu(q, gso->csum_start) +
+				 tap16_to_cpu(q, gso->csum_offset) + 2);
 
+		if (tap16_to_cpu(q, gso->hdr_len) > xdp->data_end - xdp->data) {
+			err = -EINVAL;
+			goto err;
+		}
+	}
 	skb = build_skb(xdp->data_hard_start, buflen);
 	if (!skb) {
 		err = -ENOMEM;
@@ -1155,7 +1174,7 @@ static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 	skb_reset_mac_header(skb);
 	skb->protocol = eth_hdr(skb)->h_proto;
 
-	if (vnet_hdr_len) {
+	if (gso) {
 		err = virtio_net_hdr_to_skb(skb, gso, tap_is_little_endian(q));
 		if (err)
 			goto err_kfree;
@@ -1246,7 +1265,7 @@ static const struct proto_ops tap_socket_ops = {
  * attached to a device.  The returned object works like a packet socket, it
  * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
  * holding a reference to the file for as long as the socket is in use. */
-struct socket *tap_get_socket(struct file *file)
+struct socket *tap_get_socket(struct file *file, size_t *hlen)
 {
 	struct tap_queue *q;
 	if (file->f_op != &tap_fops)
@@ -1254,6 +1273,9 @@ struct socket *tap_get_socket(struct file *file)
 	q = file->private_data;
 	if (!q)
 		return ERR_PTR(-EBADFD);
+	if (hlen)
+		*hlen = (q->flags & IFF_VNET_HDR) ? q->vnet_hdr_sz : 0;
+
 	return &q->sock;
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4cf38be26dc9..72f8a04f493b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1641,6 +1641,40 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	return NULL;
 }
 
+static int tun_skb_set_protocol(struct tun_struct *tun, struct sk_buff *skb,
+				__be16 pi_proto)
+{
+	switch (tun->flags & TUN_TYPE_MASK) {
+	case IFF_TUN:
+		if (tun->flags & IFF_NO_PI) {
+			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
+
+			switch (ip_version) {
+			case 4:
+				pi_proto = htons(ETH_P_IP);
+				break;
+			case 6:
+				pi_proto = htons(ETH_P_IPV6);
+				break;
+			default:
+				return -EINVAL;
+			}
+		}
+
+		skb_reset_mac_header(skb);
+		skb->protocol = pi_proto;
+		skb->dev = tun->dev;
+		break;
+	case IFF_TAP:
+		if (/* frags && */!pskb_may_pull(skb, ETH_HLEN))
+			return -ENOMEM;
+
+		skb->protocol = eth_type_trans(skb, tun->dev);
+		break;
+	}
+	return 0;
+}
+
 /* Get packet from user space buffer */
 static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 			    void *msg_control, struct iov_iter *from,
@@ -1784,37 +1818,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 		return -EINVAL;
 	}
 
-	switch (tun->flags & TUN_TYPE_MASK) {
-	case IFF_TUN:
-		if (tun->flags & IFF_NO_PI) {
-			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
-
-			switch (ip_version) {
-			case 4:
-				pi.proto = htons(ETH_P_IP);
-				break;
-			case 6:
-				pi.proto = htons(ETH_P_IPV6);
-				break;
-			default:
-				atomic_long_inc(&tun->dev->rx_dropped);
-				kfree_skb(skb);
-				return -EINVAL;
-			}
-		}
-
-		skb_reset_mac_header(skb);
-		skb->protocol = pi.proto;
-		skb->dev = tun->dev;
-		break;
-	case IFF_TAP:
-		if (frags && !pskb_may_pull(skb, ETH_HLEN)) {
-			err = -ENOMEM;
-			goto drop;
-		}
-		skb->protocol = eth_type_trans(skb, tun->dev);
-		break;
-	}
+	err = tun_skb_set_protocol(tun, skb, pi.proto);
+	if (err)
+		goto drop;
 
 	/* copy skb_ubuf_info for callback when skb has no error */
 	if (zerocopy) {
@@ -2331,18 +2337,48 @@ static int tun_xdp_one(struct tun_struct *tun,
 {
 	unsigned int datasize = xdp->data_end - xdp->data;
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-	struct virtio_net_hdr *gso = &hdr->gso;
+	void *tun_hdr = &hdr[1];
+	struct virtio_net_hdr *gso = NULL;
 	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb = NULL;
+	__be16 proto = 0;
 	u32 rxhash = 0, act;
 	int buflen = hdr->buflen;
 	int err = 0;
 	bool skb_xdp = false;
 	struct page *page;
 
+	if (!(tun->flags & IFF_NO_PI)) {
+		struct tun_pi *pi = tun_hdr;
+		tun_hdr += sizeof(*pi);
+
+		if (tun_hdr > xdp->data) {
+			atomic_long_inc(&tun->rx_frame_errors);
+			return -EINVAL;
+		}
+		proto = pi->proto;
+	}
+
+	if (tun->flags & IFF_VNET_HDR) {
+		gso = tun_hdr;
+		tun_hdr += sizeof(*gso);
+
+		if (tun_hdr > xdp->data) {
+			atomic_long_inc(&tun->rx_frame_errors);
+			return -EINVAL;
+		}
+
+		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+		    tun16_to_cpu(tun, gso->csum_start) + tun16_to_cpu(tun, gso->csum_offset) + 2 > tun16_to_cpu(tun, gso->hdr_len))
+			gso->hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, gso->csum_start) + tun16_to_cpu(tun, gso->csum_offset) + 2);
+
+		if (tun16_to_cpu(tun, gso->hdr_len) > datasize)
+			return -EINVAL;
+	}
+
 	xdp_prog = rcu_dereference(tun->xdp_prog);
 	if (xdp_prog) {
-		if (gso->gso_type) {
+		if (gso && gso->gso_type) {
 			skb_xdp = true;
 			goto build;
 		}
@@ -2386,16 +2422,22 @@ static int tun_xdp_one(struct tun_struct *tun,
 	}
 
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
-	skb_put(skb, xdp->data_end - xdp->data);
+	skb_put(skb, datasize);
 
-	if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
+	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
 		atomic_long_inc(&tun->rx_frame_errors);
 		kfree_skb(skb);
 		err = -EINVAL;
 		goto out;
 	}
 
-	skb->protocol = eth_type_trans(skb, tun->dev);
+	err = tun_skb_set_protocol(tun, skb, proto);
+	if (err) {
+		atomic_long_inc(&tun->dev->rx_dropped);
+		kfree_skb(skb);
+		goto out;
+	}
+
 	skb_reset_network_header(skb);
 	skb_probe_transport_header(skb);
 	skb_record_rx_queue(skb, tfile->queue_index);
@@ -3649,7 +3691,7 @@ static void tun_cleanup(void)
  * attached to a device.  The returned object works like a packet socket, it
  * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
  * holding a reference to the file for as long as the socket is in use. */
-struct socket *tun_get_socket(struct file *file)
+struct socket *tun_get_socket(struct file *file, size_t *hlen)
 {
 	struct tun_file *tfile;
 	if (file->f_op != &tun_fops)
@@ -3657,6 +3699,20 @@ struct socket *tun_get_socket(struct file *file)
 	tfile = file->private_data;
 	if (!tfile)
 		return ERR_PTR(-EBADFD);
+
+	if (hlen) {
+		struct tun_struct *tun = tun_get(tfile);
+		size_t len = 0;
+
+		if (!tun)
+			return ERR_PTR(-ENOTCONN);
+		if (tun->flags & IFF_VNET_HDR)
+			len += READ_ONCE(tun->vnet_hdr_sz);
+		if (!(tun->flags & IFF_NO_PI))
+			len += sizeof(struct tun_pi);
+		tun_put(tun);
+		*hlen = len;
+	}
 	return &tfile->socket;
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index df82b124170e..d9491c620a9c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -690,7 +690,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 					     dev);
 	struct socket *sock = vhost_vq_get_backend(vq);
 	struct page_frag *alloc_frag = &net->page_frag;
-	struct virtio_net_hdr *gso;
 	struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
 	struct tun_xdp_hdr *hdr;
 	size_t len = iov_iter_count(from);
@@ -715,29 +714,18 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 		return -ENOMEM;
 
 	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-	copied = copy_page_from_iter(alloc_frag->page,
-				     alloc_frag->offset +
-				     offsetof(struct tun_xdp_hdr, gso),
-				     sock_hlen, from);
-	if (copied != sock_hlen)
-		return -EFAULT;
-
 	hdr = buf;
-	gso = &hdr->gso;
-
-	if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-	    vhost16_to_cpu(vq, gso->csum_start) +
-	    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
-	    vhost16_to_cpu(vq, gso->hdr_len)) {
-		gso->hdr_len = cpu_to_vhost16(vq,
-			       vhost16_to_cpu(vq, gso->csum_start) +
-			       vhost16_to_cpu(vq, gso->csum_offset) + 2);
-
-		if (vhost16_to_cpu(vq, gso->hdr_len) > len)
-			return -EINVAL;
+	if (sock_hlen) {
+		copied = copy_page_from_iter(alloc_frag->page,
+					     alloc_frag->offset +
+					     sizeof(struct tun_xdp_hdr),
+					     sock_hlen, from);
+		if (copied != sock_hlen)
+			return -EFAULT;
+
+		len -= sock_hlen;
 	}
 
-	len -= sock_hlen;
 	copied = copy_page_from_iter(alloc_frag->page,
 				     alloc_frag->offset + pad,
 				     len, from);
@@ -1420,7 +1408,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	return 0;
 }
 
-static struct socket *get_raw_socket(int fd)
+static struct socket *get_raw_socket(int fd, size_t *hlen)
 {
 	int r;
 	struct socket *sock = sockfd_lookup(fd, &r);
@@ -1438,6 +1426,7 @@ static struct socket *get_raw_socket(int fd)
 		r = -EPFNOSUPPORT;
 		goto err;
 	}
+	*hlen = 0;
 	return sock;
 err:
 	sockfd_put(sock);
@@ -1463,33 +1452,33 @@ static struct ptr_ring *get_tap_ptr_ring(int fd)
 	return ring;
 }
 
-static struct socket *get_tap_socket(int fd)
+static struct socket *get_tap_socket(int fd, size_t *hlen)
 {
 	struct file *file = fget(fd);
 	struct socket *sock;
 
 	if (!file)
 		return ERR_PTR(-EBADF);
-	sock = tun_get_socket(file);
+	sock = tun_get_socket(file, hlen);
 	if (!IS_ERR(sock))
 		return sock;
-	sock = tap_get_socket(file);
+	sock = tap_get_socket(file, hlen);
 	if (IS_ERR(sock))
 		fput(file);
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_socket(int fd, size_t *hlen)
 {
 	struct socket *sock;
 
 	/* special case to disable backend */
 	if (fd == -1)
 		return NULL;
-	sock = get_raw_socket(fd);
+	sock = get_raw_socket(fd, hlen);
 	if (!IS_ERR(sock))
 		return sock;
-	sock = get_tap_socket(fd);
+	sock = get_tap_socket(fd, hlen);
 	if (!IS_ERR(sock))
 		return sock;
 	return ERR_PTR(-ENOTSOCK);
@@ -1521,7 +1510,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		r = -EFAULT;
 		goto err_vq;
 	}
-	sock = get_socket(fd);
+	sock = get_socket(fd, &nvq->sock_hlen);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err_vq;
@@ -1621,7 +1610,7 @@ static long vhost_net_reset_owner(struct vhost_net *n)
 
 static int vhost_net_set_features(struct vhost_net *n, u64 features)
 {
-	size_t vhost_hlen, sock_hlen, hdr_len;
+	size_t vhost_hlen, hdr_len;
 	int i;
 
 	hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
@@ -1631,11 +1620,8 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 	if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
 		/* vhost provides vnet_hdr */
 		vhost_hlen = hdr_len;
-		sock_hlen = 0;
 	} else {
-		/* socket provides vnet_hdr */
 		vhost_hlen = 0;
-		sock_hlen = hdr_len;
 	}
 	mutex_lock(&n->dev.mutex);
 	if ((features & (1 << VHOST_F_LOG_ALL)) &&
@@ -1651,7 +1637,6 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 		mutex_lock(&n->vqs[i].vq.mutex);
 		n->vqs[i].vq.acked_features = features;
 		n->vqs[i].vhost_hlen = vhost_hlen;
-		n->vqs[i].sock_hlen = sock_hlen;
 		mutex_unlock(&n->vqs[i].vq.mutex);
 	}
 	mutex_unlock(&n->dev.mutex);
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 915a187cfabd..b460ba98f34e 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -3,14 +3,14 @@
 #define _LINUX_IF_TAP_H_
 
 #if IS_ENABLED(CONFIG_TAP)
-struct socket *tap_get_socket(struct file *);
+struct socket *tap_get_socket(struct file *, size_t *);
 struct ptr_ring *tap_get_ptr_ring(struct file *file);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
 struct file;
 struct socket;
-static inline struct socket *tap_get_socket(struct file *f)
+static inline struct socket *tap_get_socket(struct file *f, size_t *)
 {
 	return ERR_PTR(-EINVAL);
 }
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 2a7660843444..8d78b6bbc228 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -21,11 +21,10 @@ struct tun_msg_ctl {
 
 struct tun_xdp_hdr {
 	int buflen;
-	struct virtio_net_hdr gso;
 };
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
-struct socket *tun_get_socket(struct file *);
+struct socket *tun_get_socket(struct file *, size_t *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -45,7 +44,7 @@ void tun_ptr_free(void *ptr);
 #include <linux/errno.h>
 struct file;
 struct socket;
-static inline struct socket *tun_get_socket(struct file *f)
+static inline struct socket *tun_get_socket(struct file *f, size_t *)
 {
 	return ERR_PTR(-EINVAL);
 }
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6c575cf34a71..300c03cfd0c7 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -71,6 +71,7 @@ TARGETS += user
 TARGETS += vDSO
 TARGETS += vm
 TARGETS += x86
+TARGETS += vhost
 TARGETS += zram
 #Please keep the TARGETS list alphabetically sorted
 # Run "make quicktest=1 run_tests" or
diff --git a/tools/testing/selftests/vhost/Makefile b/tools/testing/selftests/vhost/Makefile
new file mode 100644
index 000000000000..f5e565d80733
--- /dev/null
+++ b/tools/testing/selftests/vhost/Makefile
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0
+all:
+
+include ../lib.mk
+
+.PHONY: all clean
+
+BINARIES := test_vhost_net
+
+test_vhost_net: test_vhost_net.c ../kselftest.h ../kselftest_harness.h
+	$(CC) $(CFLAGS) -g $< -o $@
+
+TEST_PROGS += $(BINARIES)
+EXTRA_CLEAN := $(BINARIES)
+
+all: $(BINARIES)
diff --git a/tools/testing/selftests/vhost/config b/tools/testing/selftests/vhost/config
new file mode 100644
index 000000000000..6391c1f32c34
--- /dev/null
+++ b/tools/testing/selftests/vhost/config
@@ -0,0 +1,2 @@
+CONFIG_VHOST_NET=y
+CONFIG_TUN=y
diff --git a/tools/testing/selftests/vhost/test_vhost_net.c b/tools/testing/selftests/vhost/test_vhost_net.c
new file mode 100644
index 000000000000..747f0e5e4f57
--- /dev/null
+++ b/tools/testing/selftests/vhost/test_vhost_net.c
@@ -0,0 +1,530 @@
+// SPDX-License-Identifier: LGPL-2.1
+
+#include "../kselftest_harness.h"
+#include "../../../virtio/asm/barrier.h"
+
+#include <sys/eventfd.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/wait.h>
+#include <sys/ioctl.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <net/if.h>
+#include <sys/socket.h>
+
+#include <netinet/tcp.h>
+#include <netinet/ip.h>
+#include <netinet/ip_icmp.h>
+#include <netinet/ip6.h>
+#include <netinet/icmp6.h>
+
+#include <linux/if_tun.h>
+#include <linux/virtio_net.h>
+#include <linux/vhost.h>
+
+static unsigned char hexnybble(char hex)
+{
+	switch (hex) {
+	case '0'...'9':
+		return hex - '0';
+	case 'a'...'f':
+		return 10 + hex - 'a';
+	case 'A'...'F':
+		return 10 + hex - 'A';
+	default:
+		exit (KSFT_SKIP);
+	}
+}
+
+static unsigned char hexchar(char *hex)
+{
+	return (hexnybble(hex[0]) << 4) | hexnybble(hex[1]);
+}
+
+int open_tun(int vnet_hdr_sz, int pi, struct in6_addr *addr)
+{
+	int tun_fd = open("/dev/net/tun", O_RDWR);
+	if (tun_fd == -1)
+		return -1;
+
+	struct ifreq ifr = { 0 };
+
+	ifr.ifr_flags = IFF_TUN;
+	if (!pi)
+		ifr.ifr_flags |= IFF_NO_PI;
+	if (vnet_hdr_sz)
+		ifr.ifr_flags |= IFF_VNET_HDR;
+
+	if (ioctl(tun_fd, TUNSETIFF, (void *)&ifr) < 0)
+		goto out_tun;
+
+	if (vnet_hdr_sz &&
+	    ioctl(tun_fd, TUNSETVNETHDRSZ, &vnet_hdr_sz) < 0)
+		goto out_tun;
+
+	int sockfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP);
+	if (sockfd == -1)
+		goto out_tun;
+
+	if (ioctl(sockfd, SIOCGIFFLAGS, (void *)&ifr) < 0)
+		goto out_sock;
+
+	ifr.ifr_flags |= IFF_UP;
+	if (ioctl(sockfd, SIOCSIFFLAGS, (void *)&ifr) < 0)
+		goto out_sock;
+
+	close(sockfd);
+
+	FILE *inet6 = fopen("/proc/net/if_inet6", "r");
+	if (!inet6)
+		goto out_tun;
+
+	char buf[80];
+	while (fgets(buf, sizeof(buf), inet6)) {
+		size_t len = strlen(buf), namelen = strlen(ifr.ifr_name);
+		if (!strncmp(buf, "fe80", 4) &&
+		    buf[len - namelen - 2] == ' ' &&
+		    !strncmp(buf + len - namelen - 1, ifr.ifr_name, namelen)) {
+			for (int i = 0; i < 16; i++) {
+				addr->s6_addr[i] = hexchar(buf + i*2);
+			}
+			fclose(inet6);
+			return tun_fd;
+		}
+	}
+	/* Not found */
+	fclose(inet6);
+ out_sock:
+	close(sockfd);
+ out_tun:
+	close(tun_fd);
+	return -1;
+}
+
+#define RING_SIZE 32
+#define RING_MASK(x) ((x) & (RING_SIZE-1))
+
+struct pkt_buf {
+	unsigned char data[2048];
+};
+
+struct test_vring {
+	struct vring_desc desc[RING_SIZE];
+	struct vring_avail avail;
+	__virtio16 avail_ring[RING_SIZE];
+	struct vring_used used;
+	struct vring_used_elem used_ring[RING_SIZE];
+	struct pkt_buf pkts[RING_SIZE];
+} rings[2];
+
+static int setup_vring(int vhost_fd, int tun_fd, int call_fd, int kick_fd, int idx)
+{
+	struct test_vring *vring = &rings[idx];
+	int ret;
+
+	memset(vring, 0, sizeof(vring));
+
+	struct vhost_vring_state vs = { };
+	vs.index = idx;
+	vs.num = RING_SIZE;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_NUM, &vs) < 0) {
+		perror("VHOST_SET_VRING_NUM");
+		return -1;
+	}
+
+	vs.num = 0;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_BASE, &vs) < 0) {
+		perror("VHOST_SET_VRING_BASE");
+		return -1;
+	}
+
+	struct vhost_vring_addr va = { };
+	va.index = idx;
+	va.desc_user_addr = (uint64_t)vring->desc;
+	va.avail_user_addr = (uint64_t)&vring->avail;
+	va.used_user_addr  = (uint64_t)&vring->used;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &va) < 0) {
+		perror("VHOST_SET_VRING_ADDR");
+		return -1;
+	}
+
+	struct vhost_vring_file vf = { };
+	vf.index = idx;
+	vf.fd = tun_fd;
+	if (ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &vf) < 0) {
+		perror("VHOST_NET_SET_BACKEND");
+		return -1;
+	}
+
+	vf.fd = call_fd;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_CALL, &vf) < 0) {
+		perror("VHOST_SET_VRING_CALL");
+		return -1;
+	}
+
+	vf.fd = kick_fd;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_KICK, &vf) < 0) {
+		perror("VHOST_SET_VRING_KICK");
+		return -1;
+	}
+
+	return 0;
+}
+
+int setup_vhost(int vhost_fd, int tun_fd, int call_fd, int kick_fd, uint64_t want_features)
+{
+	int ret;
+
+	if (ioctl(vhost_fd, VHOST_SET_OWNER, NULL) < 0) {
+		perror("VHOST_SET_OWNER");
+		return -1;
+	}
+
+	uint64_t features;
+	if (ioctl(vhost_fd, VHOST_GET_FEATURES, &features) < 0) {
+		perror("VHOST_GET_FEATURES");
+		return -1;
+	}
+
+	if ((features & want_features) != want_features)
+		return KSFT_SKIP;
+
+	if (ioctl(vhost_fd, VHOST_SET_FEATURES, &want_features) < 0) {
+		perror("VHOST_SET_FEATURES");
+		return -1;
+	}
+
+	struct vhost_memory *vmem = alloca(sizeof(*vmem) + sizeof(vmem->regions[0]));
+
+	memset(vmem, 0, sizeof(*vmem) + sizeof(vmem->regions[0]));
+	vmem->nregions = 1;
+	/*
+	 * I just want to map the *whole* of userspace address space. But
+	 * from userspace I don't know what that is. On x86_64 it would be:
+	 *
+	 * vmem->regions[0].guest_phys_addr = 4096;
+	 * vmem->regions[0].memory_size = 0x7fffffffe000;
+	 * vmem->regions[0].userspace_addr = 4096;
+	 *
+	 * For now, just ensure we put everything inside a single BSS region.
+	 */
+	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
+	vmem->regions[0].userspace_addr = (uint64_t)&rings;
+	vmem->regions[0].memory_size = sizeof(rings);
+
+	if (ioctl(vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {
+		perror("VHOST_SET_MEM_TABLE");
+		return -1;
+	}
+
+	if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 0))
+		return -1;
+
+	if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 1))
+		return -1;
+
+	return 0;
+}
+
+
+static char ping_payload[16] = "VHOST TEST PACKT";
+
+static inline uint32_t csum_partial(uint16_t *buf, int nwords)
+{
+	uint32_t sum = 0;
+	for(sum=0; nwords>0; nwords--)
+		sum += ntohs(*buf++);
+	return sum;
+}
+
+static inline uint16_t csum_finish(uint32_t sum)
+{
+	sum = (sum >> 16) + (sum &0xffff);
+	sum += (sum >> 16);
+	return htons((uint16_t)(~sum));
+}
+
+static int create_icmp_echo(unsigned char *data, struct in6_addr *dst,
+			    struct in6_addr *src, uint16_t id, uint16_t seq)
+{
+	const int icmplen = ICMP_MINLEN + sizeof(ping_payload);
+	const int plen = sizeof(struct ip6_hdr) + icmplen;
+	struct ip6_hdr *iph = (void *)data;
+	struct icmp6_hdr *icmph = (void *)(data + sizeof(*iph));
+
+	/* IPv6 Header */
+	iph->ip6_flow = htonl((6 << 28) + /* version 6 */
+			      (0 << 20) + /* traffic class */
+			      (0 << 0));  /* flow ID  */
+	iph->ip6_nxt = IPPROTO_ICMPV6;
+	iph->ip6_plen = htons(icmplen);
+	iph->ip6_hlim = 128;
+	iph->ip6_src = *src;
+	iph->ip6_dst = *dst;
+
+	/* ICMPv6 echo request */
+	icmph->icmp6_type = ICMP6_ECHO_REQUEST;
+	icmph->icmp6_code = 0;
+	icmph->icmp6_data16[0] = htons(id);	/* ID */
+	icmph->icmp6_data16[1] = htons(seq);	/* sequence */
+
+	/* Some arbitrary payload */
+	memcpy(&icmph[1], ping_payload, sizeof(ping_payload));
+
+	/*
+	 * IPv6 upper-layer checksums include a pseudo-header
+	 * for IPv6 which contains the source address, the
+	 * destination address, the upper-layer packet length
+	 * and next-header field. See RFC8200 §8.1. The
+	 * checksum is as follows:
+	 *
+	 *   checksum 32 bytes of real IPv6 header:
+	 *     src addr (16 bytes)
+	 *     dst addr (16 bytes)
+	 *   8 bytes more:
+	 *     length of ICMPv6 in bytes (be32)
+	 *     3 bytes of 0
+	 *     next header byte (IPPROTO_ICMPV6)
+	 *   Then the actual ICMPv6 bytes
+	 */
+	uint32_t sum = csum_partial((uint16_t *)&iph->ip6_src, 8);      /* 8 uint16_t */
+	sum += csum_partial((uint16_t *)&iph->ip6_dst, 8);              /* 8 uint16_t */
+
+	/* The easiest way to checksum the following 8-byte
+	 * part of the pseudo-header without horridly violating
+	 * C type aliasing rules is *not* to build it in memory
+	 * at all. We know the length fits in 16 bits so the
+	 * partial checksum of 00 00 LL LL 00 00 00 NH ends up
+	 * being just LLLL + NH.
+	 */
+	sum += IPPROTO_ICMPV6;
+	sum += ICMP_MINLEN + sizeof(ping_payload);
+
+	sum += csum_partial((uint16_t *)icmph, icmplen / 2);
+	icmph->icmp6_cksum = csum_finish(sum);
+	return plen;
+}
+
+
+static int check_icmp_response(unsigned char *data, uint32_t len,
+			       struct in6_addr *dst, struct in6_addr *src)
+{
+	struct ip6_hdr *iph = (void *)data;
+	return ( len >= 41 && (ntohl(iph->ip6_flow) >> 28)==6 /* IPv6 header */
+		 && iph->ip6_nxt == IPPROTO_ICMPV6 /* IPv6 next header field = ICMPv6 */
+		 && !memcmp(&iph->ip6_src, src, 16) /* source == magic address */
+		 && !memcmp(&iph->ip6_dst, dst, 16) /* source == magic address */
+		 && len >= 40 + ICMP_MINLEN + sizeof(ping_payload) /* No short-packet segfaults */
+		 && data[40] == ICMP6_ECHO_REPLY /* ICMPv6 reply */
+		 && !memcmp(&data[40 + ICMP_MINLEN], ping_payload, sizeof(ping_payload)) /* Same payload in response */
+		 );
+
+}
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+#define vio16(x) (x)
+#define vio32(x) (x)
+#define vio64(x) (x)
+#else
+#define vio16(x) __builtin_bswap16(x)
+#define vio32(x) __builtin_bswap32(x)
+#define vio64(x) __builtin_bswap64(x)
+#endif
+
+
+int test_vhost(int vnet_hdr_sz, int pi, int xdp, uint64_t features)
+{
+	int call_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
+	int kick_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
+	int vhost_fd = open("/dev/vhost-net", O_RDWR);
+	int tun_fd = -1;
+	int ret = KSFT_SKIP;
+
+	if (call_fd < 0 || kick_fd < 0 || vhost_fd < 0)
+		goto err;
+
+	memset(rings, 0, sizeof(rings));
+
+	/* Pick up the link-local address that the kernel
+	 * assigns to the tun device. */
+	struct in6_addr tun_addr;
+	tun_fd = open_tun(vnet_hdr_sz, pi, &tun_addr);
+	if (tun_fd < 0)
+		goto err;
+
+	int pi_offset = -1;
+	int data_offset = vnet_hdr_sz;
+
+	/* The tun device puts PI *first*, before the vnet hdr */
+	if (pi) {
+		pi_offset = 0;
+		data_offset += sizeof(struct tun_pi);
+	};
+
+	/* If vhost is going a vnet hdr it comes before all else */
+	if (features & (1ULL << VHOST_NET_F_VIRTIO_NET_HDR)) {
+		int vhost_hdr_sz = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+						(1ULL << VIRTIO_F_VERSION_1))) ?
+			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
+			sizeof(struct virtio_net_hdr);
+
+		data_offset += vhost_hdr_sz;
+		if (pi_offset != -1)
+			pi_offset += vhost_hdr_sz;
+	}
+
+	if (!xdp) {
+		int sndbuf = RING_SIZE * 2048;
+		if (ioctl(tun_fd, TUNSETSNDBUF, &sndbuf) < 0) {
+			perror("TUNSETSNDBUF");
+			ret = -1;
+			goto err;
+		}
+	}
+
+	ret = setup_vhost(vhost_fd, tun_fd, call_fd, kick_fd, features);
+	if (ret)
+		goto err;
+
+	/* A fake link-local address for the userspace end */
+	struct in6_addr local_addr = { 0 };
+	local_addr.s6_addr16[0] = htons(0xfe80);
+	local_addr.s6_addr16[7] = htons(1);
+
+
+	/* Set up RX and TX descriptors; the latter with ping packets ready to
+	 * send to the kernel, but don't actually send them yet. */
+	for (int i = 0; i < RING_SIZE; i++) {
+		struct pkt_buf *pkt = &rings[1].pkts[i];
+		if (pi_offset != -1) {
+			struct tun_pi *pi = (void *)&pkt->data[pi_offset];
+			pi->proto = htons(ETH_P_IPV6);
+		}
+		int plen = create_icmp_echo(&pkt->data[data_offset], &tun_addr,
+					    &local_addr, 0x4747, i);
+
+		rings[1].desc[i].addr = vio64((uint64_t)pkt);
+		rings[1].desc[i].len = vio32(plen + data_offset);
+		rings[1].avail_ring[i] = vio16(i);
+
+		pkt = &rings[0].pkts[i];
+		rings[0].desc[i].addr = vio64((uint64_t)pkt);
+		rings[0].desc[i].len = vio32(sizeof(*pkt));
+		rings[0].desc[i].flags = vio16(VRING_DESC_F_WRITE);
+		rings[0].avail_ring[i] = vio16(i);
+	}
+	barrier();
+	rings[1].avail.idx = vio16(1);
+
+	uint16_t rx_seen_used = 0;
+	struct timeval tv = { 1, 0 };
+	while (1) {
+		fd_set rfds = { 0 };
+		FD_SET(call_fd, &rfds);
+
+		rings[0].avail.idx = vio16(rx_seen_used + RING_SIZE);
+		barrier();
+		eventfd_write(kick_fd, 1);
+
+		if (select(call_fd + 1, &rfds, NULL, NULL, &tv) <= 0) {
+			ret = -1;
+			goto err;
+		}
+
+		uint16_t rx_used_idx = vio16(rings[0].used.idx);
+		barrier();
+
+		while(rx_used_idx != rx_seen_used) {
+			uint32_t desc = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].id);
+			uint32_t len  = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].len);
+
+			if (desc >= RING_SIZE || len < data_offset)
+				return -1;
+
+			uint64_t addr = vio64(rings[0].desc[desc].addr);
+			if (!addr)
+				return -1;
+
+			if (len > data_offset &&
+			    (pi_offset == -1 ||
+			     ((struct tun_pi *)(addr + pi_offset))->proto == htons(ETH_P_IPV6)) &&
+			    check_icmp_response((void *)(addr + data_offset), len - data_offset,
+						&local_addr, &tun_addr)) {
+				ret = 0;
+				goto err;
+			}
+
+			/* Give the same buffer back */
+			rings[0].avail_ring[RING_MASK(rx_seen_used++)] = vio32(desc);
+		}
+		barrier();
+
+		uint64_t ev_val;
+		eventfd_read(call_fd, &ev_val);
+	}
+
+ err:
+	if (call_fd != -1)
+		close(call_fd);
+	if (kick_fd != -1)
+		close(kick_fd);
+	if (vhost_fd != -1)
+		close(vhost_fd);
+	if (tun_fd != -1)
+		close(tun_fd);
+
+	printf("TEST: (hdr %d, xdp %d, pi %d, features %llx) RESULT: %d\n",
+	       vnet_hdr_sz, xdp, pi, (unsigned long long)features, ret);
+	return ret;
+}
+
+/* For iterating over all permutations. */
+#define VHDR_LEN_BITS	3	/* Tun vhdr length selection */
+#define XDP_BIT		4	/* Don't TUNSETSNDBUF, so we use XDP */
+#define PI_BIT		8	/* Don't set IFF_NO_PI */
+#define VIRTIO_V1_BIT	16	/* Use VIRTIO_F_VERSION_1 feature */
+#define VHOST_HDR_BIT	32	/* Use VHOST_NET_F_VIRTIO_NET_HDR */
+
+unsigned int tun_vhdr_lens[] = { 0, 10, 12, 20 };
+
+int main(void)
+{
+	int result = KSFT_SKIP;
+	int i, ret;
+
+	for (i = 0; i < 64; i++) {
+		uint64_t features = 0;
+
+		if (i & VIRTIO_V1_BIT)
+			features |= (1ULL << VIRTIO_F_VERSION_1);
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+		else
+			continue; /* We'd need vio16 et al not to byteswap */
+#endif
+
+		if (i & VHOST_HDR_BIT) {
+			features |= (1ULL << VHOST_NET_F_VIRTIO_NET_HDR);
+
+			/* Even though the test actually passes at the time of
+			 * writing, don't bother to try asking tun *and* vhost
+			 * both to handle a virtio_net_hdr at the same time.
+			 * That's just silly.  */
+			if (i & VHDR_LEN_BITS)
+				continue;
+		}
+
+		ret = test_vhost(tun_vhdr_lens[i & VHDR_LEN_BITS],
+				 !!(i & PI_BIT), !!(i & XDP_BIT), features);
+		if (ret < result)
+			result = ret;
+	}
+
+	return result;
+}



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 4/4] vhost_net: Add self test with tun device
  2021-06-23 16:12       ` David Woodhouse
@ 2021-06-24  6:12         ` Jason Wang
  2021-06-24 10:42           ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-24  6:12 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/24 上午12:12, David Woodhouse 写道:
> On Wed, 2021-06-23 at 12:02 +0800, Jason Wang wrote:
>> 在 2021/6/23 上午12:15, David Woodhouse 写道:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> This creates a tun device and brings it up, then finds out the link-local
>>> address the kernel automatically assigns to it.
>>>
>>> It sends a ping to that address, from a fake LL address of its own, and
>>> then waits for a response.
>>>
>>> If the virtio_net_hdr stuff is all working correctly, it gets a response
>>> and manages to understand it.
>>
>> I wonder whether it worth to bother the dependency like ipv6 or kernel
>> networking stack.
>>
>> How about simply use packet socket that is bound to tun to receive and
>> send packets?
>>
> I pondered that but figured that using the kernel's network stack
> wasn't too much of an additional dependency. We *could* use an
> AF_PACKET socket on the tun device and then drive both ends, but given
> that the kernel *automatically* assigns a link-local address when we
> bring the device up anyway, it seemed simple enough just to use ICMP.
> I also happened to have the ICMP generation/checking code lying around
> anyway in the same emacs instance, so it was reduced to a previously
> solved problem.


Ok.


>
> We *should* eventually expand this test case to attach an AF_PACKET
> device to the vhost-net, instead of using a tun device as the back end.
> (Although I don't really see *why* vhost is limited to AF_PACKET. Why
> *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)


It's just because nobody wrote the code. And we're lacking the real use 
case.

Vhost_net is bascially used for accepting packet from userspace to the 
kernel networking stack.

Using AF_UNIX makes it looks more like a case of inter process 
communication (without vnet header it won't be used efficiently by VM). 
In this case, using io_uring is much more suitable.

Or thinking in another way, instead of depending on the vhost_net, we 
can expose TUN/TAP socket to userspace then io_uring could be used for 
the OpenConnect case as well?


>
>
>>> +	/*
>>> +	 * I just want to map the *whole* of userspace address space. But
>>> +	 * from userspace I don't know what that is. On x86_64 it would be:
>>> +	 *
>>> +	 * vmem->regions[0].guest_phys_addr = 4096;
>>> +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
>>> +	 * vmem->regions[0].userspace_addr = 4096;
>>> +	 *
>>> +	 * For now, just ensure we put everything inside a single BSS region.
>>> +	 */
>>> +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
>>> +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
>>> +	vmem->regions[0].memory_size = sizeof(rings);
>>
>> Instead of doing tricks like this, we can do it in another way:
>>
>> 1) enable device IOTLB
>> 2) wait for the IOTLB miss request (iova, len) and update identity
>> mapping accordingly
>>
>> This should work for all the archs (with some performance hit).
> Ick. For my actual application (OpenConnect) I'm either going to suck
> it up and put in the arch-specific limits like in the comment above, or
> I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking
> about elsewhere.


The feature could be useful for the case of vhost-vDPA as well.


>   (Probably the former, since if I'm requiring kernel
> changes then I have grander plans around extending AF_TLS to do DTLS,
> then hooking that directly up to the tun socket via BPF and a sockmap
> without the data frames ever going to userspace at all.)


Ok, I guess we need to make sockmap works for tun socket.


>
> For this test case, a hard-coded single address range in BSS is fine.
>
> I've now added !IFF_NO_PI support to the test case, but as noted it
> fails just like the other ones I'd already marked with #if 0, which is
> because vhost-net pulls some value for 'sock_hlen' out of its posterior
> based on some assumption around the vhost features. And then expects
> sock_recvmsg() to return precisely that number of bytes more than the
> value it peeks in the skb at the head of the sock's queue.
>
> I think I can fix *all* those test cases by making tun_get_socket()
> take an extra 'int *' argument, and use that to return the *actual*
> value of sock_hlen. Here's the updated test case in the meantime:


It would be better if you can post a new version of the whole series to 
ease the reviewing.

Thanks



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-23 13:52     ` David Woodhouse
  2021-06-23 17:31       ` David Woodhouse
@ 2021-06-24  6:18       ` Jason Wang
  2021-06-24  7:05         ` David Woodhouse
  1 sibling, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-24  6:18 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/23 下午9:52, David Woodhouse 写道:
> On Wed, 2021-06-23 at 11:45 +0800, Jason Wang wrote:
>> As replied in previous version, it would be better if we can unify
>> similar logic in tun_get_user().
> So that ends up looking something like this (incremental).
>
> Note the '/* XXX: frags && */' part in tun_skb_set_protocol(), because
> the 'frags &&' was there in tun_get_user() and it wasn't obvious to me
> whether I should be lifting that out as a separate argument to
> tun_skb_set_protocol() or if there's a better way.
>
> Either way, in my judgement this is less suitable for a stable fix and
> more appropriate for a follow-on cleanup. But I don't feel that
> strongly; I'm more than happy for you to overrule me on that.
> Especially if you fix the above XXX part while you're at it :)


By simply adding a boolean "pull" in tun_skb_set_protocol()?

Thanks


>
> I tested this with vhost-net and !IFF_NO_PI, and TX works. RX is still
> hosed on the vhost-net side, for the same reason that a bunch of test
> cases were already listed in #if 0, but I'll address that in a separate
> email. It's not part of *this* patch.
>
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1641,6 +1641,40 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>   	return NULL;
>   }
>   
> +static int tun_skb_set_protocol(struct tun_struct *tun, struct sk_buff *skb,
> +				__be16 pi_proto)
> +{
> +	switch (tun->flags & TUN_TYPE_MASK) {
> +	case IFF_TUN:
> +		if (tun->flags & IFF_NO_PI) {
> +			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
> +
> +			switch (ip_version) {
> +			case 4:
> +				pi_proto = htons(ETH_P_IP);
> +				break;
> +			case 6:
> +				pi_proto = htons(ETH_P_IPV6);
> +				break;
> +			default:
> +				return -EINVAL;
> +			}
> +		}
> +
> +		skb_reset_mac_header(skb);
> +		skb->protocol = pi_proto;
> +		skb->dev = tun->dev;
> +		break;
> +	case IFF_TAP:
> +		if (/* XXX frags && */!pskb_may_pull(skb, ETH_HLEN))
> +			return -ENOMEM;
> +
> +		skb->protocol = eth_type_trans(skb, tun->dev);
> +		break;
> +	}
> +	return 0;
> +}
> +
>   /* Get packet from user space buffer */
>   static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>   			    void *msg_control, struct iov_iter *from,
> @@ -1784,37 +1818,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>   		return -EINVAL;
>   	}
>   
> -	switch (tun->flags & TUN_TYPE_MASK) {
> -	case IFF_TUN:
> -		if (tun->flags & IFF_NO_PI) {
> -			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
> -
> -			switch (ip_version) {
> -			case 4:
> -				pi.proto = htons(ETH_P_IP);
> -				break;
> -			case 6:
> -				pi.proto = htons(ETH_P_IPV6);
> -				break;
> -			default:
> -				atomic_long_inc(&tun->dev->rx_dropped);
> -				kfree_skb(skb);
> -				return -EINVAL;
> -			}
> -		}
> -
> -		skb_reset_mac_header(skb);
> -		skb->protocol = pi.proto;
> -		skb->dev = tun->dev;
> -		break;
> -	case IFF_TAP:
> -		if (frags && !pskb_may_pull(skb, ETH_HLEN)) {
> -			err = -ENOMEM;
> -			goto drop;
> -		}
> -		skb->protocol = eth_type_trans(skb, tun->dev);
> -		break;
> -	}
> +	err = tun_skb_set_protocol(tun, skb, pi.proto);
> +	if (err)
> +		goto drop;
>   
>   	/* copy skb_ubuf_info for callback when skb has no error */
>   	if (zerocopy) {
> @@ -2334,8 +2340,10 @@ static int tun_xdp_one(struct tun_struct *tun,
>   	struct virtio_net_hdr *gso = NULL;
>   	struct bpf_prog *xdp_prog;
>   	struct sk_buff *skb = NULL;
> +	__be16 proto = 0;
>   	u32 rxhash = 0, act;
>   	int buflen = hdr->buflen;
> +	int reservelen = xdp->data - xdp->data_hard_start;
>   	int err = 0;
>   	bool skb_xdp = false;
>   	struct page *page;
> @@ -2343,6 +2351,17 @@ static int tun_xdp_one(struct tun_struct *tun,
>   	if (tun->flags & IFF_VNET_HDR)
>   		gso = &hdr->gso;
>   
> +	if (!(tun->flags & IFF_NO_PI)) {
> +		struct tun_pi *pi = xdp->data;
> +		if (datasize < sizeof(*pi)) {
> +			atomic_long_inc(&tun->rx_frame_errors);
> +			return  -EINVAL;
> +		}
> +		proto = pi->proto;
> +		reservelen += sizeof(*pi);
> +		datasize -= sizeof(*pi);
> +	}
> +
>   	xdp_prog = rcu_dereference(tun->xdp_prog);
>   	if (xdp_prog) {
>   		if (gso && gso->gso_type) {
> @@ -2388,8 +2407,8 @@ static int tun_xdp_one(struct tun_struct *tun,
>   		goto out;
>   	}
>   
> -	skb_reserve(skb, xdp->data - xdp->data_hard_start);
> -	skb_put(skb, xdp->data_end - xdp->data);
> +	skb_reserve(skb, reservelen);
> +	skb_put(skb, datasize);
>   
>   	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
>   		atomic_long_inc(&tun->rx_frame_errors);
> @@ -2397,48 +2416,12 @@ static int tun_xdp_one(struct tun_struct *tun,
>   		err = -EINVAL;
>   		goto out;
>   	}
> -	switch (tun->flags & TUN_TYPE_MASK) {
> -	case IFF_TUN:
> -		if (tun->flags & IFF_NO_PI) {
> -			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
>   
> -			switch (ip_version) {
> -			case 4:
> -				skb->protocol = htons(ETH_P_IP);
> -				break;
> -			case 6:
> -				skb->protocol = htons(ETH_P_IPV6);
> -				break;
> -			default:
> -				atomic_long_inc(&tun->dev->rx_dropped);
> -				kfree_skb(skb);
> -				err = -EINVAL;
> -				goto out;
> -			}
> -		} else {
> -			struct tun_pi *pi = (struct tun_pi *)skb->data;
> -			if (!pskb_may_pull(skb, sizeof(*pi))) {
> -				atomic_long_inc(&tun->dev->rx_dropped);
> -				kfree_skb(skb);
> -				err = -ENOMEM;
> -				goto out;
> -			}
> -			skb_pull_inline(skb, sizeof(*pi));
> -			skb->protocol = pi->proto;
> -		}
> -
> -		skb_reset_mac_header(skb);
> -		skb->dev = tun->dev;
> -		break;
> -	case IFF_TAP:
> -		if (!pskb_may_pull(skb, ETH_HLEN)) {
> -			atomic_long_inc(&tun->dev->rx_dropped);
> -			kfree_skb(skb);
> -			err = -ENOMEM;
> -			goto out;
> -		}
> -		skb->protocol = eth_type_trans(skb, tun->dev);
> -		break;
> +	err = tun_skb_set_protocol(tun, skb, proto);
> +	if (err) {
> +		atomic_long_inc(&tun->dev->rx_dropped);
> +		kfree_skb(skb);
> +		goto out;
>   	}
>   
>   	skb_reset_network_header(skb);
>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-23 22:52         ` David Woodhouse
@ 2021-06-24  6:37           ` Jason Wang
  2021-06-24  7:23             ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-24  6:37 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/24 上午6:52, David Woodhouse 写道:
> On Wed, 2021-06-23 at 18:31 +0100, David Woodhouse wrote:
>> Joy... that's wrong because when tun does both the PI and the vnet
>> headers, the PI header comes *first*. When tun does only PI and vhost
>> does the vnet headers, they come in the other order.
>>
>> Will fix (and adjust the test cases to cope).
>
> I got this far, pushed to
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vhost-net
>
> All the test cases are now passing. I don't guarantee I haven't
> actually broken qemu and IFF_TAP mode though, mind you :)


No problem, but it would be easier for me if you can post another 
version of the series.


>
> I'll need to refactor the intermediate commits a little so I won't
> repost the series quite yet, but figured I should at least show what I
> have for comments, as my day ends and yours begins.
>
>
> As discussed, I expanded tun_get_socket()/tap_get_socket() to return
> the actual header length instead of letting vhost make wild guesses.


This probably won't work since we had TUNSETVNETHDRSZ.

I agree the vhost codes is tricky since it assumes only two kinds of the 
hdr length.

But it was basically how it works for the past 10 years. It depends on 
the userspace (Qemu) to coordinate it with the TUN/TAP through 
TUNSETVNETHDRSZ during the feature negotiation.


> Note that in doing so, I have made tun_get_socket() return -ENOTCONN if
> the tun fd *isn't* actually attached (TUNSETIFF) to a real device yet.


Any reason for doing this? Note that the socket is loosely coupled with 
the networking device.


>
> I moved the sanity check back to tun/tap instead of doing it in
> vhost_net_build_xdp(), because the latter has no clue about the tun PI
> header and doesn't know *where* the virtio header is.


Right, the deserves a separate patch.


>
>

[...]


>   	mutex_unlock(&n->dev.mutex);
> diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
> index 915a187cfabd..b460ba98f34e 100644
> --- a/include/linux/if_tap.h
> +++ b/include/linux/if_tap.h
> @@ -3,14 +3,14 @@
>   #define _LINUX_IF_TAP_H_
>   
>   #if IS_ENABLED(CONFIG_TAP)
> -struct socket *tap_get_socket(struct file *);
> +struct socket *tap_get_socket(struct file *, size_t *);
>   struct ptr_ring *tap_get_ptr_ring(struct file *file);
>   #else
>   #include <linux/err.h>
>   #include <linux/errno.h>
>   struct file;
>   struct socket;
> -static inline struct socket *tap_get_socket(struct file *f)
> +static inline struct socket *tap_get_socket(struct file *f, size_t *)
>   {
>   	return ERR_PTR(-EINVAL);
>   }
> diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
> index 2a7660843444..8d78b6bbc228 100644
> --- a/include/linux/if_tun.h
> +++ b/include/linux/if_tun.h
> @@ -21,11 +21,10 @@ struct tun_msg_ctl {
>   
>   struct tun_xdp_hdr {
>   	int buflen;
> -	struct virtio_net_hdr gso;


Any reason for doing this? I meant it can work but we need limit the 
changes that is unrelated to the fixes.

Thanks



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-24  6:18       ` Jason Wang
@ 2021-06-24  7:05         ` David Woodhouse
  0 siblings, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-24  7:05 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 1162 bytes --]

On Thu, 2021-06-24 at 14:18 +0800, Jason Wang wrote:
> 在 2021/6/23 下午9:52, David Woodhouse 写道:
> > On Wed, 2021-06-23 at 11:45 +0800, Jason Wang wrote:
> > > As replied in previous version, it would be better if we can unify
> > > similar logic in tun_get_user().
> > 
> > So that ends up looking something like this (incremental).
> > 
> > Note the '/* XXX: frags && */' part in tun_skb_set_protocol(), because
> > the 'frags &&' was there in tun_get_user() and it wasn't obvious to me
> > whether I should be lifting that out as a separate argument to
> > tun_skb_set_protocol() or if there's a better way.
> > 
> > Either way, in my judgement this is less suitable for a stable fix and
> > more appropriate for a follow-on cleanup. But I don't feel that
> > strongly; I'm more than happy for you to overrule me on that.
> > Especially if you fix the above XXX part while you're at it :)
> 
> 
> By simply adding a boolean "pull" in tun_skb_set_protocol()?

Sure; thanks. It's been a few years since I really played with skb
handling; I was half hoping for a simpler "you don't need to
because..." answer, but that works :)


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-24  6:37           ` Jason Wang
@ 2021-06-24  7:23             ` David Woodhouse
  0 siblings, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-24  7:23 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 9346 bytes --]

On Thu, 2021-06-24 at 14:37 +0800, Jason Wang wrote:
> 在 2021/6/24 上午6:52, David Woodhouse 写道:
> > On Wed, 2021-06-23 at 18:31 +0100, David Woodhouse wrote:
> > > Joy... that's wrong because when tun does both the PI and the vnet
> > > headers, the PI header comes *first*. When tun does only PI and vhost
> > > does the vnet headers, they come in the other order.
> > > 
> > > Will fix (and adjust the test cases to cope).
> > 
> > I got this far, pushed to
> > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vhost-net
> > 
> > All the test cases are now passing. I don't guarantee I haven't
> > actually broken qemu and IFF_TAP mode though, mind you :)
> 
> 
> No problem, but it would be easier for me if you can post another 
> version of the series.

Ack; I'm reworking it now into a saner series. All three of my initial
simple fixes ended up with more changes once I expanded the test cases
to cover more permutations of PI/XDP/headers :)

> > As discussed, I expanded tun_get_socket()/tap_get_socket() to return
> > the actual header length instead of letting vhost make wild guesses.
> 
> 
> This probably won't work since we had TUNSETVNETHDRSZ.

Or indeed IFF_NO_PI.

> I agree the vhost codes is tricky since it assumes only two kinds of the 
> hdr length.
> 
> But it was basically how it works for the past 10 years. It depends on 
> the userspace (Qemu) to coordinate it with the TUN/TAP through 
> TUNSETVNETHDRSZ during the feature negotiation.

I think that in any given situation, the kernel should either work
correctly, or gracefully refuse to set it up.

My patch set will make it work correctly for all the permutations I've
looked at. I would accept and answer of "screw that, just make
tun_get_socket() return failure if IFF_NO_PI isn't set", for example.

> > Note that in doing so, I have made tun_get_socket() return -ENOTCONN if
> > the tun fd *isn't* actually attached (TUNSETIFF) to a real device yet.
> 
> Any reason for doing this? Note that the socket is loosely coupled with 
> the networking device.

Because to determine the sock_hlen to return, it needs to look at the
tun>flags and tun->vndr_hdr_sz field. And if there isn't an actual tun
device attached, it can't.

> 
> > 
> > I moved the sanity check back to tun/tap instead of doing it in
> > vhost_net_build_xdp(), because the latter has no clue about the tun PI
> > header and doesn't know *where* the virtio header is.
> 
> 
> Right, the deserves a separate patch.

Yep, in my tree it has one, but it's a bit mixed in with other fixes
until I do that refactoring. 

> > diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
> > index 2a7660843444..8d78b6bbc228 100644
> > --- a/include/linux/if_tun.h
> > +++ b/include/linux/if_tun.h
> > @@ -21,11 +21,10 @@ struct tun_msg_ctl {
> >   
> >   struct tun_xdp_hdr {
> >   	int buflen;
> > -	struct virtio_net_hdr gso;
> 
> 
> Any reason for doing this? I meant it can work but we need limit the 
> changes that is unrelated to the fixes.

That's part of the patch that moves the sanity check back to tun/tap.
As I said it needs a little reworking, so it currently contains a
little bit of cleanup to previous code in tun_xdp_one(), but it looks
like this. The bit in drivers/vhost/net.c is obviously removing code
that I'd made conditional in a previous patch, so that will change
somewhat as I rework the series and drop the original patch.

From 2a0080f37244ec6dac8fb3e8330f9153a4373cfd Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Wed, 23 Jun 2021 23:32:00 +0100
Subject: [PATCH 10/10] net: remove virtio_net_hdr from struct tun_xdp_hdr

The tun device puts its struct tun_pi *before* the virtio_net_hdr, which
significantly complicates letting vhost validate it. Just let tap and
tun validate it for themselves, as they do in the non-XDP case anyway.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/net/tap.c      | 25 ++++++++++++++++++++++---
 drivers/net/tun.c      | 34 ++++++++++++++++++++++++----------
 drivers/vhost/net.c    | 15 +--------------
 include/linux/if_tun.h |  1 -
 4 files changed, 47 insertions(+), 28 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 2170a0d3d34c..d1b1f1de374e 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1132,16 +1132,35 @@ static const struct file_operations tap_fops = {
 static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 {
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-	struct virtio_net_hdr *gso = &hdr->gso;
+	struct virtio_net_hdr *gso = NULL;
 	int buflen = hdr->buflen;
 	int vnet_hdr_len = 0;
 	struct tap_dev *tap;
 	struct sk_buff *skb;
 	int err, depth;
 
-	if (q->flags & IFF_VNET_HDR)
+	if (q->flags & IFF_VNET_HDR) {
 		vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
+		if (xdp->data != xdp->data_hard_start + sizeof(*hdr) + vnet_hdr_len) {
+			err = -EINVAL;
+			goto err;
+		}
+
+		gso = (void *)&hdr[1];
 
+		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+		     tap16_to_cpu(q, gso->csum_start) +
+		     tap16_to_cpu(q, gso->csum_offset) + 2 >
+			     tap16_to_cpu(q, gso->hdr_len))
+			gso->hdr_len = cpu_to_tap16(q,
+				 tap16_to_cpu(q, gso->csum_start) +
+				 tap16_to_cpu(q, gso->csum_offset) + 2);
+
+		if (tap16_to_cpu(q, gso->hdr_len) > xdp->data_end - xdp->data) {
+			err = -EINVAL;
+			goto err;
+		}
+	}
 	skb = build_skb(xdp->data_hard_start, buflen);
 	if (!skb) {
 		err = -ENOMEM;
@@ -1155,7 +1174,7 @@ static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 	skb_reset_mac_header(skb);
 	skb->protocol = eth_hdr(skb)->h_proto;
 
-	if (vnet_hdr_len) {
+	if (gso) {
 		err = virtio_net_hdr_to_skb(skb, gso, tap_is_little_endian(q));
 		if (err)
 			goto err_kfree;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 69f6ce87b109..72f8a04f493b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2337,29 +2337,43 @@ static int tun_xdp_one(struct tun_struct *tun,
 {
 	unsigned int datasize = xdp->data_end - xdp->data;
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
+	void *tun_hdr = &hdr[1];
 	struct virtio_net_hdr *gso = NULL;
 	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb = NULL;
 	__be16 proto = 0;
 	u32 rxhash = 0, act;
 	int buflen = hdr->buflen;
-	int reservelen = xdp->data - xdp->data_hard_start;
 	int err = 0;
 	bool skb_xdp = false;
 	struct page *page;
 
-	if (tun->flags & IFF_VNET_HDR)
-		gso = &hdr->gso;
-
 	if (!(tun->flags & IFF_NO_PI)) {
-		struct tun_pi *pi = xdp->data;
-		if (datasize < sizeof(*pi)) {
+		struct tun_pi *pi = tun_hdr;
+		tun_hdr += sizeof(*pi);
+
+		if (tun_hdr > xdp->data) {
 			atomic_long_inc(&tun->rx_frame_errors);
-			return  -EINVAL;
+			return -EINVAL;
 		}
 		proto = pi->proto;
-		reservelen += sizeof(*pi);
-		datasize -= sizeof(*pi);
+	}
+
+	if (tun->flags & IFF_VNET_HDR) {
+		gso = tun_hdr;
+		tun_hdr += sizeof(*gso);
+
+		if (tun_hdr > xdp->data) {
+			atomic_long_inc(&tun->rx_frame_errors);
+			return -EINVAL;
+		}
+
+		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+		    tun16_to_cpu(tun, gso->csum_start) + tun16_to_cpu(tun, gso->csum_offset) + 2 > tun16_to_cpu(tun, gso->hdr_len))
+			gso->hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, gso->csum_start) + tun16_to_cpu(tun, gso->csum_offset) + 2);
+
+		if (tun16_to_cpu(tun, gso->hdr_len) > datasize)
+			return -EINVAL;
 	}
 
 	xdp_prog = rcu_dereference(tun->xdp_prog);
@@ -2407,7 +2421,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 		goto out;
 	}
 
-	skb_reserve(skb, reservelen);
+	skb_reserve(skb, xdp->data - xdp->data_hard_start);
 	skb_put(skb, datasize);
 
 	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e88cc18d079f..d9491c620a9c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -716,26 +716,13 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
 	hdr = buf;
 	if (sock_hlen) {
-		struct virtio_net_hdr *gso = &hdr->gso;
-
 		copied = copy_page_from_iter(alloc_frag->page,
 					     alloc_frag->offset +
-					     offsetof(struct tun_xdp_hdr, gso),
+					     sizeof(struct tun_xdp_hdr),
 					     sock_hlen, from);
 		if (copied != sock_hlen)
 			return -EFAULT;
 
-		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-		    vhost16_to_cpu(vq, gso->csum_start) +
-		    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
-		    vhost16_to_cpu(vq, gso->hdr_len)) {
-			gso->hdr_len = cpu_to_vhost16(vq,
-						      vhost16_to_cpu(vq, gso->csum_start) +
-						      vhost16_to_cpu(vq, gso->csum_offset) + 2);
-
-			if (vhost16_to_cpu(vq, gso->hdr_len) > len)
-				return -EINVAL;
-		}
 		len -= sock_hlen;
 	}
 
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 8a7debd3f663..8d78b6bbc228 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -21,7 +21,6 @@ struct tun_msg_ctl {
 
 struct tun_xdp_hdr {
 	int buflen;
-	struct virtio_net_hdr gso;
 };
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
-- 
2.17.1



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 4/4] vhost_net: Add self test with tun device
  2021-06-24  6:12         ` Jason Wang
@ 2021-06-24 10:42           ` David Woodhouse
  2021-06-25  2:55             ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-24 10:42 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 6404 bytes --]

On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote:
> 在 2021/6/24 上午12:12, David Woodhouse 写道:
> > We *should* eventually expand this test case to attach an AF_PACKET
> > device to the vhost-net, instead of using a tun device as the back end.
> > (Although I don't really see *why* vhost is limited to AF_PACKET. Why
> > *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)
> 
> 
> It's just because nobody wrote the code. And we're lacking the real use 
> case.

Hm, what code? For AF_PACKET I haven't actually spotted that there *is*
any.

As I've been refactoring the interaction between vhost and tun/tap, and
fixing it up for different vhdr lengths, PI, and (just now) frowning in
horror at the concept that tun and vhost can have *different*
endiannesses, I hadn't spotted that there was anything special on the
packet socket. For that case, sock_hlen is just zero and we
send/receive plain packets... or so I thought? Did I miss something?

As far as I was aware, that ought to have worked with any datagram
socket. I was pondering not just AF_UNIX but also UDP (since that's my
main transport for VPN data, at least in the case where I care about
performance).

An interesting use case for a non-packet socket might be to establish a
tunnel. A guest's virtio-net device is just connected to a UDP socket
on the host, and that tunnels all their packets to/from a remote
endpoint which is where that guest is logically connected to the
network. It might be useful for live migration cases, perhaps?

I don't have an overriding desire to *make* it work, mind you; I just
wanted to make sure I understand *why* it doesn't, if indeed it
doesn't. As far as I could tell, it *should* work if we just dropped
the check?

> Vhost_net is bascially used for accepting packet from userspace to the 
> kernel networking stack.
> 
> Using AF_UNIX makes it looks more like a case of inter process 
> communication (without vnet header it won't be used efficiently by VM). 
> In this case, using io_uring is much more suitable.
> 
> Or thinking in another way, instead of depending on the vhost_net, we 
> can expose TUN/TAP socket to userspace then io_uring could be used for 
> the OpenConnect case as well?

That would work, I suppose. Although as noted, I *can* use vhost_net
with tun today from userspace as long as I disable XDP and PI, and use
a virtio_net_hdr that I don't really want. (And pull a value for
TASK_SIZE out of my posterior; qv.)

So I *can* ship a version of OpenConnect that works on existing
production kernels with those workarounds, and I'm fixing up the other
permutations of vhost/tun stuff in the kernel just because I figured we
*should*.

If I'm going to *require* new kernel support for OpenConnect then I
might as well go straight to the AF_TLS/DTLS + BPF + sockmap plan and
have the data packets never go to userspace in the first place.


> 
> > 
> > 
> > > > +	/*
> > > > +	 * I just want to map the *whole* of userspace address space. But
> > > > +	 * from userspace I don't know what that is. On x86_64 it would be:
> > > > +	 *
> > > > +	 * vmem->regions[0].guest_phys_addr = 4096;
> > > > +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
> > > > +	 * vmem->regions[0].userspace_addr = 4096;
> > > > +	 *
> > > > +	 * For now, just ensure we put everything inside a single BSS region.
> > > > +	 */
> > > > +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
> > > > +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
> > > > +	vmem->regions[0].memory_size = sizeof(rings);
> > > 
> > > Instead of doing tricks like this, we can do it in another way:
> > > 
> > > 1) enable device IOTLB
> > > 2) wait for the IOTLB miss request (iova, len) and update identity
> > > mapping accordingly
> > > 
> > > This should work for all the archs (with some performance hit).
> > 
> > Ick. For my actual application (OpenConnect) I'm either going to suck
> > it up and put in the arch-specific limits like in the comment above, or
> > I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking
> > about elsewhere.
> 
> 
> The feature could be useful for the case of vhost-vDPA as well.
> 
> 
> >   (Probably the former, since if I'm requiring kernel
> > changes then I have grander plans around extending AF_TLS to do DTLS,
> > then hooking that directly up to the tun socket via BPF and a sockmap
> > without the data frames ever going to userspace at all.)
> 
> 
> Ok, I guess we need to make sockmap works for tun socket.

Hm, I need to work out the ideal data flow here. I don't know if
sendmsg() / recvmsg() on the tun socket are even what I want, for full
zero-copy.

In the case where the NIC supports encryption, we want true zero-copy
from the moment the "encrypted" packet arrives over UDP on the public
network, through the DTLS processing and seqno checking, to being
processed as netif_receive_skb() on the tun device.

Likewise skbs from tun_net_xmit() should have the appropriate DTLS and
IP/UDP headers prepended to them and that *same* skb (or at least the
same frags) should be handed to the NIC to encrypt and send.

In the case where we have software crypto in the kernel, we can
tolerate precisely *one* copy because the crypto doesn't have to be
done in-place, so moving from the input to the output crypto buffers
can be that one "copy", and we can use it to move data around (from the
incoming skb to the outgoing skb) if we need to.

Ultimately I think we want udp_sendmsg() and tun_sendmsg() to support
being *given* ownership of the buffers, rather than copying from them.
Or just being given a skb and pushing/pulling their headers.

I'm looking at skb_send_sock() and it *doesn't* seem to support "just
steal the frags from the initial skb and give them to the new one", but
there may be ways to make that work.

> > I think I can fix *all* those test cases by making tun_get_socket()
> > take an extra 'int *' argument, and use that to return the *actual*
> > value of sock_hlen. Here's the updated test case in the meantime:
> 
> 
> It would be better if you can post a new version of the whole series to 
> ease the reviewing.

Yep. I was working on that... until I got even more distracted by
looking at how we can do the true in-kernel zero-copy option ;)


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket()
  2021-06-19 13:33 [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode David Woodhouse
  2021-06-21  7:00 ` Jason Wang
  2021-06-22 16:15 ` [PATCH v2 1/4] " David Woodhouse
@ 2021-06-24 12:30 ` David Woodhouse
  2021-06-24 12:30   ` [PATCH v3 2/5] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path David Woodhouse
                     ` (5 more replies)
  2 siblings, 6 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-24 12:30 UTC (permalink / raw)
  To: netdev; +Cc: Jason Wang, Eugenio Pérez, Willem de Bruijn

From: David Woodhouse <dwmw@amazon.co.uk>

The vhost-net driver was making wild assumptions about the header length
of the underlying tun/tap socket. Then it was discarding packets if
the number of bytes it got from sock_recvmsg() didn't precisely match
its guess.

Fix it to get the correct information along with the socket itself.
As a side-effect, this means that tun_get_socket() won't work if the
tun file isn't actually connected to a device, since there's no 'tun'
yet in that case to get the information from.

On the receive side, where the tun device generates the virtio_net_hdr
but VIRITO_NET_F_MSG_RXBUF was negotiated and vhost-net needs to fill
in the 'num_buffers' field on top of the existing virtio_net_hdr, fix
that to use 'sock_hlen - 2' as the location, which means that it goes
in the right place regardless of whether the tun device is using an
additional tun_pi header or not. In this case, the user should have
configured the tun device with a vnet hdr size of 12, to make room.

Fixes: 8dd014adfea6f ("vhost-net: mergeable buffers support")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/net/tap.c      |  5 ++++-
 drivers/net/tun.c      | 16 +++++++++++++++-
 drivers/vhost/net.c    | 31 +++++++++++++++----------------
 include/linux/if_tap.h |  4 ++--
 include/linux/if_tun.h |  4 ++--
 5 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 8e3a28ba6b28..2170a0d3d34c 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1246,7 +1246,7 @@ static const struct proto_ops tap_socket_ops = {
  * attached to a device.  The returned object works like a packet socket, it
  * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
  * holding a reference to the file for as long as the socket is in use. */
-struct socket *tap_get_socket(struct file *file)
+struct socket *tap_get_socket(struct file *file, size_t *hlen)
 {
 	struct tap_queue *q;
 	if (file->f_op != &tap_fops)
@@ -1254,6 +1254,9 @@ struct socket *tap_get_socket(struct file *file)
 	q = file->private_data;
 	if (!q)
 		return ERR_PTR(-EBADFD);
+	if (hlen)
+		*hlen = (q->flags & IFF_VNET_HDR) ? q->vnet_hdr_sz : 0;
+
 	return &q->sock;
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4cf38be26dc9..67b406fa0881 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3649,7 +3649,7 @@ static void tun_cleanup(void)
  * attached to a device.  The returned object works like a packet socket, it
  * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
  * holding a reference to the file for as long as the socket is in use. */
-struct socket *tun_get_socket(struct file *file)
+struct socket *tun_get_socket(struct file *file, size_t *hlen)
 {
 	struct tun_file *tfile;
 	if (file->f_op != &tun_fops)
@@ -3657,6 +3657,20 @@ struct socket *tun_get_socket(struct file *file)
 	tfile = file->private_data;
 	if (!tfile)
 		return ERR_PTR(-EBADFD);
+
+	if (hlen) {
+		struct tun_struct *tun = tun_get(tfile);
+		size_t len = 0;
+
+		if (!tun)
+			return ERR_PTR(-ENOTCONN);
+		if (tun->flags & IFF_VNET_HDR)
+			len += READ_ONCE(tun->vnet_hdr_sz);
+		if (!(tun->flags & IFF_NO_PI))
+			len += sizeof(struct tun_pi);
+		tun_put(tun);
+		*hlen = len;
+	}
 	return &tfile->socket;
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index df82b124170e..b92a7144ed90 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net)
 
 	vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?
 		vq->log : NULL;
-	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
+	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) &&
+		(vhost_hlen || sock_hlen >= sizeof(num_buffers));
 
 	do {
 		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
@@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net)
 			}
 		} else {
 			/* Header came from socket; we'll need to patch
-			 * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF
+			 * ->num_buffers over the last two bytes if
+			 * VIRTIO_NET_F_MRG_RXBUF is enabled.
 			 */
-			iov_iter_advance(&fixup, sizeof(hdr));
+			iov_iter_advance(&fixup, sock_hlen - 2);
 		}
 		/* TODO: Should check and handle checksum. */
 
@@ -1420,7 +1422,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	return 0;
 }
 
-static struct socket *get_raw_socket(int fd)
+static struct socket *get_raw_socket(int fd, size_t *hlen)
 {
 	int r;
 	struct socket *sock = sockfd_lookup(fd, &r);
@@ -1438,6 +1440,7 @@ static struct socket *get_raw_socket(int fd)
 		r = -EPFNOSUPPORT;
 		goto err;
 	}
+	*hlen = 0;
 	return sock;
 err:
 	sockfd_put(sock);
@@ -1463,33 +1466,33 @@ static struct ptr_ring *get_tap_ptr_ring(int fd)
 	return ring;
 }
 
-static struct socket *get_tap_socket(int fd)
+static struct socket *get_tap_socket(int fd, size_t *hlen)
 {
 	struct file *file = fget(fd);
 	struct socket *sock;
 
 	if (!file)
 		return ERR_PTR(-EBADF);
-	sock = tun_get_socket(file);
+	sock = tun_get_socket(file, hlen);
 	if (!IS_ERR(sock))
 		return sock;
-	sock = tap_get_socket(file);
+	sock = tap_get_socket(file, hlen);
 	if (IS_ERR(sock))
 		fput(file);
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_socket(int fd, size_t *hlen)
 {
 	struct socket *sock;
 
 	/* special case to disable backend */
 	if (fd == -1)
 		return NULL;
-	sock = get_raw_socket(fd);
+	sock = get_raw_socket(fd, hlen);
 	if (!IS_ERR(sock))
 		return sock;
-	sock = get_tap_socket(fd);
+	sock = get_tap_socket(fd, hlen);
 	if (!IS_ERR(sock))
 		return sock;
 	return ERR_PTR(-ENOTSOCK);
@@ -1521,7 +1524,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		r = -EFAULT;
 		goto err_vq;
 	}
-	sock = get_socket(fd);
+	sock = get_socket(fd, &nvq->sock_hlen);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err_vq;
@@ -1621,7 +1624,7 @@ static long vhost_net_reset_owner(struct vhost_net *n)
 
 static int vhost_net_set_features(struct vhost_net *n, u64 features)
 {
-	size_t vhost_hlen, sock_hlen, hdr_len;
+	size_t vhost_hlen, hdr_len;
 	int i;
 
 	hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
@@ -1631,11 +1634,8 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 	if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
 		/* vhost provides vnet_hdr */
 		vhost_hlen = hdr_len;
-		sock_hlen = 0;
 	} else {
-		/* socket provides vnet_hdr */
 		vhost_hlen = 0;
-		sock_hlen = hdr_len;
 	}
 	mutex_lock(&n->dev.mutex);
 	if ((features & (1 << VHOST_F_LOG_ALL)) &&
@@ -1651,7 +1651,6 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 		mutex_lock(&n->vqs[i].vq.mutex);
 		n->vqs[i].vq.acked_features = features;
 		n->vqs[i].vhost_hlen = vhost_hlen;
-		n->vqs[i].sock_hlen = sock_hlen;
 		mutex_unlock(&n->vqs[i].vq.mutex);
 	}
 	mutex_unlock(&n->dev.mutex);
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 915a187cfabd..b460ba98f34e 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -3,14 +3,14 @@
 #define _LINUX_IF_TAP_H_
 
 #if IS_ENABLED(CONFIG_TAP)
-struct socket *tap_get_socket(struct file *);
+struct socket *tap_get_socket(struct file *, size_t *);
 struct ptr_ring *tap_get_ptr_ring(struct file *file);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
 struct file;
 struct socket;
-static inline struct socket *tap_get_socket(struct file *f)
+static inline struct socket *tap_get_socket(struct file *f, size_t *)
 {
 	return ERR_PTR(-EINVAL);
 }
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 2a7660843444..8a7debd3f663 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -25,7 +25,7 @@ struct tun_xdp_hdr {
 };
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
-struct socket *tun_get_socket(struct file *);
+struct socket *tun_get_socket(struct file *, size_t *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -45,7 +45,7 @@ void tun_ptr_free(void *ptr);
 #include <linux/errno.h>
 struct file;
 struct socket;
-static inline struct socket *tun_get_socket(struct file *f)
+static inline struct socket *tun_get_socket(struct file *f, size_t *)
 {
 	return ERR_PTR(-EINVAL);
 }
-- 
2.31.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v3 2/5] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path
  2021-06-24 12:30 ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() David Woodhouse
@ 2021-06-24 12:30   ` David Woodhouse
  2021-06-25  6:58     ` Jason Wang
  2021-06-24 12:30   ` [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves David Woodhouse
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-24 12:30 UTC (permalink / raw)
  To: netdev; +Cc: Jason Wang, Eugenio Pérez, Willem de Bruijn

From: David Woodhouse <dwmw@amazon.co.uk>

Sometimes it's just a data packet. The virtio_net_hdr processing should be
conditional on IFF_VNET_HDR, just as it is in tun_get_user().

Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/net/tun.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 67b406fa0881..9acd448e6dfc 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2331,7 +2331,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 {
 	unsigned int datasize = xdp->data_end - xdp->data;
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-	struct virtio_net_hdr *gso = &hdr->gso;
+	struct virtio_net_hdr *gso = NULL;
 	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb = NULL;
 	u32 rxhash = 0, act;
@@ -2340,9 +2340,12 @@ static int tun_xdp_one(struct tun_struct *tun,
 	bool skb_xdp = false;
 	struct page *page;
 
+	if (tun->flags & IFF_VNET_HDR)
+		gso = &hdr->gso;
+
 	xdp_prog = rcu_dereference(tun->xdp_prog);
 	if (xdp_prog) {
-		if (gso->gso_type) {
+		if (gso && gso->gso_type) {
 			skb_xdp = true;
 			goto build;
 		}
@@ -2388,7 +2391,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
 	skb_put(skb, xdp->data_end - xdp->data);
 
-	if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
+	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
 		atomic_long_inc(&tun->rx_frame_errors);
 		kfree_skb(skb);
 		err = -EINVAL;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-24 12:30 ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() David Woodhouse
  2021-06-24 12:30   ` [PATCH v3 2/5] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path David Woodhouse
@ 2021-06-24 12:30   ` David Woodhouse
  2021-06-25  7:33     ` Jason Wang
  2021-06-24 12:30   ` [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode David Woodhouse
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-24 12:30 UTC (permalink / raw)
  To: netdev; +Cc: Jason Wang, Eugenio Pérez, Willem de Bruijn

From: David Woodhouse <dwmw@amazon.co.uk>

When the underlying socket isn't configured with a virtio_net_hdr, the
existing code in vhost_net_build_xdp() would attempt to validate
uninitialised data, by copying zero bytes (sock_hlen) into the local
copy of the header and then trying to validate that.

Fixing it is somewhat non-trivial because the tun device might put a
struct tun_pi *before* the virtio_net_hdr, which makes it hard to find.
So just stop messing with someone else's data in vhost_net_build_xdp(),
and let tap and tun validate it for themselves, as they do in the
non-XDP case anyway.

This means that the 'gso' member of struct tun_xdp_hdr can die, leaving
only 'int buflen'.

The socket header of sock_hlen is still copied separately from the
data payload because there may be a gap between them to ensure suitable
alignment of the latter.

Fixes: 0a0be13b8fe2 ("vhost_net: batch submitting XDP buffers to underlayer sockets")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/net/tap.c      | 25 ++++++++++++++++++++++---
 drivers/net/tun.c      | 21 ++++++++++++++++++---
 drivers/vhost/net.c    | 30 +++++++++---------------------
 include/linux/if_tun.h |  1 -
 4 files changed, 49 insertions(+), 28 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 2170a0d3d34c..d1b1f1de374e 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1132,16 +1132,35 @@ static const struct file_operations tap_fops = {
 static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 {
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-	struct virtio_net_hdr *gso = &hdr->gso;
+	struct virtio_net_hdr *gso = NULL;
 	int buflen = hdr->buflen;
 	int vnet_hdr_len = 0;
 	struct tap_dev *tap;
 	struct sk_buff *skb;
 	int err, depth;
 
-	if (q->flags & IFF_VNET_HDR)
+	if (q->flags & IFF_VNET_HDR) {
 		vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
+		if (xdp->data != xdp->data_hard_start + sizeof(*hdr) + vnet_hdr_len) {
+			err = -EINVAL;
+			goto err;
+		}
+
+		gso = (void *)&hdr[1];
 
+		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+		     tap16_to_cpu(q, gso->csum_start) +
+		     tap16_to_cpu(q, gso->csum_offset) + 2 >
+			     tap16_to_cpu(q, gso->hdr_len))
+			gso->hdr_len = cpu_to_tap16(q,
+				 tap16_to_cpu(q, gso->csum_start) +
+				 tap16_to_cpu(q, gso->csum_offset) + 2);
+
+		if (tap16_to_cpu(q, gso->hdr_len) > xdp->data_end - xdp->data) {
+			err = -EINVAL;
+			goto err;
+		}
+	}
 	skb = build_skb(xdp->data_hard_start, buflen);
 	if (!skb) {
 		err = -ENOMEM;
@@ -1155,7 +1174,7 @@ static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 	skb_reset_mac_header(skb);
 	skb->protocol = eth_hdr(skb)->h_proto;
 
-	if (vnet_hdr_len) {
+	if (gso) {
 		err = virtio_net_hdr_to_skb(skb, gso, tap_is_little_endian(q));
 		if (err)
 			goto err_kfree;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9acd448e6dfc..1b553f79adb0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2331,6 +2331,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 {
 	unsigned int datasize = xdp->data_end - xdp->data;
 	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
+	void *tun_hdr = &hdr[1];
 	struct virtio_net_hdr *gso = NULL;
 	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb = NULL;
@@ -2340,8 +2341,22 @@ static int tun_xdp_one(struct tun_struct *tun,
 	bool skb_xdp = false;
 	struct page *page;
 
-	if (tun->flags & IFF_VNET_HDR)
-		gso = &hdr->gso;
+	if (tun->flags & IFF_VNET_HDR) {
+		gso = tun_hdr;
+		tun_hdr += sizeof(*gso);
+
+		if (tun_hdr > xdp->data) {
+			atomic_long_inc(&tun->rx_frame_errors);
+			return -EINVAL;
+		}
+
+		if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+		    tun16_to_cpu(tun, gso->csum_start) + tun16_to_cpu(tun, gso->csum_offset) + 2 > tun16_to_cpu(tun, gso->hdr_len))
+			gso->hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, gso->csum_start) + tun16_to_cpu(tun, gso->csum_offset) + 2);
+
+		if (tun16_to_cpu(tun, gso->hdr_len) > datasize)
+			return -EINVAL;
+	}
 
 	xdp_prog = rcu_dereference(tun->xdp_prog);
 	if (xdp_prog) {
@@ -2389,7 +2404,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 	}
 
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
-	skb_put(skb, xdp->data_end - xdp->data);
+	skb_put(skb, datasize);
 
 	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
 		atomic_long_inc(&tun->rx_frame_errors);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b92a7144ed90..7cae18151c60 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -690,7 +690,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 					     dev);
 	struct socket *sock = vhost_vq_get_backend(vq);
 	struct page_frag *alloc_frag = &net->page_frag;
-	struct virtio_net_hdr *gso;
 	struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
 	struct tun_xdp_hdr *hdr;
 	size_t len = iov_iter_count(from);
@@ -715,29 +714,18 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 		return -ENOMEM;
 
 	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-	copied = copy_page_from_iter(alloc_frag->page,
-				     alloc_frag->offset +
-				     offsetof(struct tun_xdp_hdr, gso),
-				     sock_hlen, from);
-	if (copied != sock_hlen)
-		return -EFAULT;
-
 	hdr = buf;
-	gso = &hdr->gso;
-
-	if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-	    vhost16_to_cpu(vq, gso->csum_start) +
-	    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
-	    vhost16_to_cpu(vq, gso->hdr_len)) {
-		gso->hdr_len = cpu_to_vhost16(vq,
-			       vhost16_to_cpu(vq, gso->csum_start) +
-			       vhost16_to_cpu(vq, gso->csum_offset) + 2);
-
-		if (vhost16_to_cpu(vq, gso->hdr_len) > len)
-			return -EINVAL;
+	if (sock_hlen) {
+		copied = copy_page_from_iter(alloc_frag->page,
+					     alloc_frag->offset +
+					     sizeof(struct tun_xdp_hdr),
+					     sock_hlen, from);
+		if (copied != sock_hlen)
+			return -EFAULT;
+
+		len -= sock_hlen;
 	}
 
-	len -= sock_hlen;
 	copied = copy_page_from_iter(alloc_frag->page,
 				     alloc_frag->offset + pad,
 				     len, from);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 8a7debd3f663..8d78b6bbc228 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -21,7 +21,6 @@ struct tun_msg_ctl {
 
 struct tun_xdp_hdr {
 	int buflen;
-	struct virtio_net_hdr gso;
 };
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-24 12:30 ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() David Woodhouse
  2021-06-24 12:30   ` [PATCH v3 2/5] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path David Woodhouse
  2021-06-24 12:30   ` [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves David Woodhouse
@ 2021-06-24 12:30   ` David Woodhouse
  2021-06-25  7:41     ` Jason Wang
  2021-06-25 18:43     ` Willem de Bruijn
  2021-06-24 12:30   ` [PATCH v3 5/5] vhost_net: Add self test with tun device David Woodhouse
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-24 12:30 UTC (permalink / raw)
  To: netdev; +Cc: Jason Wang, Eugenio Pérez, Willem de Bruijn

From: David Woodhouse <dwmw@amazon.co.uk>

In tun_get_user(), skb->protocol is either taken from the tun_pi header
or inferred from the first byte of the packet in IFF_TUN mode, while
eth_type_trans() is called only in the IFF_TAP mode where the payload
is expected to be an Ethernet frame.

The equivalent code path in tun_xdp_one() was unconditionally using
eth_type_trans(), which is the wrong thing to do in IFF_TUN mode and
corrupts packets.

Pull the logic out to a separate tun_skb_set_protocol() function, and
call it from both tun_get_user() and tun_xdp_one().

XX: It is not entirely clear to me why it's OK to call eth_type_trans()
in some cases without first checking that enough of the Ethernet header
is linearly present by calling pskb_may_pull(). Such a check was never
present in the tun_xdp_one() code path, and commit 96aa1b22bd6bb ("tun:
correct header offsets in napi frags mode") deliberately added it *only*
for the IFF_NAPI_FRAGS mode.

I would like to see specific explanations of *why* it's ever valid and
necessary (is it so much faster?) to skip the pskb_may_pull() by setting
the 'no_pull_check' flag to tun_skb_set_protocol(), but for now I'll
settle for faithfully preserving the existing behaviour and pretending
it's someone else's problem.

Cc: Willem de Bruijn <willemb@google.com>
Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/net/tun.c | 95 +++++++++++++++++++++++++++++++----------------
 1 file changed, 63 insertions(+), 32 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1b553f79adb0..9379fa86fae9 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1641,6 +1641,47 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	return NULL;
 }
 
+static int tun_skb_set_protocol(struct tun_struct *tun, struct sk_buff *skb,
+				__be16 pi_proto, bool no_pull_check)
+{
+	switch (tun->flags & TUN_TYPE_MASK) {
+	case IFF_TUN:
+		if (tun->flags & IFF_NO_PI) {
+			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
+
+			switch (ip_version) {
+			case 4:
+				pi_proto = htons(ETH_P_IP);
+				break;
+			case 6:
+				pi_proto = htons(ETH_P_IPV6);
+				break;
+			default:
+				return -EINVAL;
+			}
+		}
+
+		skb_reset_mac_header(skb);
+		skb->protocol = pi_proto;
+		skb->dev = tun->dev;
+		break;
+	case IFF_TAP:
+		/* As an optimisation, no_pull_check can be set in the cases
+		 * where the caller *knows* that eth_type_trans() will be OK
+		 * because the Ethernet header is linearised at skb->data.
+		 *
+		 * XX: Or so I was reliably assured when I moved this code
+		 * and didn't make it unconditional. dwmw2.
+		 */
+		if (!no_pull_check && !pskb_may_pull(skb, ETH_HLEN))
+			return -ENOMEM;
+
+		skb->protocol = eth_type_trans(skb, tun->dev);
+		break;
+	}
+	return 0;
+}
+
 /* Get packet from user space buffer */
 static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 			    void *msg_control, struct iov_iter *from,
@@ -1784,37 +1825,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 		return -EINVAL;
 	}
 
-	switch (tun->flags & TUN_TYPE_MASK) {
-	case IFF_TUN:
-		if (tun->flags & IFF_NO_PI) {
-			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
-
-			switch (ip_version) {
-			case 4:
-				pi.proto = htons(ETH_P_IP);
-				break;
-			case 6:
-				pi.proto = htons(ETH_P_IPV6);
-				break;
-			default:
-				atomic_long_inc(&tun->dev->rx_dropped);
-				kfree_skb(skb);
-				return -EINVAL;
-			}
-		}
-
-		skb_reset_mac_header(skb);
-		skb->protocol = pi.proto;
-		skb->dev = tun->dev;
-		break;
-	case IFF_TAP:
-		if (frags && !pskb_may_pull(skb, ETH_HLEN)) {
-			err = -ENOMEM;
-			goto drop;
-		}
-		skb->protocol = eth_type_trans(skb, tun->dev);
-		break;
-	}
+	err = tun_skb_set_protocol(tun, skb, pi.proto, !frags);
+	if (err)
+		goto drop;
 
 	/* copy skb_ubuf_info for callback when skb has no error */
 	if (zerocopy) {
@@ -2335,12 +2348,24 @@ static int tun_xdp_one(struct tun_struct *tun,
 	struct virtio_net_hdr *gso = NULL;
 	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb = NULL;
+	__be16 proto = 0;
 	u32 rxhash = 0, act;
 	int buflen = hdr->buflen;
 	int err = 0;
 	bool skb_xdp = false;
 	struct page *page;
 
+	if (!(tun->flags & IFF_NO_PI)) {
+		struct tun_pi *pi = tun_hdr;
+		tun_hdr += sizeof(*pi);
+
+		if (tun_hdr > xdp->data) {
+			atomic_long_inc(&tun->rx_frame_errors);
+			return -EINVAL;
+		}
+		proto = pi->proto;
+	}
+
 	if (tun->flags & IFF_VNET_HDR) {
 		gso = tun_hdr;
 		tun_hdr += sizeof(*gso);
@@ -2413,7 +2438,13 @@ static int tun_xdp_one(struct tun_struct *tun,
 		goto out;
 	}
 
-	skb->protocol = eth_type_trans(skb, tun->dev);
+	err = tun_skb_set_protocol(tun, skb, proto, true);
+	if (err) {
+		atomic_long_inc(&tun->dev->rx_dropped);
+		kfree_skb(skb);
+		goto out;
+	}
+
 	skb_reset_network_header(skb);
 	skb_probe_transport_header(skb);
 	skb_record_rx_queue(skb, tfile->queue_index);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v3 5/5] vhost_net: Add self test with tun device
  2021-06-24 12:30 ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() David Woodhouse
                     ` (2 preceding siblings ...)
  2021-06-24 12:30   ` [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode David Woodhouse
@ 2021-06-24 12:30   ` David Woodhouse
  2021-06-25  5:00   ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() Jason Wang
  2021-06-25 18:13   ` Willem de Bruijn
  5 siblings, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-24 12:30 UTC (permalink / raw)
  To: netdev; +Cc: Jason Wang, Eugenio Pérez, Willem de Bruijn

From: David Woodhouse <dwmw@amazon.co.uk>

This creates a tun device and brings it up, and finds out the link-local
address that the kernel automatically assigns to it. It then sets up
vhost-net on the tun device, uses that to send a ping to the kernel's
assigned link-local addresss, and waits for a reply.

If everything is working correctly, it will get a response and manage to
understand it. If the virtio_net_hdr and other pieces are not working as
expected, then it fails (and times out after 1 second).

The test is repeated in various combinations of vhost-net feature flags,
tun vhdr length, PI enabled, and XDP/non-XDP code paths. Most of which
didn't work before the patch series that added this test, but do now.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/vhost/Makefile        |  16 +
 tools/testing/selftests/vhost/config          |   2 +
 .../testing/selftests/vhost/test_vhost_net.c  | 530 ++++++++++++++++++
 4 files changed, 549 insertions(+)
 create mode 100644 tools/testing/selftests/vhost/Makefile
 create mode 100644 tools/testing/selftests/vhost/config
 create mode 100644 tools/testing/selftests/vhost/test_vhost_net.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6c575cf34a71..300c03cfd0c7 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -71,6 +71,7 @@ TARGETS += user
 TARGETS += vDSO
 TARGETS += vm
 TARGETS += x86
+TARGETS += vhost
 TARGETS += zram
 #Please keep the TARGETS list alphabetically sorted
 # Run "make quicktest=1 run_tests" or
diff --git a/tools/testing/selftests/vhost/Makefile b/tools/testing/selftests/vhost/Makefile
new file mode 100644
index 000000000000..f5e565d80733
--- /dev/null
+++ b/tools/testing/selftests/vhost/Makefile
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0
+all:
+
+include ../lib.mk
+
+.PHONY: all clean
+
+BINARIES := test_vhost_net
+
+test_vhost_net: test_vhost_net.c ../kselftest.h ../kselftest_harness.h
+	$(CC) $(CFLAGS) -g $< -o $@
+
+TEST_PROGS += $(BINARIES)
+EXTRA_CLEAN := $(BINARIES)
+
+all: $(BINARIES)
diff --git a/tools/testing/selftests/vhost/config b/tools/testing/selftests/vhost/config
new file mode 100644
index 000000000000..6391c1f32c34
--- /dev/null
+++ b/tools/testing/selftests/vhost/config
@@ -0,0 +1,2 @@
+CONFIG_VHOST_NET=y
+CONFIG_TUN=y
diff --git a/tools/testing/selftests/vhost/test_vhost_net.c b/tools/testing/selftests/vhost/test_vhost_net.c
new file mode 100644
index 000000000000..747f0e5e4f57
--- /dev/null
+++ b/tools/testing/selftests/vhost/test_vhost_net.c
@@ -0,0 +1,530 @@
+// SPDX-License-Identifier: LGPL-2.1
+
+#include "../kselftest_harness.h"
+#include "../../../virtio/asm/barrier.h"
+
+#include <sys/eventfd.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/wait.h>
+#include <sys/ioctl.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <net/if.h>
+#include <sys/socket.h>
+
+#include <netinet/tcp.h>
+#include <netinet/ip.h>
+#include <netinet/ip_icmp.h>
+#include <netinet/ip6.h>
+#include <netinet/icmp6.h>
+
+#include <linux/if_tun.h>
+#include <linux/virtio_net.h>
+#include <linux/vhost.h>
+
+static unsigned char hexnybble(char hex)
+{
+	switch (hex) {
+	case '0'...'9':
+		return hex - '0';
+	case 'a'...'f':
+		return 10 + hex - 'a';
+	case 'A'...'F':
+		return 10 + hex - 'A';
+	default:
+		exit (KSFT_SKIP);
+	}
+}
+
+static unsigned char hexchar(char *hex)
+{
+	return (hexnybble(hex[0]) << 4) | hexnybble(hex[1]);
+}
+
+int open_tun(int vnet_hdr_sz, int pi, struct in6_addr *addr)
+{
+	int tun_fd = open("/dev/net/tun", O_RDWR);
+	if (tun_fd == -1)
+		return -1;
+
+	struct ifreq ifr = { 0 };
+
+	ifr.ifr_flags = IFF_TUN;
+	if (!pi)
+		ifr.ifr_flags |= IFF_NO_PI;
+	if (vnet_hdr_sz)
+		ifr.ifr_flags |= IFF_VNET_HDR;
+
+	if (ioctl(tun_fd, TUNSETIFF, (void *)&ifr) < 0)
+		goto out_tun;
+
+	if (vnet_hdr_sz &&
+	    ioctl(tun_fd, TUNSETVNETHDRSZ, &vnet_hdr_sz) < 0)
+		goto out_tun;
+
+	int sockfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP);
+	if (sockfd == -1)
+		goto out_tun;
+
+	if (ioctl(sockfd, SIOCGIFFLAGS, (void *)&ifr) < 0)
+		goto out_sock;
+
+	ifr.ifr_flags |= IFF_UP;
+	if (ioctl(sockfd, SIOCSIFFLAGS, (void *)&ifr) < 0)
+		goto out_sock;
+
+	close(sockfd);
+
+	FILE *inet6 = fopen("/proc/net/if_inet6", "r");
+	if (!inet6)
+		goto out_tun;
+
+	char buf[80];
+	while (fgets(buf, sizeof(buf), inet6)) {
+		size_t len = strlen(buf), namelen = strlen(ifr.ifr_name);
+		if (!strncmp(buf, "fe80", 4) &&
+		    buf[len - namelen - 2] == ' ' &&
+		    !strncmp(buf + len - namelen - 1, ifr.ifr_name, namelen)) {
+			for (int i = 0; i < 16; i++) {
+				addr->s6_addr[i] = hexchar(buf + i*2);
+			}
+			fclose(inet6);
+			return tun_fd;
+		}
+	}
+	/* Not found */
+	fclose(inet6);
+ out_sock:
+	close(sockfd);
+ out_tun:
+	close(tun_fd);
+	return -1;
+}
+
+#define RING_SIZE 32
+#define RING_MASK(x) ((x) & (RING_SIZE-1))
+
+struct pkt_buf {
+	unsigned char data[2048];
+};
+
+struct test_vring {
+	struct vring_desc desc[RING_SIZE];
+	struct vring_avail avail;
+	__virtio16 avail_ring[RING_SIZE];
+	struct vring_used used;
+	struct vring_used_elem used_ring[RING_SIZE];
+	struct pkt_buf pkts[RING_SIZE];
+} rings[2];
+
+static int setup_vring(int vhost_fd, int tun_fd, int call_fd, int kick_fd, int idx)
+{
+	struct test_vring *vring = &rings[idx];
+	int ret;
+
+	memset(vring, 0, sizeof(vring));
+
+	struct vhost_vring_state vs = { };
+	vs.index = idx;
+	vs.num = RING_SIZE;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_NUM, &vs) < 0) {
+		perror("VHOST_SET_VRING_NUM");
+		return -1;
+	}
+
+	vs.num = 0;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_BASE, &vs) < 0) {
+		perror("VHOST_SET_VRING_BASE");
+		return -1;
+	}
+
+	struct vhost_vring_addr va = { };
+	va.index = idx;
+	va.desc_user_addr = (uint64_t)vring->desc;
+	va.avail_user_addr = (uint64_t)&vring->avail;
+	va.used_user_addr  = (uint64_t)&vring->used;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &va) < 0) {
+		perror("VHOST_SET_VRING_ADDR");
+		return -1;
+	}
+
+	struct vhost_vring_file vf = { };
+	vf.index = idx;
+	vf.fd = tun_fd;
+	if (ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &vf) < 0) {
+		perror("VHOST_NET_SET_BACKEND");
+		return -1;
+	}
+
+	vf.fd = call_fd;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_CALL, &vf) < 0) {
+		perror("VHOST_SET_VRING_CALL");
+		return -1;
+	}
+
+	vf.fd = kick_fd;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_KICK, &vf) < 0) {
+		perror("VHOST_SET_VRING_KICK");
+		return -1;
+	}
+
+	return 0;
+}
+
+int setup_vhost(int vhost_fd, int tun_fd, int call_fd, int kick_fd, uint64_t want_features)
+{
+	int ret;
+
+	if (ioctl(vhost_fd, VHOST_SET_OWNER, NULL) < 0) {
+		perror("VHOST_SET_OWNER");
+		return -1;
+	}
+
+	uint64_t features;
+	if (ioctl(vhost_fd, VHOST_GET_FEATURES, &features) < 0) {
+		perror("VHOST_GET_FEATURES");
+		return -1;
+	}
+
+	if ((features & want_features) != want_features)
+		return KSFT_SKIP;
+
+	if (ioctl(vhost_fd, VHOST_SET_FEATURES, &want_features) < 0) {
+		perror("VHOST_SET_FEATURES");
+		return -1;
+	}
+
+	struct vhost_memory *vmem = alloca(sizeof(*vmem) + sizeof(vmem->regions[0]));
+
+	memset(vmem, 0, sizeof(*vmem) + sizeof(vmem->regions[0]));
+	vmem->nregions = 1;
+	/*
+	 * I just want to map the *whole* of userspace address space. But
+	 * from userspace I don't know what that is. On x86_64 it would be:
+	 *
+	 * vmem->regions[0].guest_phys_addr = 4096;
+	 * vmem->regions[0].memory_size = 0x7fffffffe000;
+	 * vmem->regions[0].userspace_addr = 4096;
+	 *
+	 * For now, just ensure we put everything inside a single BSS region.
+	 */
+	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
+	vmem->regions[0].userspace_addr = (uint64_t)&rings;
+	vmem->regions[0].memory_size = sizeof(rings);
+
+	if (ioctl(vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {
+		perror("VHOST_SET_MEM_TABLE");
+		return -1;
+	}
+
+	if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 0))
+		return -1;
+
+	if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 1))
+		return -1;
+
+	return 0;
+}
+
+
+static char ping_payload[16] = "VHOST TEST PACKT";
+
+static inline uint32_t csum_partial(uint16_t *buf, int nwords)
+{
+	uint32_t sum = 0;
+	for(sum=0; nwords>0; nwords--)
+		sum += ntohs(*buf++);
+	return sum;
+}
+
+static inline uint16_t csum_finish(uint32_t sum)
+{
+	sum = (sum >> 16) + (sum &0xffff);
+	sum += (sum >> 16);
+	return htons((uint16_t)(~sum));
+}
+
+static int create_icmp_echo(unsigned char *data, struct in6_addr *dst,
+			    struct in6_addr *src, uint16_t id, uint16_t seq)
+{
+	const int icmplen = ICMP_MINLEN + sizeof(ping_payload);
+	const int plen = sizeof(struct ip6_hdr) + icmplen;
+	struct ip6_hdr *iph = (void *)data;
+	struct icmp6_hdr *icmph = (void *)(data + sizeof(*iph));
+
+	/* IPv6 Header */
+	iph->ip6_flow = htonl((6 << 28) + /* version 6 */
+			      (0 << 20) + /* traffic class */
+			      (0 << 0));  /* flow ID  */
+	iph->ip6_nxt = IPPROTO_ICMPV6;
+	iph->ip6_plen = htons(icmplen);
+	iph->ip6_hlim = 128;
+	iph->ip6_src = *src;
+	iph->ip6_dst = *dst;
+
+	/* ICMPv6 echo request */
+	icmph->icmp6_type = ICMP6_ECHO_REQUEST;
+	icmph->icmp6_code = 0;
+	icmph->icmp6_data16[0] = htons(id);	/* ID */
+	icmph->icmp6_data16[1] = htons(seq);	/* sequence */
+
+	/* Some arbitrary payload */
+	memcpy(&icmph[1], ping_payload, sizeof(ping_payload));
+
+	/*
+	 * IPv6 upper-layer checksums include a pseudo-header
+	 * for IPv6 which contains the source address, the
+	 * destination address, the upper-layer packet length
+	 * and next-header field. See RFC8200 §8.1. The
+	 * checksum is as follows:
+	 *
+	 *   checksum 32 bytes of real IPv6 header:
+	 *     src addr (16 bytes)
+	 *     dst addr (16 bytes)
+	 *   8 bytes more:
+	 *     length of ICMPv6 in bytes (be32)
+	 *     3 bytes of 0
+	 *     next header byte (IPPROTO_ICMPV6)
+	 *   Then the actual ICMPv6 bytes
+	 */
+	uint32_t sum = csum_partial((uint16_t *)&iph->ip6_src, 8);      /* 8 uint16_t */
+	sum += csum_partial((uint16_t *)&iph->ip6_dst, 8);              /* 8 uint16_t */
+
+	/* The easiest way to checksum the following 8-byte
+	 * part of the pseudo-header without horridly violating
+	 * C type aliasing rules is *not* to build it in memory
+	 * at all. We know the length fits in 16 bits so the
+	 * partial checksum of 00 00 LL LL 00 00 00 NH ends up
+	 * being just LLLL + NH.
+	 */
+	sum += IPPROTO_ICMPV6;
+	sum += ICMP_MINLEN + sizeof(ping_payload);
+
+	sum += csum_partial((uint16_t *)icmph, icmplen / 2);
+	icmph->icmp6_cksum = csum_finish(sum);
+	return plen;
+}
+
+
+static int check_icmp_response(unsigned char *data, uint32_t len,
+			       struct in6_addr *dst, struct in6_addr *src)
+{
+	struct ip6_hdr *iph = (void *)data;
+	return ( len >= 41 && (ntohl(iph->ip6_flow) >> 28)==6 /* IPv6 header */
+		 && iph->ip6_nxt == IPPROTO_ICMPV6 /* IPv6 next header field = ICMPv6 */
+		 && !memcmp(&iph->ip6_src, src, 16) /* source == magic address */
+		 && !memcmp(&iph->ip6_dst, dst, 16) /* source == magic address */
+		 && len >= 40 + ICMP_MINLEN + sizeof(ping_payload) /* No short-packet segfaults */
+		 && data[40] == ICMP6_ECHO_REPLY /* ICMPv6 reply */
+		 && !memcmp(&data[40 + ICMP_MINLEN], ping_payload, sizeof(ping_payload)) /* Same payload in response */
+		 );
+
+}
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+#define vio16(x) (x)
+#define vio32(x) (x)
+#define vio64(x) (x)
+#else
+#define vio16(x) __builtin_bswap16(x)
+#define vio32(x) __builtin_bswap32(x)
+#define vio64(x) __builtin_bswap64(x)
+#endif
+
+
+int test_vhost(int vnet_hdr_sz, int pi, int xdp, uint64_t features)
+{
+	int call_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
+	int kick_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
+	int vhost_fd = open("/dev/vhost-net", O_RDWR);
+	int tun_fd = -1;
+	int ret = KSFT_SKIP;
+
+	if (call_fd < 0 || kick_fd < 0 || vhost_fd < 0)
+		goto err;
+
+	memset(rings, 0, sizeof(rings));
+
+	/* Pick up the link-local address that the kernel
+	 * assigns to the tun device. */
+	struct in6_addr tun_addr;
+	tun_fd = open_tun(vnet_hdr_sz, pi, &tun_addr);
+	if (tun_fd < 0)
+		goto err;
+
+	int pi_offset = -1;
+	int data_offset = vnet_hdr_sz;
+
+	/* The tun device puts PI *first*, before the vnet hdr */
+	if (pi) {
+		pi_offset = 0;
+		data_offset += sizeof(struct tun_pi);
+	};
+
+	/* If vhost is going a vnet hdr it comes before all else */
+	if (features & (1ULL << VHOST_NET_F_VIRTIO_NET_HDR)) {
+		int vhost_hdr_sz = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+						(1ULL << VIRTIO_F_VERSION_1))) ?
+			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
+			sizeof(struct virtio_net_hdr);
+
+		data_offset += vhost_hdr_sz;
+		if (pi_offset != -1)
+			pi_offset += vhost_hdr_sz;
+	}
+
+	if (!xdp) {
+		int sndbuf = RING_SIZE * 2048;
+		if (ioctl(tun_fd, TUNSETSNDBUF, &sndbuf) < 0) {
+			perror("TUNSETSNDBUF");
+			ret = -1;
+			goto err;
+		}
+	}
+
+	ret = setup_vhost(vhost_fd, tun_fd, call_fd, kick_fd, features);
+	if (ret)
+		goto err;
+
+	/* A fake link-local address for the userspace end */
+	struct in6_addr local_addr = { 0 };
+	local_addr.s6_addr16[0] = htons(0xfe80);
+	local_addr.s6_addr16[7] = htons(1);
+
+
+	/* Set up RX and TX descriptors; the latter with ping packets ready to
+	 * send to the kernel, but don't actually send them yet. */
+	for (int i = 0; i < RING_SIZE; i++) {
+		struct pkt_buf *pkt = &rings[1].pkts[i];
+		if (pi_offset != -1) {
+			struct tun_pi *pi = (void *)&pkt->data[pi_offset];
+			pi->proto = htons(ETH_P_IPV6);
+		}
+		int plen = create_icmp_echo(&pkt->data[data_offset], &tun_addr,
+					    &local_addr, 0x4747, i);
+
+		rings[1].desc[i].addr = vio64((uint64_t)pkt);
+		rings[1].desc[i].len = vio32(plen + data_offset);
+		rings[1].avail_ring[i] = vio16(i);
+
+		pkt = &rings[0].pkts[i];
+		rings[0].desc[i].addr = vio64((uint64_t)pkt);
+		rings[0].desc[i].len = vio32(sizeof(*pkt));
+		rings[0].desc[i].flags = vio16(VRING_DESC_F_WRITE);
+		rings[0].avail_ring[i] = vio16(i);
+	}
+	barrier();
+	rings[1].avail.idx = vio16(1);
+
+	uint16_t rx_seen_used = 0;
+	struct timeval tv = { 1, 0 };
+	while (1) {
+		fd_set rfds = { 0 };
+		FD_SET(call_fd, &rfds);
+
+		rings[0].avail.idx = vio16(rx_seen_used + RING_SIZE);
+		barrier();
+		eventfd_write(kick_fd, 1);
+
+		if (select(call_fd + 1, &rfds, NULL, NULL, &tv) <= 0) {
+			ret = -1;
+			goto err;
+		}
+
+		uint16_t rx_used_idx = vio16(rings[0].used.idx);
+		barrier();
+
+		while(rx_used_idx != rx_seen_used) {
+			uint32_t desc = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].id);
+			uint32_t len  = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].len);
+
+			if (desc >= RING_SIZE || len < data_offset)
+				return -1;
+
+			uint64_t addr = vio64(rings[0].desc[desc].addr);
+			if (!addr)
+				return -1;
+
+			if (len > data_offset &&
+			    (pi_offset == -1 ||
+			     ((struct tun_pi *)(addr + pi_offset))->proto == htons(ETH_P_IPV6)) &&
+			    check_icmp_response((void *)(addr + data_offset), len - data_offset,
+						&local_addr, &tun_addr)) {
+				ret = 0;
+				goto err;
+			}
+
+			/* Give the same buffer back */
+			rings[0].avail_ring[RING_MASK(rx_seen_used++)] = vio32(desc);
+		}
+		barrier();
+
+		uint64_t ev_val;
+		eventfd_read(call_fd, &ev_val);
+	}
+
+ err:
+	if (call_fd != -1)
+		close(call_fd);
+	if (kick_fd != -1)
+		close(kick_fd);
+	if (vhost_fd != -1)
+		close(vhost_fd);
+	if (tun_fd != -1)
+		close(tun_fd);
+
+	printf("TEST: (hdr %d, xdp %d, pi %d, features %llx) RESULT: %d\n",
+	       vnet_hdr_sz, xdp, pi, (unsigned long long)features, ret);
+	return ret;
+}
+
+/* For iterating over all permutations. */
+#define VHDR_LEN_BITS	3	/* Tun vhdr length selection */
+#define XDP_BIT		4	/* Don't TUNSETSNDBUF, so we use XDP */
+#define PI_BIT		8	/* Don't set IFF_NO_PI */
+#define VIRTIO_V1_BIT	16	/* Use VIRTIO_F_VERSION_1 feature */
+#define VHOST_HDR_BIT	32	/* Use VHOST_NET_F_VIRTIO_NET_HDR */
+
+unsigned int tun_vhdr_lens[] = { 0, 10, 12, 20 };
+
+int main(void)
+{
+	int result = KSFT_SKIP;
+	int i, ret;
+
+	for (i = 0; i < 64; i++) {
+		uint64_t features = 0;
+
+		if (i & VIRTIO_V1_BIT)
+			features |= (1ULL << VIRTIO_F_VERSION_1);
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+		else
+			continue; /* We'd need vio16 et al not to byteswap */
+#endif
+
+		if (i & VHOST_HDR_BIT) {
+			features |= (1ULL << VHOST_NET_F_VIRTIO_NET_HDR);
+
+			/* Even though the test actually passes at the time of
+			 * writing, don't bother to try asking tun *and* vhost
+			 * both to handle a virtio_net_hdr at the same time.
+			 * That's just silly.  */
+			if (i & VHDR_LEN_BITS)
+				continue;
+		}
+
+		ret = test_vhost(tun_vhdr_lens[i & VHDR_LEN_BITS],
+				 !!(i & PI_BIT), !!(i & XDP_BIT), features);
+		if (ret < result)
+			result = ret;
+	}
+
+	return result;
+}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-23  3:39             ` Jason Wang
@ 2021-06-24 12:39               ` David Woodhouse
  0 siblings, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-24 12:39 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 1207 bytes --]

On Wed, 2021-06-23 at 11:39 +0800, Jason Wang wrote:
> > 
> > In the 1:1 mode, the access_ok() is all that's needed since there's no
> > translation.
> > 
> > @@ -2038,6 +2065,14 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> >           u64 s = 0;
> >           int ret = 0;
> >    
> > +       if (vhost_has_feature(vq, VHOST_F_IDENTITY_MAPPING)) {
> 
> 
> Using vhost_has_feature() is kind of tricky since it's used for virtio 
> feature negotiation.
> 
> We probably need to use backend_features instead.
> 
> I think we should probably do more:
> 
> 1) forbid the feature to be set when mem table / IOTLB has at least one 
> mapping
> 2) forbid the mem table / IOTLB updating after the feature is set

Yes, that all makes sense. I confess I hadn't actually *implemented*
the feature at all; the only time I'd typed 'VHOST_F_IDENTITY_MAPPING'
was to show that snippet of "patch" as an example of what
translate_desc() would do. I just wanted *something* to put in the
'if()' statement as a placeholder, so I used that.

It could *even* have just been 'if (!umem)' but that might let it
happen by *accident* which probably isn't a good idea.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 4/4] vhost_net: Add self test with tun device
  2021-06-24 10:42           ` David Woodhouse
@ 2021-06-25  2:55             ` Jason Wang
  2021-06-25  7:54               ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-25  2:55 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez


在 2021/6/24 下午6:42, David Woodhouse 写道:
> On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote:
>> 在 2021/6/24 上午12:12, David Woodhouse 写道:
>>> We *should* eventually expand this test case to attach an AF_PACKET
>>> device to the vhost-net, instead of using a tun device as the back end.
>>> (Although I don't really see *why* vhost is limited to AF_PACKET. Why
>>> *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)
>>
>> It's just because nobody wrote the code. And we're lacking the real use
>> case.
> Hm, what code?


The codes to support AF_UNIX.


>   For AF_PACKET I haven't actually spotted that there *is*
> any.


Vhost_net has this support for more than 10 years. It's hard to say 
there's no user for that.


>
> As I've been refactoring the interaction between vhost and tun/tap, and
> fixing it up for different vhdr lengths, PI, and (just now) frowning in
> horror at the concept that tun and vhost can have *different*
> endiannesses, I hadn't spotted that there was anything special on the
> packet socket.


Vnet header support.


>   For that case, sock_hlen is just zero and we
> send/receive plain packets... or so I thought? Did I miss something?


With vnet header, it can have GSO and csum offload.


>
> As far as I was aware, that ought to have worked with any datagram
> socket. I was pondering not just AF_UNIX but also UDP (since that's my
> main transport for VPN data, at least in the case where I care about
> performance).


My understanding is that vhost_net designed for accelerating virtio 
datapath which is mainly used for VM (L2 traffic). So all kinds of TAPs 
(tuntap,macvtap or packet socket) are the main users. If you check git 
history, vhost can only be enabled without KVM until sometime last year. 
So I confess it can serve as a more general use case, and we had already 
has some discussions. But it's hard to say it's worth to do that since 
it became a re-invention of io_uring?

Another interesting thing is that, the copy done by vhost 
(copy_from/to_user()) will be much more slower than io_uring (GUP) 
because userspace copy may suffer from the performance hit caused by SMAP.


>
> An interesting use case for a non-packet socket might be to establish a
> tunnel. A guest's virtio-net device is just connected to a UDP socket
> on the host, and that tunnels all their packets to/from a remote
> endpoint which is where that guest is logically connected to the
> network. It might be useful for live migration cases, perhaps?


So kernel has already had supported on tunnels like L2 over L4 which 
were all done at netdevice level (e.g vxlan). If you want to build a 
customized tunnel which is not supported by the kernel, you need to 
redirect the traffic back to the userspace. vhost-user is probably the 
best choice in that case.


>
> I don't have an overriding desire to *make* it work, mind you; I just
> wanted to make sure I understand *why* it doesn't, if indeed it
> doesn't. As far as I could tell, it *should* work if we just dropped
> the check?


I'm not sure. It requires careful thought.

For the case of L2/VM, we care more about the performance which can be 
done via vnet header.

For the case of L3(TUN) or above, we can do that via io_uring.

So it looks to me it's not worth to bother.


>
>> Vhost_net is bascially used for accepting packet from userspace to the
>> kernel networking stack.
>>
>> Using AF_UNIX makes it looks more like a case of inter process
>> communication (without vnet header it won't be used efficiently by VM).
>> In this case, using io_uring is much more suitable.
>>
>> Or thinking in another way, instead of depending on the vhost_net, we
>> can expose TUN/TAP socket to userspace then io_uring could be used for
>> the OpenConnect case as well?
> That would work, I suppose. Although as noted, I *can* use vhost_net
> with tun today from userspace as long as I disable XDP and PI, and use
> a virtio_net_hdr that I don't really want. (And pull a value for
> TASK_SIZE out of my posterior; qv.)
>
> So I *can* ship a version of OpenConnect that works on existing
> production kernels with those workarounds, and I'm fixing up the other
> permutations of vhost/tun stuff in the kernel just because I figured we
> *should*.
>
> If I'm going to *require* new kernel support for OpenConnect then I
> might as well go straight to the AF_TLS/DTLS + BPF + sockmap plan and
> have the data packets never go to userspace in the first place.


Note that BPF have some limitations:

1) requires capabilities like CAP_BPF
2) may need userspace fallback


>
>
>>>
>>>>> +	/*
>>>>> +	 * I just want to map the *whole* of userspace address space. But
>>>>> +	 * from userspace I don't know what that is. On x86_64 it would be:
>>>>> +	 *
>>>>> +	 * vmem->regions[0].guest_phys_addr = 4096;
>>>>> +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
>>>>> +	 * vmem->regions[0].userspace_addr = 4096;
>>>>> +	 *
>>>>> +	 * For now, just ensure we put everything inside a single BSS region.
>>>>> +	 */
>>>>> +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
>>>>> +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
>>>>> +	vmem->regions[0].memory_size = sizeof(rings);
>>>> Instead of doing tricks like this, we can do it in another way:
>>>>
>>>> 1) enable device IOTLB
>>>> 2) wait for the IOTLB miss request (iova, len) and update identity
>>>> mapping accordingly
>>>>
>>>> This should work for all the archs (with some performance hit).
>>> Ick. For my actual application (OpenConnect) I'm either going to suck
>>> it up and put in the arch-specific limits like in the comment above, or
>>> I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking
>>> about elsewhere.
>>
>> The feature could be useful for the case of vhost-vDPA as well.
>>
>>
>>>    (Probably the former, since if I'm requiring kernel
>>> changes then I have grander plans around extending AF_TLS to do DTLS,
>>> then hooking that directly up to the tun socket via BPF and a sockmap
>>> without the data frames ever going to userspace at all.)
>>
>> Ok, I guess we need to make sockmap works for tun socket.
> Hm, I need to work out the ideal data flow here. I don't know if
> sendmsg() / recvmsg() on the tun socket are even what I want, for full
> zero-copy.


Zerocopy could be done via vhost_net. But due the HOL issue we disabled 
it by default.


>
> In the case where the NIC supports encryption, we want true zero-copy
> from the moment the "encrypted" packet arrives over UDP on the public
> network, through the DTLS processing and seqno checking, to being
> processed as netif_receive_skb() on the tun device.
>
> Likewise skbs from tun_net_xmit() should have the appropriate DTLS and
> IP/UDP headers prepended to them and that *same* skb (or at least the
> same frags) should be handed to the NIC to encrypt and send.
>
> In the case where we have software crypto in the kernel, we can
> tolerate precisely *one* copy because the crypto doesn't have to be
> done in-place, so moving from the input to the output crypto buffers
> can be that one "copy", and we can use it to move data around (from the
> incoming skb to the outgoing skb) if we need to.


I'm not familiar withe encryption, but it looks like what you want is 
the TLS offload support in TUN/TAP.


>
> Ultimately I think we want udp_sendmsg() and tun_sendmsg() to support
> being *given* ownership of the buffers, rather than copying from them.
> Or just being given a skb and pushing/pulling their headers.


It looks more like you want to add sendpage() support for TUN? The first 
step as discussed would be the codes to expose TUN socket to userspace.


>
> I'm looking at skb_send_sock() and it *doesn't* seem to support "just
> steal the frags from the initial skb and give them to the new one", but
> there may be ways to make that work.


I don't know. Last time I check sockmap only support TCP socket. But I 
saw some work that is proposed by Wang Cong to make it work for UDP 
probably.

Thanks


>
>>> I think I can fix *all* those test cases by making tun_get_socket()
>>> take an extra 'int *' argument, and use that to return the *actual*
>>> value of sock_hlen. Here's the updated test case in the meantime:
>>
>> It would be better if you can post a new version of the whole series to
>> ease the reviewing.
> Yep. I was working on that... until I got even more distracted by
> looking at how we can do the true in-kernel zero-copy option ;)
>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket()
  2021-06-24 12:30 ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() David Woodhouse
                     ` (3 preceding siblings ...)
  2021-06-24 12:30   ` [PATCH v3 5/5] vhost_net: Add self test with tun device David Woodhouse
@ 2021-06-25  5:00   ` Jason Wang
  2021-06-25  8:23     ` David Woodhouse
  2021-06-25 18:13   ` Willem de Bruijn
  5 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-25  5:00 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez, Willem de Bruijn


在 2021/6/24 下午8:30, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> The vhost-net driver was making wild assumptions about the header length
> of the underlying tun/tap socket.


It's by design to depend on the userspace to co-ordinate the vnet header 
setting with the underlying sockets.


>   Then it was discarding packets if
> the number of bytes it got from sock_recvmsg() didn't precisely match
> its guess.


Anything that is broken by this? The failure is a hint for the userspace 
that something is wrong during the coordination.


>
> Fix it to get the correct information along with the socket itself.


I'm not sure what is fixed by this. It looks to me it tires to let 
packet go even if the userspace set the wrong attributes to tap or 
vhost. This is even sub-optimal than failing explicitly fail the RX.


> As a side-effect, this means that tun_get_socket() won't work if the
> tun file isn't actually connected to a device, since there's no 'tun'
> yet in that case to get the information from.


This may break the existing application. Vhost-net is tied to the socket 
instead of the device that the socket is loosely coupled.


>
> On the receive side, where the tun device generates the virtio_net_hdr
> but VIRITO_NET_F_MSG_RXBUF was negotiated and vhost-net needs to fill
> in the 'num_buffers' field on top of the existing virtio_net_hdr, fix
> that to use 'sock_hlen - 2' as the location, which means that it goes
> in the right place regardless of whether the tun device is using an
> additional tun_pi header or not. In this case, the user should have
> configured the tun device with a vnet hdr size of 12, to make room.
>
> Fixes: 8dd014adfea6f ("vhost-net: mergeable buffers support")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   drivers/net/tap.c      |  5 ++++-
>   drivers/net/tun.c      | 16 +++++++++++++++-
>   drivers/vhost/net.c    | 31 +++++++++++++++----------------
>   include/linux/if_tap.h |  4 ++--
>   include/linux/if_tun.h |  4 ++--
>   5 files changed, 38 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index 8e3a28ba6b28..2170a0d3d34c 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -1246,7 +1246,7 @@ static const struct proto_ops tap_socket_ops = {
>    * attached to a device.  The returned object works like a packet socket, it
>    * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
>    * holding a reference to the file for as long as the socket is in use. */
> -struct socket *tap_get_socket(struct file *file)
> +struct socket *tap_get_socket(struct file *file, size_t *hlen)
>   {
>   	struct tap_queue *q;
>   	if (file->f_op != &tap_fops)
> @@ -1254,6 +1254,9 @@ struct socket *tap_get_socket(struct file *file)
>   	q = file->private_data;
>   	if (!q)
>   		return ERR_PTR(-EBADFD);
> +	if (hlen)
> +		*hlen = (q->flags & IFF_VNET_HDR) ? q->vnet_hdr_sz : 0;
> +
>   	return &q->sock;
>   }
>   EXPORT_SYMBOL_GPL(tap_get_socket);
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 4cf38be26dc9..67b406fa0881 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -3649,7 +3649,7 @@ static void tun_cleanup(void)
>    * attached to a device.  The returned object works like a packet socket, it
>    * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
>    * holding a reference to the file for as long as the socket is in use. */
> -struct socket *tun_get_socket(struct file *file)
> +struct socket *tun_get_socket(struct file *file, size_t *hlen)
>   {
>   	struct tun_file *tfile;
>   	if (file->f_op != &tun_fops)
> @@ -3657,6 +3657,20 @@ struct socket *tun_get_socket(struct file *file)
>   	tfile = file->private_data;
>   	if (!tfile)
>   		return ERR_PTR(-EBADFD);
> +
> +	if (hlen) {
> +		struct tun_struct *tun = tun_get(tfile);
> +		size_t len = 0;
> +
> +		if (!tun)
> +			return ERR_PTR(-ENOTCONN);
> +		if (tun->flags & IFF_VNET_HDR)
> +			len += READ_ONCE(tun->vnet_hdr_sz);
> +		if (!(tun->flags & IFF_NO_PI))
> +			len += sizeof(struct tun_pi);
> +		tun_put(tun);
> +		*hlen = len;
> +	}
>   	return &tfile->socket;
>   }
>   EXPORT_SYMBOL_GPL(tun_get_socket);
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index df82b124170e..b92a7144ed90 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net)
>   
>   	vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?
>   		vq->log : NULL;
> -	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
> +	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) &&
> +		(vhost_hlen || sock_hlen >= sizeof(num_buffers));


So this change the behavior. When mergeable buffer is enabled, userspace 
expects the vhost to merge buffers. If the feature is disabled silently, 
it violates virtio spec.

If anything wrong in the setup, userspace just breaks itself.

E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet 
header might be overwrote by the vnet header.


>   
>   	do {
>   		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
> @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net)
>   			}
>   		} else {
>   			/* Header came from socket; we'll need to patch
> -			 * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF
> +			 * ->num_buffers over the last two bytes if
> +			 * VIRTIO_NET_F_MRG_RXBUF is enabled.
>   			 */
> -			iov_iter_advance(&fixup, sizeof(hdr));
> +			iov_iter_advance(&fixup, sock_hlen - 2);


I'm not sure what did the above code want to fix. It doesn't change 
anything if vnet header is set correctly in TUN. It only prevents the 
the packet header from being rewrote.

Thanks


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 2/5] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path
  2021-06-24 12:30   ` [PATCH v3 2/5] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path David Woodhouse
@ 2021-06-25  6:58     ` Jason Wang
  0 siblings, 0 replies; 73+ messages in thread
From: Jason Wang @ 2021-06-25  6:58 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez, Willem de Bruijn


在 2021/6/24 下午8:30, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> Sometimes it's just a data packet. The virtio_net_hdr processing should be
> conditional on IFF_VNET_HDR, just as it is in tun_get_user().
>
> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/net/tun.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 67b406fa0881..9acd448e6dfc 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2331,7 +2331,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>   {
>   	unsigned int datasize = xdp->data_end - xdp->data;
>   	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
> -	struct virtio_net_hdr *gso = &hdr->gso;
> +	struct virtio_net_hdr *gso = NULL;
>   	struct bpf_prog *xdp_prog;
>   	struct sk_buff *skb = NULL;
>   	u32 rxhash = 0, act;
> @@ -2340,9 +2340,12 @@ static int tun_xdp_one(struct tun_struct *tun,
>   	bool skb_xdp = false;
>   	struct page *page;
>   
> +	if (tun->flags & IFF_VNET_HDR)
> +		gso = &hdr->gso;
> +
>   	xdp_prog = rcu_dereference(tun->xdp_prog);
>   	if (xdp_prog) {
> -		if (gso->gso_type) {
> +		if (gso && gso->gso_type) {
>   			skb_xdp = true;
>   			goto build;
>   		}
> @@ -2388,7 +2391,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>   	skb_reserve(skb, xdp->data - xdp->data_hard_start);
>   	skb_put(skb, xdp->data_end - xdp->data);
>   
> -	if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
> +	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
>   		atomic_long_inc(&tun->rx_frame_errors);
>   		kfree_skb(skb);
>   		err = -EINVAL;


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-24 12:30   ` [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves David Woodhouse
@ 2021-06-25  7:33     ` Jason Wang
  2021-06-25  8:37       ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-25  7:33 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez, Willem de Bruijn


在 2021/6/24 下午8:30, David Woodhouse 写道:
> From: David Woodhouse<dwmw@amazon.co.uk>
>
> When the underlying socket isn't configured with a virtio_net_hdr, the
> existing code in vhost_net_build_xdp() would attempt to validate
> uninitialised data, by copying zero bytes (sock_hlen) into the local
> copy of the header and then trying to validate that.
>
> Fixing it is somewhat non-trivial because the tun device might put a
> struct tun_pi*before*  the virtio_net_hdr, which makes it hard to find.
> So just stop messing with someone else's data in vhost_net_build_xdp(),
> and let tap and tun validate it for themselves, as they do in the
> non-XDP case anyway.


Thinking in another way. All XDP stuffs for vhost is prepared for TAP. 
XDP is not expected to work for TUN.

So we can simply let's vhost doesn't go with XDP path is the underlayer 
socket is TUN.

Thanks


>
> This means that the 'gso' member of struct tun_xdp_hdr can die, leaving
> only 'int buflen'.
>
> The socket header of sock_hlen is still copied separately from the
> data payload because there may be a gap between them to ensure suitable
> alignment of the latter.
>
> Fixes: 0a0be13b8fe2 ("vhost_net: batch submitting XDP buffers to underlayer sockets")
> Signed-off-by: David Woodhouse<dwmw@amazon.co.uk>
> ---
>   drivers/net/tap.c      | 25 ++++++++++++++++++++++---
>   drivers/net/tun.c      | 21 ++++++++++++++++++---
>   drivers/vhost/net.c    | 30 +++++++++---------------------
>   include/linux/if_tun.h |  1 -
>   4 files changed, 49 insertions(+), 28 deletions(-)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-24 12:30   ` [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode David Woodhouse
@ 2021-06-25  7:41     ` Jason Wang
  2021-06-25  8:51       ` David Woodhouse
  2021-06-25 18:43     ` Willem de Bruijn
  1 sibling, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-25  7:41 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez, Willem de Bruijn


在 2021/6/24 下午8:30, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> In tun_get_user(), skb->protocol is either taken from the tun_pi header
> or inferred from the first byte of the packet in IFF_TUN mode, while
> eth_type_trans() is called only in the IFF_TAP mode where the payload
> is expected to be an Ethernet frame.
>
> The equivalent code path in tun_xdp_one() was unconditionally using
> eth_type_trans(), which is the wrong thing to do in IFF_TUN mode and
> corrupts packets.
>
> Pull the logic out to a separate tun_skb_set_protocol() function, and
> call it from both tun_get_user() and tun_xdp_one().
>
> XX: It is not entirely clear to me why it's OK to call eth_type_trans()
> in some cases without first checking that enough of the Ethernet header
> is linearly present by calling pskb_may_pull().


Looks like a bug.


>   Such a check was never
> present in the tun_xdp_one() code path, and commit 96aa1b22bd6bb ("tun:
> correct header offsets in napi frags mode") deliberately added it *only*
> for the IFF_NAPI_FRAGS mode.


We had already checked this in tun_get_user() before:

         if ((tun->flags & TUN_TYPE_MASK) == IFF_TAP) {
                 align += NET_IP_ALIGN;
                 if (unlikely(len < ETH_HLEN ||
                              (gso.hdr_len && tun16_to_cpu(tun, 
gso.hdr_len) < ETH_HLEN)))
                         return -EINVAL;
         }


>
> I would like to see specific explanations of *why* it's ever valid and
> necessary (is it so much faster?) to skip the pskb_may_pull() by setting
> the 'no_pull_check' flag to tun_skb_set_protocol(), but for now I'll
> settle for faithfully preserving the existing behaviour and pretending
> it's someone else's problem.
>
> Cc: Willem de Bruijn <willemb@google.com>
> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>


Thanks


> ---
>   drivers/net/tun.c | 95 +++++++++++++++++++++++++++++++----------------
>   1 file changed, 63 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 1b553f79adb0..9379fa86fae9 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1641,6 +1641,47 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>   	return NULL;
>   }
>   
> +static int tun_skb_set_protocol(struct tun_struct *tun, struct sk_buff *skb,
> +				__be16 pi_proto, bool no_pull_check)
> +{
> +	switch (tun->flags & TUN_TYPE_MASK) {
> +	case IFF_TUN:
> +		if (tun->flags & IFF_NO_PI) {
> +			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
> +
> +			switch (ip_version) {
> +			case 4:
> +				pi_proto = htons(ETH_P_IP);
> +				break;
> +			case 6:
> +				pi_proto = htons(ETH_P_IPV6);
> +				break;
> +			default:
> +				return -EINVAL;
> +			}
> +		}
> +
> +		skb_reset_mac_header(skb);
> +		skb->protocol = pi_proto;
> +		skb->dev = tun->dev;
> +		break;
> +	case IFF_TAP:
> +		/* As an optimisation, no_pull_check can be set in the cases
> +		 * where the caller *knows* that eth_type_trans() will be OK
> +		 * because the Ethernet header is linearised at skb->data.
> +		 *
> +		 * XX: Or so I was reliably assured when I moved this code
> +		 * and didn't make it unconditional. dwmw2.
> +		 */
> +		if (!no_pull_check && !pskb_may_pull(skb, ETH_HLEN))
> +			return -ENOMEM;
> +
> +		skb->protocol = eth_type_trans(skb, tun->dev);
> +		break;
> +	}
> +	return 0;
> +}
> +
>   /* Get packet from user space buffer */
>   static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>   			    void *msg_control, struct iov_iter *from,
> @@ -1784,37 +1825,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>   		return -EINVAL;
>   	}
>   
> -	switch (tun->flags & TUN_TYPE_MASK) {
> -	case IFF_TUN:
> -		if (tun->flags & IFF_NO_PI) {
> -			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
> -
> -			switch (ip_version) {
> -			case 4:
> -				pi.proto = htons(ETH_P_IP);
> -				break;
> -			case 6:
> -				pi.proto = htons(ETH_P_IPV6);
> -				break;
> -			default:
> -				atomic_long_inc(&tun->dev->rx_dropped);
> -				kfree_skb(skb);
> -				return -EINVAL;
> -			}
> -		}
> -
> -		skb_reset_mac_header(skb);
> -		skb->protocol = pi.proto;
> -		skb->dev = tun->dev;
> -		break;
> -	case IFF_TAP:
> -		if (frags && !pskb_may_pull(skb, ETH_HLEN)) {
> -			err = -ENOMEM;
> -			goto drop;
> -		}
> -		skb->protocol = eth_type_trans(skb, tun->dev);
> -		break;
> -	}
> +	err = tun_skb_set_protocol(tun, skb, pi.proto, !frags);
> +	if (err)
> +		goto drop;
>   
>   	/* copy skb_ubuf_info for callback when skb has no error */
>   	if (zerocopy) {
> @@ -2335,12 +2348,24 @@ static int tun_xdp_one(struct tun_struct *tun,
>   	struct virtio_net_hdr *gso = NULL;
>   	struct bpf_prog *xdp_prog;
>   	struct sk_buff *skb = NULL;
> +	__be16 proto = 0;
>   	u32 rxhash = 0, act;
>   	int buflen = hdr->buflen;
>   	int err = 0;
>   	bool skb_xdp = false;
>   	struct page *page;
>   
> +	if (!(tun->flags & IFF_NO_PI)) {
> +		struct tun_pi *pi = tun_hdr;
> +		tun_hdr += sizeof(*pi);
> +
> +		if (tun_hdr > xdp->data) {
> +			atomic_long_inc(&tun->rx_frame_errors);
> +			return -EINVAL;
> +		}
> +		proto = pi->proto;
> +	}


As replied in patch 2, I think it's better to make XDP path work only 
for TAP but not TUN.

Then the series would be much simpler.

Thanks


> +
>   	if (tun->flags & IFF_VNET_HDR) {
>   		gso = tun_hdr;
>   		tun_hdr += sizeof(*gso);
> @@ -2413,7 +2438,13 @@ static int tun_xdp_one(struct tun_struct *tun,
>   		goto out;
>   	}
>   
> -	skb->protocol = eth_type_trans(skb, tun->dev);
> +	err = tun_skb_set_protocol(tun, skb, proto, true);
> +	if (err) {
> +		atomic_long_inc(&tun->dev->rx_dropped);
> +		kfree_skb(skb);
> +		goto out;
> +	}
> +
>   	skb_reset_network_header(skb);
>   	skb_probe_transport_header(skb);
>   	skb_record_rx_queue(skb, tfile->queue_index);


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 4/4] vhost_net: Add self test with tun device
  2021-06-25  2:55             ` Jason Wang
@ 2021-06-25  7:54               ` David Woodhouse
  0 siblings, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-25  7:54 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 3549 bytes --]

On Fri, 2021-06-25 at 10:55 +0800, Jason Wang wrote:
> 在 2021/6/24 下午6:42, David Woodhouse 写道:
> > On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote:
> > > 在 2021/6/24 上午12:12, David Woodhouse 写道:
> > > > We *should* eventually expand this test case to attach an AF_PACKET
> > > > device to the vhost-net, instead of using a tun device as the back end.
> > > > (Although I don't really see *why* vhost is limited to AF_PACKET. Why
> > > > *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)
> > > 
> > > It's just because nobody wrote the code. And we're lacking the real use
> > > case.
> > 
> > Hm, what code?
> 
> 
> The codes to support AF_UNIX.
> 
> 
> >   For AF_PACKET I haven't actually spotted that there *is* any.
> 
> 
> Vhost_net has this support for more than 10 years. It's hard to say 
> there's no user for that.
> 

I wasn't saying I hadn't spotted the use case. I hadn't spotted the
*code* which is in af_packet to support vhost. But...

> > As I've been refactoring the interaction between vhost and tun/tap, and
> > fixing it up for different vhdr lengths, PI, and (just now) frowning in
> > horror at the concept that tun and vhost can have *different*
> > endiannesses, I hadn't spotted that there was anything special on the
> > packet socket.
> 
> Vnet header support.

... I have no idea how I failed to spot that. OK, so AF_PACKET sockets
can *optionally* support the case where *they* provide the
virtio_net_hdr — instead of vhost doing it, or there being none.

But any other sockets would work for the "vhost does it" or the "no
vhdr" case.

... and I need to fix my 'get sock_hlen from the underlying tun/tap
device' patch to *not* assume that sock_hlen is zero for a raw socket;
it needs to check the PACKET_VNET_HDR sockopt. And *that* was broken
for the VERSION_1|MRG_RXBUF case before I came along, wasn't it?
Because vhost would have assumed sock_hlen to be 12 bytes, while in
AF_PACKET it's always only 10?

> >   For that case, sock_hlen is just zero and we
> > send/receive plain packets... or so I thought? Did I miss something?
> 
> 
> With vnet header, it can have GSO and csum offload.
> 
> 
> > 
> > As far as I was aware, that ought to have worked with any datagram
> > socket. I was pondering not just AF_UNIX but also UDP (since that's my
> > main transport for VPN data, at least in the case where I care about
> > performance).
> 
> 
> My understanding is that vhost_net designed for accelerating virtio 
> datapath which is mainly used for VM (L2 traffic). So all kinds of TAPs 
> (tuntap,macvtap or packet socket) are the main users. If you check git 
> history, vhost can only be enabled without KVM until sometime last year. 
> So I confess it can serve as a more general use case, and we had already 
> has some discussions. But it's hard to say it's worth to do that since 
> it became a re-invention of io_uring?

Yeah, ultimately I'm not sure that's worth exploring. As I said, I was
looking for something that works on *current* kernels. Which means no
io_uring on the underlying tun socket, and no vhost on UDP. If I want
to go and implement *both* ring protocols in userspace and make use of
each of them on the socket that they do support, I can do that. Yay! :)

If I'm going to require new kernels, then I should just work on the
"ideal" data path which doesn't really involve userspace at all. But we
should probably take that discussion to a separate thread.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket()
  2021-06-25  5:00   ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() Jason Wang
@ 2021-06-25  8:23     ` David Woodhouse
  2021-06-28  4:22       ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-25  8:23 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez, Willem de Bruijn

[-- Attachment #1: Type: text/plain, Size: 5798 bytes --]

On Fri, 2021-06-25 at 13:00 +0800, Jason Wang wrote:
> 在 2021/6/24 下午8:30, David Woodhouse 写道:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The vhost-net driver was making wild assumptions about the header length
> > of the underlying tun/tap socket.
> 
> 
> It's by design to depend on the userspace to co-ordinate the vnet header 
> setting with the underlying sockets.
> 
> 
> >   Then it was discarding packets if
> > the number of bytes it got from sock_recvmsg() didn't precisely match
> > its guess.
> 
> 
> Anything that is broken by this? The failure is a hint for the userspace 
> that something is wrong during the coordination.

I am not a fan of this approach. I firmly believe that for a given
configuration, the kernel should either *work* or it should gracefully
refuse to set it up that way. And the requirements should be clearly
documented.

Having been on the receiving end of this "hint" of which you speak, I
found it distinctly suboptimal as a user interface. I was left
scrabbling around trying to find a set of options which *would* work,
and it was only through debugging the kernel that I managed to work out
that I:

  • MUST set IFF_NO_PI
  • MUST use TUNSETSNDBUF to reduce the sndbuf from INT_MAX
  • MUST use a virtio_net_hdr that I don't want

If my application failed to do any of those things, I got a silent
failure to transport any packets. The only thing I could do *without*
debugging the kernel was tcpdump on the 'tun0' interface and see if the
TX packets I put into the ring were even making it to the interface,
and what they looked like if they did. (Losing the first 14 bytes and
having the *next* 14 bytes scribbled on by an Ethernet header was a fun
one.)





> > 
> > Fix it to get the correct information along with the socket itself.
> 
> 
> I'm not sure what is fixed by this. It looks to me it tires to let 
> packet go even if the userspace set the wrong attributes to tap or 
> vhost. This is even sub-optimal than failing explicitly fail the RX.

I'm OK with explicit failure. But once I'd let it *get* the information
from the underlying socket in order to decide whether it should fail or
not, it turned out to be easy enough just to make those configs work
anyway.

The main case where that "easy enough" is stretched a little (IMO) was
when there's a tun_pi header. I have one more of your emails to reply
to after this, and I'll address that there.


> 
> > As a side-effect, this means that tun_get_socket() won't work if the
> > tun file isn't actually connected to a device, since there's no 'tun'
> > yet in that case to get the information from.
> 
> 
> This may break the existing application. Vhost-net is tied to the socket 
> instead of the device that the socket is loosely coupled.

Hm. Perhaps the PI and vnet hdr should be considered an option of the
*socket* (which is tied to the tfile), not purely an option of the
underlying device?

Or maybe it's sufficient just to get the flags from *either* tfile->tun 
or tfile->detached, so that it works when the queue is detached. I'll
take a look.

I suppose we could even have a fallback that makes stuff up like we do
today. If the user attempts to attach a tun file descriptor to vhost
without ever calling TUNSETIFF on it first, *then* we make the same
assumptions we do today?

> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net)
> >   
> >   	vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?
> >   		vq->log : NULL;
> > -	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
> > +	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) &&
> > +		(vhost_hlen || sock_hlen >= sizeof(num_buffers));
> 
> 
> So this change the behavior. When mergeable buffer is enabled, userspace 
> expects the vhost to merge buffers. If the feature is disabled silently, 
> it violates virtio spec.
> 
> If anything wrong in the setup, userspace just breaks itself.
> 
> E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet 
> header might be overwrote by the vnet header.

This wasn't intended to change the behaviour of any code path that is
already working today. If *either* vhost or the underlying device have
provided a vnet header, we still merge.

If *neither* provide a vnet hdr, there's nowhere to put num_buffers and
we can't merge.

That code path doesn't work at all today, but does after my patches.
But you're right, we should explicitly refuse to negotiate
VIRITO_NET_F_MSG_RXBUF in that case.

> 
> >   
> >   	do {
> >   		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
> > @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net)
> >   			}
> >   		} else {
> >   			/* Header came from socket; we'll need to patch
> > -			 * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF
> > +			 * ->num_buffers over the last two bytes if
> > +			 * VIRTIO_NET_F_MRG_RXBUF is enabled.
> >   			 */
> > -			iov_iter_advance(&fixup, sizeof(hdr));
> > +			iov_iter_advance(&fixup, sock_hlen - 2);
> 
> 
> I'm not sure what did the above code want to fix. It doesn't change 
> anything if vnet header is set correctly in TUN. It only prevents the 
> the packet header from being rewrote.
> 

It fixes the case where the virtio_net_hdr isn't at the start of the
tun header, because the tun actually puts the tun_pi struct *first*,
and *then* the virtio_net_hdr. 

The num_buffers field needs to go at the *end* of sock_hlen. Not at a
fixed offset from the *start* of it.

At least, that's true unless we want to just declare that we *only*
support TUN with the IFF_NO_PI flag. (qv).

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-25  7:33     ` Jason Wang
@ 2021-06-25  8:37       ` David Woodhouse
  2021-06-28  4:23         ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-25  8:37 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez, Willem de Bruijn

[-- Attachment #1: Type: text/plain, Size: 1911 bytes --]

On Fri, 2021-06-25 at 15:33 +0800, Jason Wang wrote:
> 在 2021/6/24 下午8:30, David Woodhouse 写道:
> > From: David Woodhouse<dwmw@amazon.co.uk>
> > 
> > When the underlying socket isn't configured with a virtio_net_hdr, the
> > existing code in vhost_net_build_xdp() would attempt to validate
> > uninitialised data, by copying zero bytes (sock_hlen) into the local
> > copy of the header and then trying to validate that.
> > 
> > Fixing it is somewhat non-trivial because the tun device might put a
> > struct tun_pi*before*  the virtio_net_hdr, which makes it hard to find.
> > So just stop messing with someone else's data in vhost_net_build_xdp(),
> > and let tap and tun validate it for themselves, as they do in the
> > non-XDP case anyway.
> 
> 
> Thinking in another way. All XDP stuffs for vhost is prepared for TAP. 
> XDP is not expected to work for TUN.
> 
> So we can simply let's vhost doesn't go with XDP path is the underlayer 
> socket is TUN.

Actually, IFF_TUN mode per se isn't that complex. It's fixed purely on
the tun side by that first patch I posted, which I later expanded a
little to factor out tun_skb_set_protocol().

The next two patches in my original set were fixing up the fact that
XDP currently assumes that the *socket* will be doing the vhdr, not
vhost. Those two weren't tun-specific at all.

It's supporting the PI header (which tun puts *before* the virtio
header as I just said) which introduces a tiny bit more complexity.

So yes, avoiding the XDP path if PI is being used would make some
sense.

In fact I wouldn't be entirely averse to refusing PI mode completely,
as long as we fail gracefully at setup time by refusing the
SET_BACKEND. Not by just silently failing to receive packets.

But then again, it's not actually *that* hard to support, and it's
working fine in my selftests at the end of my patch series.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-25  7:41     ` Jason Wang
@ 2021-06-25  8:51       ` David Woodhouse
  2021-06-28  4:27         ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-25  8:51 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez, Willem de Bruijn

[-- Attachment #1: Type: text/plain, Size: 2707 bytes --]

On Fri, 2021-06-25 at 15:41 +0800, Jason Wang wrote:
> 在 2021/6/24 下午8:30, David Woodhouse 写道:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > In tun_get_user(), skb->protocol is either taken from the tun_pi header
> > or inferred from the first byte of the packet in IFF_TUN mode, while
> > eth_type_trans() is called only in the IFF_TAP mode where the payload
> > is expected to be an Ethernet frame.
> > 
> > The equivalent code path in tun_xdp_one() was unconditionally using
> > eth_type_trans(), which is the wrong thing to do in IFF_TUN mode and
> > corrupts packets.
> > 
> > Pull the logic out to a separate tun_skb_set_protocol() function, and
> > call it from both tun_get_user() and tun_xdp_one().
> > 
> > XX: It is not entirely clear to me why it's OK to call eth_type_trans()
> > in some cases without first checking that enough of the Ethernet header
> > is linearly present by calling pskb_may_pull().
> 
> 
> Looks like a bug.
> 
> 
> >    Such a check was never
> > present in the tun_xdp_one() code path, and commit 96aa1b22bd6bb ("tun:
> > correct header offsets in napi frags mode") deliberately added it *only*
> > for the IFF_NAPI_FRAGS mode.
> 
> 
> We had already checked this in tun_get_user() before:
> 
>          if ((tun->flags & TUN_TYPE_MASK) == IFF_TAP) {
>                  align += NET_IP_ALIGN;
>                  if (unlikely(len < ETH_HLEN ||
>                               (gso.hdr_len && tun16_to_cpu(tun, 
> gso.hdr_len) < ETH_HLEN)))
>                          return -EINVAL;
>          }

We'd checked skb->len, but that doesn't mean we had a full Ethernet
header *linearly* at skb->data, does it?

For the basic tun_get_user() case I suppose we copy_from_user() into a
single linear skb anyway, even if userspace had fragment it and used
writev(). So we *are* probably safe there?

I'm sure we *can* contrive a proof that it's safe for that case, if we
must. But I think we should *need* that proof, if we're going to bypass
the check. And I wasn't comfortable touching that code without it.

We should also have a fairly good reason... it isn't clear to me *why*
we're bothering to avoid the check. Is it so slow, even in the case
where there's nothing to be done?

For a linear skb, the inline pskb_may_pull() is going to immediately
return true because ETH_HLEN < skb_headlen(skb), isn't it? Why optimise
*that* away?

Willem, was there a reason you made that conditional in the first
place?

If we're going to continue to *not* check on the XDP path, we similarly
need a proof that it can't be fragmented. And also a reason to bother
with the "optimisation", of course.



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket()
  2021-06-24 12:30 ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() David Woodhouse
                     ` (4 preceding siblings ...)
  2021-06-25  5:00   ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() Jason Wang
@ 2021-06-25 18:13   ` Willem de Bruijn
  2021-06-25 18:55     ` David Woodhouse
  5 siblings, 1 reply; 73+ messages in thread
From: Willem de Bruijn @ 2021-06-25 18:13 UTC (permalink / raw)
  To: David Woodhouse; +Cc: netdev, Jason Wang, Eugenio Pérez

On Thu, Jun 24, 2021 at 8:30 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> The vhost-net driver was making wild assumptions about the header length

If respinning, please more concretely describe which configuration is
currently broken.

IFF_NO_PI + IFF_VNET_HDR, if I understand correctly. But I got that
from the discussion, not the commit message.

> of the underlying tun/tap socket. Then it was discarding packets if
> the number of bytes it got from sock_recvmsg() didn't precisely match
> its guess.
>
> Fix it to get the correct information along with the socket itself.
> As a side-effect, this means that tun_get_socket() won't work if the
> tun file isn't actually connected to a device, since there's no 'tun'
> yet in that case to get the information from.
>
> On the receive side, where the tun device generates the virtio_net_hdr
> but VIRITO_NET_F_MSG_RXBUF was negotiated and vhost-net needs to fill

Nit: VIRTIO_NET_F_MSG_RXBUF

> in the 'num_buffers' field on top of the existing virtio_net_hdr, fix
> that to use 'sock_hlen - 2' as the location, which means that it goes

Please use sizeof(hdr.num_buffers) instead of a raw constant 2, to
self document the code.

Should this be an independent one-line fix?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-24 12:30   ` [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode David Woodhouse
  2021-06-25  7:41     ` Jason Wang
@ 2021-06-25 18:43     ` Willem de Bruijn
  2021-06-25 19:00       ` David Woodhouse
  1 sibling, 1 reply; 73+ messages in thread
From: Willem de Bruijn @ 2021-06-25 18:43 UTC (permalink / raw)
  To: David Woodhouse; +Cc: netdev, Jason Wang, Eugenio Pérez

On Thu, Jun 24, 2021 at 8:30 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> In tun_get_user(), skb->protocol is either taken from the tun_pi header
> or inferred from the first byte of the packet in IFF_TUN mode, while
> eth_type_trans() is called only in the IFF_TAP mode where the payload
> is expected to be an Ethernet frame.
>
> The equivalent code path in tun_xdp_one() was unconditionally using
> eth_type_trans(), which is the wrong thing to do in IFF_TUN mode and
> corrupts packets.
>
> Pull the logic out to a separate tun_skb_set_protocol() function, and
> call it from both tun_get_user() and tun_xdp_one().

I think this should be two patches. The support for parsing pi is an
independent fix.

> XX: It is not entirely clear to me why it's OK to call eth_type_trans()
> in some cases without first checking that enough of the Ethernet header
> is linearly present by calling pskb_may_pull(). Such a check was never
> present in the tun_xdp_one() code path, and commit 96aa1b22bd6bb ("tun:
> correct header offsets in napi frags mode") deliberately added it *only*
> for the IFF_NAPI_FRAGS mode.

IFF_NAPI_FRAGS exercises napi_gro_frags, which uses the frag0
optimization where all data is in frags. The other receive paths do
not.

For the other cases, linear is guaranteed to include the link layer
header. __tun_build_skb, for instance, just allocates one big
skb->data. It is admittedly not trivial to prove this point
exhaustively for all paths.

commit 96aa1b22bd6bb restricted the new test to the frags case, to
limit the potential blast radius of a bug fix to only the code path
affected by the bug.

> I would like to see specific explanations of *why* it's ever valid and
> necessary (is it so much faster?) to skip the pskb_may_pull()

It was just not needed and that did not complicate anything until this
patch. It's fine to unconditionally check if that simplifies this
change.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket()
  2021-06-25 18:13   ` Willem de Bruijn
@ 2021-06-25 18:55     ` David Woodhouse
  0 siblings, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-25 18:55 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, Jason Wang, Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 3563 bytes --]

On Fri, 2021-06-25 at 14:13 -0400, Willem de Bruijn wrote:
> On Thu, Jun 24, 2021 at 8:30 AM David Woodhouse <dwmw2@infradead.org>
> wrote:
> > 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The vhost-net driver was making wild assumptions about the header
> > length
> 
> If respinning, please more concretely describe which configuration is
> currently broken.

Fairly much all of them. Here's a test run on the 5.12.8 kernel:

$ sudo ./test_vhost_net 
TEST: (hdr 0, xdp 0, pi 0, features 0) RESULT: -1
TEST: (hdr 10, xdp 0, pi 0, features 0) RESULT: 0
TEST: (hdr 12, xdp 0, pi 0, features 0) RESULT: -1
TEST: (hdr 20, xdp 0, pi 0, features 0) RESULT: -1
TEST: (hdr 0, xdp 1, pi 0, features 0) RESULT: -1
TEST: (hdr 10, xdp 1, pi 0, features 0) RESULT: -1
TEST: (hdr 12, xdp 1, pi 0, features 0) RESULT: -1
TEST: (hdr 20, xdp 1, pi 0, features 0) RESULT: -1
TEST: (hdr 0, xdp 0, pi 1, features 0) RESULT: -1
TEST: (hdr 10, xdp 0, pi 1, features 0) RESULT: -1
TEST: (hdr 12, xdp 0, pi 1, features 0) RESULT: -1
TEST: (hdr 20, xdp 0, pi 1, features 0) RESULT: -1
TEST: (hdr 0, xdp 1, pi 1, features 0) RESULT: -1
TEST: (hdr 10, xdp 1, pi 1, features 0) RESULT: -1
TEST: (hdr 12, xdp 1, pi 1, features 0) RESULT: -1
TEST: (hdr 20, xdp 1, pi 1, features 0) RESULT: -1
TEST: (hdr 0, xdp 0, pi 0, features 100000000) RESULT: -1
TEST: (hdr 10, xdp 0, pi 0, features 100000000) RESULT: -1
TEST: (hdr 12, xdp 0, pi 0, features 100000000) RESULT: 0
TEST: (hdr 20, xdp 0, pi 0, features 100000000) RESULT: -1
TEST: (hdr 0, xdp 1, pi 0, features 100000000) RESULT: -1
TEST: (hdr 10, xdp 1, pi 0, features 100000000) RESULT: -1
TEST: (hdr 12, xdp 1, pi 0, features 100000000) RESULT: -1
TEST: (hdr 20, xdp 1, pi 0, features 100000000) RESULT: -1
TEST: (hdr 0, xdp 0, pi 1, features 100000000) RESULT: -1
TEST: (hdr 10, xdp 0, pi 1, features 100000000) RESULT: -1
TEST: (hdr 12, xdp 0, pi 1, features 100000000) RESULT: -1
TEST: (hdr 20, xdp 0, pi 1, features 100000000) RESULT: -1
TEST: (hdr 0, xdp 1, pi 1, features 100000000) RESULT: -1
TEST: (hdr 10, xdp 1, pi 1, features 100000000) RESULT: -1
TEST: (hdr 12, xdp 1, pi 1, features 100000000) RESULT: -1
TEST: (hdr 20, xdp 1, pi 1, features 100000000) RESULT: -1
TEST: (hdr 0, xdp 0, pi 0, features 8000000) RESULT: 0
TEST: (hdr 0, xdp 1, pi 0, features 8000000) RESULT: -1
TEST: (hdr 0, xdp 0, pi 1, features 8000000) RESULT: -1
TEST: (hdr 0, xdp 1, pi 1, features 8000000) RESULT: -1
TEST: (hdr 0, xdp 0, pi 0, features 108000000) RESULT: 0
TEST: (hdr 0, xdp 1, pi 0, features 108000000) RESULT: -1
TEST: (hdr 0, xdp 0, pi 1, features 108000000) RESULT: -1
TEST: (hdr 0, xdp 1, pi 1, features 108000000) RESULT: -1

> IFF_NO_PI + IFF_VNET_HDR, if I understand correctly. 

That's fairly much the only one that *did* work. As long as you use
TUNSETSNDBUF which has the undocumented side-effect of turning off the
XDP path.

> > On the receive side, where the tun device generates the virtio_net_hdr
> > but VIRITO_NET_F_MSG_RXBUF was negotiated and vhost-net needs to fill
> 
> Nit: VIRTIO_NET_F_MSG_RXBUF

Thanks.

> > in the 'num_buffers' field on top of the existing virtio_net_hdr, fix
> > that to use 'sock_hlen - 2' as the location, which means that it goes
> 
> Please use sizeof(hdr.num_buffers) instead of a raw constant 2, to
> self document the code.

Makes sense.

> Should this be an independent one-line fix?

I don't think so; it's very much intertwined with the way it makes
assumptions about someone else's data.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-25 18:43     ` Willem de Bruijn
@ 2021-06-25 19:00       ` David Woodhouse
  0 siblings, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-25 19:00 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, Jason Wang, Eugenio Pérez

[-- Attachment #1: Type: text/plain, Size: 1054 bytes --]

On Fri, 2021-06-25 at 14:43 -0400, Willem de Bruijn wrote:
> On Thu, Jun 24, 2021 at 8:30 AM David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > In tun_get_user(), skb->protocol is either taken from the tun_pi header
> > or inferred from the first byte of the packet in IFF_TUN mode, while
> > eth_type_trans() is called only in the IFF_TAP mode where the payload
> > is expected to be an Ethernet frame.
> > 
> > The equivalent code path in tun_xdp_one() was unconditionally using
> > eth_type_trans(), which is the wrong thing to do in IFF_TUN mode and
> > corrupts packets.
> > 
> > Pull the logic out to a separate tun_skb_set_protocol() function, and
> > call it from both tun_get_user() and tun_xdp_one().
> 
> I think this should be two patches. The support for parsing pi is an
> independent fix.

As things stand, this patch *doesn't* change the pskb_may_pull()
behaviour. I think we should make that unconditional, and can indeed do
so in a separate subsequent patch.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket()
  2021-06-25  8:23     ` David Woodhouse
@ 2021-06-28  4:22       ` Jason Wang
  0 siblings, 0 replies; 73+ messages in thread
From: Jason Wang @ 2021-06-28  4:22 UTC (permalink / raw)
  To: David Woodhouse, netdev
  Cc: Eugenio Pérez, Willem de Bruijn, Michael S. Tsirkin


在 2021/6/25 下午4:23, David Woodhouse 写道:
> On Fri, 2021-06-25 at 13:00 +0800, Jason Wang wrote:
>> 在 2021/6/24 下午8:30, David Woodhouse 写道:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> The vhost-net driver was making wild assumptions about the header length
>>> of the underlying tun/tap socket.
>>
>> It's by design to depend on the userspace to co-ordinate the vnet header
>> setting with the underlying sockets.
>>
>>
>>>    Then it was discarding packets if
>>> the number of bytes it got from sock_recvmsg() didn't precisely match
>>> its guess.
>>
>> Anything that is broken by this? The failure is a hint for the userspace
>> that something is wrong during the coordination.
> I am not a fan of this approach. I firmly believe that for a given
> configuration, the kernel should either *work* or it should gracefully
> refuse to set it up that way. And the requirements should be clearly
> documented.


That works only if all the logic were implemented in the same module but 
not the case in the e.g networking stack that a packet need to iterate 
several modules.

E.g in this case, the vnet header size of the TAP could be changed at 
anytime via TUNSETVNETHDRSZ, and tuntap is unaware of the existence of 
vhost_net. This makes it impossible to do refuse in the case of setup 
(SET_BACKEND).


>
> Having been on the receiving end of this "hint" of which you speak, I
> found it distinctly suboptimal as a user interface. I was left
> scrabbling around trying to find a set of options which *would* work,
> and it was only through debugging the kernel that I managed to work out
> that I:
>
>    • MUST set IFF_NO_PI
>    • MUST use TUNSETSNDBUF to reduce the sndbuf from INT_MAX
>    • MUST use a virtio_net_hdr that I don't want
>
> If my application failed to do any of those things, I got a silent
> failure to transport any packets.


Yes, this is because the bug when using vhost_net + PI/TUN. And I guess 
the reason is that nobody tries to use that combination in the past.

I'm not even sure if it's a valid setup since vhost-net is a virtio-net 
kernel server which is not expected to handle L3 packets or PI header 
(which is Linux specific and out of the scope virtio spec).


>   The only thing I could do *without*
> debugging the kernel was tcpdump on the 'tun0' interface and see if the
> TX packets I put into the ring were even making it to the interface,
> and what they looked like if they did. (Losing the first 14 bytes and
> having the *next* 14 bytes scribbled on by an Ethernet header was a fun
> one.)


The tricky part is that, the networking stack thinks the packet is 
successfully received but it was actually dropped by vhost-net.

And there's no obvious userspace API to report such dropping as 
statistics counters or trace-points. Maybe we can tweak the vhost for a 
better logging in this case.


>
>
>
>
>
>>> Fix it to get the correct information along with the socket itself.
>>
>> I'm not sure what is fixed by this. It looks to me it tires to let
>> packet go even if the userspace set the wrong attributes to tap or
>> vhost. This is even sub-optimal than failing explicitly fail the RX.
> I'm OK with explicit failure. But once I'd let it *get* the information
> from the underlying socket in order to decide whether it should fail or
> not, it turned out to be easy enough just to make those configs work
> anyway.


The problem is that this change may make some wrong configuration 
"works" silently at the level of vhost or TAP. When using this for VM, 
it would make the debugging even harder.


>
> The main case where that "easy enough" is stretched a little (IMO) was
> when there's a tun_pi header. I have one more of your emails to reply
> to after this, and I'll address that there.
>
>
>>> As a side-effect, this means that tun_get_socket() won't work if the
>>> tun file isn't actually connected to a device, since there's no 'tun'
>>> yet in that case to get the information from.
>>
>> This may break the existing application. Vhost-net is tied to the socket
>> instead of the device that the socket is loosely coupled.
> Hm. Perhaps the PI and vnet hdr should be considered an option of the
> *socket* (which is tied to the tfile), not purely an option of the
> underlying device?


Though this is how it is done in macvtap. It's probably too late to 
change tuntap.


>
> Or maybe it's sufficient just to get the flags from *either* tfile->tun
> or tfile->detached, so that it works when the queue is detached. I'll
> take a look.
>
> I suppose we could even have a fallback that makes stuff up like we do
> today. If the user attempts to attach a tun file descriptor to vhost
> without ever calling TUNSETIFF on it first, *then* we make the same
> assumptions we do today?


Then I would rather keep the using the assumption:

1) the value get from get_socket() might not be correct
2) the complexity or risk for bring a very little improvement of the 
debug-ability (which is still suspicious).


>
>>> --- a/drivers/vhost/net.c
>>> +++ b/drivers/vhost/net.c
>>> @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net)
>>>    
>>>    	vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?
>>>    		vq->log : NULL;
>>> -	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
>>> +	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) &&
>>> +		(vhost_hlen || sock_hlen >= sizeof(num_buffers));
>>
>> So this change the behavior. When mergeable buffer is enabled, userspace
>> expects the vhost to merge buffers. If the feature is disabled silently,
>> it violates virtio spec.
>>
>> If anything wrong in the setup, userspace just breaks itself.
>>
>> E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet
>> header might be overwrote by the vnet header.
> This wasn't intended to change the behaviour of any code path that is
> already working today. If *either* vhost or the underlying device have
> provided a vnet header, we still merge.
>
> If *neither* provide a vnet hdr, there's nowhere to put num_buffers and
> we can't merge.
>
> That code path doesn't work at all today, but does after my patches.


It looks to me it's a bug that userspace can keep working in this case. 
After mrg rx buffer is negotiated, userspace should always assumes the 
vhost-net to provide num_buffers.

> But you're right, we should explicitly refuse to negotiate
> VIRITO_NET_F_MSG_RXBUF in that case.


This would be very hard:

1) VHOST_SET_FEATURES and VHOST_NET_SET_BACKEND are two different ioctls
2) vhost_net is not tightly coupled with tuntap, vnet header size could 
be changed by userspace at any time


>
>>>    
>>>    	do {
>>>    		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
>>> @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net)
>>>    			}
>>>    		} else {
>>>    			/* Header came from socket; we'll need to patch
>>> -			 * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF
>>> +			 * ->num_buffers over the last two bytes if
>>> +			 * VIRTIO_NET_F_MRG_RXBUF is enabled.
>>>    			 */
>>> -			iov_iter_advance(&fixup, sizeof(hdr));
>>> +			iov_iter_advance(&fixup, sock_hlen - 2);
>>
>> I'm not sure what did the above code want to fix. It doesn't change
>> anything if vnet header is set correctly in TUN. It only prevents the
>> the packet header from being rewrote.
>>
> It fixes the case where the virtio_net_hdr isn't at the start of the
> tun header, because the tun actually puts the tun_pi struct *first*,
> and *then* the virtio_net_hdr.


Right.


> The num_buffers field needs to go at the *end* of sock_hlen. Not at a
> fixed offset from the *start* of it.
>
> At least, that's true unless we want to just declare that we *only*
> support TUN with the IFF_NO_PI flag. (qv).


Yes, that's a good question. This is probably a hint that "vhost-net is 
never designed to work of PI", and even if it's not true, I'm not sure 
if it's too late to fix.

Thanks


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-25  8:37       ` David Woodhouse
@ 2021-06-28  4:23         ` Jason Wang
  2021-06-28 11:23           ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-28  4:23 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez, Willem de Bruijn


在 2021/6/25 下午4:37, David Woodhouse 写道:
> On Fri, 2021-06-25 at 15:33 +0800, Jason Wang wrote:
>> 在 2021/6/24 下午8:30, David Woodhouse 写道:
>>> From: David Woodhouse<dwmw@amazon.co.uk>
>>>
>>> When the underlying socket isn't configured with a virtio_net_hdr, the
>>> existing code in vhost_net_build_xdp() would attempt to validate
>>> uninitialised data, by copying zero bytes (sock_hlen) into the local
>>> copy of the header and then trying to validate that.
>>>
>>> Fixing it is somewhat non-trivial because the tun device might put a
>>> struct tun_pi*before*  the virtio_net_hdr, which makes it hard to find.
>>> So just stop messing with someone else's data in vhost_net_build_xdp(),
>>> and let tap and tun validate it for themselves, as they do in the
>>> non-XDP case anyway.
>>
>> Thinking in another way. All XDP stuffs for vhost is prepared for TAP.
>> XDP is not expected to work for TUN.
>>
>> So we can simply let's vhost doesn't go with XDP path is the underlayer
>> socket is TUN.
> Actually, IFF_TUN mode per se isn't that complex. It's fixed purely on
> the tun side by that first patch I posted, which I later expanded a
> little to factor out tun_skb_set_protocol().
>
> The next two patches in my original set were fixing up the fact that
> XDP currently assumes that the *socket* will be doing the vhdr, not
> vhost. Those two weren't tun-specific at all.
>
> It's supporting the PI header (which tun puts *before* the virtio
> header as I just said) which introduces a tiny bit more complexity.


This reminds me we need to fix tun_put_user_xdp(), but as we've 
discussed, we need first figure out if PI is worth to support for vhost-net.


>
> So yes, avoiding the XDP path if PI is being used would make some
> sense.
>
> In fact I wouldn't be entirely averse to refusing PI mode completely,
> as long as we fail gracefully at setup time by refusing the
> SET_BACKEND. Not by just silently failing to receive packets.


That's another way. Actually, macvtap mandate IFF_TAP | IFF_NO_PI.

Thanks


>
> But then again, it's not actually *that* hard to support, and it's
> working fine in my selftests at the end of my patch series.
>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-25  8:51       ` David Woodhouse
@ 2021-06-28  4:27         ` Jason Wang
  2021-06-28 10:43           ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-28  4:27 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez, Willem de Bruijn


在 2021/6/25 下午4:51, David Woodhouse 写道:
> On Fri, 2021-06-25 at 15:41 +0800, Jason Wang wrote:
>> 在 2021/6/24 下午8:30, David Woodhouse 写道:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> In tun_get_user(), skb->protocol is either taken from the tun_pi header
>>> or inferred from the first byte of the packet in IFF_TUN mode, while
>>> eth_type_trans() is called only in the IFF_TAP mode where the payload
>>> is expected to be an Ethernet frame.
>>>
>>> The equivalent code path in tun_xdp_one() was unconditionally using
>>> eth_type_trans(), which is the wrong thing to do in IFF_TUN mode and
>>> corrupts packets.
>>>
>>> Pull the logic out to a separate tun_skb_set_protocol() function, and
>>> call it from both tun_get_user() and tun_xdp_one().
>>>
>>> XX: It is not entirely clear to me why it's OK to call eth_type_trans()
>>> in some cases without first checking that enough of the Ethernet header
>>> is linearly present by calling pskb_may_pull().
>>
>> Looks like a bug.
>>
>>
>>>     Such a check was never
>>> present in the tun_xdp_one() code path, and commit 96aa1b22bd6bb ("tun:
>>> correct header offsets in napi frags mode") deliberately added it *only*
>>> for the IFF_NAPI_FRAGS mode.
>>
>> We had already checked this in tun_get_user() before:
>>
>>           if ((tun->flags & TUN_TYPE_MASK) == IFF_TAP) {
>>                   align += NET_IP_ALIGN;
>>                   if (unlikely(len < ETH_HLEN ||
>>                                (gso.hdr_len && tun16_to_cpu(tun,
>> gso.hdr_len) < ETH_HLEN)))
>>                           return -EINVAL;
>>           }
> We'd checked skb->len, but that doesn't mean we had a full Ethernet
> header *linearly* at skb->data, does it?


The linear room is guaranteed through either:

1) tun_build_skb()

or

2) tun_alloc_skb()


>
> For the basic tun_get_user() case I suppose we copy_from_user() into a
> single linear skb anyway, even if userspace had fragment it and used
> writev(). So we *are* probably safe there?
>
> I'm sure we *can* contrive a proof that it's safe for that case, if we
> must. But I think we should *need* that proof, if we're going to bypass
> the check. And I wasn't comfortable touching that code without it.
>
> We should also have a fairly good reason... it isn't clear to me *why*
> we're bothering to avoid the check. Is it so slow, even in the case
> where there's nothing to be done?
>
> For a linear skb, the inline pskb_may_pull() is going to immediately
> return true because ETH_HLEN < skb_headlen(skb), isn't it? Why optimise
> *that* away?
>
> Willem, was there a reason you made that conditional in the first
> place?
>
> If we're going to continue to *not* check on the XDP path, we similarly
> need a proof that it can't be fragmented. And also a reason to bother
> with the "optimisation", of course.


For XDP path, we simply need to add a length check since the packet is 
always a linear memory.

Thanks


>
>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode
  2021-06-28  4:27         ` Jason Wang
@ 2021-06-28 10:43           ` David Woodhouse
  0 siblings, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-28 10:43 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez, Willem de Bruijn

[-- Attachment #1: Type: text/plain, Size: 697 bytes --]

On Mon, 2021-06-28 at 12:27 +0800, Jason Wang wrote:
> > If we're going to continue to *not* check on the XDP path, we similarly
> > need a proof that it can't be fragmented. And also a reason to bother
> > with the "optimisation", of course.
> 
> 
> For XDP path, we simply need to add a length check since the packet is 
> always a linear memory.

Sure, but in that case skb_headlen is going to be enough to cover
ETH_HLEN, and that's the very first thing that the standard inline
version of pskb_may_pull() checks, without ever even having to make an
out-of-line call.

So there's just no reason ever for us *not* to keep it really simple,
and use pskb_may_pull() in all cases.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-28  4:23         ` Jason Wang
@ 2021-06-28 11:23           ` David Woodhouse
  2021-06-28 23:29             ` David Woodhouse
  2021-06-29  3:21             ` Jason Wang
  0 siblings, 2 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-28 11:23 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez, Willem de Bruijn

[-- Attachment #1: Type: text/plain, Size: 5010 bytes --]

On Mon, 2021-06-28 at 12:23 +0800, Jason Wang wrote:
> 在 2021/6/25 下午4:37, David Woodhouse 写道:
> > On Fri, 2021-06-25 at 15:33 +0800, Jason Wang wrote:
> > > 在 2021/6/24 下午8:30, David Woodhouse 写道:
> > > > From: David Woodhouse<dwmw@amazon.co.uk>
> > > > 
> > > > When the underlying socket isn't configured with a virtio_net_hdr, the
> > > > existing code in vhost_net_build_xdp() would attempt to validate
> > > > uninitialised data, by copying zero bytes (sock_hlen) into the local
> > > > copy of the header and then trying to validate that.
> > > > 
> > > > Fixing it is somewhat non-trivial because the tun device might put a
> > > > struct tun_pi*before*  the virtio_net_hdr, which makes it hard to find.
> > > > So just stop messing with someone else's data in vhost_net_build_xdp(),
> > > > and let tap and tun validate it for themselves, as they do in the
> > > > non-XDP case anyway.
> > > 
> > > Thinking in another way. All XDP stuffs for vhost is prepared for TAP.
> > > XDP is not expected to work for TUN.
> > > 
> > > So we can simply let's vhost doesn't go with XDP path is the underlayer
> > > socket is TUN.
> > 
> > Actually, IFF_TUN mode per se isn't that complex. It's fixed purely on
> > the tun side by that first patch I posted, which I later expanded a
> > little to factor out tun_skb_set_protocol().
> > 
> > The next two patches in my original set were fixing up the fact that
> > XDP currently assumes that the *socket* will be doing the vhdr, not
> > vhost. Those two weren't tun-specific at all.
> > 
> > It's supporting the PI header (which tun puts *before* the virtio
> > header as I just said) which introduces a tiny bit more complexity.
> 
> 
> This reminds me we need to fix tun_put_user_xdp(),

Good point; thanks.

> but as we've discussed, we need first figure out if PI is worth to
> support for vhost-net.

FWIW I certainly don't care about PI support. The only time anyone
would want PI support is if they need to support protocols *other* than
IPv6 and Legacy IP, over tun mode.

I'm fixing this stuff because when I tried to use vhost-tun + tun for
*sensible* use cases, I ended up having to flounder around trying to
find a combination of settings that actually worked. And that offended
me :)

So I wrote a test case to iterate over various possible combinations of
settings, and then kept typing until that all worked.

The only thing I do feel quite strongly about is that stuff should
either *work*, or *explicitly* fail if it's unsupported.

At this point, although I have no actual use for it myself, I'd
probably just about come down on the side of supporting PI. On the
basis that:

 • I've basically made it work already.

 • It allows those code paths like tun_skb_set_protocol() to be
   consolidated as both calling code paths need the same thing.

 • Even in the kernel, and even when modules are as incestuously
   intertwined as vhost-net and tun already are, I'm a strong
   believer in *not* making assumptions about someone else's data,
   so letting *tun* handle its own headers without making those
   assumptions seems like the right thing to do.



If we want to support PI, I need to go fix tun_put_user_xdp() as you
noted (and work out how to add that to the test case). And resolve the
fact that configuration might change after tun_get_socket() is called —
and indeed that there might not *be* a configuration at all when
tun_get_socket() is called.


If we *don't* want to support PI, well, the *interesting* part of the
above needs fixing anyway. Because I strongly believe we should
*prevent* it if we don't support it, and we *still* have the case you
point out of the tun vhdr_size being changed at runtime.

I'll take a look at whether can pass the socklen back from tun to
vhost-net on *every* packet. Is there a MSG_XXX flag we can abuse and
somewhere in the msghdr that could return the header length used for
*this* packet? Or could we make vhost_net_rx_peek_head_len() call
explicitly into the tun device instead of making stuff up in
peek_head_len()? 


To be clear: from the point of view of my *application* I don't care
about any of this; my only motivation here is to clean up the kernel
behaviour and make life easier for potential future users. I have found
a setup that works in today's kernels (even though I have to disable
XDP, and have to use a virtio header that I don't want), and will stick
with that for now, if I actually commit it to my master branch at all:
https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6

I might yet abandon it because I haven't *yet* seen it go any faster
than the code which just does read()/write() on the tun device from
userspace. And without XDP or zerocopy it's not clear that it could
ever give me any benefit that I couldn't achieve purely in userspace by
having a separate thread to do tun device I/O. But we'll see...

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-28 11:23           ` David Woodhouse
@ 2021-06-28 23:29             ` David Woodhouse
  2021-06-29  3:43               ` Jason Wang
  2021-06-29  3:21             ` Jason Wang
  1 sibling, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-28 23:29 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez, Willem de Bruijn

[-- Attachment #1: Type: text/plain, Size: 1594 bytes --]

On Mon, 2021-06-28 at 12:23 +0100, David Woodhouse wrote:
> 
> To be clear: from the point of view of my *application* I don't care
> about any of this; my only motivation here is to clean up the kernel
> behaviour and make life easier for potential future users. I have found
> a setup that works in today's kernels (even though I have to disable
> XDP, and have to use a virtio header that I don't want), and will stick
> with that for now, if I actually commit it to my master branch at all:
> https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6
> 
> I might yet abandon it because I haven't *yet* seen it go any faster
> than the code which just does read()/write() on the tun device from
> userspace. And without XDP or zerocopy it's not clear that it could
> ever give me any benefit that I couldn't achieve purely in userspace by
> having a separate thread to do tun device I/O. But we'll see...

I managed to do some proper testing, between EC2 c5 (Skylake) virtual
instances.

The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about
1.2Gb/s from iperf, as it seems to be doing it all from the iperf
thread.

Before I started messing with OpenConnect, it could transmit 1.6Gb/s.

When I pull in the 'stitched' AES+SHA code from OpenSSL instead of
doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.

Adding vhost support on top of that takes me to 2.46Gb/s, which is a
decent enough win. That's with OpenConnect taking 100% CPU, iperf3
taking 50% of another one, and the vhost kernel thread taking ~20%.



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-28 11:23           ` David Woodhouse
  2021-06-28 23:29             ` David Woodhouse
@ 2021-06-29  3:21             ` Jason Wang
  1 sibling, 0 replies; 73+ messages in thread
From: Jason Wang @ 2021-06-29  3:21 UTC (permalink / raw)
  To: David Woodhouse, netdev, Michael S. Tsirkin
  Cc: Eugenio Pérez, Willem de Bruijn


在 2021/6/28 下午7:23, David Woodhouse 写道:
> On Mon, 2021-06-28 at 12:23 +0800, Jason Wang wrote:
>> 在 2021/6/25 下午4:37, David Woodhouse 写道:
>>> On Fri, 2021-06-25 at 15:33 +0800, Jason Wang wrote:
>>>> 在 2021/6/24 下午8:30, David Woodhouse 写道:
>>>>> From: David Woodhouse<dwmw@amazon.co.uk>
>>>>>
>>>>> When the underlying socket isn't configured with a virtio_net_hdr, the
>>>>> existing code in vhost_net_build_xdp() would attempt to validate
>>>>> uninitialised data, by copying zero bytes (sock_hlen) into the local
>>>>> copy of the header and then trying to validate that.
>>>>>
>>>>> Fixing it is somewhat non-trivial because the tun device might put a
>>>>> struct tun_pi*before*  the virtio_net_hdr, which makes it hard to find.
>>>>> So just stop messing with someone else's data in vhost_net_build_xdp(),
>>>>> and let tap and tun validate it for themselves, as they do in the
>>>>> non-XDP case anyway.
>>>> Thinking in another way. All XDP stuffs for vhost is prepared for TAP.
>>>> XDP is not expected to work for TUN.
>>>>
>>>> So we can simply let's vhost doesn't go with XDP path is the underlayer
>>>> socket is TUN.
>>> Actually, IFF_TUN mode per se isn't that complex. It's fixed purely on
>>> the tun side by that first patch I posted, which I later expanded a
>>> little to factor out tun_skb_set_protocol().
>>>
>>> The next two patches in my original set were fixing up the fact that
>>> XDP currently assumes that the *socket* will be doing the vhdr, not
>>> vhost. Those two weren't tun-specific at all.
>>>
>>> It's supporting the PI header (which tun puts *before* the virtio
>>> header as I just said) which introduces a tiny bit more complexity.
>>
>> This reminds me we need to fix tun_put_user_xdp(),
> Good point; thanks.
>
>> but as we've discussed, we need first figure out if PI is worth to
>> support for vhost-net.
> FWIW I certainly don't care about PI support. The only time anyone
> would want PI support is if they need to support protocols *other* than
> IPv6 and Legacy IP, over tun mode.
>
> I'm fixing this stuff because when I tried to use vhost-tun + tun for
> *sensible* use cases, I ended up having to flounder around trying to
> find a combination of settings that actually worked. And that offended
> me :)
>
> So I wrote a test case to iterate over various possible combinations of
> settings, and then kept typing until that all worked.
>
> The only thing I do feel quite strongly about is that stuff should
> either *work*, or *explicitly* fail if it's unsupported.


I fully agree, but I suspect this may only work when we invent something 
new, otherwise I'm not sure if it's too late to fix where it may break 
the existing application.


>
> At this point, although I have no actual use for it myself, I'd
> probably just about come down on the side of supporting PI. On the
> basis that:
>
>   • I've basically made it work already.
>
>   • It allows those code paths like tun_skb_set_protocol() to be
>     consolidated as both calling code paths need the same thing.
>
>   • Even in the kernel, and even when modules are as incestuously
>     intertwined as vhost-net and tun already are, I'm a strong
>     believer in *not* making assumptions about someone else's data,
>     so letting *tun* handle its own headers without making those
>     assumptions seems like the right thing to do.
>
>
>
> If we want to support PI, I need to go fix tun_put_user_xdp() as you
> noted (and work out how to add that to the test case). And resolve the
> fact that configuration might change after tun_get_socket() is called —
> and indeed that there might not *be* a configuration at all when
> tun_get_socket() is called.


Yes, but I tend to leave the code as is PI part consider no one is 
interested in that. (vhost_net + PI).


>
>
> If we *don't* want to support PI, well, the *interesting* part of the
> above needs fixing anyway. Because I strongly believe we should
> *prevent* it if we don't support it, and we *still* have the case you
> point out of the tun vhdr_size being changed at runtime.


As discussed in another thread, it looks me to it's sufficient to have 
some statics counters/API in vhost_net. Or simply use msg_control to 
reuse tx_errors of TUN/TAP or macvtap.


>
> I'll take a look at whether can pass the socklen back from tun to
> vhost-net on *every* packet. Is there a MSG_XXX flag we can abuse and
> somewhere in the msghdr that could return the header length used for
> *this* packet?


msg_control is probably the best place to do this.


>   Or could we make vhost_net_rx_peek_head_len() call
> explicitly into the tun device instead of making stuff up in
> peek_head_len()?


They're working at skb/xdp level which is unaware of PI stuffs.

But again, I think it should be much more cheaper to just add error 
reporting in this case. And it should be sufficient.


>
>
> To be clear: from the point of view of my *application* I don't care
> about any of this; my only motivation here is to clean up the kernel
> behaviour and make life easier for potential future users.


Yes, thanks a lot for having a look at this.

Though I'm not quite sure vhost_net is designed to work on those setups 
but let's ask for Michael (author of vhost/net) for his idea:

Michael, do you think it's worth to support

1) vhost_net + TUN
2) vhost_net + PI

?


> I have found
> a setup that works in today's kernels (even though I have to disable
> XDP, and have to use a virtio header that I don't want), and will stick
> with that for now, if I actually commit it to my master branch at all:
> https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6


Yes, but unfortunately it needs some tricks for avoid hitting bugs in 
the kernel.


>
> I might yet abandon it because I haven't *yet* seen it go any faster
> than the code which just does read()/write() on the tun device from
> userspace. And without XDP or zerocopy it's not clear that it could
> ever give me any benefit that I couldn't achieve purely in userspace by
> having a separate thread to do tun device I/O. But we'll see...


Ok.

Thanks.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-28 23:29             ` David Woodhouse
@ 2021-06-29  3:43               ` Jason Wang
  2021-06-29  6:59                 ` David Woodhouse
  2021-06-29 10:49                 ` David Woodhouse
  0 siblings, 2 replies; 73+ messages in thread
From: Jason Wang @ 2021-06-29  3:43 UTC (permalink / raw)
  To: David Woodhouse, netdev; +Cc: Eugenio Pérez, Willem de Bruijn


在 2021/6/29 上午7:29, David Woodhouse 写道:
> On Mon, 2021-06-28 at 12:23 +0100, David Woodhouse wrote:
>> To be clear: from the point of view of my *application* I don't care
>> about any of this; my only motivation here is to clean up the kernel
>> behaviour and make life easier for potential future users. I have found
>> a setup that works in today's kernels (even though I have to disable
>> XDP, and have to use a virtio header that I don't want), and will stick
>> with that for now, if I actually commit it to my master branch at all:
>> https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6
>>
>> I might yet abandon it because I haven't *yet* seen it go any faster
>> than the code which just does read()/write() on the tun device from
>> userspace. And without XDP or zerocopy it's not clear that it could
>> ever give me any benefit that I couldn't achieve purely in userspace by
>> having a separate thread to do tun device I/O. But we'll see...
> I managed to do some proper testing, between EC2 c5 (Skylake) virtual
> instances.
>
> The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about
> 1.2Gb/s from iperf, as it seems to be doing it all from the iperf
> thread.
>
> Before I started messing with OpenConnect, it could transmit 1.6Gb/s.
>
> When I pull in the 'stitched' AES+SHA code from OpenSSL instead of
> doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.
>
> Adding vhost support on top of that takes me to 2.46Gb/s, which is a
> decent enough win.


Interesting, I think the latency should be improved as well in this case.

Thanks


> That's with OpenConnect taking 100% CPU, iperf3
> taking 50% of another one, and the vhost kernel thread taking ~20%.
>
>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-29  3:43               ` Jason Wang
@ 2021-06-29  6:59                 ` David Woodhouse
  2021-06-29 10:49                 ` David Woodhouse
  1 sibling, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-29  6:59 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: Eugenio Pérez, Willem de Bruijn



On 29 June 2021 04:43:15 BST, Jason Wang <jasowang@redhat.com> wrote:
>
>在 2021/6/29 上午7:29, David Woodhouse 写道:
>> On Mon, 2021-06-28 at 12:23 +0100, David Woodhouse wrote:
>>> To be clear: from the point of view of my *application* I don't care
>>> about any of this; my only motivation here is to clean up the kernel
>>> behaviour and make life easier for potential future users. I have
>found
>>> a setup that works in today's kernels (even though I have to disable
>>> XDP, and have to use a virtio header that I don't want), and will
>stick
>>> with that for now, if I actually commit it to my master branch at
>all:
>>>
>https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6
>>>
>>> I might yet abandon it because I haven't *yet* seen it go any faster
>>> than the code which just does read()/write() on the tun device from
>>> userspace. And without XDP or zerocopy it's not clear that it could
>>> ever give me any benefit that I couldn't achieve purely in userspace
>by
>>> having a separate thread to do tun device I/O. But we'll see...
>> I managed to do some proper testing, between EC2 c5 (Skylake) virtual
>> instances.
>>
>> The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about
>> 1.2Gb/s from iperf, as it seems to be doing it all from the iperf
>> thread.
>>
>> Before I started messing with OpenConnect, it could transmit 1.6Gb/s.
>>
>> When I pull in the 'stitched' AES+SHA code from OpenSSL instead of
>> doing the encryption and the HMAC in separate passes, I get to
>2.1Gb/s.
>>
>> Adding vhost support on top of that takes me to 2.46Gb/s, which is a
>> decent enough win.
>
>
>Interesting, I think the latency should be improved as well in this
>case.

I don't know about that. I figured it would be worse in the packet by packet case (especially VoIP traffic) since instead of just *writing* a packet to the tun device, we stick it in the ring and then make the same write() syscall on an eventfd to wake up the vhost thread which then has to do the *same* copy_from_user() that could have happened directly in our own process.

Maybe if I have a batch of only 1 or 2 packets I should just write it directly. I still could :)

>> That's with OpenConnect taking 100% CPU, iperf3
>> taking 50% of another one, and the vhost kernel thread taking ~20%.
>>
>>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-29  3:43               ` Jason Wang
  2021-06-29  6:59                 ` David Woodhouse
@ 2021-06-29 10:49                 ` David Woodhouse
  2021-06-29 13:15                   ` David Woodhouse
  2021-06-30  4:39                   ` Jason Wang
  1 sibling, 2 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-29 10:49 UTC (permalink / raw)
  To: Jason Wang, netdev
  Cc: Eugenio Pérez, Willem de Bruijn, Michael S.Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3507 bytes --]

On Tue, 2021-06-29 at 11:43 +0800, Jason Wang wrote:
> > The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about
> > 1.2Gb/s from iperf, as it seems to be doing it all from the iperf
> > thread.
> > 
> > Before I started messing with OpenConnect, it could transmit 1.6Gb/s.
> > 
> > When I pull in the 'stitched' AES+SHA code from OpenSSL instead of
> > doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.
> > 
> > Adding vhost support on top of that takes me to 2.46Gb/s, which is a
> > decent enough win.
> 
> 
> Interesting, I think the latency should be improved as well in this
> case.

I tried using 'ping -i 0.1' to get an idea of latency for the
interesting VoIP-like case of packets where we have to wake up each
time.

For the *inbound* case, RX on the tun device followed by TX of the
replies, I see results like this:

     --- 172.16.0.2 ping statistics ---
     141 packets transmitted, 141 received, 0% packet loss, time 14557ms
     rtt min/avg/max/mdev = 0.380/0.419/0.461/0.024 ms


The opposite direction (tun TX then RX) is similar:

     --- 172.16.0.1 ping statistics ---
     295 packets transmitted, 295 received, 0% packet loss, time 30573ms
     rtt min/avg/max/mdev = 0.454/0.545/0.718/0.024 ms


Using vhost-net (and TUNSNDBUF of INT_MAX-1 just to avoid XDP), it
looks like this. Inbound:

     --- 172.16.0.2 ping statistics ---
     139 packets transmitted, 139 received, 0% packet loss, time 14350ms
     rtt min/avg/max/mdev = 0.432/0.578/0.658/0.058 ms

Outbound:

     --- 172.16.0.1 ping statistics ---
     149 packets transmitted, 149 received, 0% packet loss, time 15391ms
     rtt mn/avg/max/mdev = 0.496/0.682/0.935/0.036 ms


So as I expected, the throughput is better with vhost-net once I get to
the point of 100% CPU usage in my main thread, because it offloads the
kernel←→user copies. But latency is somewhat worse.

I'm still using select() instead of epoll() which would give me a
little back — but only a little, as I only poll on 3-4 fds, and more to
the point it'll give me just as much win in the non-vhost case too, so
it won't make much difference to the vhost vs. non-vhost comparison.

Perhaps I really should look into that trick of "if the vhost TX ring
is already stopped and would need a kick, and I only have a few packets
in the batch, just write them directly to /dev/net/tun".

I'm wondering how that optimisation would translate to actual guests,
which presumably have the same problem. Perhaps it would be an
operation on the vhost fd, which ends up processing the ring right
there in the context of *that* process instead of doing a wakeup?

FWIW if I pull in my kernel patches and stop working around those bugs,
enabling the TX XDP path and dropping the virtio-net header that I
don't need, I get some of that latency back:

     --- 172.16.0.2 ping statistics ---
     151 packets transmitted, 151 received, 0% packet loss, time 15599ms
     rtt min/avg/max/mdev = 0.372/0.550/0.661/0.061 ms

     --- 172.16.0.1 ping statistics ---
     214 packets transmitted, 214 received, 0% packet loss, time 22151ms
     rtt min/avg/max/mdev = 0.453/0.626/0.765/0.049 ms

My bandwidth tests go up from 2.46Gb/s with the workarounds, to
2.50Gb/s once I enable XDP, and 2.52Gb/s when I drop the virtio-net
header. But there's no way for userspace to *detect* that those bugs
are fixed, which makes it hard to ship that version.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-29 10:49                 ` David Woodhouse
@ 2021-06-29 13:15                   ` David Woodhouse
  2021-06-30  4:39                   ` Jason Wang
  1 sibling, 0 replies; 73+ messages in thread
From: David Woodhouse @ 2021-06-29 13:15 UTC (permalink / raw)
  To: Jason Wang, netdev
  Cc: Eugenio Pérez, Willem de Bruijn, Michael S.Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3555 bytes --]

On Tue, 2021-06-29 at 11:49 +0100, David Woodhouse wrote:
> On Tue, 2021-06-29 at 11:43 +0800, Jason Wang wrote:
> > > The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about
> > > 1.2Gb/s from iperf, as it seems to be doing it all from the iperf
> > > thread.
> > > 
> > > Before I started messing with OpenConnect, it could transmit 1.6Gb/s.
> > > 
> > > When I pull in the 'stitched' AES+SHA code from OpenSSL instead of
> > > doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.
> > > 
> > > Adding vhost support on top of that takes me to 2.46Gb/s, which is a
> > > decent enough win.
> > 
> > 
> > Interesting, I think the latency should be improved as well in this
> > case.
> 
> I tried using 'ping -i 0.1' to get an idea of latency for the
> interesting VoIP-like case of packets where we have to wake up each
> time.
> 
> For the *inbound* case, RX on the tun device followed by TX of the
> replies, I see results like this:
> 
>      --- 172.16.0.2 ping statistics ---
>      141 packets transmitted, 141 received, 0% packet loss, time 14557ms
>      rtt min/avg/max/mdev = 0.380/0.419/0.461/0.024 ms
> 
> 
> The opposite direction (tun TX then RX) is similar:
> 
>      --- 172.16.0.1 ping statistics ---
>      295 packets transmitted, 295 received, 0% packet loss, time 30573ms
>      rtt min/avg/max/mdev = 0.454/0.545/0.718/0.024 ms
> 
> 
> Using vhost-net (and TUNSNDBUF of INT_MAX-1 just to avoid XDP), it
> looks like this. Inbound:
> 
>      --- 172.16.0.2 ping statistics ---
>      139 packets transmitted, 139 received, 0% packet loss, time 14350ms
>      rtt min/avg/max/mdev = 0.432/0.578/0.658/0.058 ms
> 
> Outbound:
> 
>      --- 172.16.0.1 ping statistics ---
>      149 packets transmitted, 149 received, 0% packet loss, time 15391ms
>      rtt mn/avg/max/mdev = 0.496/0.682/0.935/0.036 ms
> 
> 
> So as I expected, the throughput is better with vhost-net once I get to
> the point of 100% CPU usage in my main thread, because it offloads the
> kernel←→user copies. But latency is somewhat worse.
> 
> I'm still using select() instead of epoll() which would give me a
> little back — but only a little, as I only poll on 3-4 fds, and more to
> the point it'll give me just as much win in the non-vhost case too, so
> it won't make much difference to the vhost vs. non-vhost comparison.
> 
> Perhaps I really should look into that trick of "if the vhost TX ring
> is already stopped and would need a kick, and I only have a few packets
> in the batch, just write them directly to /dev/net/tun".
> 
> I'm wondering how that optimisation would translate to actual guests,
> which presumably have the same problem. Perhaps it would be an
> operation on the vhost fd, which ends up processing the ring right
> there in the context of *that* process instead of doing a wakeup?

That turns out to be fairly trivial: 
https://gitlab.com/openconnect/openconnect/-/commit/668ff1399541be927

It gives me back about half the latency I lost by moving to vhost-net:

     --- 172.16.0.2 ping statistics ---
     133 packets transmitted, 133 received, 0% packet loss, time 13725ms
     rtt min/avg/max/mdev = 0.437/0.510/0.621/0.035 ms

     --- 172.16.0.1 ping statistics ---
     133 packets transmitted, 133 received, 0% packet loss, time 13728ms
     rtt min/avg/max/mdev = 0.541/0.605/0.658/0.022 ms

I think it's definitely worth looking at whether we can/should do
something roughly equivalent for actual guests.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-29 10:49                 ` David Woodhouse
  2021-06-29 13:15                   ` David Woodhouse
@ 2021-06-30  4:39                   ` Jason Wang
  2021-06-30 10:02                     ` David Woodhouse
  1 sibling, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-06-30  4:39 UTC (permalink / raw)
  To: David Woodhouse, netdev
  Cc: Eugenio Pérez, Willem de Bruijn, Michael S.Tsirkin


在 2021/6/29 下午6:49, David Woodhouse 写道:
> On Tue, 2021-06-29 at 11:43 +0800, Jason Wang wrote:
>>> The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about
>>> 1.2Gb/s from iperf, as it seems to be doing it all from the iperf
>>> thread.
>>>
>>> Before I started messing with OpenConnect, it could transmit 1.6Gb/s.
>>>
>>> When I pull in the 'stitched' AES+SHA code from OpenSSL instead of
>>> doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.
>>>
>>> Adding vhost support on top of that takes me to 2.46Gb/s, which is a
>>> decent enough win.
>>
>> Interesting, I think the latency should be improved as well in this
>> case.
> I tried using 'ping -i 0.1' to get an idea of latency for the
> interesting VoIP-like case of packets where we have to wake up each
> time.
>
> For the *inbound* case, RX on the tun device followed by TX of the
> replies, I see results like this:
>
>       --- 172.16.0.2 ping statistics ---
>       141 packets transmitted, 141 received, 0% packet loss, time 14557ms
>       rtt min/avg/max/mdev = 0.380/0.419/0.461/0.024 ms
>
>
> The opposite direction (tun TX then RX) is similar:
>
>       --- 172.16.0.1 ping statistics ---
>       295 packets transmitted, 295 received, 0% packet loss, time 30573ms
>       rtt min/avg/max/mdev = 0.454/0.545/0.718/0.024 ms
>
>
> Using vhost-net (and TUNSNDBUF of INT_MAX-1 just to avoid XDP), it
> looks like this. Inbound:
>
>       --- 172.16.0.2 ping statistics ---
>       139 packets transmitted, 139 received, 0% packet loss, time 14350ms
>       rtt min/avg/max/mdev = 0.432/0.578/0.658/0.058 ms
>
> Outbound:
>
>       --- 172.16.0.1 ping statistics ---
>       149 packets transmitted, 149 received, 0% packet loss, time 15391ms
>       rtt mn/avg/max/mdev = 0.496/0.682/0.935/0.036 ms
>
>
> So as I expected, the throughput is better with vhost-net once I get to
> the point of 100% CPU usage in my main thread, because it offloads the
> kernel←→user copies. But latency is somewhat worse.
>
> I'm still using select() instead of epoll() which would give me a
> little back — but only a little, as I only poll on 3-4 fds, and more to
> the point it'll give me just as much win in the non-vhost case too, so
> it won't make much difference to the vhost vs. non-vhost comparison.
>
> Perhaps I really should look into that trick of "if the vhost TX ring
> is already stopped and would need a kick, and I only have a few packets
> in the batch, just write them directly to /dev/net/tun".


That should work on low throughput.


>
> I'm wondering how that optimisation would translate to actual guests,
> which presumably have the same problem. Perhaps it would be an
> operation on the vhost fd, which ends up processing the ring right
> there in the context of *that* process instead of doing a wakeup?


It might improve the latency in an ideal case but several possible issues:

1) this will blocks vCPU running until the sent is done
2) copy_from_user() may sleep which will block the vcpu thread further


>
> FWIW if I pull in my kernel patches and stop working around those bugs,
> enabling the TX XDP path and dropping the virtio-net header that I
> don't need, I get some of that latency back:
>
>       --- 172.16.0.2 ping statistics ---
>       151 packets transmitted, 151 received, 0% packet loss, time 15599ms
>       rtt min/avg/max/mdev = 0.372/0.550/0.661/0.061 ms
>
>       --- 172.16.0.1 ping statistics ---
>       214 packets transmitted, 214 received, 0% packet loss, time 22151ms
>       rtt min/avg/max/mdev = 0.453/0.626/0.765/0.049 ms
>
> My bandwidth tests go up from 2.46Gb/s with the workarounds, to
> 2.50Gb/s once I enable XDP, and 2.52Gb/s when I drop the virtio-net
> header. But there's no way for userspace to *detect* that those bugs
> are fixed, which makes it hard to ship that version.


Yes, that's sad. One possible way to advertise a VHOST_NET_TUN flag via 
VHOST_GET_BACKEND_FEATURES?

Thanks


>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-30  4:39                   ` Jason Wang
@ 2021-06-30 10:02                     ` David Woodhouse
  2021-07-01  4:13                       ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-06-30 10:02 UTC (permalink / raw)
  To: Jason Wang, netdev
  Cc: Eugenio Pérez, Willem de Bruijn, Michael S.Tsirkin

[-- Attachment #1: Type: text/plain, Size: 6736 bytes --]

On Wed, 2021-06-30 at 12:39 +0800, Jason Wang wrote:
> 在 2021/6/29 下午6:49, David Woodhouse 写道:
> > On Tue, 2021-06-29 at 11:43 +0800, Jason Wang wrote:
> > > > The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about
> > > > 1.2Gb/s from iperf, as it seems to be doing it all from the iperf
> > > > thread.
> > > > 
> > > > Before I started messing with OpenConnect, it could transmit 1.6Gb/s.
> > > > 
> > > > When I pull in the 'stitched' AES+SHA code from OpenSSL instead of
> > > > doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.
> > > > 
> > > > Adding vhost support on top of that takes me to 2.46Gb/s, which is a
> > > > decent enough win.
> > > 
> > > Interesting, I think the latency should be improved as well in this
> > > case.
> > 
> > I tried using 'ping -i 0.1' to get an idea of latency for the
> > interesting VoIP-like case of packets where we have to wake up each
> > time.
> > 
> > For the *inbound* case, RX on the tun device followed by TX of the
> > replies, I see results like this:
> > 
> >       --- 172.16.0.2 ping statistics ---
> >       141 packets transmitted, 141 received, 0% packet loss, time 14557ms
> >       rtt min/avg/max/mdev = 0.380/0.419/0.461/0.024 ms
> > 
> > 
> > The opposite direction (tun TX then RX) is similar:
> > 
> >       --- 172.16.0.1 ping statistics ---
> >       295 packets transmitted, 295 received, 0% packet loss, time 30573ms
> >       rtt min/avg/max/mdev = 0.454/0.545/0.718/0.024 ms
> > 
> > 
> > Using vhost-net (and TUNSNDBUF of INT_MAX-1 just to avoid XDP), it
> > looks like this. Inbound:
> > 
> >       --- 172.16.0.2 ping statistics ---
> >       139 packets transmitted, 139 received, 0% packet loss, time 14350ms
> >       rtt min/avg/max/mdev = 0.432/0.578/0.658/0.058 ms
> > 
> > Outbound:
> > 
> >       --- 172.16.0.1 ping statistics ---
> >       149 packets transmitted, 149 received, 0% packet loss, time 15391ms
> >       rtt mn/avg/max/mdev = 0.496/0.682/0.935/0.036 ms
> > 
> > 
> > So as I expected, the throughput is better with vhost-net once I get to
> > the point of 100% CPU usage in my main thread, because it offloads the
> > kernel←→user copies. But latency is somewhat worse.
> > 
> > I'm still using select() instead of epoll() which would give me a
> > little back — but only a little, as I only poll on 3-4 fds, and more to
> > the point it'll give me just as much win in the non-vhost case too, so
> > it won't make much difference to the vhost vs. non-vhost comparison.
> > 
> > Perhaps I really should look into that trick of "if the vhost TX ring
> > is already stopped and would need a kick, and I only have a few packets
> > in the batch, just write them directly to /dev/net/tun".
> 
> 
> That should work on low throughput.

Indeed it works remarkably well, as I noted in my follow-up. I also
fixed a minor stupidity where I was reading from the 'call' eventfd
*before* doing the real work of moving packets around. And that gives
me a few tens of microseconds back too.

> > I'm wondering how that optimisation would translate to actual guests,
> > which presumably have the same problem. Perhaps it would be an
> > operation on the vhost fd, which ends up processing the ring right
> > there in the context of *that* process instead of doing a wakeup?
> 
> 
> It might improve the latency in an ideal case but several possible issues:
> 
> 1) this will blocks vCPU running until the sent is done
> 2) copy_from_user() may sleep which will block the vcpu thread further

Yes, it would block the vCPU for a short period of time, but we could
limit that. The real win is to improve latency of single, short packets
like a first SYN, or SYNACK. It should work fine even if it's limited
to *one* *short* packet which *is* resident in memory.

Although actually I'm not *overly* worried about the 'resident' part.
For a transmit packet, especially a short one not a sendpage(), it's
fairly likely the the guest has touched the buffer right before sending
it. And taken the hit of faulting it in then, if necessary. If the host
is paging out memory which is *active* use by a guest, that guest is
screwed anyway :)

I'm thinking of something like an ioctl on the vhost-net fd which *if*
the thread is currently sleeping and there's a single short packet,
processes it immediately. {Else,then} it wakes the thread just like the
eventfd would have done. (Perhaps just by signalling the kick eventfd,
but perhaps there's a more efficient way anyway).

> > My bandwidth tests go up from 2.46Gb/s with the workarounds, to
> > 2.50Gb/s once I enable XDP, and 2.52Gb/s when I drop the virtio-net
> > header. But there's no way for userspace to *detect* that those bugs
> > are fixed, which makes it hard to ship that version.

I'm up to 2.75Gb/s now with epoll and other fixes (including using
sendmmsg() on the other side). Since the kernel can only do *half*
that, I'm now wondering if I really want my data plane in the kernel at
all, which was my long-term plan :)

> Yes, that's sad. One possible way to advertise a VHOST_NET_TUN flag via 
> VHOST_GET_BACKEND_FEATURES?

Strictly it isn't VHOST_NET_TUN, as that *does* work today if you pick
the right (very non-intuitive) combination of features. The feature is
really "VHOST_NET_TUN_WITHOUT_TUNSNDBUF_OR_UNWANTED_VNET_HEADER" :)

But we don't need a feature specifically for that; I only need to check
for *any* feature that goes in after the fixes.

Maybe if we do add a new low-latency kick then I could key on *that*
feature to assume the bugs are fixed.

Alternatively, there's still the memory map thing I need to fix before
I can commit this in my application:

#ifdef __x86_64__
	vmem->regions[0].guest_phys_addr = 4096;
	vmem->regions[0].memory_size = 0x7fffffffe000;
	vmem->regions[0].userspace_addr = 4096;
#else
#error FIXME
#endif
	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {

Perhaps if we end up with a user-visible feature to deal with that,
then I could use the presence of *that* feature to infer that the tun
bugs are fixed.

Another random thought as I stare at this... can't we handle checksums
in tun_get_user() / tun_put_user()? We could always set NETIF_F_HW_CSUM
on the tun device, and just do it *while* we're copying the packet to
userspace, if userspace doesn't support it. That would be better than
having the kernel complete the checksum in a separate pass *before*
handing the skb to tun_net_xmit().

We could similarly do a partial checksum in tun_get_user() and hand it
off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-06-30 10:02                     ` David Woodhouse
@ 2021-07-01  4:13                       ` Jason Wang
  2021-07-01 17:39                         ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-07-01  4:13 UTC (permalink / raw)
  To: David Woodhouse, netdev
  Cc: Eugenio Pérez, Willem de Bruijn, Michael S.Tsirkin


在 2021/6/30 下午6:02, David Woodhouse 写道:
> On Wed, 2021-06-30 at 12:39 +0800, Jason Wang wrote:
>> 在 2021/6/29 下午6:49, David Woodhouse 写道:
>>> On Tue, 2021-06-29 at 11:43 +0800, Jason Wang wrote:
>>>>> The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about
>>>>> 1.2Gb/s from iperf, as it seems to be doing it all from the iperf
>>>>> thread.
>>>>>
>>>>> Before I started messing with OpenConnect, it could transmit 1.6Gb/s.
>>>>>
>>>>> When I pull in the 'stitched' AES+SHA code from OpenSSL instead of
>>>>> doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.
>>>>>
>>>>> Adding vhost support on top of that takes me to 2.46Gb/s, which is a
>>>>> decent enough win.
>>>> Interesting, I think the latency should be improved as well in this
>>>> case.
>>> I tried using 'ping -i 0.1' to get an idea of latency for the
>>> interesting VoIP-like case of packets where we have to wake up each
>>> time.
>>>
>>> For the *inbound* case, RX on the tun device followed by TX of the
>>> replies, I see results like this:
>>>
>>>        --- 172.16.0.2 ping statistics ---
>>>        141 packets transmitted, 141 received, 0% packet loss, time 14557ms
>>>        rtt min/avg/max/mdev = 0.380/0.419/0.461/0.024 ms
>>>
>>>
>>> The opposite direction (tun TX then RX) is similar:
>>>
>>>        --- 172.16.0.1 ping statistics ---
>>>        295 packets transmitted, 295 received, 0% packet loss, time 30573ms
>>>        rtt min/avg/max/mdev = 0.454/0.545/0.718/0.024 ms
>>>
>>>
>>> Using vhost-net (and TUNSNDBUF of INT_MAX-1 just to avoid XDP), it
>>> looks like this. Inbound:
>>>
>>>        --- 172.16.0.2 ping statistics ---
>>>        139 packets transmitted, 139 received, 0% packet loss, time 14350ms
>>>        rtt min/avg/max/mdev = 0.432/0.578/0.658/0.058 ms
>>>
>>> Outbound:
>>>
>>>        --- 172.16.0.1 ping statistics ---
>>>        149 packets transmitted, 149 received, 0% packet loss, time 15391ms
>>>        rtt mn/avg/max/mdev = 0.496/0.682/0.935/0.036 ms
>>>
>>>
>>> So as I expected, the throughput is better with vhost-net once I get to
>>> the point of 100% CPU usage in my main thread, because it offloads the
>>> kernel←→user copies. But latency is somewhat worse.
>>>
>>> I'm still using select() instead of epoll() which would give me a
>>> little back — but only a little, as I only poll on 3-4 fds, and more to
>>> the point it'll give me just as much win in the non-vhost case too, so
>>> it won't make much difference to the vhost vs. non-vhost comparison.
>>>
>>> Perhaps I really should look into that trick of "if the vhost TX ring
>>> is already stopped and would need a kick, and I only have a few packets
>>> in the batch, just write them directly to /dev/net/tun".
>>
>> That should work on low throughput.
> Indeed it works remarkably well, as I noted in my follow-up. I also
> fixed a minor stupidity where I was reading from the 'call' eventfd
> *before* doing the real work of moving packets around. And that gives
> me a few tens of microseconds back too.
>
>>> I'm wondering how that optimisation would translate to actual guests,
>>> which presumably have the same problem. Perhaps it would be an
>>> operation on the vhost fd, which ends up processing the ring right
>>> there in the context of *that* process instead of doing a wakeup?
>>
>> It might improve the latency in an ideal case but several possible issues:
>>
>> 1) this will blocks vCPU running until the sent is done
>> 2) copy_from_user() may sleep which will block the vcpu thread further
> Yes, it would block the vCPU for a short period of time, but we could
> limit that. The real win is to improve latency of single, short packets
> like a first SYN, or SYNACK. It should work fine even if it's limited
> to *one* *short* packet which *is* resident in memory.


This looks tricky since we need to poke both virtqueue metadata as well 
as the payload.

And we need to let the packet iterate the network stack which might have 
extra latency (qdiscs, eBPF, switch/OVS).

So it looks to me it's better to use vhost_net busy polling instead 
(VHOST_SET_VRING_BUSYLOOP_TIMEOUT).

Userspace can detect this feature by validating the existence of the ioctl.


>
> Although actually I'm not *overly* worried about the 'resident' part.
> For a transmit packet, especially a short one not a sendpage(), it's
> fairly likely the the guest has touched the buffer right before sending
> it. And taken the hit of faulting it in then, if necessary. If the host
> is paging out memory which is *active* use by a guest, that guest is
> screwed anyway :)


Right, but there could be workload that is unrelated to the networking. 
Block vCPU thread in this case seems sub-optimal.


>
> I'm thinking of something like an ioctl on the vhost-net fd which *if*
> the thread is currently sleeping and there's a single short packet,
> processes it immediately. {Else,then} it wakes the thread just like the
> eventfd would have done. (Perhaps just by signalling the kick eventfd,
> but perhaps there's a more efficient way anyway).
>
>>> My bandwidth tests go up from 2.46Gb/s with the workarounds, to
>>> 2.50Gb/s once I enable XDP, and 2.52Gb/s when I drop the virtio-net
>>> header. But there's no way for userspace to *detect* that those bugs
>>> are fixed, which makes it hard to ship that version.
> I'm up to 2.75Gb/s now with epoll and other fixes (including using
> sendmmsg() on the other side). Since the kernel can only do *half*
> that, I'm now wondering if I really want my data plane in the kernel at
> all, which was my long-term plan :)


Good to know that.


>
>> Yes, that's sad. One possible way to advertise a VHOST_NET_TUN flag via
>> VHOST_GET_BACKEND_FEATURES?
> Strictly it isn't VHOST_NET_TUN, as that *does* work today if you pick
> the right (very non-intuitive) combination of features. The feature is
> really "VHOST_NET_TUN_WITHOUT_TUNSNDBUF_OR_UNWANTED_VNET_HEADER" :)


Yes, but it's a hint for userspace that TUN could be work without any 
workarounds.


>
> But we don't need a feature specifically for that; I only need to check
> for *any* feature that goes in after the fixes.
>
> Maybe if we do add a new low-latency kick then I could key on *that*
> feature to assume the bugs are fixed.
>
> Alternatively, there's still the memory map thing I need to fix before
> I can commit this in my application:
>
> #ifdef __x86_64__
> 	vmem->regions[0].guest_phys_addr = 4096;
> 	vmem->regions[0].memory_size = 0x7fffffffe000;
> 	vmem->regions[0].userspace_addr = 4096;
> #else
> #error FIXME
> #endif
> 	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {
>
> Perhaps if we end up with a user-visible feature to deal with that,
> then I could use the presence of *that* feature to infer that the tun
> bugs are fixed.


As we discussed before it could be a new backend feature. VHOST_NET_SVA 
(shared virtual address)?


>
> Another random thought as I stare at this... can't we handle checksums
> in tun_get_user() / tun_put_user()? We could always set NETIF_F_HW_CSUM
> on the tun device, and just do it *while* we're copying the packet to
> userspace, if userspace doesn't support it. That would be better than
> having the kernel complete the checksum in a separate pass *before*
> handing the skb to tun_net_xmit().


I'm not sure I get this, but for performance reason we don't do any csum 
in this case?


>
> We could similarly do a partial checksum in tun_get_user() and hand it
> off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.


I think that's is how it is expected to work (via vnet header), see 
virtio_net_hdr_to_skb().

Thanks


>
>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-07-01  4:13                       ` Jason Wang
@ 2021-07-01 17:39                         ` David Woodhouse
  2021-07-02  3:13                           ` Jason Wang
  0 siblings, 1 reply; 73+ messages in thread
From: David Woodhouse @ 2021-07-01 17:39 UTC (permalink / raw)
  To: Jason Wang, netdev
  Cc: Eugenio Pérez, Willem de Bruijn, Michael S.Tsirkin

[-- Attachment #1: Type: text/plain, Size: 6328 bytes --]

On Thu, 2021-07-01 at 12:13 +0800, Jason Wang wrote:
> 在 2021/6/30 下午6:02, David Woodhouse 写道:
> > On Wed, 2021-06-30 at 12:39 +0800, Jason Wang wrote:
> > > 在 2021/6/29 下午6:49, David Woodhouse 写道:
> > > > So as I expected, the throughput is better with vhost-net once I get to
> > > > the point of 100% CPU usage in my main thread, because it offloads the
> > > > kernel←→user copies. But latency is somewhat worse.
> > > > 
> > > > I'm still using select() instead of epoll() which would give me a
> > > > little back — but only a little, as I only poll on 3-4 fds, and more to
> > > > the point it'll give me just as much win in the non-vhost case too, so
> > > > it won't make much difference to the vhost vs. non-vhost comparison.
> > > > 
> > > > Perhaps I really should look into that trick of "if the vhost TX ring
> > > > is already stopped and would need a kick, and I only have a few packets
> > > > in the batch, just write them directly to /dev/net/tun".
> > > 
> > > That should work on low throughput.
> > 
> > Indeed it works remarkably well, as I noted in my follow-up. I also
> > fixed a minor stupidity where I was reading from the 'call' eventfd
> > *before* doing the real work of moving packets around. And that gives
> > me a few tens of microseconds back too.
> > 
> > > > I'm wondering how that optimisation would translate to actual guests,
> > > > which presumably have the same problem. Perhaps it would be an
> > > > operation on the vhost fd, which ends up processing the ring right
> > > > there in the context of *that* process instead of doing a wakeup?
> > > 
> > > It might improve the latency in an ideal case but several possible issues:
> > > 
> > > 1) this will blocks vCPU running until the sent is done
> > > 2) copy_from_user() may sleep which will block the vcpu thread further
> > 
> > Yes, it would block the vCPU for a short period of time, but we could
> > limit that. The real win is to improve latency of single, short packets
> > like a first SYN, or SYNACK. It should work fine even if it's limited
> > to *one* *short* packet which *is* resident in memory.
> 
> 
> This looks tricky since we need to poke both virtqueue metadata as well 
> as the payload.

That's OK as we'd *only* do it if the thread is quiescent anyway.

> And we need to let the packet iterate the network stack which might have 
> extra latency (qdiscs, eBPF, switch/OVS).
> 
> So it looks to me it's better to use vhost_net busy polling instead 
> (VHOST_SET_VRING_BUSYLOOP_TIMEOUT).

Or something very similar, with the *trylock* and bailing out.

> Userspace can detect this feature by validating the existence of the ioctl.

Yep. Or if we want to get fancy, we could even offer it to the guest.
As a *different* doorbell register to poke if they want to relinquish
the physical CPU to process the packet quicker. We wouldn't even *need*
to go through userspace at all, if we put that into a separate page...
but that probably *is* overengineering it :)

> > Although actually I'm not *overly* worried about the 'resident' part.
> > For a transmit packet, especially a short one not a sendpage(), it's
> > fairly likely the the guest has touched the buffer right before sending
> > it. And taken the hit of faulting it in then, if necessary. If the host
> > is paging out memory which is *active* use by a guest, that guest is
> > screwed anyway :)
> 
> 
> Right, but there could be workload that is unrelated to the networking. 
> Block vCPU thread in this case seems sub-optimal.
> 

Right, but the VMM (or the guest, if we're letting the guest choose)
wouldn't have to use it for those cases.

> > Alternatively, there's still the memory map thing I need to fix before
> > I can commit this in my application:
> > 
> > #ifdef __x86_64__
> > 	vmem->regions[0].guest_phys_addr = 4096;
> > 	vmem->regions[0].memory_size = 0x7fffffffe000;
> > 	vmem->regions[0].userspace_addr = 4096;
> > #else
> > #error FIXME
> > #endif
> > 	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {
> > 
> > Perhaps if we end up with a user-visible feature to deal with that,
> > then I could use the presence of *that* feature to infer that the tun
> > bugs are fixed.
> 
> 
> As we discussed before it could be a new backend feature. VHOST_NET_SVA 
> (shared virtual address)?

Yeah, I'll take a look.

> > Another random thought as I stare at this... can't we handle checksums
> > in tun_get_user() / tun_put_user()? We could always set NETIF_F_HW_CSUM
> > on the tun device, and just do it *while* we're copying the packet to
> > userspace, if userspace doesn't support it. That would be better than
> > having the kernel complete the checksum in a separate pass *before*
> > handing the skb to tun_net_xmit().
> 
> 
> I'm not sure I get this, but for performance reason we don't do any csum 
> in this case?

I think we have to; the packets can't leave the box without a valid
checksum. If the skb isn't CHECKSUM_COMPLETE at the time it's handed
off to the ->hard_start_xmit of a netdev which doesn't advertise
hardware checksum support, the network stack will do it manually in an
extra pass.

Which is kind of silly if the tun device is going to do a pass over all
the data *anyway* as it copies it up to userspace. Even in the normal
case without vhost-net.

> > We could similarly do a partial checksum in tun_get_user() and hand it
> > off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.
> 
> 
> I think that's is how it is expected to work (via vnet header), see 
> virtio_net_hdr_to_skb().

But only if the "guest" supports it; it doesn't get handled by the tun
device. It *could*, since we already have the helpers to checksum *as*
we copy to/from userspace.

It doesn't help for me to advertise that I support TX checksums in
userspace because I'd have to do an extra pass for that. I only do one
pass over the data as I encrypt it, and in many block cipher modes the
encryption of the early blocks affects the IV for the subsequent
blocks... do I can't just go back and "fix" the checksum at the start
of the packet, once I'm finished.

So doing the checksum as the packet is copied up to userspace would be
very useful.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-07-01 17:39                         ` David Woodhouse
@ 2021-07-02  3:13                           ` Jason Wang
  2021-07-02  8:08                             ` David Woodhouse
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Wang @ 2021-07-02  3:13 UTC (permalink / raw)
  To: David Woodhouse, netdev
  Cc: Eugenio Pérez, Willem de Bruijn, Michael S.Tsirkin


在 2021/7/2 上午1:39, David Woodhouse 写道:
> On Thu, 2021-07-01 at 12:13 +0800, Jason Wang wrote:
>> 在 2021/6/30 下午6:02, David Woodhouse 写道:
>>> On Wed, 2021-06-30 at 12:39 +0800, Jason Wang wrote:
>>>> 在 2021/6/29 下午6:49, David Woodhouse 写道:
>>>>> So as I expected, the throughput is better with vhost-net once I get to
>>>>> the point of 100% CPU usage in my main thread, because it offloads the
>>>>> kernel←→user copies. But latency is somewhat worse.
>>>>>
>>>>> I'm still using select() instead of epoll() which would give me a
>>>>> little back — but only a little, as I only poll on 3-4 fds, and more to
>>>>> the point it'll give me just as much win in the non-vhost case too, so
>>>>> it won't make much difference to the vhost vs. non-vhost comparison.
>>>>>
>>>>> Perhaps I really should look into that trick of "if the vhost TX ring
>>>>> is already stopped and would need a kick, and I only have a few packets
>>>>> in the batch, just write them directly to /dev/net/tun".
>>>> That should work on low throughput.
>>> Indeed it works remarkably well, as I noted in my follow-up. I also
>>> fixed a minor stupidity where I was reading from the 'call' eventfd
>>> *before* doing the real work of moving packets around. And that gives
>>> me a few tens of microseconds back too.
>>>
>>>>> I'm wondering how that optimisation would translate to actual guests,
>>>>> which presumably have the same problem. Perhaps it would be an
>>>>> operation on the vhost fd, which ends up processing the ring right
>>>>> there in the context of *that* process instead of doing a wakeup?
>>>> It might improve the latency in an ideal case but several possible issues:
>>>>
>>>> 1) this will blocks vCPU running until the sent is done
>>>> 2) copy_from_user() may sleep which will block the vcpu thread further
>>> Yes, it would block the vCPU for a short period of time, but we could
>>> limit that. The real win is to improve latency of single, short packets
>>> like a first SYN, or SYNACK. It should work fine even if it's limited
>>> to *one* *short* packet which *is* resident in memory.
>>
>> This looks tricky since we need to poke both virtqueue metadata as well
>> as the payload.
> That's OK as we'd *only* do it if the thread is quiescent anyway.
>
>> And we need to let the packet iterate the network stack which might have
>> extra latency (qdiscs, eBPF, switch/OVS).
>>
>> So it looks to me it's better to use vhost_net busy polling instead
>> (VHOST_SET_VRING_BUSYLOOP_TIMEOUT).
> Or something very similar, with the *trylock* and bailing out.
>
>> Userspace can detect this feature by validating the existence of the ioctl.
> Yep. Or if we want to get fancy, we could even offer it to the guest.
> As a *different* doorbell register to poke if they want to relinquish
> the physical CPU to process the packet quicker. We wouldn't even *need*
> to go through userspace at all, if we put that into a separate page...
> but that probably *is* overengineering it :)


Yes. Actually, it makes a PV virtio driver which requires an 
architectural way to detect the existence of those kinds of doorbell.

This seems contradict to how virtio want to go as a general 
device/driver interface which is not limited to the world of virtualization.


>
>>> Although actually I'm not *overly* worried about the 'resident' part.
>>> For a transmit packet, especially a short one not a sendpage(), it's
>>> fairly likely the the guest has touched the buffer right before sending
>>> it. And taken the hit of faulting it in then, if necessary. If the host
>>> is paging out memory which is *active* use by a guest, that guest is
>>> screwed anyway :)
>>
>> Right, but there could be workload that is unrelated to the networking.
>> Block vCPU thread in this case seems sub-optimal.
>>
> Right, but the VMM (or the guest, if we're letting the guest choose)
> wouldn't have to use it for those cases.


I'm not sure I get here. If so, simply write to TUN directly would work.


>
>>> Alternatively, there's still the memory map thing I need to fix before
>>> I can commit this in my application:
>>>
>>> #ifdef __x86_64__
>>> 	vmem->regions[0].guest_phys_addr = 4096;
>>> 	vmem->regions[0].memory_size = 0x7fffffffe000;
>>> 	vmem->regions[0].userspace_addr = 4096;
>>> #else
>>> #error FIXME
>>> #endif
>>> 	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {
>>>
>>> Perhaps if we end up with a user-visible feature to deal with that,
>>> then I could use the presence of *that* feature to infer that the tun
>>> bugs are fixed.
>>
>> As we discussed before it could be a new backend feature. VHOST_NET_SVA
>> (shared virtual address)?
> Yeah, I'll take a look.
>
>>> Another random thought as I stare at this... can't we handle checksums
>>> in tun_get_user() / tun_put_user()? We could always set NETIF_F_HW_CSUM
>>> on the tun device, and just do it *while* we're copying the packet to
>>> userspace, if userspace doesn't support it. That would be better than
>>> having the kernel complete the checksum in a separate pass *before*
>>> handing the skb to tun_net_xmit().
>>
>> I'm not sure I get this, but for performance reason we don't do any csum
>> in this case?
> I think we have to; the packets can't leave the box without a valid
> checksum. If the skb isn't CHECKSUM_COMPLETE at the time it's handed
> off to the ->hard_start_xmit of a netdev which doesn't advertise
> hardware checksum support, the network stack will do it manually in an
> extra pass.


Yes.


>
> Which is kind of silly if the tun device is going to do a pass over all
> the data *anyway* as it copies it up to userspace. Even in the normal
> case without vhost-net.


I think the design is to delay the tx checksum as much as possible:

1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX
2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX

E.g  if the checksum is supported in all those path, we don't need any 
software checksum at all in the above path. And if any part is not 
capable of doing checksum, the checksum will be done by networking core 
before calling the hard_start_xmit of that device.


>
>>> We could similarly do a partial checksum in tun_get_user() and hand it
>>> off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.
>>
>> I think that's is how it is expected to work (via vnet header), see
>> virtio_net_hdr_to_skb().
> But only if the "guest" supports it; it doesn't get handled by the tun
> device. It *could*, since we already have the helpers to checksum *as*
> we copy to/from userspace.
>
> It doesn't help for me to advertise that I support TX checksums in
> userspace because I'd have to do an extra pass for that. I only do one
> pass over the data as I encrypt it, and in many block cipher modes the
> encryption of the early blocks affects the IV for the subsequent
> blocks... do I can't just go back and "fix" the checksum at the start
> of the packet, once I'm finished.
>
> So doing the checksum as the packet is copied up to userspace would be
> very useful.


I think I get this, but it requires a new TUN features (and maybe make 
it userspace controllable via tun_set_csum()).

Thanks



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-07-02  3:13                           ` Jason Wang
@ 2021-07-02  8:08                             ` David Woodhouse
  2021-07-02  8:50                               ` Jason Wang
  2021-07-09 15:04                               ` Eugenio Perez Martin
  0 siblings, 2 replies; 73+ messages in thread
From: David Woodhouse @ 2021-07-02  8:08 UTC (permalink / raw)
  To: Jason Wang, netdev
  Cc: Eugenio Pérez, Willem de Bruijn, Michael S.Tsirkin

[-- Attachment #1: Type: text/plain, Size: 4895 bytes --]

On Fri, 2021-07-02 at 11:13 +0800, Jason Wang wrote:
> 在 2021/7/2 上午1:39, David Woodhouse 写道:
> > 
> > Right, but the VMM (or the guest, if we're letting the guest choose)
> > wouldn't have to use it for those cases.
> 
> 
> I'm not sure I get here. If so, simply write to TUN directly would work.

As noted, that works nicely for me in OpenConnect; I just write it to
the tun device *instead* of putting it in the vring. My TX latency is
now fine; it's just RX which takes *two* scheduler wakeups (tun wakes
vhost thread, wakes guest).

But it's not clear to me that a VMM could use it. Because the guest has
already put that packet *into* the vring. Now if the VMM is in the path
of all wakeups for that vring, I suppose we *might* be able to contrive
some hackish way to be 'sure' that the kernel isn't servicing it, so we
could try to 'steal' that packet from the ring in order to send it
directly... but no. That's awful :)

I do think it'd be interesting to look at a way to reduce the latency
of the vring wakeup especially for that case of a virtio-net guest with
a single small packet to send. But realistically speaking, I'm unlikely
to get to it any time soon except for showing the numbers with the
userspace equivalent and observing that there's probably a similar win
to be had for guests too.

In the short term, I should focus on what we want to do to finish off
my existing fixes. Did we have a consensus on whether to bother
supporting PI? As I said, I'm mildly inclined to do so just because it
mostly comes out in the wash as we fix everything else, and making it
gracefully *refuse* that setup reliably is just as hard.

I think I'll try to make the vhost-net code much more resilient to
finding that tun_recvmsg() returns a header other than the sock_hlen it
expects, and see how much still actually needs "fixing" if we can do
that.


> I think the design is to delay the tx checksum as much as possible:
> 
> 1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX
> 2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX
> 
> E.g  if the checksum is supported in all those path, we don't need any 
> software checksum at all in the above path. And if any part is not 
> capable of doing checksum, the checksum will be done by networking core 
> before calling the hard_start_xmit of that device.

Right, but in *any* case where the 'device' is going to memcpy the data
around (like tun_put_user() does), it's a waste of time having the
networking core do a *separate* pass over the data just to checksum it.

> > > > We could similarly do a partial checksum in tun_get_user() and hand it
> > > > off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.
> > > 
> > > I think that's is how it is expected to work (via vnet header), see
> > > virtio_net_hdr_to_skb().
> > 
> > But only if the "guest" supports it; it doesn't get handled by the tun
> > device. It *could*, since we already have the helpers to checksum *as*
> > we copy to/from userspace.
> > 
> > It doesn't help for me to advertise that I support TX checksums in
> > userspace because I'd have to do an extra pass for that. I only do one
> > pass over the data as I encrypt it, and in many block cipher modes the
> > encryption of the early blocks affects the IV for the subsequent
> > blocks... do I can't just go back and "fix" the checksum at the start
> > of the packet, once I'm finished.
> > 
> > So doing the checksum as the packet is copied up to userspace would be
> > very useful.
> 
> 
> I think I get this, but it requires a new TUN features (and maybe make 
> it userspace controllable via tun_set_csum()).

I don't think it's visible to userspace at all; it's purely between the
tun driver and the network stack. We *always* set NETIF_F_HW_CSUM,
regardless of what the user can cope with. And if the user *didn't*
support checksum offload then tun will transparently do the checksum
*during* the copy_to_iter() (in either tun_put_user_xdp() or
tun_put_user()).

Userspace sees precisely what it did before. If it doesn't support
checksum offload then it gets a pre-checksummed packet just as before.
It's just that the kernel will do that checksum *while* it's already
touching the data as it copies it to userspace, instead of in a
separate pass.

Although actually, for my *benchmark* case with iperf3 sending UDP, I
spotted in the perf traces that we actually do the checksum as we're
copying from userspace in the udp_sendmsg() call. There's a check in
__ip_append_data() which looks to see if the destination device has
HW_CSUM|IP_CSUM features, and does the copy-and-checksum if not. There
are definitely use cases which *don't* have that kind of optimisation
though, and packets that would reach tun_net_xmit() with CHECKSUM_NONE.
So I think it's worth looking at.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-07-02  8:08                             ` David Woodhouse
@ 2021-07-02  8:50                               ` Jason Wang
  2021-07-09 15:04                               ` Eugenio Perez Martin
  1 sibling, 0 replies; 73+ messages in thread
From: Jason Wang @ 2021-07-02  8:50 UTC (permalink / raw)
  To: David Woodhouse, netdev
  Cc: Eugenio Pérez, Willem de Bruijn, Michael S.Tsirkin


在 2021/7/2 下午4:08, David Woodhouse 写道:
> On Fri, 2021-07-02 at 11:13 +0800, Jason Wang wrote:
>> 在 2021/7/2 上午1:39, David Woodhouse 写道:
>>> Right, but the VMM (or the guest, if we're letting the guest choose)
>>> wouldn't have to use it for those cases.
>>
>> I'm not sure I get here. If so, simply write to TUN directly would work.
> As noted, that works nicely for me in OpenConnect; I just write it to
> the tun device *instead* of putting it in the vring. My TX latency is
> now fine; it's just RX which takes *two* scheduler wakeups (tun wakes
> vhost thread, wakes guest).


Note that busy polling is used for KVM to improve the latency as well. 
It was enabled by default if I was not wrong.


>
> But it's not clear to me that a VMM could use it. Because the guest has
> already put that packet *into* the vring. Now if the VMM is in the path
> of all wakeups for that vring, I suppose we *might* be able to contrive
> some hackish way to be 'sure' that the kernel isn't servicing it, so we
> could try to 'steal' that packet from the ring in order to send it
> directly... but no. That's awful :)


Yes.


>
> I do think it'd be interesting to look at a way to reduce the latency
> of the vring wakeup especially for that case of a virtio-net guest with
> a single small packet to send. But realistically speaking, I'm unlikely
> to get to it any time soon except for showing the numbers with the
> userspace equivalent and observing that there's probably a similar win
> to be had for guests too.
>
> In the short term, I should focus on what we want to do to finish off
> my existing fixes.


I think so.


> Did we have a consensus on whether to bother
> supporting PI?


Michael, any thought on this?


>   As I said, I'm mildly inclined to do so just because it
> mostly comes out in the wash as we fix everything else, and making it
> gracefully *refuse* that setup reliably is just as hard.
>
> I think I'll try to make the vhost-net code much more resilient to
> finding that tun_recvmsg() returns a header other than the sock_hlen it
> expects, and see how much still actually needs "fixing" if we can do
> that.


Let's see how well it goes.


>
>
>> I think the design is to delay the tx checksum as much as possible:
>>
>> 1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX
>> 2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX
>>
>> E.g  if the checksum is supported in all those path, we don't need any
>> software checksum at all in the above path. And if any part is not
>> capable of doing checksum, the checksum will be done by networking core
>> before calling the hard_start_xmit of that device.
> Right, but in *any* case where the 'device' is going to memcpy the data
> around (like tun_put_user() does), it's a waste of time having the
> networking core do a *separate* pass over the data just to checksum it.


See below.


>
>>>>> We could similarly do a partial checksum in tun_get_user() and hand it
>>>>> off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.
>>>> I think that's is how it is expected to work (via vnet header), see
>>>> virtio_net_hdr_to_skb().
>>> But only if the "guest" supports it; it doesn't get handled by the tun
>>> device. It *could*, since we already have the helpers to checksum *as*
>>> we copy to/from userspace.
>>>
>>> It doesn't help for me to advertise that I support TX checksums in
>>> userspace because I'd have to do an extra pass for that. I only do one
>>> pass over the data as I encrypt it, and in many block cipher modes the
>>> encryption of the early blocks affects the IV for the subsequent
>>> blocks... do I can't just go back and "fix" the checksum at the start
>>> of the packet, once I'm finished.
>>>
>>> So doing the checksum as the packet is copied up to userspace would be
>>> very useful.
>>
>> I think I get this, but it requires a new TUN features (and maybe make
>> it userspace controllable via tun_set_csum()).
> I don't think it's visible to userspace at all; it's purely between the
> tun driver and the network stack. We *always* set NETIF_F_HW_CSUM,
> regardless of what the user can cope with. And if the user *didn't*
> support checksum offload then tun will transparently do the checksum
> *during* the copy_to_iter() (in either tun_put_user_xdp() or
> tun_put_user()).
>
> Userspace sees precisely what it did before. If it doesn't support
> checksum offload then it gets a pre-checksummed packet just as before.
> It's just that the kernel will do that checksum *while* it's already
> touching the data as it copies it to userspace, instead of in a
> separate pass.


So I kind of get what did you meant:

1) Don't disable NETIF_F_HW_CSUM in tun_set_csum() even if userspace 
clear TUN_F_CSUM.
2) Use csum iov iterator helper in tun_put_user() and tun_put_user_xdp()

It may help for the performance since we get better cache locality if 
userspace doesn't support checksum offload.

But in this case we need to know if userspace can do the checksum 
offload which we don't need to care previously (via NETIF_F_HW_CSUM).

And we probably need to sync with tun_set_offload().


>
> Although actually, for my *benchmark* case with iperf3 sending UDP, I
> spotted in the perf traces that we actually do the checksum as we're
> copying from userspace in the udp_sendmsg() call. There's a check in
> __ip_append_data() which looks to see if the destination device has
> HW_CSUM|IP_CSUM features, and does the copy-and-checksum if not. There
> are definitely use cases which *don't* have that kind of optimisation
> though, and packets that would reach tun_net_xmit() with CHECKSUM_NONE.
> So I think it's worth looking at.


Yes.

Thanks


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves
  2021-07-02  8:08                             ` David Woodhouse
  2021-07-02  8:50                               ` Jason Wang
@ 2021-07-09 15:04                               ` Eugenio Perez Martin
  1 sibling, 0 replies; 73+ messages in thread
From: Eugenio Perez Martin @ 2021-07-09 15:04 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jason Wang, netdev, Willem de Bruijn, Michael S.Tsirkin

On Fri, Jul 2, 2021 at 10:08 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Fri, 2021-07-02 at 11:13 +0800, Jason Wang wrote:
> > 在 2021/7/2 上午1:39, David Woodhouse 写道:
> > >
> > > Right, but the VMM (or the guest, if we're letting the guest choose)
> > > wouldn't have to use it for those cases.
> >
> >
> > I'm not sure I get here. If so, simply write to TUN directly would work.
>
> As noted, that works nicely for me in OpenConnect; I just write it to
> the tun device *instead* of putting it in the vring. My TX latency is
> now fine; it's just RX which takes *two* scheduler wakeups (tun wakes
> vhost thread, wakes guest).
>

Maybe we can do a small test to see the effect of warming up the userland?
* Make vhost to write irqfd BEFORE add the packet to the ring, not after.
* Make userland (I think your selftest would be fine for this) to spin
reading used idx until it sees at least one buffer.

I think this introduces races in the general virtio ring management
but should work well for the testing. Any thoughts?

We could also check what happens in case of burning the userland CPU
checking for used_idx and disable notifications, and see if it is
worth keeping shaving latency in that direction :).

> But it's not clear to me that a VMM could use it. Because the guest has
> already put that packet *into* the vring. Now if the VMM is in the path
> of all wakeups for that vring, I suppose we *might* be able to contrive
> some hackish way to be 'sure' that the kernel isn't servicing it, so we
> could try to 'steal' that packet from the ring in order to send it
> directly... but no. That's awful :)
>
> I do think it'd be interesting to look at a way to reduce the latency
> of the vring wakeup especially for that case of a virtio-net guest with
> a single small packet to send. But realistically speaking, I'm unlikely
> to get to it any time soon except for showing the numbers with the
> userspace equivalent and observing that there's probably a similar win
> to be had for guests too.
>
> In the short term, I should focus on what we want to do to finish off
> my existing fixes. Did we have a consensus on whether to bother
> supporting PI? As I said, I'm mildly inclined to do so just because it
> mostly comes out in the wash as we fix everything else, and making it
> gracefully *refuse* that setup reliably is just as hard.
>
> I think I'll try to make the vhost-net code much more resilient to
> finding that tun_recvmsg() returns a header other than the sock_hlen it
> expects, and see how much still actually needs "fixing" if we can do
> that.
>
>
> > I think the design is to delay the tx checksum as much as possible:
> >
> > 1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX
> > 2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX
> >
> > E.g  if the checksum is supported in all those path, we don't need any
> > software checksum at all in the above path. And if any part is not
> > capable of doing checksum, the checksum will be done by networking core
> > before calling the hard_start_xmit of that device.
>
> Right, but in *any* case where the 'device' is going to memcpy the data
> around (like tun_put_user() does), it's a waste of time having the
> networking core do a *separate* pass over the data just to checksum it.
>
> > > > > We could similarly do a partial checksum in tun_get_user() and hand it
> > > > > off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.
> > > >
> > > > I think that's is how it is expected to work (via vnet header), see
> > > > virtio_net_hdr_to_skb().
> > >
> > > But only if the "guest" supports it; it doesn't get handled by the tun
> > > device. It *could*, since we already have the helpers to checksum *as*
> > > we copy to/from userspace.
> > >
> > > It doesn't help for me to advertise that I support TX checksums in
> > > userspace because I'd have to do an extra pass for that. I only do one
> > > pass over the data as I encrypt it, and in many block cipher modes the
> > > encryption of the early blocks affects the IV for the subsequent
> > > blocks... do I can't just go back and "fix" the checksum at the start
> > > of the packet, once I'm finished.
> > >
> > > So doing the checksum as the packet is copied up to userspace would be
> > > very useful.
> >
> >
> > I think I get this, but it requires a new TUN features (and maybe make
> > it userspace controllable via tun_set_csum()).
>
> I don't think it's visible to userspace at all; it's purely between the
> tun driver and the network stack. We *always* set NETIF_F_HW_CSUM,
> regardless of what the user can cope with. And if the user *didn't*
> support checksum offload then tun will transparently do the checksum
> *during* the copy_to_iter() (in either tun_put_user_xdp() or
> tun_put_user()).
>
> Userspace sees precisely what it did before. If it doesn't support
> checksum offload then it gets a pre-checksummed packet just as before.
> It's just that the kernel will do that checksum *while* it's already
> touching the data as it copies it to userspace, instead of in a
> separate pass.
>
> Although actually, for my *benchmark* case with iperf3 sending UDP, I
> spotted in the perf traces that we actually do the checksum as we're
> copying from userspace in the udp_sendmsg() call. There's a check in
> __ip_append_data() which looks to see if the destination device has
> HW_CSUM|IP_CSUM features, and does the copy-and-checksum if not. There
> are definitely use cases which *don't* have that kind of optimisation
> though, and packets that would reach tun_net_xmit() with CHECKSUM_NONE.
> So I think it's worth looking at.
>


^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2021-07-09 15:05 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-19 13:33 [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode David Woodhouse
2021-06-21  7:00 ` Jason Wang
2021-06-21 10:52   ` David Woodhouse
2021-06-21 14:50     ` David Woodhouse
2021-06-21 20:43       ` David Woodhouse
2021-06-22  4:52         ` Jason Wang
2021-06-22  7:24           ` David Woodhouse
2021-06-22  7:51             ` Jason Wang
2021-06-22  8:10               ` David Woodhouse
2021-06-22 11:36               ` David Woodhouse
2021-06-22  4:34       ` Jason Wang
2021-06-22  4:34     ` Jason Wang
2021-06-22  7:28       ` David Woodhouse
2021-06-22  8:00         ` Jason Wang
2021-06-22  8:29           ` David Woodhouse
2021-06-23  3:39             ` Jason Wang
2021-06-24 12:39               ` David Woodhouse
2021-06-22 16:15 ` [PATCH v2 1/4] " David Woodhouse
2021-06-22 16:15   ` [PATCH v2 2/4] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path David Woodhouse
2021-06-23  3:46     ` Jason Wang
2021-06-22 16:15   ` [PATCH v2 3/4] vhost_net: validate virtio_net_hdr only if it exists David Woodhouse
2021-06-23  3:48     ` Jason Wang
2021-06-22 16:15   ` [PATCH v2 4/4] vhost_net: Add self test with tun device David Woodhouse
2021-06-23  4:02     ` Jason Wang
2021-06-23 16:12       ` David Woodhouse
2021-06-24  6:12         ` Jason Wang
2021-06-24 10:42           ` David Woodhouse
2021-06-25  2:55             ` Jason Wang
2021-06-25  7:54               ` David Woodhouse
2021-06-23  3:45   ` [PATCH v2 1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode Jason Wang
2021-06-23  8:30     ` David Woodhouse
2021-06-23 13:52     ` David Woodhouse
2021-06-23 17:31       ` David Woodhouse
2021-06-23 22:52         ` David Woodhouse
2021-06-24  6:37           ` Jason Wang
2021-06-24  7:23             ` David Woodhouse
2021-06-24  6:18       ` Jason Wang
2021-06-24  7:05         ` David Woodhouse
2021-06-24 12:30 ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() David Woodhouse
2021-06-24 12:30   ` [PATCH v3 2/5] net: tun: don't assume IFF_VNET_HDR in tun_xdp_one() tx path David Woodhouse
2021-06-25  6:58     ` Jason Wang
2021-06-24 12:30   ` [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves David Woodhouse
2021-06-25  7:33     ` Jason Wang
2021-06-25  8:37       ` David Woodhouse
2021-06-28  4:23         ` Jason Wang
2021-06-28 11:23           ` David Woodhouse
2021-06-28 23:29             ` David Woodhouse
2021-06-29  3:43               ` Jason Wang
2021-06-29  6:59                 ` David Woodhouse
2021-06-29 10:49                 ` David Woodhouse
2021-06-29 13:15                   ` David Woodhouse
2021-06-30  4:39                   ` Jason Wang
2021-06-30 10:02                     ` David Woodhouse
2021-07-01  4:13                       ` Jason Wang
2021-07-01 17:39                         ` David Woodhouse
2021-07-02  3:13                           ` Jason Wang
2021-07-02  8:08                             ` David Woodhouse
2021-07-02  8:50                               ` Jason Wang
2021-07-09 15:04                               ` Eugenio Perez Martin
2021-06-29  3:21             ` Jason Wang
2021-06-24 12:30   ` [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode David Woodhouse
2021-06-25  7:41     ` Jason Wang
2021-06-25  8:51       ` David Woodhouse
2021-06-28  4:27         ` Jason Wang
2021-06-28 10:43           ` David Woodhouse
2021-06-25 18:43     ` Willem de Bruijn
2021-06-25 19:00       ` David Woodhouse
2021-06-24 12:30   ` [PATCH v3 5/5] vhost_net: Add self test with tun device David Woodhouse
2021-06-25  5:00   ` [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket() Jason Wang
2021-06-25  8:23     ` David Woodhouse
2021-06-28  4:22       ` Jason Wang
2021-06-25 18:13   ` Willem de Bruijn
2021-06-25 18:55     ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.