[PATCH 0/2] net/packet: support of specifying virtio net header size

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] net/packet: support of specifying virtio net header size
@ 2023-02-09 12:43 沈安琪(凛玥)
  2023-02-09 12:43 ` [PATCH 1/2] net/packet: add socketopt to set/get vnet_hdr_sz 沈安琪(凛玥)
  2023-02-09 12:43 ` [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz 沈安琪(凛玥)
  0 siblings, 2 replies; 19+ messages in thread
From: 沈安琪(凛玥) @ 2023-02-09 12:43 UTC (permalink / raw)
  To: netdev
  Cc: willemdebruijn.kernel, mst, davem, jasowang,
	沈安琪(凛玥)

Raw socket, like tap, can be used as the backend for kernel vhost.
In raw socket, virtio net header size is currently hardcoded to be
the size of struct virtio_net_hdr, which is 10 bytes; however, it is not
always the case: some virtio features, such as mrg_rxbuf and VERSION_1,
need virtio net header to be 12-byte long.

These virtio features are worthy to support: for example, packets
that larger than one-mbuf size will be dropped in vhost worker's
handle_rx if mrg_rxbuf feature is not used, but large packets cannot be
avoided and increasing mbuf's size is not economical.

With these virtio features enabled, raw socket with hardcoded 10-byte
virtio net header will parse mac head incorrectly in packet_snd by taking
the last two bytes of virtio net header as part of mac header as well.
This incorrect mac header parsing will cause packet be dropped due to
invalid ether head checking in later under-layer device packet receiving.

By adding extra field vnet_hdr_sz in packet_sock to record current using
virtio net header size and supporting extra sockopt PACKET_VNET_HDR_SZ to
set specified vnet_hdr_sz, packet_sock can know the exact length of virtio
net header that virtio user gives. In packet_snd, tpacket_snd and
packet_recvmsg, instead of using hardcode virtio net header size, it can
get the exact vnet_hdr_sz from corresponding packet_sock, and parse mac
header correctly based on this information.

Anqi Shen (2):
  net/packet: add socketopt to set/get vnet_hdr_sz
  net/packet: send and receive pkt with given vnet_hdr_sz

 include/uapi/linux/if_packet.h |  1 +
 net/packet/af_packet.c         | 82 ++++++++++++++++++++++++++++++++++--------
 net/packet/internal.h          |  3 +-
 3 files changed, 70 insertions(+), 16 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] net/packet: add socketopt to set/get vnet_hdr_sz
  2023-02-09 12:43 [PATCH 0/2] net/packet: support of specifying virtio net header size 沈安琪(凛玥)
@ 2023-02-09 12:43 ` 沈安琪(凛玥)
  2023-02-09 14:45   ` Willem de Bruijn
  2023-02-09 12:43 ` [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz 沈安琪(凛玥)
  1 sibling, 1 reply; 19+ messages in thread
From: 沈安琪(凛玥) @ 2023-02-09 12:43 UTC (permalink / raw)
  To: netdev
  Cc: willemdebruijn.kernel, mst, davem, jasowang,
	谈鉴锋,
	沈安琪(凛玥)

From: "Jianfeng Tan" <henry.tjf@antgroup.com>

Raw socket can be used as the backend for kernel vhost, like tap.
However, in current raw socket implementation, it use hardcoded virtio
net header length, which will cause error mac header parsing when some
virtio features that need virtio net header other than 10-byte are used.

By adding extra field vnet_hdr_sz in packet_sock to record virtio net
header size that current raw socket should use and supporting extra
sockopt PACKET_VNET_HDR_SZ to allow user level set specified vnet header
size to current socket, raw socket will know the exact virtio net header
size it should use instead of hardcoding to avoid incorrect header
parsing.

Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
---
 include/uapi/linux/if_packet.h |  1 +
 net/packet/af_packet.c         | 34 ++++++++++++++++++++++++++++++++++
 net/packet/internal.h          |  3 ++-
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 78c981d..9efc423 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -59,6 +59,7 @@ struct sockaddr_ll {
 #define PACKET_ROLLOVER_STATS		21
 #define PACKET_FANOUT_DATA		22
 #define PACKET_IGNORE_OUTGOING		23
+#define PACKET_VNET_HDR_SZ		24
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 8ffb19c..8389f18 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3936,11 +3936,42 @@ static void packet_flush_mclist(struct sock *sk)
 			ret = -EBUSY;
 		} else {
 			po->has_vnet_hdr = !!val;
+			/* set vnet_hdr_sz to default value */
+			if (po->has_vnet_hdr)
+				po->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
+			else
+				po->vnet_hdr_sz = 0;
 			ret = 0;
 		}
 		release_sock(sk);
 		return ret;
 	}
+	case PACKET_VNET_HDR_SZ:
+	{
+		int val;
+
+		if (sock->type != SOCK_RAW)
+			return -EINVAL;
+		if (optlen < sizeof(val))
+			return -EINVAL;
+		if (copy_from_user(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		lock_sock(sk);
+		if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) {
+			ret = -EBUSY;
+		} else {
+			if (val == sizeof(struct virtio_net_hdr) ||
+			    val == sizeof(struct virtio_net_hdr_mrg_rxbuf)) {
+				po->vnet_hdr_sz = val;
+				ret = 0;
+			} else {
+				ret = -EINVAL;
+			}
+		}
+		release_sock(sk);
+		return ret;
+	}
 	case PACKET_TIMESTAMP:
 	{
 		int val;
@@ -4070,6 +4101,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_VNET_HDR:
 		val = po->has_vnet_hdr;
 		break;
+	case PACKET_VNET_HDR_SZ:
+		val = po->vnet_hdr_sz;
+		break;
 	case PACKET_VERSION:
 		val = po->tp_version;
 		break;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 48af35b..e27b47d 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -121,7 +121,8 @@ struct packet_sock {
 				origdev:1,
 				has_vnet_hdr:1,
 				tp_loss:1,
-				tp_tx_has_off:1;
+				tp_tx_has_off:1,
+				vnet_hdr_sz:8;	/* vnet header size should use */
 	int			pressure;
 	int			ifindex;	/* bound device		*/
 	__be16			num;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-09 12:43 [PATCH 0/2] net/packet: support of specifying virtio net header size 沈安琪(凛玥)
  2023-02-09 12:43 ` [PATCH 1/2] net/packet: add socketopt to set/get vnet_hdr_sz 沈安琪(凛玥)
@ 2023-02-09 12:43 ` 沈安琪(凛玥)
  2023-02-09 13:07   ` Michael S. Tsirkin
  1 sibling, 1 reply; 19+ messages in thread
From: 沈安琪(凛玥) @ 2023-02-09 12:43 UTC (permalink / raw)
  To: netdev
  Cc: willemdebruijn.kernel, mst, davem, jasowang,
	谈鉴锋,
	沈安琪(凛玥)

From: "Jianfeng Tan" <henry.tjf@antgroup.com>

When raw socket is used as the backend for kernel vhost, currently it
will regard the virtio net header as 10-byte, which is not always the
case since some virtio features need virtio net header other than
10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
net header.

Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
size that is recorded in packet_sock to indicate the exact virtio net
header size that virtio user actually prepares in the packets. By doing
so, it can fix the issue of incorrect mac header parsing when these
virtio features that need virtio net header other than 10-byte are
enable.

Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
---
 net/packet/af_packet.c | 48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ab37baf..4f49939 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2092,18 +2092,25 @@ static unsigned int run_filter(struct sk_buff *skb,
 }
 
 static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
-			   size_t *len)
+			   size_t *len, int vnet_hdr_sz)
 {
 	struct virtio_net_hdr vnet_hdr;
+	int ret;
 
-	if (*len < sizeof(vnet_hdr))
+	if (*len < vnet_hdr_sz)
 		return -EINVAL;
-	*len -= sizeof(vnet_hdr);
+	*len -= vnet_hdr_sz;
 
 	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
 		return -EINVAL;
 
-	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
+	ret = memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
+
+	/* reserve space for extra info in vnet_hdr if needed */
+	if (ret == 0)
+		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(vnet_hdr));
+
+	return ret;
 }
 
 /*
@@ -2311,7 +2318,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 				       (maclen < 16 ? 16 : maclen)) +
 				       po->tp_reserve;
 		if (po->has_vnet_hdr) {
-			netoff += sizeof(struct virtio_net_hdr);
+			netoff += po->vnet_hdr_sz;
 			do_vnet = true;
 		}
 		macoff = netoff - maclen;
@@ -2552,16 +2559,23 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
 }
 
 static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,
-				 struct virtio_net_hdr *vnet_hdr)
+				 struct virtio_net_hdr *vnet_hdr, int vnet_hdr_sz)
 {
-	if (*len < sizeof(*vnet_hdr))
+	int ret;
+
+	if (*len < vnet_hdr_sz)
 		return -EINVAL;
-	*len -= sizeof(*vnet_hdr);
+	*len -= vnet_hdr_sz;
 
 	if (!copy_from_iter_full(vnet_hdr, sizeof(*vnet_hdr), &msg->msg_iter))
 		return -EFAULT;
 
-	return __packet_snd_vnet_parse(vnet_hdr, *len);
+	ret = __packet_snd_vnet_parse(vnet_hdr, *len);
+
+	/* move iter to point to the start of mac header */
+	if (ret == 0)
+		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(struct virtio_net_hdr));
+	return ret;
 }
 
 static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
@@ -2730,6 +2744,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	int status = TP_STATUS_AVAILABLE;
 	int hlen, tlen, copylen = 0;
 	long timeo = 0;
+	int vnet_hdr_sz;
 
 	mutex_lock(&po->pg_vec_lock);
 
@@ -2811,8 +2826,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		tlen = dev->needed_tailroom;
 		if (po->has_vnet_hdr) {
 			vnet_hdr = data;
-			data += sizeof(*vnet_hdr);
-			tp_len -= sizeof(*vnet_hdr);
+			vnet_hdr_sz = po->vnet_hdr_sz;
+			data += vnet_hdr_sz;
+			tp_len -= vnet_hdr_sz;
 			if (tp_len < 0 ||
 			    __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
 				tp_len = -EINVAL;
@@ -2947,6 +2963,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	int offset = 0;
 	struct packet_sock *po = pkt_sk(sk);
 	bool has_vnet_hdr = false;
+	int vnet_hdr_sz;
 	int hlen, tlen, linear;
 	int extra_len = 0;
 
@@ -2991,7 +3008,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	if (sock->type == SOCK_RAW)
 		reserve = dev->hard_header_len;
 	if (po->has_vnet_hdr) {
-		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr);
+		vnet_hdr_sz = po->vnet_hdr_sz;
+		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr, vnet_hdr_sz);
 		if (err)
 			goto out_unlock;
 		has_vnet_hdr = true;
@@ -3068,7 +3086,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
 		if (err)
 			goto out_free;
-		len += sizeof(vnet_hdr);
+		len += vnet_hdr_sz;
 		virtio_net_hdr_set_proto(skb, &vnet_hdr);
 	}
 
@@ -3452,10 +3470,10 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 	packet_rcv_try_clear_pressure(pkt_sk(sk));
 
 	if (pkt_sk(sk)->has_vnet_hdr) {
-		err = packet_rcv_vnet(msg, skb, &len);
+		vnet_hdr_len = pkt_sk(sk)->vnet_hdr_sz;
+		err = packet_rcv_vnet(msg, skb, &len, vnet_hdr_len);
 		if (err)
 			goto out_free;
-		vnet_hdr_len = sizeof(struct virtio_net_hdr);
 	}
 
 	/* You lose any data beyond the buffer you gave. If it worries
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-09 12:43 ` [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz 沈安琪(凛玥)
@ 2023-02-09 13:07   ` Michael S. Tsirkin
  2023-02-10  4:01     ` 沈安琪(凛玥)
  0 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2023-02-09 13:07 UTC (permalink / raw)
  To: 沈安琪(凛玥)
  Cc: netdev, willemdebruijn.kernel, davem, jasowang, 谈鉴锋

On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
> From: "Jianfeng Tan" <henry.tjf@antgroup.com>
> 
> When raw socket is used as the backend for kernel vhost, currently it
> will regard the virtio net header as 10-byte, which is not always the
> case since some virtio features need virtio net header other than
> 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
> net header.
> 
> Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
> tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
> size that is recorded in packet_sock to indicate the exact virtio net
> header size that virtio user actually prepares in the packets. By doing
> so, it can fix the issue of incorrect mac header parsing when these
> virtio features that need virtio net header other than 10-byte are
> enable.
> 
> Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
> Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

Does it handle VERSION_1 though? That one is also LE.
Would it be better to pass a features bitmap instead?


> ---
>  net/packet/af_packet.c | 48 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 33 insertions(+), 15 deletions(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index ab37baf..4f49939 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2092,18 +2092,25 @@ static unsigned int run_filter(struct sk_buff *skb,
>  }
>  
>  static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
> -			   size_t *len)
> +			   size_t *len, int vnet_hdr_sz)
>  {
>  	struct virtio_net_hdr vnet_hdr;
> +	int ret;
>  
> -	if (*len < sizeof(vnet_hdr))
> +	if (*len < vnet_hdr_sz)
>  		return -EINVAL;
> -	*len -= sizeof(vnet_hdr);
> +	*len -= vnet_hdr_sz;
>  
>  	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
>  		return -EINVAL;
>  
> -	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
> +	ret = memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
> +
> +	/* reserve space for extra info in vnet_hdr if needed */
> +	if (ret == 0)
> +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(vnet_hdr));
> +
> +	return ret;
>  }
>  
>  /*
> @@ -2311,7 +2318,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>  				       (maclen < 16 ? 16 : maclen)) +
>  				       po->tp_reserve;
>  		if (po->has_vnet_hdr) {
> -			netoff += sizeof(struct virtio_net_hdr);
> +			netoff += po->vnet_hdr_sz;
>  			do_vnet = true;
>  		}
>  		macoff = netoff - maclen;
> @@ -2552,16 +2559,23 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
>  }
>  
>  static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,
> -				 struct virtio_net_hdr *vnet_hdr)
> +				 struct virtio_net_hdr *vnet_hdr, int vnet_hdr_sz)
>  {
> -	if (*len < sizeof(*vnet_hdr))
> +	int ret;
> +
> +	if (*len < vnet_hdr_sz)
>  		return -EINVAL;
> -	*len -= sizeof(*vnet_hdr);
> +	*len -= vnet_hdr_sz;
>  
>  	if (!copy_from_iter_full(vnet_hdr, sizeof(*vnet_hdr), &msg->msg_iter))
>  		return -EFAULT;
>  
> -	return __packet_snd_vnet_parse(vnet_hdr, *len);
> +	ret = __packet_snd_vnet_parse(vnet_hdr, *len);
> +
> +	/* move iter to point to the start of mac header */
> +	if (ret == 0)
> +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(struct virtio_net_hdr));
> +	return ret;
>  }
>  
>  static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
> @@ -2730,6 +2744,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>  	int status = TP_STATUS_AVAILABLE;
>  	int hlen, tlen, copylen = 0;
>  	long timeo = 0;
> +	int vnet_hdr_sz;
>  
>  	mutex_lock(&po->pg_vec_lock);
>  
> @@ -2811,8 +2826,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>  		tlen = dev->needed_tailroom;
>  		if (po->has_vnet_hdr) {
>  			vnet_hdr = data;
> -			data += sizeof(*vnet_hdr);
> -			tp_len -= sizeof(*vnet_hdr);
> +			vnet_hdr_sz = po->vnet_hdr_sz;
> +			data += vnet_hdr_sz;
> +			tp_len -= vnet_hdr_sz;
>  			if (tp_len < 0 ||
>  			    __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
>  				tp_len = -EINVAL;
> @@ -2947,6 +2963,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>  	int offset = 0;
>  	struct packet_sock *po = pkt_sk(sk);
>  	bool has_vnet_hdr = false;
> +	int vnet_hdr_sz;
>  	int hlen, tlen, linear;
>  	int extra_len = 0;
>  
> @@ -2991,7 +3008,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>  	if (sock->type == SOCK_RAW)
>  		reserve = dev->hard_header_len;
>  	if (po->has_vnet_hdr) {
> -		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr);
> +		vnet_hdr_sz = po->vnet_hdr_sz;
> +		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr, vnet_hdr_sz);
>  		if (err)
>  			goto out_unlock;
>  		has_vnet_hdr = true;
> @@ -3068,7 +3086,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>  		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
>  		if (err)
>  			goto out_free;
> -		len += sizeof(vnet_hdr);
> +		len += vnet_hdr_sz;
>  		virtio_net_hdr_set_proto(skb, &vnet_hdr);
>  	}
>  
> @@ -3452,10 +3470,10 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
>  	packet_rcv_try_clear_pressure(pkt_sk(sk));
>  
>  	if (pkt_sk(sk)->has_vnet_hdr) {
> -		err = packet_rcv_vnet(msg, skb, &len);
> +		vnet_hdr_len = pkt_sk(sk)->vnet_hdr_sz;
> +		err = packet_rcv_vnet(msg, skb, &len, vnet_hdr_len);
>  		if (err)
>  			goto out_free;
> -		vnet_hdr_len = sizeof(struct virtio_net_hdr);
>  	}
>  
>  	/* You lose any data beyond the buffer you gave. If it worries
> -- 
> 1.8.3.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net/packet: add socketopt to set/get vnet_hdr_sz
  2023-02-09 12:43 ` [PATCH 1/2] net/packet: add socketopt to set/get vnet_hdr_sz 沈安琪(凛玥)
@ 2023-02-09 14:45   ` Willem de Bruijn
  2023-02-10  4:00     ` 沈安琪(凛玥)
  0 siblings, 1 reply; 19+ messages in thread
From: Willem de Bruijn @ 2023-02-09 14:45 UTC (permalink / raw)
  To: 沈安琪(凛玥)
  Cc: netdev, willemdebruijn.kernel, mst, davem, jasowang,
	谈鉴锋

On Thu, Feb 9, 2023 at 7:43 AM 沈安琪(凛玥) <amy.saq@antgroup.com> wrote:
>
> From: "Jianfeng Tan" <henry.tjf@antgroup.com>
>
> Raw socket can be used as the backend for kernel vhost, like tap.

Please refer to PF_PACKET sockets as packet sockets.

"raw" sockets is ambiguous: it is also used to refer to type SOCK_RAW
of other socket families.

> However, in current raw socket implementation, it use hardcoded virtio
> net header length, which will cause error mac header parsing when some
> virtio features that need virtio net header other than 10-byte are used.

This series only adds support for skipping past two extra bytes.

The 2-byte field num_buffers in virtio_net_hdr_mrg_rxbuf is used for
coalesced buffers. That is not a feature packet sockets support.

How do you intend to use this? Only ever with num_buffers == 1?

We have to make ABI changes sparingly. It would take another setsockopt
to signal actual use of this feature.

If adding an extended struct, then this also needs to be documented in
the UAPI headers.

> By adding extra field vnet_hdr_sz in packet_sock to record virtio net
> header size that current raw socket should use and supporting extra
> sockopt PACKET_VNET_HDR_SZ to allow user level set specified vnet header
> size to current socket, raw socket will know the exact virtio net header
> size it should use instead of hardcoding to avoid incorrect header
> parsing.
>
> Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
> Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
> ---
>  include/uapi/linux/if_packet.h |  1 +
>  net/packet/af_packet.c         | 34 ++++++++++++++++++++++++++++++++++
>  net/packet/internal.h          |  3 ++-
>  3 files changed, 37 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index 78c981d..9efc423 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
> @@ -59,6 +59,7 @@ struct sockaddr_ll {
>  #define PACKET_ROLLOVER_STATS          21
>  #define PACKET_FANOUT_DATA             22
>  #define PACKET_IGNORE_OUTGOING         23
> +#define PACKET_VNET_HDR_SZ             24
>
>  #define PACKET_FANOUT_HASH             0
>  #define PACKET_FANOUT_LB               1
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 8ffb19c..8389f18 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -3936,11 +3936,42 @@ static void packet_flush_mclist(struct sock *sk)
>                         ret = -EBUSY;
>                 } else {
>                         po->has_vnet_hdr = !!val;
> +                       /* set vnet_hdr_sz to default value */
> +                       if (po->has_vnet_hdr)
> +                               po->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
> +                       else
> +                               po->vnet_hdr_sz = 0;
>                         ret = 0;
>                 }
>                 release_sock(sk);
>                 return ret;
>         }
> +       case PACKET_VNET_HDR_SZ:
> +       {
> +               int val;
> +
> +               if (sock->type != SOCK_RAW)
> +                       return -EINVAL;
> +               if (optlen < sizeof(val))
> +                       return -EINVAL;
> +               if (copy_from_user(&val, optval, sizeof(val)))
> +                       return -EFAULT;

This duplicates the code in PACKET_VNET_HR. I'd prefer:

        case PACKET_VNET_HDR:
        case PACKET_VNET_HDR_SZ:

        .. sanity checks and copy from user ..

        if (optname = PACKET_VNET_HDR)
                val = sizeof(struct virtio_net_hdr);

And move the check for valid lengths before taking the lock.

> +
> +               lock_sock(sk);
> +               if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) {
> +                       ret = -EBUSY;
> +               } else {
> +                       if (val == sizeof(struct virtio_net_hdr) ||
> +                           val == sizeof(struct virtio_net_hdr_mrg_rxbuf)) {
> +                               po->vnet_hdr_sz = val;
> +                               ret = 0;
> +                       } else {
> +                               ret = -EINVAL;
> +                       }
> +               }
> +               release_sock(sk);
> +               return ret;
> +       }
>         case PACKET_TIMESTAMP:
>         {
>                 int val;
> @@ -4070,6 +4101,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
>         case PACKET_VNET_HDR:
>                 val = po->has_vnet_hdr;
>                 break;
> +       case PACKET_VNET_HDR_SZ:
> +               val = po->vnet_hdr_sz;
> +               break;
>         case PACKET_VERSION:
>                 val = po->tp_version;
>                 break;
> diff --git a/net/packet/internal.h b/net/packet/internal.h
> index 48af35b..e27b47d 100644
> --- a/net/packet/internal.h
> +++ b/net/packet/internal.h
> @@ -121,7 +121,8 @@ struct packet_sock {
>                                 origdev:1,
>                                 has_vnet_hdr:1,
>                                 tp_loss:1,
> -                               tp_tx_has_off:1;
> +                               tp_tx_has_off:1,
> +                               vnet_hdr_sz:8;  /* vnet header size should use */

This location looks fine from the point of view of using holes in the
struct:


        /* --- cacheline 12 boundary (768 bytes) --- */
        struct packet_ring_buffer  rx_ring;              /*   768   200 */
        /* --- cacheline 15 boundary (960 bytes) was 8 bytes ago --- */
        struct packet_ring_buffer  tx_ring;              /*   968   200 */
        /* --- cacheline 18 boundary (1152 bytes) was 16 bytes ago --- */
        int                        copy_thresh;          /*  1168     4 */
        spinlock_t                 bind_lock;            /*  1172     4 */
        struct mutex               pg_vec_lock;          /*  1176    32 */
        unsigned int               running;              /*  1208     4 */
        unsigned int               auxdata:1;            /*  1212: 0  4 */
        unsigned int               origdev:1;            /*  1212: 1  4 */
        unsigned int               has_vnet_hdr:1;       /*  1212: 2  4 */
        unsigned int               tp_loss:1;            /*  1212: 3  4 */
        unsigned int               tp_tx_has_off:1;      /*  1212: 4  4 */

        /* XXX 27 bits hole, try to pack */

        /* --- cacheline 19 boundary (1216 bytes) --- */

>         int                     pressure;
>         int                     ifindex;        /* bound device         */
>         __be16                  num;
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net/packet: add socketopt to set/get vnet_hdr_sz
  2023-02-09 14:45   ` Willem de Bruijn
@ 2023-02-10  4:00     ` 沈安琪(凛玥)
  0 siblings, 0 replies; 19+ messages in thread
From: 沈安琪(凛玥) @ 2023-02-10  4:00 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev, willemdebruijn.kernel, mst, davem, jasowang,
	谈鉴锋


在 2023/2/9 下午10:45, Willem de Bruijn 写道:
> On Thu, Feb 9, 2023 at 7:43 AM 沈安琪(凛玥) <amy.saq@antgroup.com> wrote:
>> From: "Jianfeng Tan" <henry.tjf@antgroup.com>
>>
>> Raw socket can be used as the backend for kernel vhost, like tap.
> Please refer to PF_PACKET sockets as packet sockets.
>
> "raw" sockets is ambiguous: it is also used to refer to type SOCK_RAW
> of other socket families.
>
>> However, in current raw socket implementation, it use hardcoded virtio
>> net header length, which will cause error mac header parsing when some
>> virtio features that need virtio net header other than 10-byte are used.
> This series only adds support for skipping past two extra bytes.
>
> The 2-byte field num_buffers in virtio_net_hdr_mrg_rxbuf is used for
> coalesced buffers. That is not a feature packet sockets support.
>
> How do you intend to use this? Only ever with num_buffers == 1?


This 2-byte field will later be used to record how many buffers it can 
use when virtio mergeable is enable in drivers/vhost/net.c:1231.

Here we need to skip this 2-byte field in af_packet since otherwise, the 
number of buffers info will overwrite the mac header info.


>
> We have to make ABI changes sparingly. It would take another setsockopt
> to signal actual use of this feature.
>
> If adding an extended struct, then this also needs to be documented in
> the UAPI headers.
>
>> By adding extra field vnet_hdr_sz in packet_sock to record virtio net
>> header size that current raw socket should use and supporting extra
>> sockopt PACKET_VNET_HDR_SZ to allow user level set specified vnet header
>> size to current socket, raw socket will know the exact virtio net header
>> size it should use instead of hardcoding to avoid incorrect header
>> parsing.
>>
>> Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
>> Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
>> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
>> ---
>>   include/uapi/linux/if_packet.h |  1 +
>>   net/packet/af_packet.c         | 34 ++++++++++++++++++++++++++++++++++
>>   net/packet/internal.h          |  3 ++-
>>   3 files changed, 37 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
>> index 78c981d..9efc423 100644
>> --- a/include/uapi/linux/if_packet.h
>> +++ b/include/uapi/linux/if_packet.h
>> @@ -59,6 +59,7 @@ struct sockaddr_ll {
>>   #define PACKET_ROLLOVER_STATS          21
>>   #define PACKET_FANOUT_DATA             22
>>   #define PACKET_IGNORE_OUTGOING         23
>> +#define PACKET_VNET_HDR_SZ             24
>>
>>   #define PACKET_FANOUT_HASH             0
>>   #define PACKET_FANOUT_LB               1
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 8ffb19c..8389f18 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -3936,11 +3936,42 @@ static void packet_flush_mclist(struct sock *sk)
>>                          ret = -EBUSY;
>>                  } else {
>>                          po->has_vnet_hdr = !!val;
>> +                       /* set vnet_hdr_sz to default value */
>> +                       if (po->has_vnet_hdr)
>> +                               po->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
>> +                       else
>> +                               po->vnet_hdr_sz = 0;
>>                          ret = 0;
>>                  }
>>                  release_sock(sk);
>>                  return ret;
>>          }
>> +       case PACKET_VNET_HDR_SZ:
>> +       {
>> +               int val;
>> +
>> +               if (sock->type != SOCK_RAW)
>> +                       return -EINVAL;
>> +               if (optlen < sizeof(val))
>> +                       return -EINVAL;
>> +               if (copy_from_user(&val, optval, sizeof(val)))
>> +                       return -EFAULT;
> This duplicates the code in PACKET_VNET_HR. I'd prefer:
>
>          case PACKET_VNET_HDR:
>          case PACKET_VNET_HDR_SZ:
>
>          .. sanity checks and copy from user ..
>
>          if (optname = PACKET_VNET_HDR)
>                  val = sizeof(struct virtio_net_hdr);
>
> And move the check for valid lengths before taking the lock.


Thanks for pointing out. It makes sense and we will address in the next 
version of patch.


>
>> +
>> +               lock_sock(sk);
>> +               if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) {
>> +                       ret = -EBUSY;
>> +               } else {
>> +                       if (val == sizeof(struct virtio_net_hdr) ||
>> +                           val == sizeof(struct virtio_net_hdr_mrg_rxbuf)) {
>> +                               po->vnet_hdr_sz = val;
>> +                               ret = 0;
>> +                       } else {
>> +                               ret = -EINVAL;
>> +                       }
>> +               }
>> +               release_sock(sk);
>> +               return ret;
>> +       }
>>          case PACKET_TIMESTAMP:
>>          {
>>                  int val;
>> @@ -4070,6 +4101,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
>>          case PACKET_VNET_HDR:
>>                  val = po->has_vnet_hdr;
>>                  break;
>> +       case PACKET_VNET_HDR_SZ:
>> +               val = po->vnet_hdr_sz;
>> +               break;
>>          case PACKET_VERSION:
>>                  val = po->tp_version;
>>                  break;
>> diff --git a/net/packet/internal.h b/net/packet/internal.h
>> index 48af35b..e27b47d 100644
>> --- a/net/packet/internal.h
>> +++ b/net/packet/internal.h
>> @@ -121,7 +121,8 @@ struct packet_sock {
>>                                  origdev:1,
>>                                  has_vnet_hdr:1,
>>                                  tp_loss:1,
>> -                               tp_tx_has_off:1;
>> +                               tp_tx_has_off:1,
>> +                               vnet_hdr_sz:8;  /* vnet header size should use */
> This location looks fine from the point of view of using holes in the
> struct:
>
>
>          /* --- cacheline 12 boundary (768 bytes) --- */
>          struct packet_ring_buffer  rx_ring;              /*   768   200 */
>          /* --- cacheline 15 boundary (960 bytes) was 8 bytes ago --- */
>          struct packet_ring_buffer  tx_ring;              /*   968   200 */
>          /* --- cacheline 18 boundary (1152 bytes) was 16 bytes ago --- */
>          int                        copy_thresh;          /*  1168     4 */
>          spinlock_t                 bind_lock;            /*  1172     4 */
>          struct mutex               pg_vec_lock;          /*  1176    32 */
>          unsigned int               running;              /*  1208     4 */
>          unsigned int               auxdata:1;            /*  1212: 0  4 */
>          unsigned int               origdev:1;            /*  1212: 1  4 */
>          unsigned int               has_vnet_hdr:1;       /*  1212: 2  4 */
>          unsigned int               tp_loss:1;            /*  1212: 3  4 */
>          unsigned int               tp_tx_has_off:1;      /*  1212: 4  4 */
>
>          /* XXX 27 bits hole, try to pack */
>
>          /* --- cacheline 19 boundary (1216 bytes) --- */
>
>>          int                     pressure;
>>          int                     ifindex;        /* bound device         */
>>          __be16                  num;
>> --
>> 1.8.3.1
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-09 13:07   ` Michael S. Tsirkin
@ 2023-02-10  4:01     ` 沈安琪(凛玥)
  2023-02-10  8:10       ` Michael S. Tsirkin
  0 siblings, 1 reply; 19+ messages in thread
From: 沈安琪(凛玥) @ 2023-02-10  4:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, willemdebruijn.kernel, davem, jasowang, 谈鉴锋


在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
> On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
>> From: "Jianfeng Tan" <henry.tjf@antgroup.com>
>>
>> When raw socket is used as the backend for kernel vhost, currently it
>> will regard the virtio net header as 10-byte, which is not always the
>> case since some virtio features need virtio net header other than
>> 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
>> net header.
>>
>> Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
>> tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
>> size that is recorded in packet_sock to indicate the exact virtio net
>> header size that virtio user actually prepares in the packets. By doing
>> so, it can fix the issue of incorrect mac header parsing when these
>> virtio features that need virtio net header other than 10-byte are
>> enable.
>>
>> Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
>> Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
>> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
> Does it handle VERSION_1 though? That one is also LE.
> Would it be better to pass a features bitmap instead?


Thanks for quick reply!

I am a little confused abot what "LE" presents here?

For passing a features bitmap to af_packet here, our consideration is 
whether it will be too complicated for af_packet to understand the 
virtio features bitmap in order to get the vnet header size. For now, 
all the virtio features stuff is handled by vhost worker and af_packet 
actually does not need to know much about virtio features. Would it be 
better if we keep the virtio feature stuff in user-level and let 
user-level tell af_packet how much space it should reserve?

>
>
>> ---
>>   net/packet/af_packet.c | 48 +++++++++++++++++++++++++++++++++---------------
>>   1 file changed, 33 insertions(+), 15 deletions(-)
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index ab37baf..4f49939 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -2092,18 +2092,25 @@ static unsigned int run_filter(struct sk_buff *skb,
>>   }
>>   
>>   static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
>> -			   size_t *len)
>> +			   size_t *len, int vnet_hdr_sz)
>>   {
>>   	struct virtio_net_hdr vnet_hdr;
>> +	int ret;
>>   
>> -	if (*len < sizeof(vnet_hdr))
>> +	if (*len < vnet_hdr_sz)
>>   		return -EINVAL;
>> -	*len -= sizeof(vnet_hdr);
>> +	*len -= vnet_hdr_sz;
>>   
>>   	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
>>   		return -EINVAL;
>>   
>> -	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
>> +	ret = memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
>> +
>> +	/* reserve space for extra info in vnet_hdr if needed */
>> +	if (ret == 0)
>> +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(vnet_hdr));
>> +
>> +	return ret;
>>   }
>>   
>>   /*
>> @@ -2311,7 +2318,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>>   				       (maclen < 16 ? 16 : maclen)) +
>>   				       po->tp_reserve;
>>   		if (po->has_vnet_hdr) {
>> -			netoff += sizeof(struct virtio_net_hdr);
>> +			netoff += po->vnet_hdr_sz;
>>   			do_vnet = true;
>>   		}
>>   		macoff = netoff - maclen;
>> @@ -2552,16 +2559,23 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
>>   }
>>   
>>   static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,
>> -				 struct virtio_net_hdr *vnet_hdr)
>> +				 struct virtio_net_hdr *vnet_hdr, int vnet_hdr_sz)
>>   {
>> -	if (*len < sizeof(*vnet_hdr))
>> +	int ret;
>> +
>> +	if (*len < vnet_hdr_sz)
>>   		return -EINVAL;
>> -	*len -= sizeof(*vnet_hdr);
>> +	*len -= vnet_hdr_sz;
>>   
>>   	if (!copy_from_iter_full(vnet_hdr, sizeof(*vnet_hdr), &msg->msg_iter))
>>   		return -EFAULT;
>>   
>> -	return __packet_snd_vnet_parse(vnet_hdr, *len);
>> +	ret = __packet_snd_vnet_parse(vnet_hdr, *len);
>> +
>> +	/* move iter to point to the start of mac header */
>> +	if (ret == 0)
>> +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(struct virtio_net_hdr));
>> +	return ret;
>>   }
>>   
>>   static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
>> @@ -2730,6 +2744,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>>   	int status = TP_STATUS_AVAILABLE;
>>   	int hlen, tlen, copylen = 0;
>>   	long timeo = 0;
>> +	int vnet_hdr_sz;
>>   
>>   	mutex_lock(&po->pg_vec_lock);
>>   
>> @@ -2811,8 +2826,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>>   		tlen = dev->needed_tailroom;
>>   		if (po->has_vnet_hdr) {
>>   			vnet_hdr = data;
>> -			data += sizeof(*vnet_hdr);
>> -			tp_len -= sizeof(*vnet_hdr);
>> +			vnet_hdr_sz = po->vnet_hdr_sz;
>> +			data += vnet_hdr_sz;
>> +			tp_len -= vnet_hdr_sz;
>>   			if (tp_len < 0 ||
>>   			    __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
>>   				tp_len = -EINVAL;
>> @@ -2947,6 +2963,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>   	int offset = 0;
>>   	struct packet_sock *po = pkt_sk(sk);
>>   	bool has_vnet_hdr = false;
>> +	int vnet_hdr_sz;
>>   	int hlen, tlen, linear;
>>   	int extra_len = 0;
>>   
>> @@ -2991,7 +3008,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>   	if (sock->type == SOCK_RAW)
>>   		reserve = dev->hard_header_len;
>>   	if (po->has_vnet_hdr) {
>> -		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr);
>> +		vnet_hdr_sz = po->vnet_hdr_sz;
>> +		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr, vnet_hdr_sz);
>>   		if (err)
>>   			goto out_unlock;
>>   		has_vnet_hdr = true;
>> @@ -3068,7 +3086,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>   		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
>>   		if (err)
>>   			goto out_free;
>> -		len += sizeof(vnet_hdr);
>> +		len += vnet_hdr_sz;
>>   		virtio_net_hdr_set_proto(skb, &vnet_hdr);
>>   	}
>>   
>> @@ -3452,10 +3470,10 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
>>   	packet_rcv_try_clear_pressure(pkt_sk(sk));
>>   
>>   	if (pkt_sk(sk)->has_vnet_hdr) {
>> -		err = packet_rcv_vnet(msg, skb, &len);
>> +		vnet_hdr_len = pkt_sk(sk)->vnet_hdr_sz;
>> +		err = packet_rcv_vnet(msg, skb, &len, vnet_hdr_len);
>>   		if (err)
>>   			goto out_free;
>> -		vnet_hdr_len = sizeof(struct virtio_net_hdr);
>>   	}
>>   
>>   	/* You lose any data beyond the buffer you gave. If it worries
>> -- 
>> 1.8.3.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-10  4:01     ` 沈安琪(凛玥)
@ 2023-02-10  8:10       ` Michael S. Tsirkin
  2023-02-10 15:36         ` Willem de Bruijn
  2023-02-10 15:39         ` Willem de Bruijn
  0 siblings, 2 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2023-02-10  8:10 UTC (permalink / raw)
  To: 沈安琪(凛玥)
  Cc: netdev, willemdebruijn.kernel, davem, jasowang, 谈鉴锋

On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
> 
> 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
> > On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
> > > From: "Jianfeng Tan" <henry.tjf@antgroup.com>
> > > 
> > > When raw socket is used as the backend for kernel vhost, currently it
> > > will regard the virtio net header as 10-byte, which is not always the
> > > case since some virtio features need virtio net header other than
> > > 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
> > > net header.
> > > 
> > > Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
> > > tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
> > > size that is recorded in packet_sock to indicate the exact virtio net
> > > header size that virtio user actually prepares in the packets. By doing
> > > so, it can fix the issue of incorrect mac header parsing when these
> > > virtio features that need virtio net header other than 10-byte are
> > > enable.
> > > 
> > > Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
> > > Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
> > > Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
> > Does it handle VERSION_1 though? That one is also LE.
> > Would it be better to pass a features bitmap instead?
> 
> 
> Thanks for quick reply!
> 
> I am a little confused abot what "LE" presents here?

LE == little_endian.
Little endian format.

> For passing a features bitmap to af_packet here, our consideration is
> whether it will be too complicated for af_packet to understand the virtio
> features bitmap in order to get the vnet header size. For now, all the
> virtio features stuff is handled by vhost worker and af_packet actually does
> not need to know much about virtio features. Would it be better if we keep
> the virtio feature stuff in user-level and let user-level tell af_packet how
> much space it should reserve?

Presumably, we'd add an API in include/linux/virtio_net.h ?

> > 
> > 
> > > ---
> > >   net/packet/af_packet.c | 48 +++++++++++++++++++++++++++++++++---------------
> > >   1 file changed, 33 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > > index ab37baf..4f49939 100644
> > > --- a/net/packet/af_packet.c
> > > +++ b/net/packet/af_packet.c
> > > @@ -2092,18 +2092,25 @@ static unsigned int run_filter(struct sk_buff *skb,
> > >   }
> > >   static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
> > > -			   size_t *len)
> > > +			   size_t *len, int vnet_hdr_sz)
> > >   {
> > >   	struct virtio_net_hdr vnet_hdr;
> > > +	int ret;
> > > -	if (*len < sizeof(vnet_hdr))
> > > +	if (*len < vnet_hdr_sz)
> > >   		return -EINVAL;
> > > -	*len -= sizeof(vnet_hdr);
> > > +	*len -= vnet_hdr_sz;
> > >   	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
> > >   		return -EINVAL;
> > > -	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
> > > +	ret = memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
> > > +
> > > +	/* reserve space for extra info in vnet_hdr if needed */
> > > +	if (ret == 0)
> > > +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(vnet_hdr));
> > > +
> > > +	return ret;
> > >   }
> > >   /*
> > > @@ -2311,7 +2318,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
> > >   				       (maclen < 16 ? 16 : maclen)) +
> > >   				       po->tp_reserve;
> > >   		if (po->has_vnet_hdr) {
> > > -			netoff += sizeof(struct virtio_net_hdr);
> > > +			netoff += po->vnet_hdr_sz;
> > >   			do_vnet = true;
> > >   		}
> > >   		macoff = netoff - maclen;
> > > @@ -2552,16 +2559,23 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
> > >   }
> > >   static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,
> > > -				 struct virtio_net_hdr *vnet_hdr)
> > > +				 struct virtio_net_hdr *vnet_hdr, int vnet_hdr_sz)
> > >   {
> > > -	if (*len < sizeof(*vnet_hdr))
> > > +	int ret;
> > > +
> > > +	if (*len < vnet_hdr_sz)
> > >   		return -EINVAL;
> > > -	*len -= sizeof(*vnet_hdr);
> > > +	*len -= vnet_hdr_sz;
> > >   	if (!copy_from_iter_full(vnet_hdr, sizeof(*vnet_hdr), &msg->msg_iter))
> > >   		return -EFAULT;
> > > -	return __packet_snd_vnet_parse(vnet_hdr, *len);
> > > +	ret = __packet_snd_vnet_parse(vnet_hdr, *len);
> > > +
> > > +	/* move iter to point to the start of mac header */
> > > +	if (ret == 0)
> > > +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(struct virtio_net_hdr));
> > > +	return ret;
> > >   }
> > >   static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
> > > @@ -2730,6 +2744,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > >   	int status = TP_STATUS_AVAILABLE;
> > >   	int hlen, tlen, copylen = 0;
> > >   	long timeo = 0;
> > > +	int vnet_hdr_sz;
> > >   	mutex_lock(&po->pg_vec_lock);
> > > @@ -2811,8 +2826,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > >   		tlen = dev->needed_tailroom;
> > >   		if (po->has_vnet_hdr) {
> > >   			vnet_hdr = data;
> > > -			data += sizeof(*vnet_hdr);
> > > -			tp_len -= sizeof(*vnet_hdr);
> > > +			vnet_hdr_sz = po->vnet_hdr_sz;
> > > +			data += vnet_hdr_sz;
> > > +			tp_len -= vnet_hdr_sz;
> > >   			if (tp_len < 0 ||
> > >   			    __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
> > >   				tp_len = -EINVAL;
> > > @@ -2947,6 +2963,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > >   	int offset = 0;
> > >   	struct packet_sock *po = pkt_sk(sk);
> > >   	bool has_vnet_hdr = false;
> > > +	int vnet_hdr_sz;
> > >   	int hlen, tlen, linear;
> > >   	int extra_len = 0;
> > > @@ -2991,7 +3008,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > >   	if (sock->type == SOCK_RAW)
> > >   		reserve = dev->hard_header_len;
> > >   	if (po->has_vnet_hdr) {
> > > -		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr);
> > > +		vnet_hdr_sz = po->vnet_hdr_sz;
> > > +		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr, vnet_hdr_sz);
> > >   		if (err)
> > >   			goto out_unlock;
> > >   		has_vnet_hdr = true;
> > > @@ -3068,7 +3086,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > >   		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
> > >   		if (err)
> > >   			goto out_free;
> > > -		len += sizeof(vnet_hdr);
> > > +		len += vnet_hdr_sz;
> > >   		virtio_net_hdr_set_proto(skb, &vnet_hdr);
> > >   	}
> > > @@ -3452,10 +3470,10 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
> > >   	packet_rcv_try_clear_pressure(pkt_sk(sk));
> > >   	if (pkt_sk(sk)->has_vnet_hdr) {
> > > -		err = packet_rcv_vnet(msg, skb, &len);
> > > +		vnet_hdr_len = pkt_sk(sk)->vnet_hdr_sz;
> > > +		err = packet_rcv_vnet(msg, skb, &len, vnet_hdr_len);
> > >   		if (err)
> > >   			goto out_free;
> > > -		vnet_hdr_len = sizeof(struct virtio_net_hdr);
> > >   	}
> > >   	/* You lose any data beyond the buffer you gave. If it worries
> > > -- 
> > > 1.8.3.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-10  8:10       ` Michael S. Tsirkin
@ 2023-02-10 15:36         ` Willem de Bruijn
  2023-02-12  9:54           ` Michael S. Tsirkin
  2023-02-10 15:39         ` Willem de Bruijn
  1 sibling, 1 reply; 19+ messages in thread
From: Willem de Bruijn @ 2023-02-10 15:36 UTC (permalink / raw)
  To: Michael S. Tsirkin, 沈安琪(凛玥)
  Cc: netdev, willemdebruijn.kernel, davem, jasowang, 谈鉴锋

Michael S. Tsirkin wrote:
> On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
> > 
> > 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
> > > On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
> > > > From: "Jianfeng Tan" <henry.tjf@antgroup.com>
> > > > 
> > > > When raw socket is used as the backend for kernel vhost, currently it
> > > > will regard the virtio net header as 10-byte, which is not always the
> > > > case since some virtio features need virtio net header other than
> > > > 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
> > > > net header.
> > > > 
> > > > Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
> > > > tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
> > > > size that is recorded in packet_sock to indicate the exact virtio net
> > > > header size that virtio user actually prepares in the packets. By doing
> > > > so, it can fix the issue of incorrect mac header parsing when these
> > > > virtio features that need virtio net header other than 10-byte are
> > > > enable.
> > > > 
> > > > Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
> > > > Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
> > > > Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
> > > Does it handle VERSION_1 though? That one is also LE.
> > > Would it be better to pass a features bitmap instead?
> > 
> > 
> > Thanks for quick reply!
> > 
> > I am a little confused abot what "LE" presents here?
> 
> LE == little_endian.
> Little endian format.
> 
> > For passing a features bitmap to af_packet here, our consideration is
> > whether it will be too complicated for af_packet to understand the virtio
> > features bitmap in order to get the vnet header size. For now, all the
> > virtio features stuff is handled by vhost worker and af_packet actually does
> > not need to know much about virtio features. Would it be better if we keep
> > the virtio feature stuff in user-level and let user-level tell af_packet how
> > much space it should reserve?
> 
> Presumably, we'd add an API in include/linux/virtio_net.h ?

If packet sockets do not act on the contents of these extended fields,
it's probably better to leave them opaque.

This patch series probably should be one patch. The new option in the
first patch modifies the data path. Now there is one SHA1 at which its
behavior would not work.

> 
> > > 
> > > 
> > > > ---
> > > >   net/packet/af_packet.c | 48 +++++++++++++++++++++++++++++++++---------------
> > > >   1 file changed, 33 insertions(+), 15 deletions(-)
> > > > 
> > > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > > > index ab37baf..4f49939 100644
> > > > --- a/net/packet/af_packet.c
> > > > +++ b/net/packet/af_packet.c
> > > > @@ -2092,18 +2092,25 @@ static unsigned int run_filter(struct sk_buff *skb,
> > > >   }
> > > >   static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
> > > > -			   size_t *len)
> > > > +			   size_t *len, int vnet_hdr_sz)
> > > >   {
> > > >   	struct virtio_net_hdr vnet_hdr;
> > > > +	int ret;
> > > > -	if (*len < sizeof(vnet_hdr))
> > > > +	if (*len < vnet_hdr_sz)
> > > >   		return -EINVAL;
> > > > -	*len -= sizeof(vnet_hdr);
> > > > +	*len -= vnet_hdr_sz;
> > > >   	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
> > > >   		return -EINVAL;
> > > > -	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
> > > > +	ret = memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
> > > > +
> > > > +	/* reserve space for extra info in vnet_hdr if needed */
> > > > +	if (ret == 0)
> > > > +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(vnet_hdr));
> > > > +

How about

    struct virtio_net_hdr_mrg_rxbuf vnet_hdr { .num_buffers = 0 };

    ..

    ret = memcpy_to_msg(msg, (void *)&vnet_hdr, vnet_hdr_sz);

To avoid the iov_iter_advance and properly initialize those bytes.

> > > > +	return ret;
> > > >   }
> > > >   /*
> > > > @@ -2311,7 +2318,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
> > > >   				       (maclen < 16 ? 16 : maclen)) +
> > > >   				       po->tp_reserve;
> > > >   		if (po->has_vnet_hdr) {
> > > > -			netoff += sizeof(struct virtio_net_hdr);
> > > > +			netoff += po->vnet_hdr_sz;
> > > >   			do_vnet = true;
> > > >   		}
> > > >   		macoff = netoff - maclen;
> > > > @@ -2552,16 +2559,23 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
> > > >   }
> > > >   static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,
> > > > -				 struct virtio_net_hdr *vnet_hdr)
> > > > +				 struct virtio_net_hdr *vnet_hdr, int vnet_hdr_sz)
> > > >   {
> > > > -	if (*len < sizeof(*vnet_hdr))
> > > > +	int ret;
> > > > +
> > > > +	if (*len < vnet_hdr_sz)
> > > >   		return -EINVAL;
> > > > -	*len -= sizeof(*vnet_hdr);
> > > > +	*len -= vnet_hdr_sz;
> > > >   	if (!copy_from_iter_full(vnet_hdr, sizeof(*vnet_hdr), &msg->msg_iter))
> > > >   		return -EFAULT;
> > > > -	return __packet_snd_vnet_parse(vnet_hdr, *len);
> > > > +	ret = __packet_snd_vnet_parse(vnet_hdr, *len);
> > > > +
> > > > +	/* move iter to point to the start of mac header */
> > > > +	if (ret == 0)
> > > > +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(struct virtio_net_hdr));
> > > > +	return ret;
> > > >   }
> > > >   static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
> > > > @@ -2730,6 +2744,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > >   	int status = TP_STATUS_AVAILABLE;
> > > >   	int hlen, tlen, copylen = 0;
> > > >   	long timeo = 0;
> > > > +	int vnet_hdr_sz;
> > > >   	mutex_lock(&po->pg_vec_lock);
> > > > @@ -2811,8 +2826,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > >   		tlen = dev->needed_tailroom;
> > > >   		if (po->has_vnet_hdr) {
> > > >   			vnet_hdr = data;
> > > > -			data += sizeof(*vnet_hdr);
> > > > -			tp_len -= sizeof(*vnet_hdr);
> > > > +			vnet_hdr_sz = po->vnet_hdr_sz;
> > > > +			data += vnet_hdr_sz;
> > > > +			tp_len -= vnet_hdr_sz;
> > > >   			if (tp_len < 0 ||
> > > >   			    __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
> > > >   				tp_len = -EINVAL;
> > > > @@ -2947,6 +2963,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > > >   	int offset = 0;
> > > >   	struct packet_sock *po = pkt_sk(sk);
> > > >   	bool has_vnet_hdr = false;
> > > > +	int vnet_hdr_sz;
> > > >   	int hlen, tlen, linear;
> > > >   	int extra_len = 0;
> > > > @@ -2991,7 +3008,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > > >   	if (sock->type == SOCK_RAW)
> > > >   		reserve = dev->hard_header_len;
> > > >   	if (po->has_vnet_hdr) {
> > > > -		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr);
> > > > +		vnet_hdr_sz = po->vnet_hdr_sz;
> > > > +		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr, vnet_hdr_sz);
> > > >   		if (err)
> > > >   			goto out_unlock;
> > > >   		has_vnet_hdr = true;
> > > > @@ -3068,7 +3086,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > > >   		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
> > > >   		if (err)
> > > >   			goto out_free;
> > > > -		len += sizeof(vnet_hdr);
> > > > +		len += vnet_hdr_sz;
> > > >   		virtio_net_hdr_set_proto(skb, &vnet_hdr);
> > > >   	}
> > > > @@ -3452,10 +3470,10 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
> > > >   	packet_rcv_try_clear_pressure(pkt_sk(sk));
> > > >   	if (pkt_sk(sk)->has_vnet_hdr) {
> > > > -		err = packet_rcv_vnet(msg, skb, &len);
> > > > +		vnet_hdr_len = pkt_sk(sk)->vnet_hdr_sz;
> > > > +		err = packet_rcv_vnet(msg, skb, &len, vnet_hdr_len);
> > > >   		if (err)
> > > >   			goto out_free;
> > > > -		vnet_hdr_len = sizeof(struct virtio_net_hdr);
> > > >   	}
> > > >   	/* You lose any data beyond the buffer you gave. If it worries
> > > > -- 
> > > > 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-10  8:10       ` Michael S. Tsirkin
  2023-02-10 15:36         ` Willem de Bruijn
@ 2023-02-10 15:39         ` Willem de Bruijn
  2023-02-13 11:06           ` 沈安琪(凛玥)
  1 sibling, 1 reply; 19+ messages in thread
From: Willem de Bruijn @ 2023-02-10 15:39 UTC (permalink / raw)
  To: Michael S. Tsirkin, 沈安琪(凛玥)
  Cc: netdev, willemdebruijn.kernel, davem, jasowang, 谈鉴锋

Michael S. Tsirkin wrote:
> On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
> > 
> > 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
> > > On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
> > > > From: "Jianfeng Tan" <henry.tjf@antgroup.com>
> > > > 
> > > > When raw socket is used as the backend for kernel vhost, currently it
> > > > will regard the virtio net header as 10-byte, which is not always the
> > > > case since some virtio features need virtio net header other than
> > > > 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
> > > > net header.
> > > > 
> > > > Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
> > > > tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
> > > > size that is recorded in packet_sock to indicate the exact virtio net
> > > > header size that virtio user actually prepares in the packets. By doing
> > > > so, it can fix the issue of incorrect mac header parsing when these
> > > > virtio features that need virtio net header other than 10-byte are
> > > > enable.
> > > > 
> > > > Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
> > > > Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
> > > > Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
> > > Does it handle VERSION_1 though? That one is also LE.
> > > Would it be better to pass a features bitmap instead?
> > 
> > 
> > Thanks for quick reply!
> > 
> > I am a little confused abot what "LE" presents here?
> 
> LE == little_endian.
> Little endian format.
> 
> > For passing a features bitmap to af_packet here, our consideration is
> > whether it will be too complicated for af_packet to understand the virtio
> > features bitmap in order to get the vnet header size. For now, all the
> > virtio features stuff is handled by vhost worker and af_packet actually does
> > not need to know much about virtio features. Would it be better if we keep
> > the virtio feature stuff in user-level and let user-level tell af_packet how
> > much space it should reserve?
> 
> Presumably, we'd add an API in include/linux/virtio_net.h ?

Better leave this opaque to packet sockets if they won't act on this
type info.
 
This patch series probably should be a single patch btw. As else the
socket option introduced in the first is broken at that commit, since
the behavior is only introduced in patch 2.

> > > 
> > > 
> > > > ---
> > > >   net/packet/af_packet.c | 48 +++++++++++++++++++++++++++++++++---------------
> > > >   1 file changed, 33 insertions(+), 15 deletions(-)
> > > > 
> > > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > > > index ab37baf..4f49939 100644
> > > > --- a/net/packet/af_packet.c
> > > > +++ b/net/packet/af_packet.c
> > > > @@ -2092,18 +2092,25 @@ static unsigned int run_filter(struct sk_buff *skb,
> > > >   }
> > > >   static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
> > > > -			   size_t *len)
> > > > +			   size_t *len, int vnet_hdr_sz)
> > > >   {
> > > >   	struct virtio_net_hdr vnet_hdr;
> > > > +	int ret;
> > > > -	if (*len < sizeof(vnet_hdr))
> > > > +	if (*len < vnet_hdr_sz)
> > > >   		return -EINVAL;
> > > > -	*len -= sizeof(vnet_hdr);
> > > > +	*len -= vnet_hdr_sz;
> > > >   	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
> > > >   		return -EINVAL;
> > > > -	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
> > > > +	ret = memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
> > > > +
> > > > +	/* reserve space for extra info in vnet_hdr if needed */
> > > > +	if (ret == 0)
> > > > +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(vnet_hdr));
> > > > +

How about

    struct virtio_net_hdr_mrg_rxbuf vnet_hdr = { .num_buffers = 0 };

    ..

    ret = memcpy_to_msg(msg, (void *)&vnet_hdr, vnet_hdr_sz);

To initialize data correctly and avoid the extra function call.

> > > > +	return ret;
> > > >   }
> > > >   /*
> > > > @@ -2311,7 +2318,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
> > > >   				       (maclen < 16 ? 16 : maclen)) +
> > > >   				       po->tp_reserve;
> > > >   		if (po->has_vnet_hdr) {
> > > > -			netoff += sizeof(struct virtio_net_hdr);
> > > > +			netoff += po->vnet_hdr_sz;
> > > >   			do_vnet = true;
> > > >   		}
> > > >   		macoff = netoff - maclen;
> > > > @@ -2552,16 +2559,23 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
> > > >   }
> > > >   static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,
> > > > -				 struct virtio_net_hdr *vnet_hdr)
> > > > +				 struct virtio_net_hdr *vnet_hdr, int vnet_hdr_sz)
> > > >   {
> > > > -	if (*len < sizeof(*vnet_hdr))
> > > > +	int ret;
> > > > +
> > > > +	if (*len < vnet_hdr_sz)
> > > >   		return -EINVAL;
> > > > -	*len -= sizeof(*vnet_hdr);
> > > > +	*len -= vnet_hdr_sz;
> > > >   	if (!copy_from_iter_full(vnet_hdr, sizeof(*vnet_hdr), &msg->msg_iter))
> > > >   		return -EFAULT;
> > > > -	return __packet_snd_vnet_parse(vnet_hdr, *len);
> > > > +	ret = __packet_snd_vnet_parse(vnet_hdr, *len);
> > > > +
> > > > +	/* move iter to point to the start of mac header */
> > > > +	if (ret == 0)
> > > > +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(struct virtio_net_hdr));
> > > > +	return ret;
> > > >   }
> > > >   static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
> > > > @@ -2730,6 +2744,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > >   	int status = TP_STATUS_AVAILABLE;
> > > >   	int hlen, tlen, copylen = 0;
> > > >   	long timeo = 0;
> > > > +	int vnet_hdr_sz;
> > > >   	mutex_lock(&po->pg_vec_lock);
> > > > @@ -2811,8 +2826,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > >   		tlen = dev->needed_tailroom;
> > > >   		if (po->has_vnet_hdr) {
> > > >   			vnet_hdr = data;
> > > > -			data += sizeof(*vnet_hdr);
> > > > -			tp_len -= sizeof(*vnet_hdr);
> > > > +			vnet_hdr_sz = po->vnet_hdr_sz;
> > > > +			data += vnet_hdr_sz;
> > > > +			tp_len -= vnet_hdr_sz;
> > > >   			if (tp_len < 0 ||
> > > >   			    __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
> > > >   				tp_len = -EINVAL;
> > > > @@ -2947,6 +2963,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > > >   	int offset = 0;
> > > >   	struct packet_sock *po = pkt_sk(sk);
> > > >   	bool has_vnet_hdr = false;
> > > > +	int vnet_hdr_sz;
> > > >   	int hlen, tlen, linear;
> > > >   	int extra_len = 0;
> > > > @@ -2991,7 +3008,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > > >   	if (sock->type == SOCK_RAW)
> > > >   		reserve = dev->hard_header_len;
> > > >   	if (po->has_vnet_hdr) {
> > > > -		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr);
> > > > +		vnet_hdr_sz = po->vnet_hdr_sz;
> > > > +		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr, vnet_hdr_sz);
> > > >   		if (err)
> > > >   			goto out_unlock;
> > > >   		has_vnet_hdr = true;
> > > > @@ -3068,7 +3086,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > > >   		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
> > > >   		if (err)
> > > >   			goto out_free;
> > > > -		len += sizeof(vnet_hdr);
> > > > +		len += vnet_hdr_sz;
> > > >   		virtio_net_hdr_set_proto(skb, &vnet_hdr);
> > > >   	}
> > > > @@ -3452,10 +3470,10 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
> > > >   	packet_rcv_try_clear_pressure(pkt_sk(sk));
> > > >   	if (pkt_sk(sk)->has_vnet_hdr) {
> > > > -		err = packet_rcv_vnet(msg, skb, &len);
> > > > +		vnet_hdr_len = pkt_sk(sk)->vnet_hdr_sz;
> > > > +		err = packet_rcv_vnet(msg, skb, &len, vnet_hdr_len);
> > > >   		if (err)
> > > >   			goto out_free;
> > > > -		vnet_hdr_len = sizeof(struct virtio_net_hdr);
> > > >   	}
> > > >   	/* You lose any data beyond the buffer you gave. If it worries
> > > > -- 
> > > > 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-10 15:36         ` Willem de Bruijn
@ 2023-02-12  9:54           ` Michael S. Tsirkin
  0 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2023-02-12  9:54 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: 沈安琪(凛玥),
	netdev, davem, jasowang, 谈鉴锋

On Fri, Feb 10, 2023 at 10:36:16AM -0500, Willem de Bruijn wrote:
> Michael S. Tsirkin wrote:
> > On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
> > > 
> > > 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
> > > > On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
> > > > > From: "Jianfeng Tan" <henry.tjf@antgroup.com>
> > > > > 
> > > > > When raw socket is used as the backend for kernel vhost, currently it
> > > > > will regard the virtio net header as 10-byte, which is not always the
> > > > > case since some virtio features need virtio net header other than
> > > > > 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
> > > > > net header.
> > > > > 
> > > > > Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
> > > > > tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
> > > > > size that is recorded in packet_sock to indicate the exact virtio net
> > > > > header size that virtio user actually prepares in the packets. By doing
> > > > > so, it can fix the issue of incorrect mac header parsing when these
> > > > > virtio features that need virtio net header other than 10-byte are
> > > > > enable.
> > > > > 
> > > > > Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
> > > > > Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
> > > > > Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
> > > > Does it handle VERSION_1 though? That one is also LE.
> > > > Would it be better to pass a features bitmap instead?
> > > 
> > > 
> > > Thanks for quick reply!
> > > 
> > > I am a little confused abot what "LE" presents here?
> > 
> > LE == little_endian.
> > Little endian format.
> > 
> > > For passing a features bitmap to af_packet here, our consideration is
> > > whether it will be too complicated for af_packet to understand the virtio
> > > features bitmap in order to get the vnet header size. For now, all the
> > > virtio features stuff is handled by vhost worker and af_packet actually does
> > > not need to know much about virtio features. Would it be better if we keep
> > > the virtio feature stuff in user-level and let user-level tell af_packet how
> > > much space it should reserve?
> > 
> > Presumably, we'd add an API in include/linux/virtio_net.h ?
> 
> If packet sockets do not act on the contents of these extended fields,
> it's probably better to leave them opaque.

Well. If the point is to support VERSION_1 which the
commit log says it is, the switch to LE format then does affect
all fields, both used and unused by the packet socket.



> This patch series probably should be one patch. The new option in the
> first patch modifies the data path. Now there is one SHA1 at which its
> behavior would not work.
> 
> > 
> > > > 
> > > > 
> > > > > ---
> > > > >   net/packet/af_packet.c | 48 +++++++++++++++++++++++++++++++++---------------
> > > > >   1 file changed, 33 insertions(+), 15 deletions(-)
> > > > > 
> > > > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > > > > index ab37baf..4f49939 100644
> > > > > --- a/net/packet/af_packet.c
> > > > > +++ b/net/packet/af_packet.c
> > > > > @@ -2092,18 +2092,25 @@ static unsigned int run_filter(struct sk_buff *skb,
> > > > >   }
> > > > >   static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
> > > > > -			   size_t *len)
> > > > > +			   size_t *len, int vnet_hdr_sz)
> > > > >   {
> > > > >   	struct virtio_net_hdr vnet_hdr;
> > > > > +	int ret;
> > > > > -	if (*len < sizeof(vnet_hdr))
> > > > > +	if (*len < vnet_hdr_sz)
> > > > >   		return -EINVAL;
> > > > > -	*len -= sizeof(vnet_hdr);
> > > > > +	*len -= vnet_hdr_sz;
> > > > >   	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
> > > > >   		return -EINVAL;
> > > > > -	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
> > > > > +	ret = memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
> > > > > +
> > > > > +	/* reserve space for extra info in vnet_hdr if needed */
> > > > > +	if (ret == 0)
> > > > > +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(vnet_hdr));
> > > > > +
> 
> How about
> 
>     struct virtio_net_hdr_mrg_rxbuf vnet_hdr { .num_buffers = 0 };
> 
>     ..
> 
>     ret = memcpy_to_msg(msg, (void *)&vnet_hdr, vnet_hdr_sz);
> 
> To avoid the iov_iter_advance and properly initialize those bytes.
> 
> > > > > +	return ret;
> > > > >   }
> > > > >   /*
> > > > > @@ -2311,7 +2318,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
> > > > >   				       (maclen < 16 ? 16 : maclen)) +
> > > > >   				       po->tp_reserve;
> > > > >   		if (po->has_vnet_hdr) {
> > > > > -			netoff += sizeof(struct virtio_net_hdr);
> > > > > +			netoff += po->vnet_hdr_sz;
> > > > >   			do_vnet = true;
> > > > >   		}
> > > > >   		macoff = netoff - maclen;
> > > > > @@ -2552,16 +2559,23 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
> > > > >   }
> > > > >   static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,
> > > > > -				 struct virtio_net_hdr *vnet_hdr)
> > > > > +				 struct virtio_net_hdr *vnet_hdr, int vnet_hdr_sz)
> > > > >   {
> > > > > -	if (*len < sizeof(*vnet_hdr))
> > > > > +	int ret;
> > > > > +
> > > > > +	if (*len < vnet_hdr_sz)
> > > > >   		return -EINVAL;
> > > > > -	*len -= sizeof(*vnet_hdr);
> > > > > +	*len -= vnet_hdr_sz;
> > > > >   	if (!copy_from_iter_full(vnet_hdr, sizeof(*vnet_hdr), &msg->msg_iter))
> > > > >   		return -EFAULT;
> > > > > -	return __packet_snd_vnet_parse(vnet_hdr, *len);
> > > > > +	ret = __packet_snd_vnet_parse(vnet_hdr, *len);
> > > > > +
> > > > > +	/* move iter to point to the start of mac header */
> > > > > +	if (ret == 0)
> > > > > +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(struct virtio_net_hdr));
> > > > > +	return ret;
> > > > >   }
> > > > >   static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
> > > > > @@ -2730,6 +2744,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > > >   	int status = TP_STATUS_AVAILABLE;
> > > > >   	int hlen, tlen, copylen = 0;
> > > > >   	long timeo = 0;
> > > > > +	int vnet_hdr_sz;
> > > > >   	mutex_lock(&po->pg_vec_lock);
> > > > > @@ -2811,8 +2826,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > > >   		tlen = dev->needed_tailroom;
> > > > >   		if (po->has_vnet_hdr) {
> > > > >   			vnet_hdr = data;
> > > > > -			data += sizeof(*vnet_hdr);
> > > > > -			tp_len -= sizeof(*vnet_hdr);
> > > > > +			vnet_hdr_sz = po->vnet_hdr_sz;
> > > > > +			data += vnet_hdr_sz;
> > > > > +			tp_len -= vnet_hdr_sz;
> > > > >   			if (tp_len < 0 ||
> > > > >   			    __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
> > > > >   				tp_len = -EINVAL;
> > > > > @@ -2947,6 +2963,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > > > >   	int offset = 0;
> > > > >   	struct packet_sock *po = pkt_sk(sk);
> > > > >   	bool has_vnet_hdr = false;
> > > > > +	int vnet_hdr_sz;
> > > > >   	int hlen, tlen, linear;
> > > > >   	int extra_len = 0;
> > > > > @@ -2991,7 +3008,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > > > >   	if (sock->type == SOCK_RAW)
> > > > >   		reserve = dev->hard_header_len;
> > > > >   	if (po->has_vnet_hdr) {
> > > > > -		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr);
> > > > > +		vnet_hdr_sz = po->vnet_hdr_sz;
> > > > > +		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr, vnet_hdr_sz);
> > > > >   		if (err)
> > > > >   			goto out_unlock;
> > > > >   		has_vnet_hdr = true;
> > > > > @@ -3068,7 +3086,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> > > > >   		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
> > > > >   		if (err)
> > > > >   			goto out_free;
> > > > > -		len += sizeof(vnet_hdr);
> > > > > +		len += vnet_hdr_sz;
> > > > >   		virtio_net_hdr_set_proto(skb, &vnet_hdr);
> > > > >   	}
> > > > > @@ -3452,10 +3470,10 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
> > > > >   	packet_rcv_try_clear_pressure(pkt_sk(sk));
> > > > >   	if (pkt_sk(sk)->has_vnet_hdr) {
> > > > > -		err = packet_rcv_vnet(msg, skb, &len);
> > > > > +		vnet_hdr_len = pkt_sk(sk)->vnet_hdr_sz;
> > > > > +		err = packet_rcv_vnet(msg, skb, &len, vnet_hdr_len);
> > > > >   		if (err)
> > > > >   			goto out_free;
> > > > > -		vnet_hdr_len = sizeof(struct virtio_net_hdr);
> > > > >   	}
> > > > >   	/* You lose any data beyond the buffer you gave. If it worries
> > > > > -- 
> > > > > 1.8.3.1
> > 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-10 15:39         ` Willem de Bruijn
@ 2023-02-13 11:06           ` 沈安琪(凛玥)
  2023-02-14 14:28             ` Willem de Bruijn
  0 siblings, 1 reply; 19+ messages in thread
From: 沈安琪(凛玥) @ 2023-02-13 11:06 UTC (permalink / raw)
  To: Willem de Bruijn, Michael S. Tsirkin
  Cc: netdev, davem, jasowang, 谈鉴锋


在 2023/2/10 下午11:39, Willem de Bruijn 写道:
> Michael S. Tsirkin wrote:
>> On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
>>> 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
>>>> On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
>>>>> From: "Jianfeng Tan" <henry.tjf@antgroup.com>
>>>>>
>>>>> When raw socket is used as the backend for kernel vhost, currently it
>>>>> will regard the virtio net header as 10-byte, which is not always the
>>>>> case since some virtio features need virtio net header other than
>>>>> 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
>>>>> net header.
>>>>>
>>>>> Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
>>>>> tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
>>>>> size that is recorded in packet_sock to indicate the exact virtio net
>>>>> header size that virtio user actually prepares in the packets. By doing
>>>>> so, it can fix the issue of incorrect mac header parsing when these
>>>>> virtio features that need virtio net header other than 10-byte are
>>>>> enable.
>>>>>
>>>>> Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
>>>>> Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
>>>>> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
>>>> Does it handle VERSION_1 though? That one is also LE.
>>>> Would it be better to pass a features bitmap instead?
>>>
>>> Thanks for quick reply!
>>>
>>> I am a little confused abot what "LE" presents here?
>> LE == little_endian.
>> Little endian format.
>>
>>> For passing a features bitmap to af_packet here, our consideration is
>>> whether it will be too complicated for af_packet to understand the virtio
>>> features bitmap in order to get the vnet header size. For now, all the
>>> virtio features stuff is handled by vhost worker and af_packet actually does
>>> not need to know much about virtio features. Would it be better if we keep
>>> the virtio feature stuff in user-level and let user-level tell af_packet how
>>> much space it should reserve?
>> Presumably, we'd add an API in include/linux/virtio_net.h ?
> Better leave this opaque to packet sockets if they won't act on this
> type info.
>   
> This patch series probably should be a single patch btw. As else the
> socket option introduced in the first is broken at that commit, since
> the behavior is only introduced in patch 2.


Good point, will merge this patch series into one patch.


Thanks for Michael's enlightening advice, we plan to modify current UAPI 
change of adding an extra socketopt from only setting vnet header size 
only to setting a bit-map of virtio features, and implement another 
helper function in include/linux/virtio_net.h to parse the feature 
bit-map. In this case, packet sockets have no need to understand the 
feature bit-map but only pass this bit-map to virtio_net helper and get 
back the information, such as vnet header size, it needs.

This change will make the new UAPI more general and avoid further 
modification if there are more virtio features to support in the future.


>
>>>>
>>>>> ---
>>>>>    net/packet/af_packet.c | 48 +++++++++++++++++++++++++++++++++---------------
>>>>>    1 file changed, 33 insertions(+), 15 deletions(-)
>>>>>
>>>>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>>>>> index ab37baf..4f49939 100644
>>>>> --- a/net/packet/af_packet.c
>>>>> +++ b/net/packet/af_packet.c
>>>>> @@ -2092,18 +2092,25 @@ static unsigned int run_filter(struct sk_buff *skb,
>>>>>    }
>>>>>    static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
>>>>> -			   size_t *len)
>>>>> +			   size_t *len, int vnet_hdr_sz)
>>>>>    {
>>>>>    	struct virtio_net_hdr vnet_hdr;
>>>>> +	int ret;
>>>>> -	if (*len < sizeof(vnet_hdr))
>>>>> +	if (*len < vnet_hdr_sz)
>>>>>    		return -EINVAL;
>>>>> -	*len -= sizeof(vnet_hdr);
>>>>> +	*len -= vnet_hdr_sz;
>>>>>    	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
>>>>>    		return -EINVAL;
>>>>> -	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
>>>>> +	ret = memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
>>>>> +
>>>>> +	/* reserve space for extra info in vnet_hdr if needed */
>>>>> +	if (ret == 0)
>>>>> +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(vnet_hdr));
>>>>> +
> How about
>
>      struct virtio_net_hdr_mrg_rxbuf vnet_hdr = { .num_buffers = 0 };
>
>      ..
>
>      ret = memcpy_to_msg(msg, (void *)&vnet_hdr, vnet_hdr_sz);
>
> To initialize data correctly and avoid the extra function call.


It makes sense. Thanks for pointing out and we will address it in the 
next version of this patch.


>
>>>>> +	return ret;
>>>>>    }
>>>>>    /*
>>>>> @@ -2311,7 +2318,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>>>>>    				       (maclen < 16 ? 16 : maclen)) +
>>>>>    				       po->tp_reserve;
>>>>>    		if (po->has_vnet_hdr) {
>>>>> -			netoff += sizeof(struct virtio_net_hdr);
>>>>> +			netoff += po->vnet_hdr_sz;
>>>>>    			do_vnet = true;
>>>>>    		}
>>>>>    		macoff = netoff - maclen;
>>>>> @@ -2552,16 +2559,23 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
>>>>>    }
>>>>>    static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,
>>>>> -				 struct virtio_net_hdr *vnet_hdr)
>>>>> +				 struct virtio_net_hdr *vnet_hdr, int vnet_hdr_sz)
>>>>>    {
>>>>> -	if (*len < sizeof(*vnet_hdr))
>>>>> +	int ret;
>>>>> +
>>>>> +	if (*len < vnet_hdr_sz)
>>>>>    		return -EINVAL;
>>>>> -	*len -= sizeof(*vnet_hdr);
>>>>> +	*len -= vnet_hdr_sz;
>>>>>    	if (!copy_from_iter_full(vnet_hdr, sizeof(*vnet_hdr), &msg->msg_iter))
>>>>>    		return -EFAULT;
>>>>> -	return __packet_snd_vnet_parse(vnet_hdr, *len);
>>>>> +	ret = __packet_snd_vnet_parse(vnet_hdr, *len);
>>>>> +
>>>>> +	/* move iter to point to the start of mac header */
>>>>> +	if (ret == 0)
>>>>> +		iov_iter_advance(&msg->msg_iter, vnet_hdr_sz - sizeof(struct virtio_net_hdr));
>>>>> +	return ret;
>>>>>    }
>>>>>    static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
>>>>> @@ -2730,6 +2744,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>>>>>    	int status = TP_STATUS_AVAILABLE;
>>>>>    	int hlen, tlen, copylen = 0;
>>>>>    	long timeo = 0;
>>>>> +	int vnet_hdr_sz;
>>>>>    	mutex_lock(&po->pg_vec_lock);
>>>>> @@ -2811,8 +2826,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>>>>>    		tlen = dev->needed_tailroom;
>>>>>    		if (po->has_vnet_hdr) {
>>>>>    			vnet_hdr = data;
>>>>> -			data += sizeof(*vnet_hdr);
>>>>> -			tp_len -= sizeof(*vnet_hdr);
>>>>> +			vnet_hdr_sz = po->vnet_hdr_sz;
>>>>> +			data += vnet_hdr_sz;
>>>>> +			tp_len -= vnet_hdr_sz;
>>>>>    			if (tp_len < 0 ||
>>>>>    			    __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
>>>>>    				tp_len = -EINVAL;
>>>>> @@ -2947,6 +2963,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>>>>    	int offset = 0;
>>>>>    	struct packet_sock *po = pkt_sk(sk);
>>>>>    	bool has_vnet_hdr = false;
>>>>> +	int vnet_hdr_sz;
>>>>>    	int hlen, tlen, linear;
>>>>>    	int extra_len = 0;
>>>>> @@ -2991,7 +3008,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>>>>    	if (sock->type == SOCK_RAW)
>>>>>    		reserve = dev->hard_header_len;
>>>>>    	if (po->has_vnet_hdr) {
>>>>> -		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr);
>>>>> +		vnet_hdr_sz = po->vnet_hdr_sz;
>>>>> +		err = packet_snd_vnet_parse(msg, &len, &vnet_hdr, vnet_hdr_sz);
>>>>>    		if (err)
>>>>>    			goto out_unlock;
>>>>>    		has_vnet_hdr = true;
>>>>> @@ -3068,7 +3086,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>>>>    		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
>>>>>    		if (err)
>>>>>    			goto out_free;
>>>>> -		len += sizeof(vnet_hdr);
>>>>> +		len += vnet_hdr_sz;
>>>>>    		virtio_net_hdr_set_proto(skb, &vnet_hdr);
>>>>>    	}
>>>>> @@ -3452,10 +3470,10 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
>>>>>    	packet_rcv_try_clear_pressure(pkt_sk(sk));
>>>>>    	if (pkt_sk(sk)->has_vnet_hdr) {
>>>>> -		err = packet_rcv_vnet(msg, skb, &len);
>>>>> +		vnet_hdr_len = pkt_sk(sk)->vnet_hdr_sz;
>>>>> +		err = packet_rcv_vnet(msg, skb, &len, vnet_hdr_len);
>>>>>    		if (err)
>>>>>    			goto out_free;
>>>>> -		vnet_hdr_len = sizeof(struct virtio_net_hdr);
>>>>>    	}
>>>>>    	/* You lose any data beyond the buffer you gave. If it worries
>>>>> -- 
>>>>> 1.8.3.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-13 11:06           ` 沈安琪(凛玥)
@ 2023-02-14 14:28             ` Willem de Bruijn
  2023-02-21  9:40               ` 沈安琪(凛玥)
  0 siblings, 1 reply; 19+ messages in thread
From: Willem de Bruijn @ 2023-02-14 14:28 UTC (permalink / raw)
  To: 沈安琪(凛玥),
	Willem de Bruijn, Michael S. Tsirkin
  Cc: netdev, davem, jasowang, 谈鉴锋

沈安琪(凛玥) wrote:
> 
> 在 2023/2/10 下午11:39, Willem de Bruijn 写道:
> > Michael S. Tsirkin wrote:
> >> On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
> >>> 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
> >>>> On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
> >>>>> From: "Jianfeng Tan" <henry.tjf@antgroup.com>
> >>>>>
> >>>>> When raw socket is used as the backend for kernel vhost, currently it
> >>>>> will regard the virtio net header as 10-byte, which is not always the
> >>>>> case since some virtio features need virtio net header other than
> >>>>> 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
> >>>>> net header.
> >>>>>
> >>>>> Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
> >>>>> tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
> >>>>> size that is recorded in packet_sock to indicate the exact virtio net
> >>>>> header size that virtio user actually prepares in the packets. By doing
> >>>>> so, it can fix the issue of incorrect mac header parsing when these
> >>>>> virtio features that need virtio net header other than 10-byte are
> >>>>> enable.
> >>>>>
> >>>>> Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
> >>>>> Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
> >>>>> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
> >>>> Does it handle VERSION_1 though? That one is also LE.
> >>>> Would it be better to pass a features bitmap instead?
> >>>
> >>> Thanks for quick reply!
> >>>
> >>> I am a little confused abot what "LE" presents here?
> >> LE == little_endian.
> >> Little endian format.
> >>
> >>> For passing a features bitmap to af_packet here, our consideration is
> >>> whether it will be too complicated for af_packet to understand the virtio
> >>> features bitmap in order to get the vnet header size. For now, all the
> >>> virtio features stuff is handled by vhost worker and af_packet actually does
> >>> not need to know much about virtio features. Would it be better if we keep
> >>> the virtio feature stuff in user-level and let user-level tell af_packet how
> >>> much space it should reserve?
> >> Presumably, we'd add an API in include/linux/virtio_net.h ?
> > Better leave this opaque to packet sockets if they won't act on this
> > type info.
> >   
> > This patch series probably should be a single patch btw. As else the
> > socket option introduced in the first is broken at that commit, since
> > the behavior is only introduced in patch 2.
> 
> 
> Good point, will merge this patch series into one patch.
> 
> 
> Thanks for Michael's enlightening advice, we plan to modify current UAPI 
> change of adding an extra socketopt from only setting vnet header size 
> only to setting a bit-map of virtio features, and implement another 
> helper function in include/linux/virtio_net.h to parse the feature 
> bit-map. In this case, packet sockets have no need to understand the 
> feature bit-map but only pass this bit-map to virtio_net helper and get 
> back the information, such as vnet header size, it needs.
> 
> This change will make the new UAPI more general and avoid further 
> modification if there are more virtio features to support in the future.
>

Please also comment how these UAPI extension are intended to be used.
As that use is not included in this initial patch series.

If the only intended user is vhost-net, we can consider not exposing
outside the kernel at all. That makes it easier to iterate if
necessary (no stable ABI) and avoids accidentally opening up new
avenues for bugs and exploits (syzkaller has a history with
virtio_net_header options).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-14 14:28             ` Willem de Bruijn
@ 2023-02-21  9:40               ` 沈安琪(凛玥)
  2023-02-21 15:03                 ` Willem de Bruijn
  0 siblings, 1 reply; 19+ messages in thread
From: 沈安琪(凛玥) @ 2023-02-21  9:40 UTC (permalink / raw)
  To: Willem de Bruijn, Michael S. Tsirkin
  Cc: netdev, davem, jasowang, 谈鉴锋


在 2023/2/14 下午10:28, Willem de Bruijn 写道:
> 沈安琪(凛玥) wrote:
>> 在 2023/2/10 下午11:39, Willem de Bruijn 写道:
>>> Michael S. Tsirkin wrote:
>>>> On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
>>>>> 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
>>>>>> On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
>>>>>>> From: "Jianfeng Tan" <henry.tjf@antgroup.com>
>>>>>>>
>>>>>>> When raw socket is used as the backend for kernel vhost, currently it
>>>>>>> will regard the virtio net header as 10-byte, which is not always the
>>>>>>> case since some virtio features need virtio net header other than
>>>>>>> 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
>>>>>>> net header.
>>>>>>>
>>>>>>> Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
>>>>>>> tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
>>>>>>> size that is recorded in packet_sock to indicate the exact virtio net
>>>>>>> header size that virtio user actually prepares in the packets. By doing
>>>>>>> so, it can fix the issue of incorrect mac header parsing when these
>>>>>>> virtio features that need virtio net header other than 10-byte are
>>>>>>> enable.
>>>>>>>
>>>>>>> Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
>>>>>>> Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
>>>>>>> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
>>>>>> Does it handle VERSION_1 though? That one is also LE.
>>>>>> Would it be better to pass a features bitmap instead?
>>>>> Thanks for quick reply!
>>>>>
>>>>> I am a little confused abot what "LE" presents here?
>>>> LE == little_endian.
>>>> Little endian format.
>>>>
>>>>> For passing a features bitmap to af_packet here, our consideration is
>>>>> whether it will be too complicated for af_packet to understand the virtio
>>>>> features bitmap in order to get the vnet header size. For now, all the
>>>>> virtio features stuff is handled by vhost worker and af_packet actually does
>>>>> not need to know much about virtio features. Would it be better if we keep
>>>>> the virtio feature stuff in user-level and let user-level tell af_packet how
>>>>> much space it should reserve?
>>>> Presumably, we'd add an API in include/linux/virtio_net.h ?
>>> Better leave this opaque to packet sockets if they won't act on this
>>> type info.
>>>    
>>> This patch series probably should be a single patch btw. As else the
>>> socket option introduced in the first is broken at that commit, since
>>> the behavior is only introduced in patch 2.
>>
>> Good point, will merge this patch series into one patch.
>>
>>
>> Thanks for Michael's enlightening advice, we plan to modify current UAPI
>> change of adding an extra socketopt from only setting vnet header size
>> only to setting a bit-map of virtio features, and implement another
>> helper function in include/linux/virtio_net.h to parse the feature
>> bit-map. In this case, packet sockets have no need to understand the
>> feature bit-map but only pass this bit-map to virtio_net helper and get
>> back the information, such as vnet header size, it needs.
>>
>> This change will make the new UAPI more general and avoid further
>> modification if there are more virtio features to support in the future.
>>
> Please also comment how these UAPI extension are intended to be used.
> As that use is not included in this initial patch series.
>
> If the only intended user is vhost-net, we can consider not exposing
> outside the kernel at all. That makes it easier to iterate if
> necessary (no stable ABI) and avoids accidentally opening up new
> avenues for bugs and exploits (syzkaller has a history with
> virtio_net_header options).


Our concern is, it seems there is no other solution than uapi to let 
packet sockets know the vnet header size they should use.

Receiving packets in vhost driver, implemented in drivers/vhost/net.c: 
1109 handle_rx(), will abstract the backend device it uses and directly 
invoke the corresponding socket ops with no extra information indicating 
it is invoked by vhost worker. Vhost worker actually does not know the 
type of backend device it is using; only virito-user knows what type of 
backend device it uses. Therefore, it seems impossible to let vhost set 
the vnet header information to the target backend device.

Tap, another kind of backend device vhost may use, lets virtio-user set 
whether it needs vnet header and how long the vnet header is through 
ioctl. (implemented in drivers/net/tap.c:1066)

In this case, we wonder whether we should align with what tap does and 
set vnet hdr size through setsockopt for packet_sockets.

We really appreciate suggestions on if any, potential approachs to pass 
this vnet header size information from virtio-user to packet-socket.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-21  9:40               ` 沈安琪(凛玥)
@ 2023-02-21 15:03                 ` Willem de Bruijn
  2023-02-22  8:04                   ` 沈安琪(凛玥)
  0 siblings, 1 reply; 19+ messages in thread
From: Willem de Bruijn @ 2023-02-21 15:03 UTC (permalink / raw)
  To: 沈安琪(凛玥),
	Willem de Bruijn, Michael S. Tsirkin
  Cc: netdev, davem, jasowang, 谈鉴锋

沈安琪(凛玥) wrote:
> 
> 在 2023/2/14 下午10:28, Willem de Bruijn 写道:
> > 沈安琪(凛玥) wrote:
> >> 在 2023/2/10 下午11:39, Willem de Bruijn 写道:
> >>> Michael S. Tsirkin wrote:
> >>>> On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
> >>>>> 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
> >>>>>> On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
> >>>>>>> From: "Jianfeng Tan" <henry.tjf@antgroup.com>
> >>>>>>>
> >>>>>>> When raw socket is used as the backend for kernel vhost, currently it
> >>>>>>> will regard the virtio net header as 10-byte, which is not always the
> >>>>>>> case since some virtio features need virtio net header other than
> >>>>>>> 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
> >>>>>>> net header.
> >>>>>>>
> >>>>>>> Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
> >>>>>>> tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
> >>>>>>> size that is recorded in packet_sock to indicate the exact virtio net
> >>>>>>> header size that virtio user actually prepares in the packets. By doing
> >>>>>>> so, it can fix the issue of incorrect mac header parsing when these
> >>>>>>> virtio features that need virtio net header other than 10-byte are
> >>>>>>> enable.
> >>>>>>>
> >>>>>>> Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
> >>>>>>> Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
> >>>>>>> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
> >>>>>> Does it handle VERSION_1 though? That one is also LE.
> >>>>>> Would it be better to pass a features bitmap instead?
> >>>>> Thanks for quick reply!
> >>>>>
> >>>>> I am a little confused abot what "LE" presents here?
> >>>> LE == little_endian.
> >>>> Little endian format.
> >>>>
> >>>>> For passing a features bitmap to af_packet here, our consideration is
> >>>>> whether it will be too complicated for af_packet to understand the virtio
> >>>>> features bitmap in order to get the vnet header size. For now, all the
> >>>>> virtio features stuff is handled by vhost worker and af_packet actually does
> >>>>> not need to know much about virtio features. Would it be better if we keep
> >>>>> the virtio feature stuff in user-level and let user-level tell af_packet how
> >>>>> much space it should reserve?
> >>>> Presumably, we'd add an API in include/linux/virtio_net.h ?
> >>> Better leave this opaque to packet sockets if they won't act on this
> >>> type info.
> >>>    
> >>> This patch series probably should be a single patch btw. As else the
> >>> socket option introduced in the first is broken at that commit, since
> >>> the behavior is only introduced in patch 2.
> >>
> >> Good point, will merge this patch series into one patch.
> >>
> >>
> >> Thanks for Michael's enlightening advice, we plan to modify current UAPI
> >> change of adding an extra socketopt from only setting vnet header size
> >> only to setting a bit-map of virtio features, and implement another
> >> helper function in include/linux/virtio_net.h to parse the feature
> >> bit-map. In this case, packet sockets have no need to understand the
> >> feature bit-map but only pass this bit-map to virtio_net helper and get
> >> back the information, such as vnet header size, it needs.
> >>
> >> This change will make the new UAPI more general and avoid further
> >> modification if there are more virtio features to support in the future.
> >>
> > Please also comment how these UAPI extension are intended to be used.
> > As that use is not included in this initial patch series.
> >
> > If the only intended user is vhost-net, we can consider not exposing
> > outside the kernel at all. That makes it easier to iterate if
> > necessary (no stable ABI) and avoids accidentally opening up new
> > avenues for bugs and exploits (syzkaller has a history with
> > virtio_net_header options).
> 
> 
> Our concern is, it seems there is no other solution than uapi to let 
> packet sockets know the vnet header size they should use.
> 
> Receiving packets in vhost driver, implemented in drivers/vhost/net.c: 
> 1109 handle_rx(), will abstract the backend device it uses and directly 
> invoke the corresponding socket ops with no extra information indicating 
> it is invoked by vhost worker. Vhost worker actually does not know the 
> type of backend device it is using; only virito-user knows what type of 
> backend device it uses. Therefore, it seems impossible to let vhost set 
> the vnet header information to the target backend device.
> 
> Tap, another kind of backend device vhost may use, lets virtio-user set 
> whether it needs vnet header and how long the vnet header is through 
> ioctl. (implemented in drivers/net/tap.c:1066)
> 
> In this case, we wonder whether we should align with what tap does and 
> set vnet hdr size through setsockopt for packet_sockets.
> 
> We really appreciate suggestions on if any, potential approachs to pass 
> this vnet header size information from virtio-user to packet-socket.

You're right. This is configured from userspace before the FD is passed
to vhost-net, so indeed this will require packet socket UAPI support.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-21 15:03                 ` Willem de Bruijn
@ 2023-02-22  8:04                   ` 沈安琪(凛玥)
  2023-02-22 11:37                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 19+ messages in thread
From: 沈安琪(凛玥) @ 2023-02-22  8:04 UTC (permalink / raw)
  To: Willem de Bruijn, Michael S. Tsirkin
  Cc: netdev, davem, jasowang, 谈鉴锋


在 2023/2/21 下午11:03, Willem de Bruijn 写道:
> 沈安琪(凛玥) wrote:
>> 在 2023/2/14 下午10:28, Willem de Bruijn 写道:
>>> 沈安琪(凛玥) wrote:
>>>> 在 2023/2/10 下午11:39, Willem de Bruijn 写道:
>>>>> Michael S. Tsirkin wrote:
>>>>>> On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
>>>>>>> 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
>>>>>>>> On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
>>>>>>>>> From: "Jianfeng Tan" <henry.tjf@antgroup.com>
>>>>>>>>>
>>>>>>>>> When raw socket is used as the backend for kernel vhost, currently it
>>>>>>>>> will regard the virtio net header as 10-byte, which is not always the
>>>>>>>>> case since some virtio features need virtio net header other than
>>>>>>>>> 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
>>>>>>>>> net header.
>>>>>>>>>
>>>>>>>>> Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
>>>>>>>>> tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
>>>>>>>>> size that is recorded in packet_sock to indicate the exact virtio net
>>>>>>>>> header size that virtio user actually prepares in the packets. By doing
>>>>>>>>> so, it can fix the issue of incorrect mac header parsing when these
>>>>>>>>> virtio features that need virtio net header other than 10-byte are
>>>>>>>>> enable.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
>>>>>>>>> Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
>>>>>>>>> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
>>>>>>>> Does it handle VERSION_1 though? That one is also LE.
>>>>>>>> Would it be better to pass a features bitmap instead?
>>>>>>> Thanks for quick reply!
>>>>>>>
>>>>>>> I am a little confused abot what "LE" presents here?
>>>>>> LE == little_endian.
>>>>>> Little endian format.
>>>>>>
>>>>>>> For passing a features bitmap to af_packet here, our consideration is
>>>>>>> whether it will be too complicated for af_packet to understand the virtio
>>>>>>> features bitmap in order to get the vnet header size. For now, all the
>>>>>>> virtio features stuff is handled by vhost worker and af_packet actually does
>>>>>>> not need to know much about virtio features. Would it be better if we keep
>>>>>>> the virtio feature stuff in user-level and let user-level tell af_packet how
>>>>>>> much space it should reserve?
>>>>>> Presumably, we'd add an API in include/linux/virtio_net.h ?
>>>>> Better leave this opaque to packet sockets if they won't act on this
>>>>> type info.
>>>>>     
>>>>> This patch series probably should be a single patch btw. As else the
>>>>> socket option introduced in the first is broken at that commit, since
>>>>> the behavior is only introduced in patch 2.
>>>> Good point, will merge this patch series into one patch.
>>>>
>>>>
>>>> Thanks for Michael's enlightening advice, we plan to modify current UAPI
>>>> change of adding an extra socketopt from only setting vnet header size
>>>> only to setting a bit-map of virtio features, and implement another
>>>> helper function in include/linux/virtio_net.h to parse the feature
>>>> bit-map. In this case, packet sockets have no need to understand the
>>>> feature bit-map but only pass this bit-map to virtio_net helper and get
>>>> back the information, such as vnet header size, it needs.
>>>>
>>>> This change will make the new UAPI more general and avoid further
>>>> modification if there are more virtio features to support in the future.
>>>>
>>> Please also comment how these UAPI extension are intended to be used.
>>> As that use is not included in this initial patch series.
>>>
>>> If the only intended user is vhost-net, we can consider not exposing
>>> outside the kernel at all. That makes it easier to iterate if
>>> necessary (no stable ABI) and avoids accidentally opening up new
>>> avenues for bugs and exploits (syzkaller has a history with
>>> virtio_net_header options).
>>
>> Our concern is, it seems there is no other solution than uapi to let
>> packet sockets know the vnet header size they should use.
>>
>> Receiving packets in vhost driver, implemented in drivers/vhost/net.c:
>> 1109 handle_rx(), will abstract the backend device it uses and directly
>> invoke the corresponding socket ops with no extra information indicating
>> it is invoked by vhost worker. Vhost worker actually does not know the
>> type of backend device it is using; only virito-user knows what type of
>> backend device it uses. Therefore, it seems impossible to let vhost set
>> the vnet header information to the target backend device.
>>
>> Tap, another kind of backend device vhost may use, lets virtio-user set
>> whether it needs vnet header and how long the vnet header is through
>> ioctl. (implemented in drivers/net/tap.c:1066)
>>
>> In this case, we wonder whether we should align with what tap does and
>> set vnet hdr size through setsockopt for packet_sockets.
>>
>> We really appreciate suggestions on if any, potential approachs to pass
>> this vnet header size information from virtio-user to packet-socket.
> You're right. This is configured from userspace before the FD is passed
> to vhost-net, so indeed this will require packet socket UAPI support.


Thanks for quick reply. We will go with adding an extra UAPI here then.


Another discussion for designing this UAPI is, whether it will be better 
to support setting only vnet header size, just like what TAP does in its 
ioctl, or to support setting a virtio feature bit-map.


UAPI setting only vnet header size

Pros:

1. It aligns with how other virito backend devices communicate with 
virtio-user

2. We can use the holes in struct packet_socket 
(net/packet/internal.h:120) to record the extra information since the 
size info only takes 8 bits.

Cons:

1. It may have more information that virtio-user needs to communicate 
with packet socket in the future and needs to add more UAPI supports here.

To Michael: Is there any other information that backend device needs and 
will be given from virtio-user?


UAPI setting a virtio feature bit-map

Pros:

1. It is more general and may reduce future UAPI changes.

Cons:

1. A virtio feature bit-map needs 64 bits, which needs to add an extra 
field in packet_sock struct

2. Virtio-user needs to aware that using packet socket as backend 
supports different approach to negotiate the vnet header size.


We really appreciate any suggestion or discussion on this design choice 
of UAPI.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-22  8:04                   ` 沈安琪(凛玥)
@ 2023-02-22 11:37                     ` Michael S. Tsirkin
  2023-02-22 11:43                       ` 沈安琪(凛玥)
  0 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2023-02-22 11:37 UTC (permalink / raw)
  To: 沈安琪(凛玥)
  Cc: Willem de Bruijn, netdev, davem, jasowang, 谈鉴锋

On Wed, Feb 22, 2023 at 04:04:34PM +0800, 沈安琪(凛玥) wrote:
> 
> 在 2023/2/21 下午11:03, Willem de Bruijn 写道:
> > 沈安琪(凛玥) wrote:
> > > 在 2023/2/14 下午10:28, Willem de Bruijn 写道:
> > > > 沈安琪(凛玥) wrote:
> > > > > 在 2023/2/10 下午11:39, Willem de Bruijn 写道:
> > > > > > Michael S. Tsirkin wrote:
> > > > > > > On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
> > > > > > > > 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
> > > > > > > > > On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
> > > > > > > > > > From: "Jianfeng Tan" <henry.tjf@antgroup.com>
> > > > > > > > > > 
> > > > > > > > > > When raw socket is used as the backend for kernel vhost, currently it
> > > > > > > > > > will regard the virtio net header as 10-byte, which is not always the
> > > > > > > > > > case since some virtio features need virtio net header other than
> > > > > > > > > > 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
> > > > > > > > > > net header.
> > > > > > > > > > 
> > > > > > > > > > Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
> > > > > > > > > > tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
> > > > > > > > > > size that is recorded in packet_sock to indicate the exact virtio net
> > > > > > > > > > header size that virtio user actually prepares in the packets. By doing
> > > > > > > > > > so, it can fix the issue of incorrect mac header parsing when these
> > > > > > > > > > virtio features that need virtio net header other than 10-byte are
> > > > > > > > > > enable.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
> > > > > > > > > > Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
> > > > > > > > > > Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
> > > > > > > > > Does it handle VERSION_1 though? That one is also LE.
> > > > > > > > > Would it be better to pass a features bitmap instead?
> > > > > > > > Thanks for quick reply!
> > > > > > > > 
> > > > > > > > I am a little confused abot what "LE" presents here?
> > > > > > > LE == little_endian.
> > > > > > > Little endian format.
> > > > > > > 
> > > > > > > > For passing a features bitmap to af_packet here, our consideration is
> > > > > > > > whether it will be too complicated for af_packet to understand the virtio
> > > > > > > > features bitmap in order to get the vnet header size. For now, all the
> > > > > > > > virtio features stuff is handled by vhost worker and af_packet actually does
> > > > > > > > not need to know much about virtio features. Would it be better if we keep
> > > > > > > > the virtio feature stuff in user-level and let user-level tell af_packet how
> > > > > > > > much space it should reserve?
> > > > > > > Presumably, we'd add an API in include/linux/virtio_net.h ?
> > > > > > Better leave this opaque to packet sockets if they won't act on this
> > > > > > type info.
> > > > > > This patch series probably should be a single patch btw. As else the
> > > > > > socket option introduced in the first is broken at that commit, since
> > > > > > the behavior is only introduced in patch 2.
> > > > > Good point, will merge this patch series into one patch.
> > > > > 
> > > > > 
> > > > > Thanks for Michael's enlightening advice, we plan to modify current UAPI
> > > > > change of adding an extra socketopt from only setting vnet header size
> > > > > only to setting a bit-map of virtio features, and implement another
> > > > > helper function in include/linux/virtio_net.h to parse the feature
> > > > > bit-map. In this case, packet sockets have no need to understand the
> > > > > feature bit-map but only pass this bit-map to virtio_net helper and get
> > > > > back the information, such as vnet header size, it needs.
> > > > > 
> > > > > This change will make the new UAPI more general and avoid further
> > > > > modification if there are more virtio features to support in the future.
> > > > > 
> > > > Please also comment how these UAPI extension are intended to be used.
> > > > As that use is not included in this initial patch series.
> > > > 
> > > > If the only intended user is vhost-net, we can consider not exposing
> > > > outside the kernel at all. That makes it easier to iterate if
> > > > necessary (no stable ABI) and avoids accidentally opening up new
> > > > avenues for bugs and exploits (syzkaller has a history with
> > > > virtio_net_header options).
> > > 
> > > Our concern is, it seems there is no other solution than uapi to let
> > > packet sockets know the vnet header size they should use.
> > > 
> > > Receiving packets in vhost driver, implemented in drivers/vhost/net.c:
> > > 1109 handle_rx(), will abstract the backend device it uses and directly
> > > invoke the corresponding socket ops with no extra information indicating
> > > it is invoked by vhost worker. Vhost worker actually does not know the
> > > type of backend device it is using; only virito-user knows what type of
> > > backend device it uses. Therefore, it seems impossible to let vhost set
> > > the vnet header information to the target backend device.
> > > 
> > > Tap, another kind of backend device vhost may use, lets virtio-user set
> > > whether it needs vnet header and how long the vnet header is through
> > > ioctl. (implemented in drivers/net/tap.c:1066)
> > > 
> > > In this case, we wonder whether we should align with what tap does and
> > > set vnet hdr size through setsockopt for packet_sockets.
> > > 
> > > We really appreciate suggestions on if any, potential approachs to pass
> > > this vnet header size information from virtio-user to packet-socket.
> > You're right. This is configured from userspace before the FD is passed
> > to vhost-net, so indeed this will require packet socket UAPI support.
> 
> 
> Thanks for quick reply. We will go with adding an extra UAPI here then.
> 
> 
> Another discussion for designing this UAPI is, whether it will be better to
> support setting only vnet header size, just like what TAP does in its ioctl,
> or to support setting a virtio feature bit-map.
> 
> 
> UAPI setting only vnet header size
> 
> Pros:
> 
> 1. It aligns with how other virito backend devices communicate with
> virtio-user
> 
> 2. We can use the holes in struct packet_socket (net/packet/internal.h:120)
> to record the extra information since the size info only takes 8 bits.
> 
> Cons:
> 
> 1. It may have more information that virtio-user needs to communicate with
> packet socket in the future and needs to add more UAPI supports here.
> 
> To Michael: Is there any other information that backend device needs and
> will be given from virtio-user?


Yes e.g. I already mentioned virtio 1.0 wrt LE versus native endian
format.


> 
> UAPI setting a virtio feature bit-map
> 
> Pros:
> 
> 1. It is more general and may reduce future UAPI changes.
> 
> Cons:
> 
> 1. A virtio feature bit-map needs 64 bits, which needs to add an extra field
> in packet_sock struct
> 
> 2. Virtio-user needs to aware that using packet socket as backend supports
> different approach to negotiate the vnet header size.
> 
> 
> We really appreciate any suggestion or discussion on this design choice of
> UAPI.

In the end it's ok with just size too, you just probably shouldn't say
you support VERSION_1 if you are not passing that bit.

-- 
MST


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-22 11:37                     ` Michael S. Tsirkin
@ 2023-02-22 11:43                       ` 沈安琪(凛玥)
  2023-02-22 15:04                         ` Willem de Bruijn
  0 siblings, 1 reply; 19+ messages in thread
From: 沈安琪(凛玥) @ 2023-02-22 11:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Willem de Bruijn, netdev, davem, jasowang, 谈鉴锋


在 2023/2/22 下午7:37, Michael S. Tsirkin 写道:
> On Wed, Feb 22, 2023 at 04:04:34PM +0800, 沈安琪(凛玥) wrote:
>> 在 2023/2/21 下午11:03, Willem de Bruijn 写道:
>>> 沈安琪(凛玥) wrote:
>>>> 在 2023/2/14 下午10:28, Willem de Bruijn 写道:
>>>>> 沈安琪(凛玥) wrote:
>>>>>> 在 2023/2/10 下午11:39, Willem de Bruijn 写道:
>>>>>>> Michael S. Tsirkin wrote:
>>>>>>>> On Fri, Feb 10, 2023 at 12:01:03PM +0800, 沈安琪(凛玥) wrote:
>>>>>>>>> 在 2023/2/9 下午9:07, Michael S. Tsirkin 写道:
>>>>>>>>>> On Thu, Feb 09, 2023 at 08:43:15PM +0800, 沈安琪(凛玥) wrote:
>>>>>>>>>>> From: "Jianfeng Tan" <henry.tjf@antgroup.com>
>>>>>>>>>>>
>>>>>>>>>>> When raw socket is used as the backend for kernel vhost, currently it
>>>>>>>>>>> will regard the virtio net header as 10-byte, which is not always the
>>>>>>>>>>> case since some virtio features need virtio net header other than
>>>>>>>>>>> 10-byte, such as mrg_rxbuf and VERSION_1 that both need 12-byte virtio
>>>>>>>>>>> net header.
>>>>>>>>>>>
>>>>>>>>>>> Instead of hardcoding virtio net header length to 10 bytes, tpacket_snd,
>>>>>>>>>>> tpacket_rcv, packet_snd and packet_recvmsg now get the virtio net header
>>>>>>>>>>> size that is recorded in packet_sock to indicate the exact virtio net
>>>>>>>>>>> header size that virtio user actually prepares in the packets. By doing
>>>>>>>>>>> so, it can fix the issue of incorrect mac header parsing when these
>>>>>>>>>>> virtio features that need virtio net header other than 10-byte are
>>>>>>>>>>> enable.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
>>>>>>>>>>> Co-developed-by: Anqi Shen <amy.saq@antgroup.com>
>>>>>>>>>>> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
>>>>>>>>>> Does it handle VERSION_1 though? That one is also LE.
>>>>>>>>>> Would it be better to pass a features bitmap instead?
>>>>>>>>> Thanks for quick reply!
>>>>>>>>>
>>>>>>>>> I am a little confused abot what "LE" presents here?
>>>>>>>> LE == little_endian.
>>>>>>>> Little endian format.
>>>>>>>>
>>>>>>>>> For passing a features bitmap to af_packet here, our consideration is
>>>>>>>>> whether it will be too complicated for af_packet to understand the virtio
>>>>>>>>> features bitmap in order to get the vnet header size. For now, all the
>>>>>>>>> virtio features stuff is handled by vhost worker and af_packet actually does
>>>>>>>>> not need to know much about virtio features. Would it be better if we keep
>>>>>>>>> the virtio feature stuff in user-level and let user-level tell af_packet how
>>>>>>>>> much space it should reserve?
>>>>>>>> Presumably, we'd add an API in include/linux/virtio_net.h ?
>>>>>>> Better leave this opaque to packet sockets if they won't act on this
>>>>>>> type info.
>>>>>>> This patch series probably should be a single patch btw. As else the
>>>>>>> socket option introduced in the first is broken at that commit, since
>>>>>>> the behavior is only introduced in patch 2.
>>>>>> Good point, will merge this patch series into one patch.
>>>>>>
>>>>>>
>>>>>> Thanks for Michael's enlightening advice, we plan to modify current UAPI
>>>>>> change of adding an extra socketopt from only setting vnet header size
>>>>>> only to setting a bit-map of virtio features, and implement another
>>>>>> helper function in include/linux/virtio_net.h to parse the feature
>>>>>> bit-map. In this case, packet sockets have no need to understand the
>>>>>> feature bit-map but only pass this bit-map to virtio_net helper and get
>>>>>> back the information, such as vnet header size, it needs.
>>>>>>
>>>>>> This change will make the new UAPI more general and avoid further
>>>>>> modification if there are more virtio features to support in the future.
>>>>>>
>>>>> Please also comment how these UAPI extension are intended to be used.
>>>>> As that use is not included in this initial patch series.
>>>>>
>>>>> If the only intended user is vhost-net, we can consider not exposing
>>>>> outside the kernel at all. That makes it easier to iterate if
>>>>> necessary (no stable ABI) and avoids accidentally opening up new
>>>>> avenues for bugs and exploits (syzkaller has a history with
>>>>> virtio_net_header options).
>>>> Our concern is, it seems there is no other solution than uapi to let
>>>> packet sockets know the vnet header size they should use.
>>>>
>>>> Receiving packets in vhost driver, implemented in drivers/vhost/net.c:
>>>> 1109 handle_rx(), will abstract the backend device it uses and directly
>>>> invoke the corresponding socket ops with no extra information indicating
>>>> it is invoked by vhost worker. Vhost worker actually does not know the
>>>> type of backend device it is using; only virito-user knows what type of
>>>> backend device it uses. Therefore, it seems impossible to let vhost set
>>>> the vnet header information to the target backend device.
>>>>
>>>> Tap, another kind of backend device vhost may use, lets virtio-user set
>>>> whether it needs vnet header and how long the vnet header is through
>>>> ioctl. (implemented in drivers/net/tap.c:1066)
>>>>
>>>> In this case, we wonder whether we should align with what tap does and
>>>> set vnet hdr size through setsockopt for packet_sockets.
>>>>
>>>> We really appreciate suggestions on if any, potential approachs to pass
>>>> this vnet header size information from virtio-user to packet-socket.
>>> You're right. This is configured from userspace before the FD is passed
>>> to vhost-net, so indeed this will require packet socket UAPI support.
>>
>> Thanks for quick reply. We will go with adding an extra UAPI here then.
>>
>>
>> Another discussion for designing this UAPI is, whether it will be better to
>> support setting only vnet header size, just like what TAP does in its ioctl,
>> or to support setting a virtio feature bit-map.
>>
>>
>> UAPI setting only vnet header size
>>
>> Pros:
>>
>> 1. It aligns with how other virito backend devices communicate with
>> virtio-user
>>
>> 2. We can use the holes in struct packet_socket (net/packet/internal.h:120)
>> to record the extra information since the size info only takes 8 bits.
>>
>> Cons:
>>
>> 1. It may have more information that virtio-user needs to communicate with
>> packet socket in the future and needs to add more UAPI supports here.
>>
>> To Michael: Is there any other information that backend device needs and
>> will be given from virtio-user?
>
> Yes e.g. I already mentioned virtio 1.0 wrt LE versus native endian
> format.
>
>
>> UAPI setting a virtio feature bit-map
>>
>> Pros:
>>
>> 1. It is more general and may reduce future UAPI changes.
>>
>> Cons:
>>
>> 1. A virtio feature bit-map needs 64 bits, which needs to add an extra field
>> in packet_sock struct
>>
>> 2. Virtio-user needs to aware that using packet socket as backend supports
>> different approach to negotiate the vnet header size.
>>
>>
>> We really appreciate any suggestion or discussion on this design choice of
>> UAPI.
> In the end it's ok with just size too, you just probably shouldn't say
> you support VERSION_1 if you are not passing that bit.
>

Sorry for the confusion here that we mentioned VERSION_1 in the commit 
log. We actually just attended to give an example of what features that 
may need 12-byte vnet header. We will remove it from the commit log in 
patch v2 to avoid confusion here. Thanks a lot for your suggestions.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz
  2023-02-22 11:43                       ` 沈安琪(凛玥)
@ 2023-02-22 15:04                         ` Willem de Bruijn
  0 siblings, 0 replies; 19+ messages in thread
From: Willem de Bruijn @ 2023-02-22 15:04 UTC (permalink / raw)
  To: 沈安琪(凛玥), Michael S. Tsirkin
  Cc: Willem de Bruijn, netdev, davem, jasowang, 谈鉴锋

> >>> You're right. This is configured from userspace before the FD is passed
> >>> to vhost-net, so indeed this will require packet socket UAPI support.
> >>
> >> Thanks for quick reply. We will go with adding an extra UAPI here then.
> >>
> >>
> >> Another discussion for designing this UAPI is, whether it will be better to
> >> support setting only vnet header size, just like what TAP does in its ioctl,
> >> or to support setting a virtio feature bit-map.
> >>
> >>
> >> UAPI setting only vnet header size
> >>
> >> Pros:
> >>
> >> 1. It aligns with how other virito backend devices communicate with
> >> virtio-user
> >>
> >> 2. We can use the holes in struct packet_socket (net/packet/internal.h:120)
> >> to record the extra information since the size info only takes 8 bits.
> >>
> >> Cons:
> >>
> >> 1. It may have more information that virtio-user needs to communicate with
> >> packet socket in the future and needs to add more UAPI supports here.
> >>
> >> To Michael: Is there any other information that backend device needs and
> >> will be given from virtio-user?
> >
> > Yes e.g. I already mentioned virtio 1.0 wrt LE versus native endian
> > format.
> >
> >
> >> UAPI setting a virtio feature bit-map
> >>
> >> Pros:
> >>
> >> 1. It is more general and may reduce future UAPI changes.
> >>
> >> Cons:
> >>
> >> 1. A virtio feature bit-map needs 64 bits, which needs to add an extra field
> >> in packet_sock struct

Accepting a bitmap in the ABI does not have to imply storing a bitmap.

> >>
> >> 2. Virtio-user needs to aware that using packet socket as backend supports
> >> different approach to negotiate the vnet header size.
> >>
> >>
> >> We really appreciate any suggestion or discussion on this design choice of
> >> UAPI.
> > In the end it's ok with just size too, you just probably shouldn't say
> > you support VERSION_1 if you are not passing that bit.
> >
> 
> Sorry for the confusion here that we mentioned VERSION_1 in the commit 
> log. We actually just attended to give an example of what features that 
> may need 12-byte vnet header. We will remove it from the commit log in 
> patch v2 to avoid confusion here. Thanks a lot for your suggestions.

The question hinges on which features are expected to have to be
supported in the future. So far we have

- extra num_buffers field
- little endian (V1)

Given the rate of change in the spec, I don't think this should
be over designed. If V1 is not planned to be supported, just
configure header length. If it is, then perhaps instead a feature
bitmap.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-02-22 15:05 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-09 12:43 [PATCH 0/2] net/packet: support of specifying virtio net header size 沈安琪(凛玥)
2023-02-09 12:43 ` [PATCH 1/2] net/packet: add socketopt to set/get vnet_hdr_sz 沈安琪(凛玥)
2023-02-09 14:45   ` Willem de Bruijn
2023-02-10  4:00     ` 沈安琪(凛玥)
2023-02-09 12:43 ` [PATCH 2/2] net/packet: send and receive pkt with given vnet_hdr_sz 沈安琪(凛玥)
2023-02-09 13:07   ` Michael S. Tsirkin
2023-02-10  4:01     ` 沈安琪(凛玥)
2023-02-10  8:10       ` Michael S. Tsirkin
2023-02-10 15:36         ` Willem de Bruijn
2023-02-12  9:54           ` Michael S. Tsirkin
2023-02-10 15:39         ` Willem de Bruijn
2023-02-13 11:06           ` 沈安琪(凛玥)
2023-02-14 14:28             ` Willem de Bruijn
2023-02-21  9:40               ` 沈安琪(凛玥)
2023-02-21 15:03                 ` Willem de Bruijn
2023-02-22  8:04                   ` 沈安琪(凛玥)
2023-02-22 11:37                     ` Michael S. Tsirkin
2023-02-22 11:43                       ` 沈安琪(凛玥)
2023-02-22 15:04                         ` Willem de Bruijn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.