[PATCH net-next v5 00/15] virtio-net: support xdp socket zero copy

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v5 00/15] virtio-net: support xdp socket zero copy
@ 2021-06-10  8:21 ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

XDP socket is an excellent by pass kernel network transmission framework. The
zero copy feature of xsk (XDP socket) needs to be supported by the driver. The
performance of zero copy is very good. mlx5 and intel ixgbe already support this
feature, This patch set allows virtio-net to support xsk's zerocopy xmit
feature.

Compared with other drivers, the trouble with virtio-net is that when we bind
the channel to xsk, we cannot directly disable/enable the channel. So we have to
consider the buf that has been placed in the vq after the xsk is released by the
upper layer.

My solution is to add a ctx to each buf placed in vq to record the page used,
and add a reference to the page. So when the upper xsk is released, these pages
are still safe in vq. We will process these bufs when we recycle buf or
receive new data.

In the case of rx, it will be more complicated, because we may encounter the buf
of xsk, or it may be a normal buf. Especially in the case of merge, we may
receive multiple bufs, and these bufs may be xsk buf and normal are
mixed together.

v5:
   support rx

v4:
    1. add priv_flags IFF_NOT_USE_DMA_ADDR
    2. more reasonable patch split

Xuan Zhuo (15):
  netdevice: priv_flags extend to 64bit
  netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR
  virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
  xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR
  virtio: support virtqueue_detach_unused_buf_ctx
  virtio-net: unify the code for recycling the xmit ptr
  virtio-net: standalone virtnet_aloc_frag function
  virtio-net: split the receive_mergeable function
  virtio-net: virtnet_poll_tx support budget check
  virtio-net: independent directory
  virtio-net: move to virtio_net.h
  virtio-net: support AF_XDP zc tx
  virtio-net: support AF_XDP zc rx
  virtio-net: xsk direct xmit inside xsk wakeup
  virtio-net: xsk zero copy xmit kick by threshold

 MAINTAINERS                           |   2 +-
 drivers/net/Kconfig                   |   8 +-
 drivers/net/Makefile                  |   2 +-
 drivers/net/virtio/Kconfig            |  11 +
 drivers/net/virtio/Makefile           |   7 +
 drivers/net/{ => virtio}/virtio_net.c | 670 +++++++++++-----------
 drivers/net/virtio/virtio_net.h       | 288 ++++++++++
 drivers/net/virtio/xsk.c              | 766 ++++++++++++++++++++++++++
 drivers/net/virtio/xsk.h              | 176 ++++++
 drivers/virtio/virtio_ring.c          |  22 +-
 include/linux/netdevice.h             | 143 ++---
 include/linux/virtio.h                |   2 +
 net/8021q/vlanproc.c                  |   2 +-
 net/xdp/xsk_buff_pool.c               |   2 +-
 14 files changed, 1664 insertions(+), 437 deletions(-)
 create mode 100644 drivers/net/virtio/Kconfig
 create mode 100644 drivers/net/virtio/Makefile
 rename drivers/net/{ => virtio}/virtio_net.c (92%)
 create mode 100644 drivers/net/virtio/virtio_net.h
 create mode 100644 drivers/net/virtio/xsk.c
 create mode 100644 drivers/net/virtio/xsk.h

--
2.31.0


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 00/15] virtio-net: support xdp socket zero copy
@ 2021-06-10  8:21 ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

XDP socket is an excellent by pass kernel network transmission framework. The
zero copy feature of xsk (XDP socket) needs to be supported by the driver. The
performance of zero copy is very good. mlx5 and intel ixgbe already support this
feature, This patch set allows virtio-net to support xsk's zerocopy xmit
feature.

Compared with other drivers, the trouble with virtio-net is that when we bind
the channel to xsk, we cannot directly disable/enable the channel. So we have to
consider the buf that has been placed in the vq after the xsk is released by the
upper layer.

My solution is to add a ctx to each buf placed in vq to record the page used,
and add a reference to the page. So when the upper xsk is released, these pages
are still safe in vq. We will process these bufs when we recycle buf or
receive new data.

In the case of rx, it will be more complicated, because we may encounter the buf
of xsk, or it may be a normal buf. Especially in the case of merge, we may
receive multiple bufs, and these bufs may be xsk buf and normal are
mixed together.

v5:
   support rx

v4:
    1. add priv_flags IFF_NOT_USE_DMA_ADDR
    2. more reasonable patch split

Xuan Zhuo (15):
  netdevice: priv_flags extend to 64bit
  netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR
  virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
  xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR
  virtio: support virtqueue_detach_unused_buf_ctx
  virtio-net: unify the code for recycling the xmit ptr
  virtio-net: standalone virtnet_aloc_frag function
  virtio-net: split the receive_mergeable function
  virtio-net: virtnet_poll_tx support budget check
  virtio-net: independent directory
  virtio-net: move to virtio_net.h
  virtio-net: support AF_XDP zc tx
  virtio-net: support AF_XDP zc rx
  virtio-net: xsk direct xmit inside xsk wakeup
  virtio-net: xsk zero copy xmit kick by threshold

 MAINTAINERS                           |   2 +-
 drivers/net/Kconfig                   |   8 +-
 drivers/net/Makefile                  |   2 +-
 drivers/net/virtio/Kconfig            |  11 +
 drivers/net/virtio/Makefile           |   7 +
 drivers/net/{ => virtio}/virtio_net.c | 670 +++++++++++-----------
 drivers/net/virtio/virtio_net.h       | 288 ++++++++++
 drivers/net/virtio/xsk.c              | 766 ++++++++++++++++++++++++++
 drivers/net/virtio/xsk.h              | 176 ++++++
 drivers/virtio/virtio_ring.c          |  22 +-
 include/linux/netdevice.h             | 143 ++---
 include/linux/virtio.h                |   2 +
 net/8021q/vlanproc.c                  |   2 +-
 net/xdp/xsk_buff_pool.c               |   2 +-
 14 files changed, 1664 insertions(+), 437 deletions(-)
 create mode 100644 drivers/net/virtio/Kconfig
 create mode 100644 drivers/net/virtio/Makefile
 rename drivers/net/{ => virtio}/virtio_net.c (92%)
 create mode 100644 drivers/net/virtio/virtio_net.h
 create mode 100644 drivers/net/virtio/xsk.c
 create mode 100644 drivers/net/virtio/xsk.h

--
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 01/15] netdevice: priv_flags extend to 64bit
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:21   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

The size of priv_flags is 32 bits, and the number of flags currently
available has reached 32. It is time to expand the size of priv_flags to
64 bits.

Here the priv_flags is modified to 8 bytes, but the size of struct
net_device has not changed, it is still 2176 bytes. It is because _tx is
aligned based on the cache line. But there is a 4-byte hole left here.

Since the fields before and after priv_flags are read mostly, I did not
adjust the order of the fields here.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/linux/netdevice.h | 140 ++++++++++++++++++++------------------
 net/8021q/vlanproc.c      |   2 +-
 2 files changed, 73 insertions(+), 69 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index be1dcceda5e4..3202e055b305 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1549,9 +1549,9 @@ struct net_device_ops {
                                                          struct net_device_path *path);
 };
 
+typedef u64 netdev_priv_flags_t;
+
 /**
- * enum netdev_priv_flags - &struct net_device priv_flags
- *
  * These are the &struct net_device, they are only set internally
  * by drivers and used in the kernel. These flags are invisible to
  * userspace; this means that the order of these flags can change
@@ -1597,73 +1597,76 @@ struct net_device_ops {
  * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
  *	skb_headlen(skb) == 0 (data starts from frag0)
  */
-enum netdev_priv_flags {
-	IFF_802_1Q_VLAN			= 1<<0,
-	IFF_EBRIDGE			= 1<<1,
-	IFF_BONDING			= 1<<2,
-	IFF_ISATAP			= 1<<3,
-	IFF_WAN_HDLC			= 1<<4,
-	IFF_XMIT_DST_RELEASE		= 1<<5,
-	IFF_DONT_BRIDGE			= 1<<6,
-	IFF_DISABLE_NETPOLL		= 1<<7,
-	IFF_MACVLAN_PORT		= 1<<8,
-	IFF_BRIDGE_PORT			= 1<<9,
-	IFF_OVS_DATAPATH		= 1<<10,
-	IFF_TX_SKB_SHARING		= 1<<11,
-	IFF_UNICAST_FLT			= 1<<12,
-	IFF_TEAM_PORT			= 1<<13,
-	IFF_SUPP_NOFCS			= 1<<14,
-	IFF_LIVE_ADDR_CHANGE		= 1<<15,
-	IFF_MACVLAN			= 1<<16,
-	IFF_XMIT_DST_RELEASE_PERM	= 1<<17,
-	IFF_L3MDEV_MASTER		= 1<<18,
-	IFF_NO_QUEUE			= 1<<19,
-	IFF_OPENVSWITCH			= 1<<20,
-	IFF_L3MDEV_SLAVE		= 1<<21,
-	IFF_TEAM			= 1<<22,
-	IFF_RXFH_CONFIGURED		= 1<<23,
-	IFF_PHONY_HEADROOM		= 1<<24,
-	IFF_MACSEC			= 1<<25,
-	IFF_NO_RX_HANDLER		= 1<<26,
-	IFF_FAILOVER			= 1<<27,
-	IFF_FAILOVER_SLAVE		= 1<<28,
-	IFF_L3MDEV_RX_HANDLER		= 1<<29,
-	IFF_LIVE_RENAME_OK		= 1<<30,
-	IFF_TX_SKB_NO_LINEAR		= 1<<31,
+enum {
+	IFF_802_1Q_VLAN_BIT,
+	IFF_EBRIDGE_BIT,
+	IFF_BONDING_BIT,
+	IFF_ISATAP_BIT,
+	IFF_WAN_HDLC_BIT,
+	IFF_XMIT_DST_RELEASE_BIT,
+	IFF_DONT_BRIDGE_BIT,
+	IFF_DISABLE_NETPOLL_BIT,
+	IFF_MACVLAN_PORT_BIT,
+	IFF_BRIDGE_PORT_BIT,
+	IFF_OVS_DATAPATH_BIT,
+	IFF_TX_SKB_SHARING_BIT,
+	IFF_UNICAST_FLT_BIT,
+	IFF_TEAM_PORT_BIT,
+	IFF_SUPP_NOFCS_BIT,
+	IFF_LIVE_ADDR_CHANGE_BIT,
+	IFF_MACVLAN_BIT,
+	IFF_XMIT_DST_RELEASE_PERM_BIT,
+	IFF_L3MDEV_MASTER_BIT,
+	IFF_NO_QUEUE_BIT,
+	IFF_OPENVSWITCH_BIT,
+	IFF_L3MDEV_SLAVE_BIT,
+	IFF_TEAM_BIT,
+	IFF_RXFH_CONFIGURED_BIT,
+	IFF_PHONY_HEADROOM_BIT,
+	IFF_MACSEC_BIT,
+	IFF_NO_RX_HANDLER_BIT,
+	IFF_FAILOVER_BIT,
+	IFF_FAILOVER_SLAVE_BIT,
+	IFF_L3MDEV_RX_HANDLER_BIT,
+	IFF_LIVE_RENAME_OK_BIT,
+	IFF_TX_SKB_NO_LINEAR_BIT,
 };
 
-#define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
-#define IFF_EBRIDGE			IFF_EBRIDGE
-#define IFF_BONDING			IFF_BONDING
-#define IFF_ISATAP			IFF_ISATAP
-#define IFF_WAN_HDLC			IFF_WAN_HDLC
-#define IFF_XMIT_DST_RELEASE		IFF_XMIT_DST_RELEASE
-#define IFF_DONT_BRIDGE			IFF_DONT_BRIDGE
-#define IFF_DISABLE_NETPOLL		IFF_DISABLE_NETPOLL
-#define IFF_MACVLAN_PORT		IFF_MACVLAN_PORT
-#define IFF_BRIDGE_PORT			IFF_BRIDGE_PORT
-#define IFF_OVS_DATAPATH		IFF_OVS_DATAPATH
-#define IFF_TX_SKB_SHARING		IFF_TX_SKB_SHARING
-#define IFF_UNICAST_FLT			IFF_UNICAST_FLT
-#define IFF_TEAM_PORT			IFF_TEAM_PORT
-#define IFF_SUPP_NOFCS			IFF_SUPP_NOFCS
-#define IFF_LIVE_ADDR_CHANGE		IFF_LIVE_ADDR_CHANGE
-#define IFF_MACVLAN			IFF_MACVLAN
-#define IFF_XMIT_DST_RELEASE_PERM	IFF_XMIT_DST_RELEASE_PERM
-#define IFF_L3MDEV_MASTER		IFF_L3MDEV_MASTER
-#define IFF_NO_QUEUE			IFF_NO_QUEUE
-#define IFF_OPENVSWITCH			IFF_OPENVSWITCH
-#define IFF_L3MDEV_SLAVE		IFF_L3MDEV_SLAVE
-#define IFF_TEAM			IFF_TEAM
-#define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
-#define IFF_PHONY_HEADROOM		IFF_PHONY_HEADROOM
-#define IFF_MACSEC			IFF_MACSEC
-#define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
-#define IFF_FAILOVER			IFF_FAILOVER
-#define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
-#define IFF_L3MDEV_RX_HANDLER		IFF_L3MDEV_RX_HANDLER
-#define IFF_LIVE_RENAME_OK		IFF_LIVE_RENAME_OK
-#define IFF_TX_SKB_NO_LINEAR		IFF_TX_SKB_NO_LINEAR
+#define __IFF_BIT(bit)		((netdev_priv_flags_t)1 << (bit))
+#define __IFF(name)		__IFF_BIT(IFF_##name##_BIT)
+
+#define IFF_802_1Q_VLAN			__IFF(802_1Q_VLAN)
+#define IFF_EBRIDGE			__IFF(EBRIDGE)
+#define IFF_BONDING			__IFF(BONDING)
+#define IFF_ISATAP			__IFF(ISATAP)
+#define IFF_WAN_HDLC			__IFF(WAN_HDLC)
+#define IFF_XMIT_DST_RELEASE		__IFF(XMIT_DST_RELEASE)
+#define IFF_DONT_BRIDGE			__IFF(DONT_BRIDGE)
+#define IFF_DISABLE_NETPOLL		__IFF(DISABLE_NETPOLL)
+#define IFF_MACVLAN_PORT		__IFF(MACVLAN_PORT)
+#define IFF_BRIDGE_PORT			__IFF(BRIDGE_PORT)
+#define IFF_OVS_DATAPATH		__IFF(OVS_DATAPATH)
+#define IFF_TX_SKB_SHARING		__IFF(TX_SKB_SHARING)
+#define IFF_UNICAST_FLT			__IFF(UNICAST_FLT)
+#define IFF_TEAM_PORT			__IFF(TEAM_PORT)
+#define IFF_SUPP_NOFCS			__IFF(SUPP_NOFCS)
+#define IFF_LIVE_ADDR_CHANGE		__IFF(LIVE_ADDR_CHANGE)
+#define IFF_MACVLAN			__IFF(MACVLAN)
+#define IFF_XMIT_DST_RELEASE_PERM	__IFF(XMIT_DST_RELEASE_PERM)
+#define IFF_L3MDEV_MASTER		__IFF(L3MDEV_MASTER)
+#define IFF_NO_QUEUE			__IFF(NO_QUEUE)
+#define IFF_OPENVSWITCH			__IFF(OPENVSWITCH)
+#define IFF_L3MDEV_SLAVE		__IFF(L3MDEV_SLAVE)
+#define IFF_TEAM			__IFF(TEAM)
+#define IFF_RXFH_CONFIGURED		__IFF(RXFH_CONFIGURED)
+#define IFF_PHONY_HEADROOM		__IFF(PHONY_HEADROOM)
+#define IFF_MACSEC			__IFF(MACSEC)
+#define IFF_NO_RX_HANDLER		__IFF(NO_RX_HANDLER)
+#define IFF_FAILOVER			__IFF(FAILOVER)
+#define IFF_FAILOVER_SLAVE		__IFF(FAILOVER_SLAVE)
+#define IFF_L3MDEV_RX_HANDLER		__IFF(L3MDEV_RX_HANDLER)
+#define IFF_LIVE_RENAME_OK		__IFF(LIVE_RENAME_OK)
+#define IFF_TX_SKB_NO_LINEAR		__IFF(TX_SKB_NO_LINEAR)
 
 /* Specifies the type of the struct net_device::ml_priv pointer */
 enum netdev_ml_priv_type {
@@ -1963,7 +1966,8 @@ struct net_device {
 
 	/* Read-mostly cache-line for fast-path access */
 	unsigned int		flags;
-	unsigned int		priv_flags;
+	/* 4 byte hole */
+	netdev_priv_flags_t	priv_flags;
 	const struct net_device_ops *netdev_ops;
 	int			ifindex;
 	unsigned short		gflags;
diff --git a/net/8021q/vlanproc.c b/net/8021q/vlanproc.c
index ec87dea23719..08bf6c839e25 100644
--- a/net/8021q/vlanproc.c
+++ b/net/8021q/vlanproc.c
@@ -252,7 +252,7 @@ static int vlandev_seq_show(struct seq_file *seq, void *offset)
 
 	stats = dev_get_stats(vlandev, &temp);
 	seq_printf(seq,
-		   "%s  VID: %d	 REORDER_HDR: %i  dev->priv_flags: %hx\n",
+		   "%s  VID: %d	 REORDER_HDR: %i  dev->priv_flags: %llx\n",
 		   vlandev->name, vlan->vlan_id,
 		   (int)(vlan->flags & 1), vlandev->priv_flags);
 
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 01/15] netdevice: priv_flags extend to 64bit
@ 2021-06-10  8:21   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

The size of priv_flags is 32 bits, and the number of flags currently
available has reached 32. It is time to expand the size of priv_flags to
64 bits.

Here the priv_flags is modified to 8 bytes, but the size of struct
net_device has not changed, it is still 2176 bytes. It is because _tx is
aligned based on the cache line. But there is a 4-byte hole left here.

Since the fields before and after priv_flags are read mostly, I did not
adjust the order of the fields here.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/linux/netdevice.h | 140 ++++++++++++++++++++------------------
 net/8021q/vlanproc.c      |   2 +-
 2 files changed, 73 insertions(+), 69 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index be1dcceda5e4..3202e055b305 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1549,9 +1549,9 @@ struct net_device_ops {
                                                          struct net_device_path *path);
 };
 
+typedef u64 netdev_priv_flags_t;
+
 /**
- * enum netdev_priv_flags - &struct net_device priv_flags
- *
  * These are the &struct net_device, they are only set internally
  * by drivers and used in the kernel. These flags are invisible to
  * userspace; this means that the order of these flags can change
@@ -1597,73 +1597,76 @@ struct net_device_ops {
  * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
  *	skb_headlen(skb) == 0 (data starts from frag0)
  */
-enum netdev_priv_flags {
-	IFF_802_1Q_VLAN			= 1<<0,
-	IFF_EBRIDGE			= 1<<1,
-	IFF_BONDING			= 1<<2,
-	IFF_ISATAP			= 1<<3,
-	IFF_WAN_HDLC			= 1<<4,
-	IFF_XMIT_DST_RELEASE		= 1<<5,
-	IFF_DONT_BRIDGE			= 1<<6,
-	IFF_DISABLE_NETPOLL		= 1<<7,
-	IFF_MACVLAN_PORT		= 1<<8,
-	IFF_BRIDGE_PORT			= 1<<9,
-	IFF_OVS_DATAPATH		= 1<<10,
-	IFF_TX_SKB_SHARING		= 1<<11,
-	IFF_UNICAST_FLT			= 1<<12,
-	IFF_TEAM_PORT			= 1<<13,
-	IFF_SUPP_NOFCS			= 1<<14,
-	IFF_LIVE_ADDR_CHANGE		= 1<<15,
-	IFF_MACVLAN			= 1<<16,
-	IFF_XMIT_DST_RELEASE_PERM	= 1<<17,
-	IFF_L3MDEV_MASTER		= 1<<18,
-	IFF_NO_QUEUE			= 1<<19,
-	IFF_OPENVSWITCH			= 1<<20,
-	IFF_L3MDEV_SLAVE		= 1<<21,
-	IFF_TEAM			= 1<<22,
-	IFF_RXFH_CONFIGURED		= 1<<23,
-	IFF_PHONY_HEADROOM		= 1<<24,
-	IFF_MACSEC			= 1<<25,
-	IFF_NO_RX_HANDLER		= 1<<26,
-	IFF_FAILOVER			= 1<<27,
-	IFF_FAILOVER_SLAVE		= 1<<28,
-	IFF_L3MDEV_RX_HANDLER		= 1<<29,
-	IFF_LIVE_RENAME_OK		= 1<<30,
-	IFF_TX_SKB_NO_LINEAR		= 1<<31,
+enum {
+	IFF_802_1Q_VLAN_BIT,
+	IFF_EBRIDGE_BIT,
+	IFF_BONDING_BIT,
+	IFF_ISATAP_BIT,
+	IFF_WAN_HDLC_BIT,
+	IFF_XMIT_DST_RELEASE_BIT,
+	IFF_DONT_BRIDGE_BIT,
+	IFF_DISABLE_NETPOLL_BIT,
+	IFF_MACVLAN_PORT_BIT,
+	IFF_BRIDGE_PORT_BIT,
+	IFF_OVS_DATAPATH_BIT,
+	IFF_TX_SKB_SHARING_BIT,
+	IFF_UNICAST_FLT_BIT,
+	IFF_TEAM_PORT_BIT,
+	IFF_SUPP_NOFCS_BIT,
+	IFF_LIVE_ADDR_CHANGE_BIT,
+	IFF_MACVLAN_BIT,
+	IFF_XMIT_DST_RELEASE_PERM_BIT,
+	IFF_L3MDEV_MASTER_BIT,
+	IFF_NO_QUEUE_BIT,
+	IFF_OPENVSWITCH_BIT,
+	IFF_L3MDEV_SLAVE_BIT,
+	IFF_TEAM_BIT,
+	IFF_RXFH_CONFIGURED_BIT,
+	IFF_PHONY_HEADROOM_BIT,
+	IFF_MACSEC_BIT,
+	IFF_NO_RX_HANDLER_BIT,
+	IFF_FAILOVER_BIT,
+	IFF_FAILOVER_SLAVE_BIT,
+	IFF_L3MDEV_RX_HANDLER_BIT,
+	IFF_LIVE_RENAME_OK_BIT,
+	IFF_TX_SKB_NO_LINEAR_BIT,
 };
 
-#define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
-#define IFF_EBRIDGE			IFF_EBRIDGE
-#define IFF_BONDING			IFF_BONDING
-#define IFF_ISATAP			IFF_ISATAP
-#define IFF_WAN_HDLC			IFF_WAN_HDLC
-#define IFF_XMIT_DST_RELEASE		IFF_XMIT_DST_RELEASE
-#define IFF_DONT_BRIDGE			IFF_DONT_BRIDGE
-#define IFF_DISABLE_NETPOLL		IFF_DISABLE_NETPOLL
-#define IFF_MACVLAN_PORT		IFF_MACVLAN_PORT
-#define IFF_BRIDGE_PORT			IFF_BRIDGE_PORT
-#define IFF_OVS_DATAPATH		IFF_OVS_DATAPATH
-#define IFF_TX_SKB_SHARING		IFF_TX_SKB_SHARING
-#define IFF_UNICAST_FLT			IFF_UNICAST_FLT
-#define IFF_TEAM_PORT			IFF_TEAM_PORT
-#define IFF_SUPP_NOFCS			IFF_SUPP_NOFCS
-#define IFF_LIVE_ADDR_CHANGE		IFF_LIVE_ADDR_CHANGE
-#define IFF_MACVLAN			IFF_MACVLAN
-#define IFF_XMIT_DST_RELEASE_PERM	IFF_XMIT_DST_RELEASE_PERM
-#define IFF_L3MDEV_MASTER		IFF_L3MDEV_MASTER
-#define IFF_NO_QUEUE			IFF_NO_QUEUE
-#define IFF_OPENVSWITCH			IFF_OPENVSWITCH
-#define IFF_L3MDEV_SLAVE		IFF_L3MDEV_SLAVE
-#define IFF_TEAM			IFF_TEAM
-#define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
-#define IFF_PHONY_HEADROOM		IFF_PHONY_HEADROOM
-#define IFF_MACSEC			IFF_MACSEC
-#define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
-#define IFF_FAILOVER			IFF_FAILOVER
-#define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
-#define IFF_L3MDEV_RX_HANDLER		IFF_L3MDEV_RX_HANDLER
-#define IFF_LIVE_RENAME_OK		IFF_LIVE_RENAME_OK
-#define IFF_TX_SKB_NO_LINEAR		IFF_TX_SKB_NO_LINEAR
+#define __IFF_BIT(bit)		((netdev_priv_flags_t)1 << (bit))
+#define __IFF(name)		__IFF_BIT(IFF_##name##_BIT)
+
+#define IFF_802_1Q_VLAN			__IFF(802_1Q_VLAN)
+#define IFF_EBRIDGE			__IFF(EBRIDGE)
+#define IFF_BONDING			__IFF(BONDING)
+#define IFF_ISATAP			__IFF(ISATAP)
+#define IFF_WAN_HDLC			__IFF(WAN_HDLC)
+#define IFF_XMIT_DST_RELEASE		__IFF(XMIT_DST_RELEASE)
+#define IFF_DONT_BRIDGE			__IFF(DONT_BRIDGE)
+#define IFF_DISABLE_NETPOLL		__IFF(DISABLE_NETPOLL)
+#define IFF_MACVLAN_PORT		__IFF(MACVLAN_PORT)
+#define IFF_BRIDGE_PORT			__IFF(BRIDGE_PORT)
+#define IFF_OVS_DATAPATH		__IFF(OVS_DATAPATH)
+#define IFF_TX_SKB_SHARING		__IFF(TX_SKB_SHARING)
+#define IFF_UNICAST_FLT			__IFF(UNICAST_FLT)
+#define IFF_TEAM_PORT			__IFF(TEAM_PORT)
+#define IFF_SUPP_NOFCS			__IFF(SUPP_NOFCS)
+#define IFF_LIVE_ADDR_CHANGE		__IFF(LIVE_ADDR_CHANGE)
+#define IFF_MACVLAN			__IFF(MACVLAN)
+#define IFF_XMIT_DST_RELEASE_PERM	__IFF(XMIT_DST_RELEASE_PERM)
+#define IFF_L3MDEV_MASTER		__IFF(L3MDEV_MASTER)
+#define IFF_NO_QUEUE			__IFF(NO_QUEUE)
+#define IFF_OPENVSWITCH			__IFF(OPENVSWITCH)
+#define IFF_L3MDEV_SLAVE		__IFF(L3MDEV_SLAVE)
+#define IFF_TEAM			__IFF(TEAM)
+#define IFF_RXFH_CONFIGURED		__IFF(RXFH_CONFIGURED)
+#define IFF_PHONY_HEADROOM		__IFF(PHONY_HEADROOM)
+#define IFF_MACSEC			__IFF(MACSEC)
+#define IFF_NO_RX_HANDLER		__IFF(NO_RX_HANDLER)
+#define IFF_FAILOVER			__IFF(FAILOVER)
+#define IFF_FAILOVER_SLAVE		__IFF(FAILOVER_SLAVE)
+#define IFF_L3MDEV_RX_HANDLER		__IFF(L3MDEV_RX_HANDLER)
+#define IFF_LIVE_RENAME_OK		__IFF(LIVE_RENAME_OK)
+#define IFF_TX_SKB_NO_LINEAR		__IFF(TX_SKB_NO_LINEAR)
 
 /* Specifies the type of the struct net_device::ml_priv pointer */
 enum netdev_ml_priv_type {
@@ -1963,7 +1966,8 @@ struct net_device {
 
 	/* Read-mostly cache-line for fast-path access */
 	unsigned int		flags;
-	unsigned int		priv_flags;
+	/* 4 byte hole */
+	netdev_priv_flags_t	priv_flags;
 	const struct net_device_ops *netdev_ops;
 	int			ifindex;
 	unsigned short		gflags;
diff --git a/net/8021q/vlanproc.c b/net/8021q/vlanproc.c
index ec87dea23719..08bf6c839e25 100644
--- a/net/8021q/vlanproc.c
+++ b/net/8021q/vlanproc.c
@@ -252,7 +252,7 @@ static int vlandev_seq_show(struct seq_file *seq, void *offset)
 
 	stats = dev_get_stats(vlandev, &temp);
 	seq_printf(seq,
-		   "%s  VID: %d	 REORDER_HDR: %i  dev->priv_flags: %hx\n",
+		   "%s  VID: %d	 REORDER_HDR: %i  dev->priv_flags: %llx\n",
 		   vlandev->name, vlan->vlan_id,
 		   (int)(vlan->flags & 1), vlandev->priv_flags);
 
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 02/15] netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:21   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

Some driver devices, such as virtio-net, do not directly use dma addr.
For upper-level frameworks such as xdp socket, that need to be aware of
this. So add a new priv_flag IFF_NOT_USE_DMA_ADDR.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/linux/netdevice.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3202e055b305..2de0a0c455e5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1596,6 +1596,7 @@ typedef u64 netdev_priv_flags_t;
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
  * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
  *	skb_headlen(skb) == 0 (data starts from frag0)
+ * @IFF_NOT_USE_DMA_ADDR: driver not use dma addr directly. such as virtio-net
  */
 enum {
 	IFF_802_1Q_VLAN_BIT,
@@ -1630,6 +1631,7 @@ enum {
 	IFF_L3MDEV_RX_HANDLER_BIT,
 	IFF_LIVE_RENAME_OK_BIT,
 	IFF_TX_SKB_NO_LINEAR_BIT,
+	IFF_NOT_USE_DMA_ADDR_BIT,
 };
 
 #define __IFF_BIT(bit)		((netdev_priv_flags_t)1 << (bit))
@@ -1667,6 +1669,7 @@ enum {
 #define IFF_L3MDEV_RX_HANDLER		__IFF(L3MDEV_RX_HANDLER)
 #define IFF_LIVE_RENAME_OK		__IFF(LIVE_RENAME_OK)
 #define IFF_TX_SKB_NO_LINEAR		__IFF(TX_SKB_NO_LINEAR)
+#define IFF_NOT_USE_DMA_ADDR		__IFF(NOT_USE_DMA_ADDR)
 
 /* Specifies the type of the struct net_device::ml_priv pointer */
 enum netdev_ml_priv_type {
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 02/15] netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR
@ 2021-06-10  8:21   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

Some driver devices, such as virtio-net, do not directly use dma addr.
For upper-level frameworks such as xdp socket, that need to be aware of
this. So add a new priv_flag IFF_NOT_USE_DMA_ADDR.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/linux/netdevice.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3202e055b305..2de0a0c455e5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1596,6 +1596,7 @@ typedef u64 netdev_priv_flags_t;
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
  * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
  *	skb_headlen(skb) == 0 (data starts from frag0)
+ * @IFF_NOT_USE_DMA_ADDR: driver not use dma addr directly. such as virtio-net
  */
 enum {
 	IFF_802_1Q_VLAN_BIT,
@@ -1630,6 +1631,7 @@ enum {
 	IFF_L3MDEV_RX_HANDLER_BIT,
 	IFF_LIVE_RENAME_OK_BIT,
 	IFF_TX_SKB_NO_LINEAR_BIT,
+	IFF_NOT_USE_DMA_ADDR_BIT,
 };
 
 #define __IFF_BIT(bit)		((netdev_priv_flags_t)1 << (bit))
@@ -1667,6 +1669,7 @@ enum {
 #define IFF_L3MDEV_RX_HANDLER		__IFF(L3MDEV_RX_HANDLER)
 #define IFF_LIVE_RENAME_OK		__IFF(LIVE_RENAME_OK)
 #define IFF_TX_SKB_NO_LINEAR		__IFF(TX_SKB_NO_LINEAR)
+#define IFF_NOT_USE_DMA_ADDR		__IFF(NOT_USE_DMA_ADDR)
 
 /* Specifies the type of the struct net_device::ml_priv pointer */
 enum netdev_ml_priv_type {
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 03/15] virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:21   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

virtio-net not use dma addr directly. So add this priv_flags
IFF_NOT_USE_DMA_ADDR.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0416a7e00914..6c1233f0ab3e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3064,7 +3064,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up network device as normal. */
 	dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
-			   IFF_TX_SKB_NO_LINEAR;
+			   IFF_TX_SKB_NO_LINEAR | IFF_NOT_USE_DMA_ADDR;
 	dev->netdev_ops = &virtnet_netdev;
 	dev->features = NETIF_F_HIGHDMA;
 
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 03/15] virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
@ 2021-06-10  8:21   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

virtio-net not use dma addr directly. So add this priv_flags
IFF_NOT_USE_DMA_ADDR.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0416a7e00914..6c1233f0ab3e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3064,7 +3064,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up network device as normal. */
 	dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
-			   IFF_TX_SKB_NO_LINEAR;
+			   IFF_TX_SKB_NO_LINEAR | IFF_NOT_USE_DMA_ADDR;
 	dev->netdev_ops = &virtnet_netdev;
 	dev->features = NETIF_F_HIGHDMA;
 
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 04/15] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:21   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

Some devices, such as virtio-net, do not directly use dma addr. These
devices do not initialize dma after completing the xsk setup, so the dma
check is skipped here.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 net/xdp/xsk_buff_pool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 8de01aaac4a0..a7e434de0308 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -171,7 +171,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (err)
 		goto err_unreg_pool;
 
-	if (!pool->dma_pages) {
+	if (!(netdev->priv_flags & IFF_NOT_USE_DMA_ADDR) && !pool->dma_pages) {
 		WARN(1, "Driver did not DMA map zero-copy buffers");
 		err = -EINVAL;
 		goto err_unreg_xsk;
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 04/15] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR
@ 2021-06-10  8:21   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

Some devices, such as virtio-net, do not directly use dma addr. These
devices do not initialize dma after completing the xsk setup, so the dma
check is skipped here.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 net/xdp/xsk_buff_pool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 8de01aaac4a0..a7e434de0308 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -171,7 +171,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (err)
 		goto err_unreg_pool;
 
-	if (!pool->dma_pages) {
+	if (!(netdev->priv_flags & IFF_NOT_USE_DMA_ADDR) && !pool->dma_pages) {
 		WARN(1, "Driver did not DMA map zero-copy buffers");
 		err = -EINVAL;
 		goto err_unreg_xsk;
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 05/15] virtio: support virtqueue_detach_unused_buf_ctx
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:21   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

Supports returning ctx while recycling unused buf, which helps to
release buf in different ways for different bufs.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/virtio/virtio_ring.c | 22 +++++++++++++++-------
 include/linux/virtio.h       |  2 ++
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71e16b53e9c1..a3d7ec1c9ea7 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -815,7 +815,8 @@ static bool virtqueue_enable_cb_delayed_split(struct virtqueue *_vq)
 	return true;
 }
 
-static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
+static void *virtqueue_detach_unused_buf_ctx_split(struct virtqueue *_vq,
+						   void **ctx)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	unsigned int i;
@@ -828,7 +829,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
 			continue;
 		/* detach_buf_split clears data, so grab it now. */
 		buf = vq->split.desc_state[i].data;
-		detach_buf_split(vq, i, NULL);
+		detach_buf_split(vq, i, ctx);
 		vq->split.avail_idx_shadow--;
 		vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
 				vq->split.avail_idx_shadow);
@@ -1526,7 +1527,8 @@ static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 	return true;
 }
 
-static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
+static void *virtqueue_detach_unused_buf_ctx_packed(struct virtqueue *_vq,
+						    void **ctx)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	unsigned int i;
@@ -1539,7 +1541,7 @@ static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
 			continue;
 		/* detach_buf clears data, so grab it now. */
 		buf = vq->packed.desc_state[i].data;
-		detach_buf_packed(vq, i, NULL);
+		detach_buf_packed(vq, i, ctx);
 		END_USE(vq);
 		return buf;
 	}
@@ -2018,12 +2020,18 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
  * This is not valid on an active queue; it is useful only for device
  * shutdown.
  */
-void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
+void *virtqueue_detach_unused_buf_ctx(struct virtqueue *_vq, void **ctx)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
-	return vq->packed_ring ? virtqueue_detach_unused_buf_packed(_vq) :
-				 virtqueue_detach_unused_buf_split(_vq);
+	return vq->packed_ring ?
+		virtqueue_detach_unused_buf_ctx_packed(_vq, ctx) :
+		virtqueue_detach_unused_buf_ctx_split(_vq, ctx);
+}
+
+void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
+{
+	return virtqueue_detach_unused_buf_ctx(_vq, NULL);
 }
 EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b1894e0323fa..8aada4d29e04 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -78,6 +78,8 @@ bool virtqueue_poll(struct virtqueue *vq, unsigned);
 
 bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
 
+void *virtqueue_detach_unused_buf_ctx(struct virtqueue *vq, void **ctx);
+
 void *virtqueue_detach_unused_buf(struct virtqueue *vq);
 
 unsigned int virtqueue_get_vring_size(struct virtqueue *vq);
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 05/15] virtio: support virtqueue_detach_unused_buf_ctx
@ 2021-06-10  8:21   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:21 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

Supports returning ctx while recycling unused buf, which helps to
release buf in different ways for different bufs.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/virtio/virtio_ring.c | 22 +++++++++++++++-------
 include/linux/virtio.h       |  2 ++
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71e16b53e9c1..a3d7ec1c9ea7 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -815,7 +815,8 @@ static bool virtqueue_enable_cb_delayed_split(struct virtqueue *_vq)
 	return true;
 }
 
-static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
+static void *virtqueue_detach_unused_buf_ctx_split(struct virtqueue *_vq,
+						   void **ctx)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	unsigned int i;
@@ -828,7 +829,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
 			continue;
 		/* detach_buf_split clears data, so grab it now. */
 		buf = vq->split.desc_state[i].data;
-		detach_buf_split(vq, i, NULL);
+		detach_buf_split(vq, i, ctx);
 		vq->split.avail_idx_shadow--;
 		vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
 				vq->split.avail_idx_shadow);
@@ -1526,7 +1527,8 @@ static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 	return true;
 }
 
-static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
+static void *virtqueue_detach_unused_buf_ctx_packed(struct virtqueue *_vq,
+						    void **ctx)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	unsigned int i;
@@ -1539,7 +1541,7 @@ static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
 			continue;
 		/* detach_buf clears data, so grab it now. */
 		buf = vq->packed.desc_state[i].data;
-		detach_buf_packed(vq, i, NULL);
+		detach_buf_packed(vq, i, ctx);
 		END_USE(vq);
 		return buf;
 	}
@@ -2018,12 +2020,18 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
  * This is not valid on an active queue; it is useful only for device
  * shutdown.
  */
-void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
+void *virtqueue_detach_unused_buf_ctx(struct virtqueue *_vq, void **ctx)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
-	return vq->packed_ring ? virtqueue_detach_unused_buf_packed(_vq) :
-				 virtqueue_detach_unused_buf_split(_vq);
+	return vq->packed_ring ?
+		virtqueue_detach_unused_buf_ctx_packed(_vq, ctx) :
+		virtqueue_detach_unused_buf_ctx_split(_vq, ctx);
+}
+
+void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
+{
+	return virtqueue_detach_unused_buf_ctx(_vq, NULL);
 }
 EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b1894e0323fa..8aada4d29e04 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -78,6 +78,8 @@ bool virtqueue_poll(struct virtqueue *vq, unsigned);
 
 bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
 
+void *virtqueue_detach_unused_buf_ctx(struct virtqueue *vq, void **ctx);
+
 void *virtqueue_detach_unused_buf(struct virtqueue *vq);
 
 unsigned int virtqueue_get_vring_size(struct virtqueue *vq);
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 06/15] virtio-net: unify the code for recycling the xmit ptr
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:22   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

Now there are two types of "skb" and "xdp frame" during recycling old
xmit.

There are two completely similar and independent implementations. This
is inconvenient for the subsequent addition of new types. So extract a
function from this piece of code and call this function uniformly to
recover old xmit ptr.

Rename free_old_xmit_skbs() to free_old_xmit().

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 86 ++++++++++++++++++----------------------
 1 file changed, 38 insertions(+), 48 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6c1233f0ab3e..d791543a8dd8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -264,6 +264,30 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
 	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
 }
 
+static void __free_old_xmit(struct send_queue *sq, bool in_napi,
+			    struct virtnet_sq_stats *stats)
+{
+	unsigned int len;
+	void *ptr;
+
+	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		if (!is_xdp_frame(ptr)) {
+			struct sk_buff *skb = ptr;
+
+			pr_debug("Sent skb %p\n", skb);
+
+			stats->bytes += skb->len;
+			napi_consume_skb(skb, in_napi);
+		} else {
+			struct xdp_frame *frame = ptr_to_xdp(ptr);
+
+			stats->bytes += frame->len;
+			xdp_return_frame(frame);
+		}
+		stats->packets++;
+	}
+}
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -572,15 +596,12 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 			    int n, struct xdp_frame **frames, u32 flags)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_sq_stats stats = {};
 	struct receive_queue *rq = vi->rq;
 	struct bpf_prog *xdp_prog;
 	struct send_queue *sq;
-	unsigned int len;
-	int packets = 0;
-	int bytes = 0;
 	int nxmit = 0;
 	int kicks = 0;
-	void *ptr;
 	int ret;
 	int i;
 
@@ -599,20 +620,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 	}
 
 	/* Free up any pending old buffers before queueing new ones. */
-	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (likely(is_xdp_frame(ptr))) {
-			struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-			bytes += frame->len;
-			xdp_return_frame(frame);
-		} else {
-			struct sk_buff *skb = ptr;
-
-			bytes += skb->len;
-			napi_consume_skb(skb, false);
-		}
-		packets++;
-	}
+	__free_old_xmit(sq, false, &stats);
 
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *xdpf = frames[i];
@@ -629,8 +637,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 	}
 out:
 	u64_stats_update_begin(&sq->stats.syncp);
-	sq->stats.bytes += bytes;
-	sq->stats.packets += packets;
+	sq->stats.bytes += stats.bytes;
+	sq->stats.packets += stats.packets;
 	sq->stats.xdp_tx += n;
 	sq->stats.xdp_tx_drops += n - nxmit;
 	sq->stats.kicks += kicks;
@@ -1459,39 +1467,21 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
 	return stats.packets;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
+static void free_old_xmit(struct send_queue *sq, bool in_napi)
 {
-	unsigned int len;
-	unsigned int packets = 0;
-	unsigned int bytes = 0;
-	void *ptr;
+	struct virtnet_sq_stats stats = {};
 
-	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (likely(!is_xdp_frame(ptr))) {
-			struct sk_buff *skb = ptr;
-
-			pr_debug("Sent skb %p\n", skb);
-
-			bytes += skb->len;
-			napi_consume_skb(skb, in_napi);
-		} else {
-			struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-			bytes += frame->len;
-			xdp_return_frame(frame);
-		}
-		packets++;
-	}
+	__free_old_xmit(sq, in_napi, &stats);
 
 	/* Avoid overhead when no packets have been processed
 	 * happens when called speculatively from start_xmit.
 	 */
-	if (!packets)
+	if (!stats.packets)
 		return;
 
 	u64_stats_update_begin(&sq->stats.syncp);
-	sq->stats.bytes += bytes;
-	sq->stats.packets += packets;
+	sq->stats.bytes += stats.bytes;
+	sq->stats.packets += stats.packets;
 	u64_stats_update_end(&sq->stats.syncp);
 }
 
@@ -1516,7 +1506,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
 		return;
 
 	if (__netif_tx_trylock(txq)) {
-		free_old_xmit_skbs(sq, true);
+		free_old_xmit(sq, true);
 		__netif_tx_unlock(txq);
 	}
 
@@ -1601,7 +1591,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 
 	txq = netdev_get_tx_queue(vi->dev, index);
 	__netif_tx_lock(txq, raw_smp_processor_id());
-	free_old_xmit_skbs(sq, true);
+	free_old_xmit(sq, true);
 	__netif_tx_unlock(txq);
 
 	virtqueue_napi_complete(napi, sq->vq, 0);
@@ -1670,7 +1660,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	bool use_napi = sq->napi.weight;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(sq, false);
+	free_old_xmit(sq, false);
 
 	if (use_napi && kick)
 		virtqueue_enable_cb_delayed(sq->vq);
@@ -1714,7 +1704,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (!use_napi &&
 		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(sq, false);
+			free_old_xmit(sq, false);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 06/15] virtio-net: unify the code for recycling the xmit ptr
@ 2021-06-10  8:22   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

Now there are two types of "skb" and "xdp frame" during recycling old
xmit.

There are two completely similar and independent implementations. This
is inconvenient for the subsequent addition of new types. So extract a
function from this piece of code and call this function uniformly to
recover old xmit ptr.

Rename free_old_xmit_skbs() to free_old_xmit().

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 86 ++++++++++++++++++----------------------
 1 file changed, 38 insertions(+), 48 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6c1233f0ab3e..d791543a8dd8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -264,6 +264,30 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
 	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
 }
 
+static void __free_old_xmit(struct send_queue *sq, bool in_napi,
+			    struct virtnet_sq_stats *stats)
+{
+	unsigned int len;
+	void *ptr;
+
+	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		if (!is_xdp_frame(ptr)) {
+			struct sk_buff *skb = ptr;
+
+			pr_debug("Sent skb %p\n", skb);
+
+			stats->bytes += skb->len;
+			napi_consume_skb(skb, in_napi);
+		} else {
+			struct xdp_frame *frame = ptr_to_xdp(ptr);
+
+			stats->bytes += frame->len;
+			xdp_return_frame(frame);
+		}
+		stats->packets++;
+	}
+}
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -572,15 +596,12 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 			    int n, struct xdp_frame **frames, u32 flags)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_sq_stats stats = {};
 	struct receive_queue *rq = vi->rq;
 	struct bpf_prog *xdp_prog;
 	struct send_queue *sq;
-	unsigned int len;
-	int packets = 0;
-	int bytes = 0;
 	int nxmit = 0;
 	int kicks = 0;
-	void *ptr;
 	int ret;
 	int i;
 
@@ -599,20 +620,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 	}
 
 	/* Free up any pending old buffers before queueing new ones. */
-	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (likely(is_xdp_frame(ptr))) {
-			struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-			bytes += frame->len;
-			xdp_return_frame(frame);
-		} else {
-			struct sk_buff *skb = ptr;
-
-			bytes += skb->len;
-			napi_consume_skb(skb, false);
-		}
-		packets++;
-	}
+	__free_old_xmit(sq, false, &stats);
 
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *xdpf = frames[i];
@@ -629,8 +637,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 	}
 out:
 	u64_stats_update_begin(&sq->stats.syncp);
-	sq->stats.bytes += bytes;
-	sq->stats.packets += packets;
+	sq->stats.bytes += stats.bytes;
+	sq->stats.packets += stats.packets;
 	sq->stats.xdp_tx += n;
 	sq->stats.xdp_tx_drops += n - nxmit;
 	sq->stats.kicks += kicks;
@@ -1459,39 +1467,21 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
 	return stats.packets;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
+static void free_old_xmit(struct send_queue *sq, bool in_napi)
 {
-	unsigned int len;
-	unsigned int packets = 0;
-	unsigned int bytes = 0;
-	void *ptr;
+	struct virtnet_sq_stats stats = {};
 
-	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (likely(!is_xdp_frame(ptr))) {
-			struct sk_buff *skb = ptr;
-
-			pr_debug("Sent skb %p\n", skb);
-
-			bytes += skb->len;
-			napi_consume_skb(skb, in_napi);
-		} else {
-			struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-			bytes += frame->len;
-			xdp_return_frame(frame);
-		}
-		packets++;
-	}
+	__free_old_xmit(sq, in_napi, &stats);
 
 	/* Avoid overhead when no packets have been processed
 	 * happens when called speculatively from start_xmit.
 	 */
-	if (!packets)
+	if (!stats.packets)
 		return;
 
 	u64_stats_update_begin(&sq->stats.syncp);
-	sq->stats.bytes += bytes;
-	sq->stats.packets += packets;
+	sq->stats.bytes += stats.bytes;
+	sq->stats.packets += stats.packets;
 	u64_stats_update_end(&sq->stats.syncp);
 }
 
@@ -1516,7 +1506,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
 		return;
 
 	if (__netif_tx_trylock(txq)) {
-		free_old_xmit_skbs(sq, true);
+		free_old_xmit(sq, true);
 		__netif_tx_unlock(txq);
 	}
 
@@ -1601,7 +1591,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 
 	txq = netdev_get_tx_queue(vi->dev, index);
 	__netif_tx_lock(txq, raw_smp_processor_id());
-	free_old_xmit_skbs(sq, true);
+	free_old_xmit(sq, true);
 	__netif_tx_unlock(txq);
 
 	virtqueue_napi_complete(napi, sq->vq, 0);
@@ -1670,7 +1660,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	bool use_napi = sq->napi.weight;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(sq, false);
+	free_old_xmit(sq, false);
 
 	if (use_napi && kick)
 		virtqueue_enable_cb_delayed(sq->vq);
@@ -1714,7 +1704,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (!use_napi &&
 		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(sq, false);
+			free_old_xmit(sq, false);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 07/15] virtio-net: standalone virtnet_aloc_frag function
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:22   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

This logic is used by small and merge when adding buf, and the
subsequent patch will also use this logic, so it is separated as an
independent function.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d791543a8dd8..3fd87bf2b2ad 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -264,6 +264,22 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
 	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
 }
 
+static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
+				int gfp)
+{
+	struct page_frag *alloc_frag = &rq->alloc_frag;
+	char *buf;
+
+	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
+		return NULL;
+
+	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	get_page(alloc_frag->page);
+	alloc_frag->offset += len;
+
+	return buf;
+}
+
 static void __free_old_xmit(struct send_queue *sq, bool in_napi,
 			    struct virtnet_sq_stats *stats)
 {
@@ -1190,7 +1206,6 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
 static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
 			     gfp_t gfp)
 {
-	struct page_frag *alloc_frag = &rq->alloc_frag;
 	char *buf;
 	unsigned int xdp_headroom = virtnet_get_headroom(vi);
 	void *ctx = (void *)(unsigned long)xdp_headroom;
@@ -1199,12 +1214,10 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
 
 	len = SKB_DATA_ALIGN(len) +
 	      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
+	buf = virtnet_alloc_frag(rq, len, gfp);
+	if (unlikely(!buf))
 		return -ENOMEM;
 
-	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-	get_page(alloc_frag->page);
-	alloc_frag->offset += len;
 	sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom,
 		    vi->hdr_len + GOOD_PACKET_LEN);
 	err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
@@ -1295,13 +1308,11 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
 	 * disabled GSO for XDP, it won't be a big issue.
 	 */
 	len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
-	if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
+	buf = virtnet_alloc_frag(rq, len + room, gfp);
+	if (unlikely(!buf))
 		return -ENOMEM;
 
-	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
 	buf += headroom; /* advance address leaving hole at front of pkt */
-	get_page(alloc_frag->page);
-	alloc_frag->offset += len + room;
 	hole = alloc_frag->size - alloc_frag->offset;
 	if (hole < len + room) {
 		/* To avoid internal fragmentation, if there is very likely not
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 07/15] virtio-net: standalone virtnet_aloc_frag function
@ 2021-06-10  8:22   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

This logic is used by small and merge when adding buf, and the
subsequent patch will also use this logic, so it is separated as an
independent function.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d791543a8dd8..3fd87bf2b2ad 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -264,6 +264,22 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
 	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
 }
 
+static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
+				int gfp)
+{
+	struct page_frag *alloc_frag = &rq->alloc_frag;
+	char *buf;
+
+	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
+		return NULL;
+
+	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	get_page(alloc_frag->page);
+	alloc_frag->offset += len;
+
+	return buf;
+}
+
 static void __free_old_xmit(struct send_queue *sq, bool in_napi,
 			    struct virtnet_sq_stats *stats)
 {
@@ -1190,7 +1206,6 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
 static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
 			     gfp_t gfp)
 {
-	struct page_frag *alloc_frag = &rq->alloc_frag;
 	char *buf;
 	unsigned int xdp_headroom = virtnet_get_headroom(vi);
 	void *ctx = (void *)(unsigned long)xdp_headroom;
@@ -1199,12 +1214,10 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
 
 	len = SKB_DATA_ALIGN(len) +
 	      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
+	buf = virtnet_alloc_frag(rq, len, gfp);
+	if (unlikely(!buf))
 		return -ENOMEM;
 
-	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-	get_page(alloc_frag->page);
-	alloc_frag->offset += len;
 	sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom,
 		    vi->hdr_len + GOOD_PACKET_LEN);
 	err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
@@ -1295,13 +1308,11 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
 	 * disabled GSO for XDP, it won't be a big issue.
 	 */
 	len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
-	if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
+	buf = virtnet_alloc_frag(rq, len + room, gfp);
+	if (unlikely(!buf))
 		return -ENOMEM;
 
-	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
 	buf += headroom; /* advance address leaving hole at front of pkt */
-	get_page(alloc_frag->page);
-	alloc_frag->offset += len + room;
 	hole = alloc_frag->size - alloc_frag->offset;
 	if (hole < len + room) {
 		/* To avoid internal fragmentation, if there is very likely not
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 08/15] virtio-net: split the receive_mergeable function
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:22   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

receive_mergeable() is too complicated, so this function is split here.
One is to make the function more readable. On the other hand, the two
independent functions will be called separately in subsequent patches.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 181 ++++++++++++++++++++++++---------------
 1 file changed, 111 insertions(+), 70 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3fd87bf2b2ad..989aba600e63 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -733,6 +733,109 @@ static struct page *xdp_linearize_page(struct receive_queue *rq,
 	return NULL;
 }
 
+static void merge_drop_follow_bufs(struct net_device *dev,
+				   struct receive_queue *rq,
+				   u16 num_buf,
+				   struct virtnet_rq_stats *stats)
+{
+	struct page *page;
+	unsigned int len;
+	void *buf;
+
+	while (num_buf-- > 1) {
+		buf = virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!buf)) {
+			pr_debug("%s: rx error: %d buffers missing\n",
+				 dev->name, num_buf);
+			dev->stats.rx_length_errors++;
+			break;
+		}
+		stats->bytes += len;
+		page = virt_to_head_page(buf);
+		put_page(page);
+	}
+}
+
+static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
+						 struct virtnet_info *vi,
+						 struct receive_queue *rq,
+						 struct sk_buff *head_skb,
+						 u16 num_buf,
+						 struct virtnet_rq_stats *stats)
+{
+	struct sk_buff *curr_skb;
+	unsigned int truesize;
+	unsigned int len, num;
+	struct page *page;
+	void *buf, *ctx;
+	int offset;
+
+	curr_skb = head_skb;
+	num = num_buf;
+
+	while (--num_buf) {
+		int num_skb_frags;
+
+		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
+		if (unlikely(!buf)) {
+			pr_debug("%s: rx error: %d buffers out of %d missing\n",
+				 dev->name, num_buf, num);
+			dev->stats.rx_length_errors++;
+			goto err_buf;
+		}
+
+		stats->bytes += len;
+		page = virt_to_head_page(buf);
+
+		truesize = mergeable_ctx_to_truesize(ctx);
+		if (unlikely(len > truesize)) {
+			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
+				 dev->name, len, (unsigned long)ctx);
+			dev->stats.rx_length_errors++;
+			goto err_skb;
+		}
+
+		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
+		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
+			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
+
+			if (unlikely(!nskb))
+				goto err_skb;
+			if (curr_skb == head_skb)
+				skb_shinfo(curr_skb)->frag_list = nskb;
+			else
+				curr_skb->next = nskb;
+			curr_skb = nskb;
+			head_skb->truesize += nskb->truesize;
+			num_skb_frags = 0;
+		}
+		if (curr_skb != head_skb) {
+			head_skb->data_len += len;
+			head_skb->len += len;
+			head_skb->truesize += truesize;
+		}
+		offset = buf - page_address(page);
+		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
+			put_page(page);
+			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
+					     len, truesize);
+		} else {
+			skb_add_rx_frag(curr_skb, num_skb_frags, page,
+					offset, len, truesize);
+		}
+	}
+
+	return head_skb;
+
+err_skb:
+	put_page(page);
+	merge_drop_follow_bufs(dev, rq, num_buf, stats);
+err_buf:
+	stats->drops++;
+	dev_kfree_skb(head_skb);
+	return NULL;
+}
+
 static struct sk_buff *receive_small(struct net_device *dev,
 				     struct virtnet_info *vi,
 				     struct receive_queue *rq,
@@ -909,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
-	struct sk_buff *head_skb, *curr_skb;
+	struct sk_buff *head_skb;
 	struct bpf_prog *xdp_prog;
 	unsigned int truesize = mergeable_ctx_to_truesize(ctx);
 	unsigned int headroom = mergeable_ctx_to_headroom(ctx);
@@ -1054,65 +1157,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 
 	head_skb = page_to_skb(vi, rq, page, offset, len, truesize, !xdp_prog,
 			       metasize, !!headroom);
-	curr_skb = head_skb;
-
-	if (unlikely(!curr_skb))
+	if (unlikely(!head_skb))
 		goto err_skb;
-	while (--num_buf) {
-		int num_skb_frags;
 
-		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
-		if (unlikely(!buf)) {
-			pr_debug("%s: rx error: %d buffers out of %d missing\n",
-				 dev->name, num_buf,
-				 virtio16_to_cpu(vi->vdev,
-						 hdr->num_buffers));
-			dev->stats.rx_length_errors++;
-			goto err_buf;
-		}
-
-		stats->bytes += len;
-		page = virt_to_head_page(buf);
-
-		truesize = mergeable_ctx_to_truesize(ctx);
-		if (unlikely(len > truesize)) {
-			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
-				 dev->name, len, (unsigned long)ctx);
-			dev->stats.rx_length_errors++;
-			goto err_skb;
-		}
-
-		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
-		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
-			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
-
-			if (unlikely(!nskb))
-				goto err_skb;
-			if (curr_skb == head_skb)
-				skb_shinfo(curr_skb)->frag_list = nskb;
-			else
-				curr_skb->next = nskb;
-			curr_skb = nskb;
-			head_skb->truesize += nskb->truesize;
-			num_skb_frags = 0;
-		}
-		if (curr_skb != head_skb) {
-			head_skb->data_len += len;
-			head_skb->len += len;
-			head_skb->truesize += truesize;
-		}
-		offset = buf - page_address(page);
-		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
-			put_page(page);
-			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-					     len, truesize);
-		} else {
-			skb_add_rx_frag(curr_skb, num_skb_frags, page,
-					offset, len, truesize);
-		}
-	}
+	if (num_buf > 1)
+		head_skb = merge_receive_follow_bufs(dev, vi, rq, head_skb,
+						     num_buf, stats);
+	if (head_skb)
+		ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
 
-	ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
 	return head_skb;
 
 err_xdp:
@@ -1120,19 +1173,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	stats->xdp_drops++;
 err_skb:
 	put_page(page);
-	while (num_buf-- > 1) {
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
-			pr_debug("%s: rx error: %d buffers missing\n",
-				 dev->name, num_buf);
-			dev->stats.rx_length_errors++;
-			break;
-		}
-		stats->bytes += len;
-		page = virt_to_head_page(buf);
-		put_page(page);
-	}
-err_buf:
+	merge_drop_follow_bufs(dev, rq, num_buf, stats);
 	stats->drops++;
 	dev_kfree_skb(head_skb);
 xdp_xmit:
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 08/15] virtio-net: split the receive_mergeable function
@ 2021-06-10  8:22   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

receive_mergeable() is too complicated, so this function is split here.
One is to make the function more readable. On the other hand, the two
independent functions will be called separately in subsequent patches.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 181 ++++++++++++++++++++++++---------------
 1 file changed, 111 insertions(+), 70 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3fd87bf2b2ad..989aba600e63 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -733,6 +733,109 @@ static struct page *xdp_linearize_page(struct receive_queue *rq,
 	return NULL;
 }
 
+static void merge_drop_follow_bufs(struct net_device *dev,
+				   struct receive_queue *rq,
+				   u16 num_buf,
+				   struct virtnet_rq_stats *stats)
+{
+	struct page *page;
+	unsigned int len;
+	void *buf;
+
+	while (num_buf-- > 1) {
+		buf = virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!buf)) {
+			pr_debug("%s: rx error: %d buffers missing\n",
+				 dev->name, num_buf);
+			dev->stats.rx_length_errors++;
+			break;
+		}
+		stats->bytes += len;
+		page = virt_to_head_page(buf);
+		put_page(page);
+	}
+}
+
+static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
+						 struct virtnet_info *vi,
+						 struct receive_queue *rq,
+						 struct sk_buff *head_skb,
+						 u16 num_buf,
+						 struct virtnet_rq_stats *stats)
+{
+	struct sk_buff *curr_skb;
+	unsigned int truesize;
+	unsigned int len, num;
+	struct page *page;
+	void *buf, *ctx;
+	int offset;
+
+	curr_skb = head_skb;
+	num = num_buf;
+
+	while (--num_buf) {
+		int num_skb_frags;
+
+		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
+		if (unlikely(!buf)) {
+			pr_debug("%s: rx error: %d buffers out of %d missing\n",
+				 dev->name, num_buf, num);
+			dev->stats.rx_length_errors++;
+			goto err_buf;
+		}
+
+		stats->bytes += len;
+		page = virt_to_head_page(buf);
+
+		truesize = mergeable_ctx_to_truesize(ctx);
+		if (unlikely(len > truesize)) {
+			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
+				 dev->name, len, (unsigned long)ctx);
+			dev->stats.rx_length_errors++;
+			goto err_skb;
+		}
+
+		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
+		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
+			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
+
+			if (unlikely(!nskb))
+				goto err_skb;
+			if (curr_skb == head_skb)
+				skb_shinfo(curr_skb)->frag_list = nskb;
+			else
+				curr_skb->next = nskb;
+			curr_skb = nskb;
+			head_skb->truesize += nskb->truesize;
+			num_skb_frags = 0;
+		}
+		if (curr_skb != head_skb) {
+			head_skb->data_len += len;
+			head_skb->len += len;
+			head_skb->truesize += truesize;
+		}
+		offset = buf - page_address(page);
+		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
+			put_page(page);
+			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
+					     len, truesize);
+		} else {
+			skb_add_rx_frag(curr_skb, num_skb_frags, page,
+					offset, len, truesize);
+		}
+	}
+
+	return head_skb;
+
+err_skb:
+	put_page(page);
+	merge_drop_follow_bufs(dev, rq, num_buf, stats);
+err_buf:
+	stats->drops++;
+	dev_kfree_skb(head_skb);
+	return NULL;
+}
+
 static struct sk_buff *receive_small(struct net_device *dev,
 				     struct virtnet_info *vi,
 				     struct receive_queue *rq,
@@ -909,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
-	struct sk_buff *head_skb, *curr_skb;
+	struct sk_buff *head_skb;
 	struct bpf_prog *xdp_prog;
 	unsigned int truesize = mergeable_ctx_to_truesize(ctx);
 	unsigned int headroom = mergeable_ctx_to_headroom(ctx);
@@ -1054,65 +1157,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 
 	head_skb = page_to_skb(vi, rq, page, offset, len, truesize, !xdp_prog,
 			       metasize, !!headroom);
-	curr_skb = head_skb;
-
-	if (unlikely(!curr_skb))
+	if (unlikely(!head_skb))
 		goto err_skb;
-	while (--num_buf) {
-		int num_skb_frags;
 
-		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
-		if (unlikely(!buf)) {
-			pr_debug("%s: rx error: %d buffers out of %d missing\n",
-				 dev->name, num_buf,
-				 virtio16_to_cpu(vi->vdev,
-						 hdr->num_buffers));
-			dev->stats.rx_length_errors++;
-			goto err_buf;
-		}
-
-		stats->bytes += len;
-		page = virt_to_head_page(buf);
-
-		truesize = mergeable_ctx_to_truesize(ctx);
-		if (unlikely(len > truesize)) {
-			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
-				 dev->name, len, (unsigned long)ctx);
-			dev->stats.rx_length_errors++;
-			goto err_skb;
-		}
-
-		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
-		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
-			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
-
-			if (unlikely(!nskb))
-				goto err_skb;
-			if (curr_skb == head_skb)
-				skb_shinfo(curr_skb)->frag_list = nskb;
-			else
-				curr_skb->next = nskb;
-			curr_skb = nskb;
-			head_skb->truesize += nskb->truesize;
-			num_skb_frags = 0;
-		}
-		if (curr_skb != head_skb) {
-			head_skb->data_len += len;
-			head_skb->len += len;
-			head_skb->truesize += truesize;
-		}
-		offset = buf - page_address(page);
-		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
-			put_page(page);
-			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-					     len, truesize);
-		} else {
-			skb_add_rx_frag(curr_skb, num_skb_frags, page,
-					offset, len, truesize);
-		}
-	}
+	if (num_buf > 1)
+		head_skb = merge_receive_follow_bufs(dev, vi, rq, head_skb,
+						     num_buf, stats);
+	if (head_skb)
+		ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
 
-	ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
 	return head_skb;
 
 err_xdp:
@@ -1120,19 +1173,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	stats->xdp_drops++;
 err_skb:
 	put_page(page);
-	while (num_buf-- > 1) {
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
-			pr_debug("%s: rx error: %d buffers missing\n",
-				 dev->name, num_buf);
-			dev->stats.rx_length_errors++;
-			break;
-		}
-		stats->bytes += len;
-		page = virt_to_head_page(buf);
-		put_page(page);
-	}
-err_buf:
+	merge_drop_follow_bufs(dev, rq, num_buf, stats);
 	stats->drops++;
 	dev_kfree_skb(head_skb);
 xdp_xmit:
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 09/15] virtio-net: virtnet_poll_tx support budget check
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:22   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

virtnet_poll_tx() check the work done like other network card drivers.

When work < budget, napi_poll() in dev.c will exit directly. And
virtqueue_napi_complete() will be called to close napi. If closing napi
fails or there is still data to be processed, virtqueue_napi_complete()
will make napi schedule again, and no conflicts with the logic of
napi_poll().

When work == budget, virtnet_poll_tx() will return the var 'work', and
the napi_poll() in dev.c will re-add napi to the queue.

The purpose of this patch is to support xsk xmit in virtio_poll_tx for
subsequent patch.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 989aba600e63..953739860563 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1634,6 +1634,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	unsigned int index = vq2txq(sq->vq);
 	struct netdev_queue *txq;
+	int work_done = 0;
 
 	if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {
 		/* We don't need to enable cb for XDP */
@@ -1646,12 +1647,13 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 	free_old_xmit(sq, true);
 	__netif_tx_unlock(txq);
 
-	virtqueue_napi_complete(napi, sq->vq, 0);
+	if (work_done < budget)
+		virtqueue_napi_complete(napi, sq->vq, 0);
 
 	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
 		netif_tx_wake_queue(txq);
 
-	return 0;
+	return work_done;
 }
 
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 09/15] virtio-net: virtnet_poll_tx support budget check
@ 2021-06-10  8:22   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

virtnet_poll_tx() check the work done like other network card drivers.

When work < budget, napi_poll() in dev.c will exit directly. And
virtqueue_napi_complete() will be called to close napi. If closing napi
fails or there is still data to be processed, virtqueue_napi_complete()
will make napi schedule again, and no conflicts with the logic of
napi_poll().

When work == budget, virtnet_poll_tx() will return the var 'work', and
the napi_poll() in dev.c will re-add napi to the queue.

The purpose of this patch is to support xsk xmit in virtio_poll_tx for
subsequent patch.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 989aba600e63..953739860563 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1634,6 +1634,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	unsigned int index = vq2txq(sq->vq);
 	struct netdev_queue *txq;
+	int work_done = 0;
 
 	if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {
 		/* We don't need to enable cb for XDP */
@@ -1646,12 +1647,13 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 	free_old_xmit(sq, true);
 	__netif_tx_unlock(txq);
 
-	virtqueue_napi_complete(napi, sq->vq, 0);
+	if (work_done < budget)
+		virtqueue_napi_complete(napi, sq->vq, 0);
 
 	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
 		netif_tx_wake_queue(txq);
 
-	return 0;
+	return work_done;
 }
 
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 10/15] virtio-net: independent directory
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:22   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

Create a separate directory for virtio-net. AF_XDP support will be added
later, and a separate xsk.c file will be added.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 MAINTAINERS                           |  2 +-
 drivers/net/Kconfig                   |  8 +-------
 drivers/net/Makefile                  |  2 +-
 drivers/net/virtio/Kconfig            | 11 +++++++++++
 drivers/net/virtio/Makefile           |  6 ++++++
 drivers/net/{ => virtio}/virtio_net.c |  0
 6 files changed, 20 insertions(+), 9 deletions(-)
 create mode 100644 drivers/net/virtio/Kconfig
 create mode 100644 drivers/net/virtio/Makefile
 rename drivers/net/{ => virtio}/virtio_net.c (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index e69c1991ec3b..2041267f19f1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19344,7 +19344,7 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/virtio/
 F:	drivers/block/virtio_blk.c
 F:	drivers/crypto/virtio/
-F:	drivers/net/virtio_net.c
+F:	drivers/net/virtio/
 F:	drivers/vdpa/
 F:	drivers/virtio/
 F:	include/linux/vdpa.h
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 4da68ba8448f..2297fe4183ae 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -392,13 +392,7 @@ config VETH
 	  When one end receives the packet it appears on its pair and vice
 	  versa.
 
-config VIRTIO_NET
-	tristate "Virtio network driver"
-	depends on VIRTIO
-	select NET_FAILOVER
-	help
-	  This is the virtual network driver for virtio.  It can be used with
-	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
+source "drivers/net/virtio/Kconfig"
 
 config NLMON
 	tristate "Virtual netlink monitoring device"
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 7ffd2d03efaf..c4c7419e0398 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -28,7 +28,7 @@ obj-$(CONFIG_NET_TEAM) += team/
 obj-$(CONFIG_TUN) += tun.o
 obj-$(CONFIG_TAP) += tap.o
 obj-$(CONFIG_VETH) += veth.o
-obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VIRTIO_NET) += virtio/
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_BAREUDP) += bareudp.o
diff --git a/drivers/net/virtio/Kconfig b/drivers/net/virtio/Kconfig
new file mode 100644
index 000000000000..9bc2a2fc6c3e
--- /dev/null
+++ b/drivers/net/virtio/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# virtio-net device configuration
+#
+config VIRTIO_NET
+	tristate "Virtio network driver"
+	depends on VIRTIO
+	select NET_FAILOVER
+	help
+	  This is the virtual network driver for virtio.  It can be used with
+	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
new file mode 100644
index 000000000000..ccc80f40f33a
--- /dev/null
+++ b/drivers/net/virtio/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the virtio network device drivers.
+#
+
+obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio/virtio_net.c
similarity index 100%
rename from drivers/net/virtio_net.c
rename to drivers/net/virtio/virtio_net.c
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 10/15] virtio-net: independent directory
@ 2021-06-10  8:22   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

Create a separate directory for virtio-net. AF_XDP support will be added
later, and a separate xsk.c file will be added.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 MAINTAINERS                           |  2 +-
 drivers/net/Kconfig                   |  8 +-------
 drivers/net/Makefile                  |  2 +-
 drivers/net/virtio/Kconfig            | 11 +++++++++++
 drivers/net/virtio/Makefile           |  6 ++++++
 drivers/net/{ => virtio}/virtio_net.c |  0
 6 files changed, 20 insertions(+), 9 deletions(-)
 create mode 100644 drivers/net/virtio/Kconfig
 create mode 100644 drivers/net/virtio/Makefile
 rename drivers/net/{ => virtio}/virtio_net.c (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index e69c1991ec3b..2041267f19f1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19344,7 +19344,7 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/virtio/
 F:	drivers/block/virtio_blk.c
 F:	drivers/crypto/virtio/
-F:	drivers/net/virtio_net.c
+F:	drivers/net/virtio/
 F:	drivers/vdpa/
 F:	drivers/virtio/
 F:	include/linux/vdpa.h
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 4da68ba8448f..2297fe4183ae 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -392,13 +392,7 @@ config VETH
 	  When one end receives the packet it appears on its pair and vice
 	  versa.
 
-config VIRTIO_NET
-	tristate "Virtio network driver"
-	depends on VIRTIO
-	select NET_FAILOVER
-	help
-	  This is the virtual network driver for virtio.  It can be used with
-	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
+source "drivers/net/virtio/Kconfig"
 
 config NLMON
 	tristate "Virtual netlink monitoring device"
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 7ffd2d03efaf..c4c7419e0398 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -28,7 +28,7 @@ obj-$(CONFIG_NET_TEAM) += team/
 obj-$(CONFIG_TUN) += tun.o
 obj-$(CONFIG_TAP) += tap.o
 obj-$(CONFIG_VETH) += veth.o
-obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VIRTIO_NET) += virtio/
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_BAREUDP) += bareudp.o
diff --git a/drivers/net/virtio/Kconfig b/drivers/net/virtio/Kconfig
new file mode 100644
index 000000000000..9bc2a2fc6c3e
--- /dev/null
+++ b/drivers/net/virtio/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# virtio-net device configuration
+#
+config VIRTIO_NET
+	tristate "Virtio network driver"
+	depends on VIRTIO
+	select NET_FAILOVER
+	help
+	  This is the virtual network driver for virtio.  It can be used with
+	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
new file mode 100644
index 000000000000..ccc80f40f33a
--- /dev/null
+++ b/drivers/net/virtio/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the virtio network device drivers.
+#
+
+obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio/virtio_net.c
similarity index 100%
rename from drivers/net/virtio_net.c
rename to drivers/net/virtio/virtio_net.c
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 11/15] virtio-net: move to virtio_net.h
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:22   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

Move some structure definitions and inline functions into the
virtio_net.h file.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio/virtio_net.c | 225 +------------------------------
 drivers/net/virtio/virtio_net.h | 230 ++++++++++++++++++++++++++++++++
 2 files changed, 232 insertions(+), 223 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_net.h

diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 953739860563..395ec1f18331 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -4,24 +4,8 @@
  * Copyright 2007 Rusty Russell <rusty@rustcorp.com.au> IBM Corporation
  */
 //#define DEBUG
-#include <linux/netdevice.h>
-#include <linux/etherdevice.h>
-#include <linux/ethtool.h>
-#include <linux/module.h>
-#include <linux/virtio.h>
-#include <linux/virtio_net.h>
-#include <linux/bpf.h>
-#include <linux/bpf_trace.h>
-#include <linux/scatterlist.h>
-#include <linux/if_vlan.h>
-#include <linux/slab.h>
-#include <linux/cpu.h>
-#include <linux/average.h>
-#include <linux/filter.h>
-#include <linux/kernel.h>
-#include <net/route.h>
-#include <net/xdp.h>
-#include <net/net_failover.h>
+
+#include "virtio_net.h"
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -44,15 +28,6 @@ module_param(napi_tx, bool, 0644);
 #define VIRTIO_XDP_TX		BIT(0)
 #define VIRTIO_XDP_REDIR	BIT(1)
 
-#define VIRTIO_XDP_FLAG	BIT(0)
-
-/* RX packet size EWMA. The average packet size is used to determine the packet
- * buffer size when refilling RX rings. As the entire RX ring may be refilled
- * at once, the weight is chosen so that the EWMA will be insensitive to short-
- * term, transient changes in packet size.
- */
-DECLARE_EWMA(pkt_len, 0, 64)
-
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 static const unsigned long guest_offloads[] = {
@@ -68,35 +43,6 @@ static const unsigned long guest_offloads[] = {
 				(1ULL << VIRTIO_NET_F_GUEST_ECN)  | \
 				(1ULL << VIRTIO_NET_F_GUEST_UFO))
 
-struct virtnet_stat_desc {
-	char desc[ETH_GSTRING_LEN];
-	size_t offset;
-};
-
-struct virtnet_sq_stats {
-	struct u64_stats_sync syncp;
-	u64 packets;
-	u64 bytes;
-	u64 xdp_tx;
-	u64 xdp_tx_drops;
-	u64 kicks;
-};
-
-struct virtnet_rq_stats {
-	struct u64_stats_sync syncp;
-	u64 packets;
-	u64 bytes;
-	u64 drops;
-	u64 xdp_packets;
-	u64 xdp_tx;
-	u64 xdp_redirects;
-	u64 xdp_drops;
-	u64 kicks;
-};
-
-#define VIRTNET_SQ_STAT(m)	offsetof(struct virtnet_sq_stats, m)
-#define VIRTNET_RQ_STAT(m)	offsetof(struct virtnet_rq_stats, m)
-
 static const struct virtnet_stat_desc virtnet_sq_stats_desc[] = {
 	{ "packets",		VIRTNET_SQ_STAT(packets) },
 	{ "bytes",		VIRTNET_SQ_STAT(bytes) },
@@ -119,54 +65,6 @@ static const struct virtnet_stat_desc virtnet_rq_stats_desc[] = {
 #define VIRTNET_SQ_STATS_LEN	ARRAY_SIZE(virtnet_sq_stats_desc)
 #define VIRTNET_RQ_STATS_LEN	ARRAY_SIZE(virtnet_rq_stats_desc)
 
-/* Internal representation of a send virtqueue */
-struct send_queue {
-	/* Virtqueue associated with this send _queue */
-	struct virtqueue *vq;
-
-	/* TX: fragments + linear part + virtio header */
-	struct scatterlist sg[MAX_SKB_FRAGS + 2];
-
-	/* Name of the send queue: output.$index */
-	char name[40];
-
-	struct virtnet_sq_stats stats;
-
-	struct napi_struct napi;
-};
-
-/* Internal representation of a receive virtqueue */
-struct receive_queue {
-	/* Virtqueue associated with this receive_queue */
-	struct virtqueue *vq;
-
-	struct napi_struct napi;
-
-	struct bpf_prog __rcu *xdp_prog;
-
-	struct virtnet_rq_stats stats;
-
-	/* Chain pages by the private ptr. */
-	struct page *pages;
-
-	/* Average packet length for mergeable receive buffers. */
-	struct ewma_pkt_len mrg_avg_pkt_len;
-
-	/* Page frag for packet buffer allocation. */
-	struct page_frag alloc_frag;
-
-	/* RX: fragments + linear part + virtio header */
-	struct scatterlist sg[MAX_SKB_FRAGS + 2];
-
-	/* Min single buffer size for mergeable buffers case. */
-	unsigned int min_buf_len;
-
-	/* Name of this receive queue: input.$index */
-	char name[40];
-
-	struct xdp_rxq_info xdp_rxq;
-};
-
 /* Control VQ buffers: protected by the rtnl lock */
 struct control_buf {
 	struct virtio_net_ctrl_hdr hdr;
@@ -178,67 +76,6 @@ struct control_buf {
 	__virtio64 offloads;
 };
 
-struct virtnet_info {
-	struct virtio_device *vdev;
-	struct virtqueue *cvq;
-	struct net_device *dev;
-	struct send_queue *sq;
-	struct receive_queue *rq;
-	unsigned int status;
-
-	/* Max # of queue pairs supported by the device */
-	u16 max_queue_pairs;
-
-	/* # of queue pairs currently used by the driver */
-	u16 curr_queue_pairs;
-
-	/* # of XDP queue pairs currently used by the driver */
-	u16 xdp_queue_pairs;
-
-	/* xdp_queue_pairs may be 0, when xdp is already loaded. So add this. */
-	bool xdp_enabled;
-
-	/* I like... big packets and I cannot lie! */
-	bool big_packets;
-
-	/* Host will merge rx buffers for big packets (shake it! shake it!) */
-	bool mergeable_rx_bufs;
-
-	/* Has control virtqueue */
-	bool has_cvq;
-
-	/* Host can handle any s/g split between our header and packet data */
-	bool any_header_sg;
-
-	/* Packet virtio header size */
-	u8 hdr_len;
-
-	/* Work struct for refilling if we run low on memory. */
-	struct delayed_work refill;
-
-	/* Work struct for config space updates */
-	struct work_struct config_work;
-
-	/* Does the affinity hint is set for virtqueues? */
-	bool affinity_hint_set;
-
-	/* CPU hotplug instances for online & dead */
-	struct hlist_node node;
-	struct hlist_node node_dead;
-
-	struct control_buf *ctrl;
-
-	/* Ethtool settings */
-	u8 duplex;
-	u32 speed;
-
-	unsigned long guest_offloads;
-	unsigned long guest_offloads_capable;
-
-	/* failover when STANDBY feature enabled */
-	struct failover *failover;
-};
-
 struct padded_vnet_hdr {
 	struct virtio_net_hdr_mrg_rxbuf hdr;
 	/*
@@ -249,21 +86,6 @@ struct padded_vnet_hdr {
 	char padding[4];
 };
 
-static bool is_xdp_frame(void *ptr)
-{
-	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
-}
-
-static void *xdp_to_ptr(struct xdp_frame *ptr)
-{
-	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
-}
-
-static struct xdp_frame *ptr_to_xdp(void *ptr)
-{
-	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
-}
-
 static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
 				int gfp)
 {
@@ -280,30 +102,6 @@ static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
 	return buf;
 }
 
-static void __free_old_xmit(struct send_queue *sq, bool in_napi,
-			    struct virtnet_sq_stats *stats)
-{
-	unsigned int len;
-	void *ptr;
-
-	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (!is_xdp_frame(ptr)) {
-			struct sk_buff *skb = ptr;
-
-			pr_debug("Sent skb %p\n", skb);
-
-			stats->bytes += skb->len;
-			napi_consume_skb(skb, in_napi);
-		} else {
-			struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-			stats->bytes += frame->len;
-			xdp_return_frame(frame);
-		}
-		stats->packets++;
-	}
-}
-
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -359,15 +157,6 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 	return p;
 }
 
-static void virtqueue_napi_schedule(struct napi_struct *napi,
-				    struct virtqueue *vq)
-{
-	if (napi_schedule_prep(napi)) {
-		virtqueue_disable_cb(vq);
-		__napi_schedule(napi);
-	}
-}
-
 static void virtqueue_napi_complete(struct napi_struct *napi,
 				    struct virtqueue *vq, int processed)
 {
@@ -1537,16 +1326,6 @@ static void free_old_xmit(struct send_queue *sq, bool in_napi)
 	u64_stats_update_end(&sq->stats.syncp);
 }
 
-static bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
-{
-	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
-		return false;
-	else if (q < vi->curr_queue_pairs)
-		return true;
-	else
-		return false;
-}
-
 static void virtnet_poll_cleantx(struct receive_queue *rq)
 {
 	struct virtnet_info *vi = rq->vq->vdev->priv;
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
new file mode 100644
index 000000000000..931cc81f92fb
--- /dev/null
+++ b/drivers/net/virtio/virtio_net.h
@@ -0,0 +1,230 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef __VIRTIO_NET_H__
+#define __VIRTIO_NET_H__
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/virtio.h>
+#include <linux/virtio_net.h>
+#include <linux/bpf.h>
+#include <linux/bpf_trace.h>
+#include <linux/scatterlist.h>
+#include <linux/if_vlan.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+#include <linux/average.h>
+#include <linux/filter.h>
+#include <linux/kernel.h>
+#include <net/route.h>
+#include <net/xdp.h>
+#include <net/net_failover.h>
+#include <net/xdp_sock_drv.h>
+
+#define VIRTIO_XDP_FLAG	BIT(0)
+
+struct virtnet_info {
+	struct virtio_device *vdev;
+	struct virtqueue *cvq;
+	struct net_device *dev;
+	struct send_queue *sq;
+	struct receive_queue *rq;
+	unsigned int status;
+
+	/* Max # of queue pairs supported by the device */
+	u16 max_queue_pairs;
+
+	/* # of queue pairs currently used by the driver */
+	u16 curr_queue_pairs;
+
+	/* # of XDP queue pairs currently used by the driver */
+	u16 xdp_queue_pairs;
+
+	/* xdp_queue_pairs may be 0, when xdp is already loaded. So add this. */
+	bool xdp_enabled;
+
+	/* I like... big packets and I cannot lie! */
+	bool big_packets;
+
+	/* Host will merge rx buffers for big packets (shake it! shake it!) */
+	bool mergeable_rx_bufs;
+
+	/* Has control virtqueue */
+	bool has_cvq;
+
+	/* Host can handle any s/g split between our header and packet data */
+	bool any_header_sg;
+
+	/* Packet virtio header size */
+	u8 hdr_len;
+
+	/* Work struct for refilling if we run low on memory. */
+	struct delayed_work refill;
+
+	/* Work struct for config space updates */
+	struct work_struct config_work;
+
+	/* Does the affinity hint is set for virtqueues? */
+	bool affinity_hint_set;
+
+	/* CPU hotplug instances for online & dead */
+	struct hlist_node node;
+	struct hlist_node node_dead;
+
+	struct control_buf *ctrl;
+
+	/* Ethtool settings */
+	u8 duplex;
+	u32 speed;
+
+	unsigned long guest_offloads;
+	unsigned long guest_offloads_capable;
+
+	/* failover when STANDBY feature enabled */
+	struct failover *failover;
+};
+
+/* RX packet size EWMA. The average packet size is used to determine the packet
+ * buffer size when refilling RX rings. As the entire RX ring may be refilled
+ * at once, the weight is chosen so that the EWMA will be insensitive to short-
+ * term, transient changes in packet size.
+ */
+DECLARE_EWMA(pkt_len, 0, 64)
+
+struct virtnet_stat_desc {
+	char desc[ETH_GSTRING_LEN];
+	size_t offset;
+};
+
+struct virtnet_sq_stats {
+	struct u64_stats_sync syncp;
+	u64 packets;
+	u64 bytes;
+	u64 xdp_tx;
+	u64 xdp_tx_drops;
+	u64 kicks;
+};
+
+struct virtnet_rq_stats {
+	struct u64_stats_sync syncp;
+	u64 packets;
+	u64 bytes;
+	u64 drops;
+	u64 xdp_packets;
+	u64 xdp_tx;
+	u64 xdp_redirects;
+	u64 xdp_drops;
+	u64 kicks;
+};
+
+#define VIRTNET_SQ_STAT(m)	offsetof(struct virtnet_sq_stats, m)
+#define VIRTNET_RQ_STAT(m)	offsetof(struct virtnet_rq_stats, m)
+
+/* Internal representation of a send virtqueue */
+struct send_queue {
+	/* Virtqueue associated with this send _queue */
+	struct virtqueue *vq;
+
+	/* TX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+
+	/* Name of the send queue: output.$index */
+	char name[40];
+
+	struct virtnet_sq_stats stats;
+
+	struct napi_struct napi;
+};
+
+/* Internal representation of a receive virtqueue */
+struct receive_queue {
+	/* Virtqueue associated with this receive_queue */
+	struct virtqueue *vq;
+
+	struct napi_struct napi;
+
+	struct bpf_prog __rcu *xdp_prog;
+
+	struct virtnet_rq_stats stats;
+
+	/* Chain pages by the private ptr. */
+	struct page *pages;
+
+	/* Average packet length for mergeable receive buffers. */
+	struct ewma_pkt_len mrg_avg_pkt_len;
+
+	/* Page frag for packet buffer allocation. */
+	struct page_frag alloc_frag;
+
+	/* RX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+
+	/* Min single buffer size for mergeable buffers case. */
+	unsigned int min_buf_len;
+
+	/* Name of this receive queue: input.$index */
+	char name[40];
+
+	struct xdp_rxq_info xdp_rxq;
+};
+
+static inline bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
+{
+	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
+		return false;
+	else if (q < vi->curr_queue_pairs)
+		return true;
+	else
+		return false;
+}
+
+static inline void virtqueue_napi_schedule(struct napi_struct *napi,
+					   struct virtqueue *vq)
+{
+	if (napi_schedule_prep(napi)) {
+		virtqueue_disable_cb(vq);
+		__napi_schedule(napi);
+	}
+}
+
+static inline bool is_xdp_frame(void *ptr)
+{
+	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
+}
+
+static inline void *xdp_to_ptr(struct xdp_frame *ptr)
+{
+	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
+}
+
+static inline struct xdp_frame *ptr_to_xdp(void *ptr)
+{
+	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
+}
+
+static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
+				   struct virtnet_sq_stats *stats)
+{
+	unsigned int len;
+	void *ptr;
+
+	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		if (!is_xdp_frame(ptr)) {
+			struct sk_buff *skb = ptr;
+
+			pr_debug("Sent skb %p\n", skb);
+
+			stats->bytes += skb->len;
+			napi_consume_skb(skb, in_napi);
+		} else {
+			struct xdp_frame *frame = ptr_to_xdp(ptr);
+
+			stats->bytes += frame->len;
+			xdp_return_frame(frame);
+		}
+		stats->packets++;
+	}
+}
+
+#endif
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 11/15] virtio-net: move to virtio_net.h
@ 2021-06-10  8:22   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

Move some structure definitions and inline functions into the
virtio_net.h file.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio/virtio_net.c | 225 +------------------------------
 drivers/net/virtio/virtio_net.h | 230 ++++++++++++++++++++++++++++++++
 2 files changed, 232 insertions(+), 223 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_net.h

diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 953739860563..395ec1f18331 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -4,24 +4,8 @@
  * Copyright 2007 Rusty Russell <rusty@rustcorp.com.au> IBM Corporation
  */
 //#define DEBUG
-#include <linux/netdevice.h>
-#include <linux/etherdevice.h>
-#include <linux/ethtool.h>
-#include <linux/module.h>
-#include <linux/virtio.h>
-#include <linux/virtio_net.h>
-#include <linux/bpf.h>
-#include <linux/bpf_trace.h>
-#include <linux/scatterlist.h>
-#include <linux/if_vlan.h>
-#include <linux/slab.h>
-#include <linux/cpu.h>
-#include <linux/average.h>
-#include <linux/filter.h>
-#include <linux/kernel.h>
-#include <net/route.h>
-#include <net/xdp.h>
-#include <net/net_failover.h>
+
+#include "virtio_net.h"
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -44,15 +28,6 @@ module_param(napi_tx, bool, 0644);
 #define VIRTIO_XDP_TX		BIT(0)
 #define VIRTIO_XDP_REDIR	BIT(1)
 
-#define VIRTIO_XDP_FLAG	BIT(0)
-
-/* RX packet size EWMA. The average packet size is used to determine the packet
- * buffer size when refilling RX rings. As the entire RX ring may be refilled
- * at once, the weight is chosen so that the EWMA will be insensitive to short-
- * term, transient changes in packet size.
- */
-DECLARE_EWMA(pkt_len, 0, 64)
-
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 static const unsigned long guest_offloads[] = {
@@ -68,35 +43,6 @@ static const unsigned long guest_offloads[] = {
 				(1ULL << VIRTIO_NET_F_GUEST_ECN)  | \
 				(1ULL << VIRTIO_NET_F_GUEST_UFO))
 
-struct virtnet_stat_desc {
-	char desc[ETH_GSTRING_LEN];
-	size_t offset;
-};
-
-struct virtnet_sq_stats {
-	struct u64_stats_sync syncp;
-	u64 packets;
-	u64 bytes;
-	u64 xdp_tx;
-	u64 xdp_tx_drops;
-	u64 kicks;
-};
-
-struct virtnet_rq_stats {
-	struct u64_stats_sync syncp;
-	u64 packets;
-	u64 bytes;
-	u64 drops;
-	u64 xdp_packets;
-	u64 xdp_tx;
-	u64 xdp_redirects;
-	u64 xdp_drops;
-	u64 kicks;
-};
-
-#define VIRTNET_SQ_STAT(m)	offsetof(struct virtnet_sq_stats, m)
-#define VIRTNET_RQ_STAT(m)	offsetof(struct virtnet_rq_stats, m)
-
 static const struct virtnet_stat_desc virtnet_sq_stats_desc[] = {
 	{ "packets",		VIRTNET_SQ_STAT(packets) },
 	{ "bytes",		VIRTNET_SQ_STAT(bytes) },
@@ -119,54 +65,6 @@ static const struct virtnet_stat_desc virtnet_rq_stats_desc[] = {
 #define VIRTNET_SQ_STATS_LEN	ARRAY_SIZE(virtnet_sq_stats_desc)
 #define VIRTNET_RQ_STATS_LEN	ARRAY_SIZE(virtnet_rq_stats_desc)
 
-/* Internal representation of a send virtqueue */
-struct send_queue {
-	/* Virtqueue associated with this send _queue */
-	struct virtqueue *vq;
-
-	/* TX: fragments + linear part + virtio header */
-	struct scatterlist sg[MAX_SKB_FRAGS + 2];
-
-	/* Name of the send queue: output.$index */
-	char name[40];
-
-	struct virtnet_sq_stats stats;
-
-	struct napi_struct napi;
-};
-
-/* Internal representation of a receive virtqueue */
-struct receive_queue {
-	/* Virtqueue associated with this receive_queue */
-	struct virtqueue *vq;
-
-	struct napi_struct napi;
-
-	struct bpf_prog __rcu *xdp_prog;
-
-	struct virtnet_rq_stats stats;
-
-	/* Chain pages by the private ptr. */
-	struct page *pages;
-
-	/* Average packet length for mergeable receive buffers. */
-	struct ewma_pkt_len mrg_avg_pkt_len;
-
-	/* Page frag for packet buffer allocation. */
-	struct page_frag alloc_frag;
-
-	/* RX: fragments + linear part + virtio header */
-	struct scatterlist sg[MAX_SKB_FRAGS + 2];
-
-	/* Min single buffer size for mergeable buffers case. */
-	unsigned int min_buf_len;
-
-	/* Name of this receive queue: input.$index */
-	char name[40];
-
-	struct xdp_rxq_info xdp_rxq;
-};
-
 /* Control VQ buffers: protected by the rtnl lock */
 struct control_buf {
 	struct virtio_net_ctrl_hdr hdr;
@@ -178,67 +76,6 @@ struct control_buf {
 	__virtio64 offloads;
 };
 
-struct virtnet_info {
-	struct virtio_device *vdev;
-	struct virtqueue *cvq;
-	struct net_device *dev;
-	struct send_queue *sq;
-	struct receive_queue *rq;
-	unsigned int status;
-
-	/* Max # of queue pairs supported by the device */
-	u16 max_queue_pairs;
-
-	/* # of queue pairs currently used by the driver */
-	u16 curr_queue_pairs;
-
-	/* # of XDP queue pairs currently used by the driver */
-	u16 xdp_queue_pairs;
-
-	/* xdp_queue_pairs may be 0, when xdp is already loaded. So add this. */
-	bool xdp_enabled;
-
-	/* I like... big packets and I cannot lie! */
-	bool big_packets;
-
-	/* Host will merge rx buffers for big packets (shake it! shake it!) */
-	bool mergeable_rx_bufs;
-
-	/* Has control virtqueue */
-	bool has_cvq;
-
-	/* Host can handle any s/g split between our header and packet data */
-	bool any_header_sg;
-
-	/* Packet virtio header size */
-	u8 hdr_len;
-
-	/* Work struct for refilling if we run low on memory. */
-	struct delayed_work refill;
-
-	/* Work struct for config space updates */
-	struct work_struct config_work;
-
-	/* Does the affinity hint is set for virtqueues? */
-	bool affinity_hint_set;
-
-	/* CPU hotplug instances for online & dead */
-	struct hlist_node node;
-	struct hlist_node node_dead;
-
-	struct control_buf *ctrl;
-
-	/* Ethtool settings */
-	u8 duplex;
-	u32 speed;
-
-	unsigned long guest_offloads;
-	unsigned long guest_offloads_capable;
-
-	/* failover when STANDBY feature enabled */
-	struct failover *failover;
-};
-
 struct padded_vnet_hdr {
 	struct virtio_net_hdr_mrg_rxbuf hdr;
 	/*
@@ -249,21 +86,6 @@ struct padded_vnet_hdr {
 	char padding[4];
 };
 
-static bool is_xdp_frame(void *ptr)
-{
-	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
-}
-
-static void *xdp_to_ptr(struct xdp_frame *ptr)
-{
-	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
-}
-
-static struct xdp_frame *ptr_to_xdp(void *ptr)
-{
-	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
-}
-
 static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
 				int gfp)
 {
@@ -280,30 +102,6 @@ static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
 	return buf;
 }
 
-static void __free_old_xmit(struct send_queue *sq, bool in_napi,
-			    struct virtnet_sq_stats *stats)
-{
-	unsigned int len;
-	void *ptr;
-
-	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (!is_xdp_frame(ptr)) {
-			struct sk_buff *skb = ptr;
-
-			pr_debug("Sent skb %p\n", skb);
-
-			stats->bytes += skb->len;
-			napi_consume_skb(skb, in_napi);
-		} else {
-			struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-			stats->bytes += frame->len;
-			xdp_return_frame(frame);
-		}
-		stats->packets++;
-	}
-}
-
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -359,15 +157,6 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 	return p;
 }
 
-static void virtqueue_napi_schedule(struct napi_struct *napi,
-				    struct virtqueue *vq)
-{
-	if (napi_schedule_prep(napi)) {
-		virtqueue_disable_cb(vq);
-		__napi_schedule(napi);
-	}
-}
-
 static void virtqueue_napi_complete(struct napi_struct *napi,
 				    struct virtqueue *vq, int processed)
 {
@@ -1537,16 +1326,6 @@ static void free_old_xmit(struct send_queue *sq, bool in_napi)
 	u64_stats_update_end(&sq->stats.syncp);
 }
 
-static bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
-{
-	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
-		return false;
-	else if (q < vi->curr_queue_pairs)
-		return true;
-	else
-		return false;
-}
-
 static void virtnet_poll_cleantx(struct receive_queue *rq)
 {
 	struct virtnet_info *vi = rq->vq->vdev->priv;
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
new file mode 100644
index 000000000000..931cc81f92fb
--- /dev/null
+++ b/drivers/net/virtio/virtio_net.h
@@ -0,0 +1,230 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef __VIRTIO_NET_H__
+#define __VIRTIO_NET_H__
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/virtio.h>
+#include <linux/virtio_net.h>
+#include <linux/bpf.h>
+#include <linux/bpf_trace.h>
+#include <linux/scatterlist.h>
+#include <linux/if_vlan.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+#include <linux/average.h>
+#include <linux/filter.h>
+#include <linux/kernel.h>
+#include <net/route.h>
+#include <net/xdp.h>
+#include <net/net_failover.h>
+#include <net/xdp_sock_drv.h>
+
+#define VIRTIO_XDP_FLAG	BIT(0)
+
+struct virtnet_info {
+	struct virtio_device *vdev;
+	struct virtqueue *cvq;
+	struct net_device *dev;
+	struct send_queue *sq;
+	struct receive_queue *rq;
+	unsigned int status;
+
+	/* Max # of queue pairs supported by the device */
+	u16 max_queue_pairs;
+
+	/* # of queue pairs currently used by the driver */
+	u16 curr_queue_pairs;
+
+	/* # of XDP queue pairs currently used by the driver */
+	u16 xdp_queue_pairs;
+
+	/* xdp_queue_pairs may be 0, when xdp is already loaded. So add this. */
+	bool xdp_enabled;
+
+	/* I like... big packets and I cannot lie! */
+	bool big_packets;
+
+	/* Host will merge rx buffers for big packets (shake it! shake it!) */
+	bool mergeable_rx_bufs;
+
+	/* Has control virtqueue */
+	bool has_cvq;
+
+	/* Host can handle any s/g split between our header and packet data */
+	bool any_header_sg;
+
+	/* Packet virtio header size */
+	u8 hdr_len;
+
+	/* Work struct for refilling if we run low on memory. */
+	struct delayed_work refill;
+
+	/* Work struct for config space updates */
+	struct work_struct config_work;
+
+	/* Does the affinity hint is set for virtqueues? */
+	bool affinity_hint_set;
+
+	/* CPU hotplug instances for online & dead */
+	struct hlist_node node;
+	struct hlist_node node_dead;
+
+	struct control_buf *ctrl;
+
+	/* Ethtool settings */
+	u8 duplex;
+	u32 speed;
+
+	unsigned long guest_offloads;
+	unsigned long guest_offloads_capable;
+
+	/* failover when STANDBY feature enabled */
+	struct failover *failover;
+};
+
+/* RX packet size EWMA. The average packet size is used to determine the packet
+ * buffer size when refilling RX rings. As the entire RX ring may be refilled
+ * at once, the weight is chosen so that the EWMA will be insensitive to short-
+ * term, transient changes in packet size.
+ */
+DECLARE_EWMA(pkt_len, 0, 64)
+
+struct virtnet_stat_desc {
+	char desc[ETH_GSTRING_LEN];
+	size_t offset;
+};
+
+struct virtnet_sq_stats {
+	struct u64_stats_sync syncp;
+	u64 packets;
+	u64 bytes;
+	u64 xdp_tx;
+	u64 xdp_tx_drops;
+	u64 kicks;
+};
+
+struct virtnet_rq_stats {
+	struct u64_stats_sync syncp;
+	u64 packets;
+	u64 bytes;
+	u64 drops;
+	u64 xdp_packets;
+	u64 xdp_tx;
+	u64 xdp_redirects;
+	u64 xdp_drops;
+	u64 kicks;
+};
+
+#define VIRTNET_SQ_STAT(m)	offsetof(struct virtnet_sq_stats, m)
+#define VIRTNET_RQ_STAT(m)	offsetof(struct virtnet_rq_stats, m)
+
+/* Internal representation of a send virtqueue */
+struct send_queue {
+	/* Virtqueue associated with this send _queue */
+	struct virtqueue *vq;
+
+	/* TX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+
+	/* Name of the send queue: output.$index */
+	char name[40];
+
+	struct virtnet_sq_stats stats;
+
+	struct napi_struct napi;
+};
+
+/* Internal representation of a receive virtqueue */
+struct receive_queue {
+	/* Virtqueue associated with this receive_queue */
+	struct virtqueue *vq;
+
+	struct napi_struct napi;
+
+	struct bpf_prog __rcu *xdp_prog;
+
+	struct virtnet_rq_stats stats;
+
+	/* Chain pages by the private ptr. */
+	struct page *pages;
+
+	/* Average packet length for mergeable receive buffers. */
+	struct ewma_pkt_len mrg_avg_pkt_len;
+
+	/* Page frag for packet buffer allocation. */
+	struct page_frag alloc_frag;
+
+	/* RX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+
+	/* Min single buffer size for mergeable buffers case. */
+	unsigned int min_buf_len;
+
+	/* Name of this receive queue: input.$index */
+	char name[40];
+
+	struct xdp_rxq_info xdp_rxq;
+};
+
+static inline bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
+{
+	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
+		return false;
+	else if (q < vi->curr_queue_pairs)
+		return true;
+	else
+		return false;
+}
+
+static inline void virtqueue_napi_schedule(struct napi_struct *napi,
+					   struct virtqueue *vq)
+{
+	if (napi_schedule_prep(napi)) {
+		virtqueue_disable_cb(vq);
+		__napi_schedule(napi);
+	}
+}
+
+static inline bool is_xdp_frame(void *ptr)
+{
+	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
+}
+
+static inline void *xdp_to_ptr(struct xdp_frame *ptr)
+{
+	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
+}
+
+static inline struct xdp_frame *ptr_to_xdp(void *ptr)
+{
+	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
+}
+
+static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
+				   struct virtnet_sq_stats *stats)
+{
+	unsigned int len;
+	void *ptr;
+
+	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		if (!is_xdp_frame(ptr)) {
+			struct sk_buff *skb = ptr;
+
+			pr_debug("Sent skb %p\n", skb);
+
+			stats->bytes += skb->len;
+			napi_consume_skb(skb, in_napi);
+		} else {
+			struct xdp_frame *frame = ptr_to_xdp(ptr);
+
+			stats->bytes += frame->len;
+			xdp_return_frame(frame);
+		}
+		stats->packets++;
+	}
+}
+
+#endif
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:22   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

AF_XDP(xdp socket, xsk) is a high-performance packet receiving and
sending technology.

This patch implements the binding and unbinding operations of xsk and
the virtio-net queue for xsk zero copy xmit.

The xsk zero copy xmit depends on tx napi. Because the actual sending
of data is done in the process of tx napi. If tx napi does not
work, then the data of the xsk tx queue will not be sent.
So if tx napi is not true, an error will be reported when bind xsk.

If xsk is active, it will prevent ethtool from modifying tx napi.

When reclaiming ptr, a new type of ptr is added, which is distinguished
based on the last two digits of ptr:
00: skb
01: xdp frame
10: xsk xmit ptr

All sent xsk packets share the virtio-net header of xsk_hdr. If xsk
needs to support csum and other functions later, consider assigning xsk
hdr separately for each sent packet.

Different from other physical network cards, you can reinitialize the
channel when you bind xsk. And vrtio does not support independent reset
channel, you can only reset the entire device. I think it is not
appropriate for us to directly reset the entire setting. So the
situation becomes a bit more complicated. We have to consider how
to deal with the buffer referenced in vq after xsk is unbind.

I added the ring size struct virtnet_xsk_ctx when xsk been bind. Each xsk
buffer added to vq corresponds to a ctx. This ctx is used to record the
page where the xsk buffer is located, and add a page reference. When the
buffer is recycling, reduce the reference to page. When xsk has been
unbind, and all related xsk buffers have been recycled, release all ctx.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 drivers/net/virtio/Makefile     |   1 +
 drivers/net/virtio/virtio_net.c |  20 +-
 drivers/net/virtio/virtio_net.h |  37 +++-
 drivers/net/virtio/xsk.c        | 346 ++++++++++++++++++++++++++++++++
 drivers/net/virtio/xsk.h        |  99 +++++++++
 5 files changed, 497 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/virtio/xsk.c
 create mode 100644 drivers/net/virtio/xsk.h

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index ccc80f40f33a..db79d2e7925f 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -4,3 +4,4 @@
 #
 
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VIRTIO_NET) += xsk.o
diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 395ec1f18331..40d7751f1c5f 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -1423,6 +1423,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 
 	txq = netdev_get_tx_queue(vi->dev, index);
 	__netif_tx_lock(txq, raw_smp_processor_id());
+	work_done += virtnet_poll_xsk(sq, budget);
 	free_old_xmit(sq, true);
 	__netif_tx_unlock(txq);
 
@@ -2133,8 +2134,16 @@ static int virtnet_set_coalesce(struct net_device *dev,
 	if (napi_weight ^ vi->sq[0].napi.weight) {
 		if (dev->flags & IFF_UP)
 			return -EBUSY;
-		for (i = 0; i < vi->max_queue_pairs; i++)
+
+		for (i = 0; i < vi->max_queue_pairs; i++) {
+			/* xsk xmit depend on the tx napi. So if xsk is active,
+			 * prevent modifications to tx napi.
+			 */
+			if (rtnl_dereference(vi->sq[i].xsk.pool))
+				continue;
+
 			vi->sq[i].napi.weight = napi_weight;
+		}
 	}
 
 	return 0;
@@ -2407,6 +2416,8 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
 		return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_SETUP_XSK_POOL:
+		return virtnet_xsk_pool_setup(dev, xdp);
 	default:
 		return -EINVAL;
 	}
@@ -2466,6 +2477,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
 	.ndo_bpf		= virtnet_xdp,
 	.ndo_xdp_xmit		= virtnet_xdp_xmit,
+	.ndo_xsk_wakeup         = virtnet_xsk_wakeup,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
 	.ndo_set_features	= virtnet_set_features,
@@ -2569,10 +2581,12 @@ static void free_unused_bufs(struct virtnet_info *vi)
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		struct virtqueue *vq = vi->sq[i].vq;
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (!is_xdp_frame(buf))
+			if (is_skb_ptr(buf))
 				dev_kfree_skb(buf);
-			else
+			else if (is_xdp_frame(buf))
 				xdp_return_frame(ptr_to_xdp(buf));
+			else
+				virtnet_xsk_ctx_tx_put(ptr_to_xsk(buf));
 		}
 	}
 
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index 931cc81f92fb..e3da829887dc 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -135,6 +135,16 @@ struct send_queue {
 	struct virtnet_sq_stats stats;
 
 	struct napi_struct napi;
+
+	struct {
+		struct xsk_buff_pool __rcu *pool;
+
+		/* xsk wait for tx inter or softirq */
+		bool need_wakeup;
+
+		/* ctx used to record the page added to vq */
+		struct virtnet_xsk_ctx_head *ctx_head;
+	} xsk;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -188,6 +198,13 @@ static inline void virtqueue_napi_schedule(struct napi_struct *napi,
 	}
 }
 
+#include "xsk.h"
+
+static inline bool is_skb_ptr(void *ptr)
+{
+	return !((unsigned long)ptr & (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG));
+}
+
 static inline bool is_xdp_frame(void *ptr)
 {
 	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
@@ -206,25 +223,39 @@ static inline struct xdp_frame *ptr_to_xdp(void *ptr)
 static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
 				   struct virtnet_sq_stats *stats)
 {
+	unsigned int xsknum = 0;
 	unsigned int len;
 	void *ptr;
 
 	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (!is_xdp_frame(ptr)) {
+		if (is_skb_ptr(ptr)) {
 			struct sk_buff *skb = ptr;
 
 			pr_debug("Sent skb %p\n", skb);
 
 			stats->bytes += skb->len;
 			napi_consume_skb(skb, in_napi);
-		} else {
+		} else if (is_xdp_frame(ptr)) {
 			struct xdp_frame *frame = ptr_to_xdp(ptr);
 
 			stats->bytes += frame->len;
 			xdp_return_frame(frame);
+		} else {
+			struct virtnet_xsk_ctx_tx *ctx;
+
+			ctx = ptr_to_xsk(ptr);
+
+			/* Maybe this ptr was sent by the last xsk. */
+			if (ctx->ctx.head->active)
+				++xsknum;
+
+			stats->bytes += ctx->len;
+			virtnet_xsk_ctx_tx_put(ctx);
 		}
 		stats->packets++;
 	}
-}
 
+	if (xsknum)
+		virtnet_xsk_complete(sq, xsknum);
+}
 #endif
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
new file mode 100644
index 000000000000..f98b68576709
--- /dev/null
+++ b/drivers/net/virtio/xsk.c
@@ -0,0 +1,346 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * virtio-net xsk
+ */
+
+#include "virtio_net.h"
+
+static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
+
+static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
+{
+	struct virtnet_xsk_ctx *ctx;
+
+	ctx = head->ctx;
+	head->ctx = ctx->next;
+
+	++head->ref;
+
+	return ctx;
+}
+
+#define virtnet_xsk_ctx_tx_get(head) ((struct virtnet_xsk_ctx_tx *)virtnet_xsk_ctx_get(head))
+
+static void virtnet_xsk_check_queue(struct send_queue *sq)
+{
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct net_device *dev = vi->dev;
+	int qnum = sq - vi->sq;
+
+	/* If it is a raw buffer queue, it does not check whether the status
+	 * of the queue is stopped when sending. So there is no need to check
+	 * the situation of the raw buffer queue.
+	 */
+	if (is_xdp_raw_buffer_queue(vi, qnum))
+		return;
+
+	/* If this sq is not the exclusive queue of the current cpu,
+	 * then it may be called by start_xmit, so check it running out
+	 * of space.
+	 *
+	 * Stop the queue to avoid getting packets that we are
+	 * then unable to transmit. Then wait the tx interrupt.
+	 */
+	if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
+		netif_stop_subqueue(dev, qnum);
+}
+
+void virtnet_xsk_complete(struct send_queue *sq, u32 num)
+{
+	struct xsk_buff_pool *pool;
+
+	rcu_read_lock();
+	pool = rcu_dereference(sq->xsk.pool);
+	if (!pool) {
+		rcu_read_unlock();
+		return;
+	}
+	xsk_tx_completed(pool, num);
+	rcu_read_unlock();
+
+	if (sq->xsk.need_wakeup) {
+		sq->xsk.need_wakeup = false;
+		virtqueue_napi_schedule(&sq->napi, sq->vq);
+	}
+}
+
+static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
+			    struct xdp_desc *desc)
+{
+	struct virtnet_xsk_ctx_tx *ctx;
+	struct virtnet_info *vi;
+	u32 offset, n, len;
+	struct page *page;
+	void *data;
+
+	vi = sq->vq->vdev->priv;
+
+	data = xsk_buff_raw_get_data(pool, desc->addr);
+	offset = offset_in_page(data);
+
+	ctx = virtnet_xsk_ctx_tx_get(sq->xsk.ctx_head);
+
+	/* xsk unaligned mode, desc may use two pages */
+	if (desc->len > PAGE_SIZE - offset)
+		n = 3;
+	else
+		n = 2;
+
+	sg_init_table(sq->sg, n);
+	sg_set_buf(sq->sg, &xsk_hdr, vi->hdr_len);
+
+	/* handle for xsk first page */
+	len = min_t(int, desc->len, PAGE_SIZE - offset);
+	page = vmalloc_to_page(data);
+	sg_set_page(sq->sg + 1, page, len, offset);
+
+	/* ctx is used to record and reference this page to prevent xsk from
+	 * being released before this xmit is recycled
+	 */
+	ctx->ctx.page = page;
+	get_page(page);
+
+	/* xsk unaligned mode, handle for the second page */
+	if (len < desc->len) {
+		page = vmalloc_to_page(data + len);
+		len = min_t(int, desc->len - len, PAGE_SIZE);
+		sg_set_page(sq->sg + 2, page, len, 0);
+
+		ctx->ctx.page_unaligned = page;
+		get_page(page);
+	} else {
+		ctx->ctx.page_unaligned = NULL;
+	}
+
+	return virtqueue_add_outbuf(sq->vq, sq->sg, n,
+				   xsk_to_ptr(ctx), GFP_ATOMIC);
+}
+
+static int virtnet_xsk_xmit_batch(struct send_queue *sq,
+				  struct xsk_buff_pool *pool,
+				  unsigned int budget,
+				  bool in_napi, int *done,
+				  struct virtnet_sq_stats *stats)
+{
+	struct xdp_desc desc;
+	int err, packet = 0;
+	int ret = -EAGAIN;
+
+	while (budget-- > 0) {
+		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
+			ret = -EBUSY;
+			break;
+		}
+
+		if (!xsk_tx_peek_desc(pool, &desc)) {
+			/* done */
+			ret = 0;
+			break;
+		}
+
+		err = virtnet_xsk_xmit(sq, pool, &desc);
+		if (unlikely(err)) {
+			ret = -EBUSY;
+			break;
+		}
+
+		++packet;
+	}
+
+	if (packet) {
+		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
+			++stats->kicks;
+
+		*done += packet;
+		stats->xdp_tx += packet;
+
+		xsk_tx_release(pool);
+	}
+
+	return ret;
+}
+
+static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
+			   int budget, bool in_napi)
+{
+	struct virtnet_sq_stats stats = {};
+	int done = 0;
+	int err;
+
+	sq->xsk.need_wakeup = false;
+	__free_old_xmit(sq, in_napi, &stats);
+
+	/* return err:
+	 * -EAGAIN: done == budget
+	 * -EBUSY:  done < budget
+	 *  0    :  done < budget
+	 */
+xmit:
+	err = virtnet_xsk_xmit_batch(sq, pool, budget - done, in_napi,
+				     &done, &stats);
+	if (err == -EBUSY) {
+		__free_old_xmit(sq, in_napi, &stats);
+
+		/* If the space is enough, let napi run again. */
+		if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+			goto xmit;
+		else
+			sq->xsk.need_wakeup = true;
+	}
+
+	virtnet_xsk_check_queue(sq);
+
+	u64_stats_update_begin(&sq->stats.syncp);
+	sq->stats.packets += stats.packets;
+	sq->stats.bytes += stats.bytes;
+	sq->stats.kicks += stats.kicks;
+	sq->stats.xdp_tx += stats.xdp_tx;
+	u64_stats_update_end(&sq->stats.syncp);
+
+	return done;
+}
+
+int virtnet_poll_xsk(struct send_queue *sq, int budget)
+{
+	struct xsk_buff_pool *pool;
+	int work_done = 0;
+
+	rcu_read_lock();
+	pool = rcu_dereference(sq->xsk.pool);
+	if (pool)
+		work_done = virtnet_xsk_run(sq, pool, budget, true);
+	rcu_read_unlock();
+	return work_done;
+}
+
+int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct xsk_buff_pool *pool;
+	struct send_queue *sq;
+
+	if (!netif_running(dev))
+		return -ENETDOWN;
+
+	if (qid >= vi->curr_queue_pairs)
+		return -EINVAL;
+
+	sq = &vi->sq[qid];
+
+	rcu_read_lock();
+	pool = rcu_dereference(sq->xsk.pool);
+	if (pool) {
+		local_bh_disable();
+		virtqueue_napi_schedule(&sq->napi, sq->vq);
+		local_bh_enable();
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
+static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct xsk_buff_pool *pool,
+							  struct virtqueue *vq)
+{
+	struct virtnet_xsk_ctx_head *head;
+	u32 size, n, ring_size, ctx_sz;
+	struct virtnet_xsk_ctx *ctx;
+	void *p;
+
+	ctx_sz = sizeof(struct virtnet_xsk_ctx_tx);
+
+	ring_size = virtqueue_get_vring_size(vq);
+	size = sizeof(*head) + ctx_sz * ring_size;
+
+	head = kmalloc(size, GFP_ATOMIC);
+	if (!head)
+		return NULL;
+
+	memset(head, 0, sizeof(*head));
+
+	head->active = true;
+	head->frame_size = xsk_pool_get_rx_frame_size(pool);
+
+	p = head + 1;
+	for (n = 0; n < ring_size; ++n) {
+		ctx = p;
+		ctx->head = head;
+		ctx->next = head->ctx;
+		head->ctx = ctx;
+
+		p += ctx_sz;
+	}
+
+	return head;
+}
+
+static int virtnet_xsk_pool_enable(struct net_device *dev,
+				   struct xsk_buff_pool *pool,
+				   u16 qid)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct send_queue *sq;
+
+	if (qid >= vi->curr_queue_pairs)
+		return -EINVAL;
+
+	sq = &vi->sq[qid];
+
+	/* xsk zerocopy depend on the tx napi.
+	 *
+	 * All data is actually consumed and sent out from the xsk tx queue
+	 * under the tx napi mechanism.
+	 */
+	if (!sq->napi.weight)
+		return -EPERM;
+
+	memset(&sq->xsk, 0, sizeof(sq->xsk));
+
+	sq->xsk.ctx_head = virtnet_xsk_ctx_alloc(pool, sq->vq);
+	if (!sq->xsk.ctx_head)
+		return -ENOMEM;
+
+	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
+	 * safe.
+	 */
+	rcu_assign_pointer(sq->xsk.pool, pool);
+
+	return 0;
+}
+
+static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct send_queue *sq;
+
+	if (qid >= vi->curr_queue_pairs)
+		return -EINVAL;
+
+	sq = &vi->sq[qid];
+
+	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
+	 * safe.
+	 */
+	rcu_assign_pointer(sq->xsk.pool, NULL);
+
+	/* Sync with the XSK wakeup and with NAPI. */
+	synchronize_net();
+
+	if (READ_ONCE(sq->xsk.ctx_head->ref))
+		WRITE_ONCE(sq->xsk.ctx_head->active, false);
+	else
+		kfree(sq->xsk.ctx_head);
+
+	sq->xsk.ctx_head = NULL;
+
+	return 0;
+}
+
+int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	if (xdp->xsk.pool)
+		return virtnet_xsk_pool_enable(dev, xdp->xsk.pool,
+					       xdp->xsk.queue_id);
+	else
+		return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id);
+}
+
diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
new file mode 100644
index 000000000000..54948e0b07fc
--- /dev/null
+++ b/drivers/net/virtio/xsk.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef __XSK_H__
+#define __XSK_H__
+
+#define VIRTIO_XSK_FLAG	BIT(1)
+
+/* When xsk disable, under normal circumstances, the network card must reclaim
+ * all the memory that has been sent and the memory added to the rq queue by
+ * destroying the queue.
+ *
+ * But virtio's queue does not support separate setting to been disable. "Reset"
+ * is not very suitable.
+ *
+ * The method here is that each sent chunk or chunk added to the rq queue is
+ * described by an independent structure struct virtnet_xsk_ctx.
+ *
+ * We will use get_page(page) to refer to the page where these chunks are
+ * located. And these pages will be recorded in struct virtnet_xsk_ctx. So these
+ * chunks in vq are safe. When recycling, put the these page.
+ *
+ * These structures point to struct virtnet_xsk_ctx_head, and ref records how
+ * many chunks have not been reclaimed. If active == 0, it means that xsk has
+ * been disabled.
+ *
+ * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
+ * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
+ * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
+ * ref is 0.
+ */
+struct virtnet_xsk_ctx;
+struct virtnet_xsk_ctx_head {
+	struct virtnet_xsk_ctx *ctx;
+
+	/* how many ctx has been add to vq */
+	u64 ref;
+
+	unsigned int frame_size;
+
+	/* the xsk status */
+	bool active;
+};
+
+struct virtnet_xsk_ctx {
+	struct virtnet_xsk_ctx_head *head;
+	struct virtnet_xsk_ctx *next;
+
+	struct page *page;
+
+	/* xsk unaligned mode will use two page in one desc */
+	struct page *page_unaligned;
+};
+
+struct virtnet_xsk_ctx_tx {
+	/* this *MUST* be the first */
+	struct virtnet_xsk_ctx ctx;
+
+	/* xsk tx xmit use this record the len of packet */
+	u32 len;
+};
+
+static inline void *xsk_to_ptr(struct virtnet_xsk_ctx_tx *ctx)
+{
+	return (void *)((unsigned long)ctx | VIRTIO_XSK_FLAG);
+}
+
+static inline struct virtnet_xsk_ctx_tx *ptr_to_xsk(void *ptr)
+{
+	unsigned long p;
+
+	p = (unsigned long)ptr;
+	return (struct virtnet_xsk_ctx_tx *)(p & ~VIRTIO_XSK_FLAG);
+}
+
+static inline void virtnet_xsk_ctx_put(struct virtnet_xsk_ctx *ctx)
+{
+	put_page(ctx->page);
+	if (ctx->page_unaligned)
+		put_page(ctx->page_unaligned);
+
+	--ctx->head->ref;
+
+	if (ctx->head->active) {
+		ctx->next = ctx->head->ctx;
+		ctx->head->ctx = ctx;
+	} else {
+		if (!ctx->head->ref)
+			kfree(ctx->head);
+	}
+}
+
+#define virtnet_xsk_ctx_tx_put(ctx) \
+	virtnet_xsk_ctx_put((struct virtnet_xsk_ctx *)ctx)
+
+int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
+int virtnet_poll_xsk(struct send_queue *sq, int budget);
+void virtnet_xsk_complete(struct send_queue *sq, u32 num);
+int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
+#endif
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
@ 2021-06-10  8:22   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

AF_XDP(xdp socket, xsk) is a high-performance packet receiving and
sending technology.

This patch implements the binding and unbinding operations of xsk and
the virtio-net queue for xsk zero copy xmit.

The xsk zero copy xmit depends on tx napi. Because the actual sending
of data is done in the process of tx napi. If tx napi does not
work, then the data of the xsk tx queue will not be sent.
So if tx napi is not true, an error will be reported when bind xsk.

If xsk is active, it will prevent ethtool from modifying tx napi.

When reclaiming ptr, a new type of ptr is added, which is distinguished
based on the last two digits of ptr:
00: skb
01: xdp frame
10: xsk xmit ptr

All sent xsk packets share the virtio-net header of xsk_hdr. If xsk
needs to support csum and other functions later, consider assigning xsk
hdr separately for each sent packet.

Different from other physical network cards, you can reinitialize the
channel when you bind xsk. And vrtio does not support independent reset
channel, you can only reset the entire device. I think it is not
appropriate for us to directly reset the entire setting. So the
situation becomes a bit more complicated. We have to consider how
to deal with the buffer referenced in vq after xsk is unbind.

I added the ring size struct virtnet_xsk_ctx when xsk been bind. Each xsk
buffer added to vq corresponds to a ctx. This ctx is used to record the
page where the xsk buffer is located, and add a page reference. When the
buffer is recycling, reduce the reference to page. When xsk has been
unbind, and all related xsk buffers have been recycled, release all ctx.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 drivers/net/virtio/Makefile     |   1 +
 drivers/net/virtio/virtio_net.c |  20 +-
 drivers/net/virtio/virtio_net.h |  37 +++-
 drivers/net/virtio/xsk.c        | 346 ++++++++++++++++++++++++++++++++
 drivers/net/virtio/xsk.h        |  99 +++++++++
 5 files changed, 497 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/virtio/xsk.c
 create mode 100644 drivers/net/virtio/xsk.h

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index ccc80f40f33a..db79d2e7925f 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -4,3 +4,4 @@
 #
 
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VIRTIO_NET) += xsk.o
diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 395ec1f18331..40d7751f1c5f 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -1423,6 +1423,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 
 	txq = netdev_get_tx_queue(vi->dev, index);
 	__netif_tx_lock(txq, raw_smp_processor_id());
+	work_done += virtnet_poll_xsk(sq, budget);
 	free_old_xmit(sq, true);
 	__netif_tx_unlock(txq);
 
@@ -2133,8 +2134,16 @@ static int virtnet_set_coalesce(struct net_device *dev,
 	if (napi_weight ^ vi->sq[0].napi.weight) {
 		if (dev->flags & IFF_UP)
 			return -EBUSY;
-		for (i = 0; i < vi->max_queue_pairs; i++)
+
+		for (i = 0; i < vi->max_queue_pairs; i++) {
+			/* xsk xmit depend on the tx napi. So if xsk is active,
+			 * prevent modifications to tx napi.
+			 */
+			if (rtnl_dereference(vi->sq[i].xsk.pool))
+				continue;
+
 			vi->sq[i].napi.weight = napi_weight;
+		}
 	}
 
 	return 0;
@@ -2407,6 +2416,8 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
 		return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_SETUP_XSK_POOL:
+		return virtnet_xsk_pool_setup(dev, xdp);
 	default:
 		return -EINVAL;
 	}
@@ -2466,6 +2477,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
 	.ndo_bpf		= virtnet_xdp,
 	.ndo_xdp_xmit		= virtnet_xdp_xmit,
+	.ndo_xsk_wakeup         = virtnet_xsk_wakeup,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
 	.ndo_set_features	= virtnet_set_features,
@@ -2569,10 +2581,12 @@ static void free_unused_bufs(struct virtnet_info *vi)
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		struct virtqueue *vq = vi->sq[i].vq;
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (!is_xdp_frame(buf))
+			if (is_skb_ptr(buf))
 				dev_kfree_skb(buf);
-			else
+			else if (is_xdp_frame(buf))
 				xdp_return_frame(ptr_to_xdp(buf));
+			else
+				virtnet_xsk_ctx_tx_put(ptr_to_xsk(buf));
 		}
 	}
 
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index 931cc81f92fb..e3da829887dc 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -135,6 +135,16 @@ struct send_queue {
 	struct virtnet_sq_stats stats;
 
 	struct napi_struct napi;
+
+	struct {
+		struct xsk_buff_pool __rcu *pool;
+
+		/* xsk wait for tx inter or softirq */
+		bool need_wakeup;
+
+		/* ctx used to record the page added to vq */
+		struct virtnet_xsk_ctx_head *ctx_head;
+	} xsk;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -188,6 +198,13 @@ static inline void virtqueue_napi_schedule(struct napi_struct *napi,
 	}
 }
 
+#include "xsk.h"
+
+static inline bool is_skb_ptr(void *ptr)
+{
+	return !((unsigned long)ptr & (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG));
+}
+
 static inline bool is_xdp_frame(void *ptr)
 {
 	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
@@ -206,25 +223,39 @@ static inline struct xdp_frame *ptr_to_xdp(void *ptr)
 static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
 				   struct virtnet_sq_stats *stats)
 {
+	unsigned int xsknum = 0;
 	unsigned int len;
 	void *ptr;
 
 	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (!is_xdp_frame(ptr)) {
+		if (is_skb_ptr(ptr)) {
 			struct sk_buff *skb = ptr;
 
 			pr_debug("Sent skb %p\n", skb);
 
 			stats->bytes += skb->len;
 			napi_consume_skb(skb, in_napi);
-		} else {
+		} else if (is_xdp_frame(ptr)) {
 			struct xdp_frame *frame = ptr_to_xdp(ptr);
 
 			stats->bytes += frame->len;
 			xdp_return_frame(frame);
+		} else {
+			struct virtnet_xsk_ctx_tx *ctx;
+
+			ctx = ptr_to_xsk(ptr);
+
+			/* Maybe this ptr was sent by the last xsk. */
+			if (ctx->ctx.head->active)
+				++xsknum;
+
+			stats->bytes += ctx->len;
+			virtnet_xsk_ctx_tx_put(ctx);
 		}
 		stats->packets++;
 	}
-}
 
+	if (xsknum)
+		virtnet_xsk_complete(sq, xsknum);
+}
 #endif
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
new file mode 100644
index 000000000000..f98b68576709
--- /dev/null
+++ b/drivers/net/virtio/xsk.c
@@ -0,0 +1,346 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * virtio-net xsk
+ */
+
+#include "virtio_net.h"
+
+static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
+
+static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
+{
+	struct virtnet_xsk_ctx *ctx;
+
+	ctx = head->ctx;
+	head->ctx = ctx->next;
+
+	++head->ref;
+
+	return ctx;
+}
+
+#define virtnet_xsk_ctx_tx_get(head) ((struct virtnet_xsk_ctx_tx *)virtnet_xsk_ctx_get(head))
+
+static void virtnet_xsk_check_queue(struct send_queue *sq)
+{
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct net_device *dev = vi->dev;
+	int qnum = sq - vi->sq;
+
+	/* If it is a raw buffer queue, it does not check whether the status
+	 * of the queue is stopped when sending. So there is no need to check
+	 * the situation of the raw buffer queue.
+	 */
+	if (is_xdp_raw_buffer_queue(vi, qnum))
+		return;
+
+	/* If this sq is not the exclusive queue of the current cpu,
+	 * then it may be called by start_xmit, so check it running out
+	 * of space.
+	 *
+	 * Stop the queue to avoid getting packets that we are
+	 * then unable to transmit. Then wait the tx interrupt.
+	 */
+	if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
+		netif_stop_subqueue(dev, qnum);
+}
+
+void virtnet_xsk_complete(struct send_queue *sq, u32 num)
+{
+	struct xsk_buff_pool *pool;
+
+	rcu_read_lock();
+	pool = rcu_dereference(sq->xsk.pool);
+	if (!pool) {
+		rcu_read_unlock();
+		return;
+	}
+	xsk_tx_completed(pool, num);
+	rcu_read_unlock();
+
+	if (sq->xsk.need_wakeup) {
+		sq->xsk.need_wakeup = false;
+		virtqueue_napi_schedule(&sq->napi, sq->vq);
+	}
+}
+
+static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
+			    struct xdp_desc *desc)
+{
+	struct virtnet_xsk_ctx_tx *ctx;
+	struct virtnet_info *vi;
+	u32 offset, n, len;
+	struct page *page;
+	void *data;
+
+	vi = sq->vq->vdev->priv;
+
+	data = xsk_buff_raw_get_data(pool, desc->addr);
+	offset = offset_in_page(data);
+
+	ctx = virtnet_xsk_ctx_tx_get(sq->xsk.ctx_head);
+
+	/* xsk unaligned mode, desc may use two pages */
+	if (desc->len > PAGE_SIZE - offset)
+		n = 3;
+	else
+		n = 2;
+
+	sg_init_table(sq->sg, n);
+	sg_set_buf(sq->sg, &xsk_hdr, vi->hdr_len);
+
+	/* handle for xsk first page */
+	len = min_t(int, desc->len, PAGE_SIZE - offset);
+	page = vmalloc_to_page(data);
+	sg_set_page(sq->sg + 1, page, len, offset);
+
+	/* ctx is used to record and reference this page to prevent xsk from
+	 * being released before this xmit is recycled
+	 */
+	ctx->ctx.page = page;
+	get_page(page);
+
+	/* xsk unaligned mode, handle for the second page */
+	if (len < desc->len) {
+		page = vmalloc_to_page(data + len);
+		len = min_t(int, desc->len - len, PAGE_SIZE);
+		sg_set_page(sq->sg + 2, page, len, 0);
+
+		ctx->ctx.page_unaligned = page;
+		get_page(page);
+	} else {
+		ctx->ctx.page_unaligned = NULL;
+	}
+
+	return virtqueue_add_outbuf(sq->vq, sq->sg, n,
+				   xsk_to_ptr(ctx), GFP_ATOMIC);
+}
+
+static int virtnet_xsk_xmit_batch(struct send_queue *sq,
+				  struct xsk_buff_pool *pool,
+				  unsigned int budget,
+				  bool in_napi, int *done,
+				  struct virtnet_sq_stats *stats)
+{
+	struct xdp_desc desc;
+	int err, packet = 0;
+	int ret = -EAGAIN;
+
+	while (budget-- > 0) {
+		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
+			ret = -EBUSY;
+			break;
+		}
+
+		if (!xsk_tx_peek_desc(pool, &desc)) {
+			/* done */
+			ret = 0;
+			break;
+		}
+
+		err = virtnet_xsk_xmit(sq, pool, &desc);
+		if (unlikely(err)) {
+			ret = -EBUSY;
+			break;
+		}
+
+		++packet;
+	}
+
+	if (packet) {
+		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
+			++stats->kicks;
+
+		*done += packet;
+		stats->xdp_tx += packet;
+
+		xsk_tx_release(pool);
+	}
+
+	return ret;
+}
+
+static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
+			   int budget, bool in_napi)
+{
+	struct virtnet_sq_stats stats = {};
+	int done = 0;
+	int err;
+
+	sq->xsk.need_wakeup = false;
+	__free_old_xmit(sq, in_napi, &stats);
+
+	/* return err:
+	 * -EAGAIN: done == budget
+	 * -EBUSY:  done < budget
+	 *  0    :  done < budget
+	 */
+xmit:
+	err = virtnet_xsk_xmit_batch(sq, pool, budget - done, in_napi,
+				     &done, &stats);
+	if (err == -EBUSY) {
+		__free_old_xmit(sq, in_napi, &stats);
+
+		/* If the space is enough, let napi run again. */
+		if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+			goto xmit;
+		else
+			sq->xsk.need_wakeup = true;
+	}
+
+	virtnet_xsk_check_queue(sq);
+
+	u64_stats_update_begin(&sq->stats.syncp);
+	sq->stats.packets += stats.packets;
+	sq->stats.bytes += stats.bytes;
+	sq->stats.kicks += stats.kicks;
+	sq->stats.xdp_tx += stats.xdp_tx;
+	u64_stats_update_end(&sq->stats.syncp);
+
+	return done;
+}
+
+int virtnet_poll_xsk(struct send_queue *sq, int budget)
+{
+	struct xsk_buff_pool *pool;
+	int work_done = 0;
+
+	rcu_read_lock();
+	pool = rcu_dereference(sq->xsk.pool);
+	if (pool)
+		work_done = virtnet_xsk_run(sq, pool, budget, true);
+	rcu_read_unlock();
+	return work_done;
+}
+
+int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct xsk_buff_pool *pool;
+	struct send_queue *sq;
+
+	if (!netif_running(dev))
+		return -ENETDOWN;
+
+	if (qid >= vi->curr_queue_pairs)
+		return -EINVAL;
+
+	sq = &vi->sq[qid];
+
+	rcu_read_lock();
+	pool = rcu_dereference(sq->xsk.pool);
+	if (pool) {
+		local_bh_disable();
+		virtqueue_napi_schedule(&sq->napi, sq->vq);
+		local_bh_enable();
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
+static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct xsk_buff_pool *pool,
+							  struct virtqueue *vq)
+{
+	struct virtnet_xsk_ctx_head *head;
+	u32 size, n, ring_size, ctx_sz;
+	struct virtnet_xsk_ctx *ctx;
+	void *p;
+
+	ctx_sz = sizeof(struct virtnet_xsk_ctx_tx);
+
+	ring_size = virtqueue_get_vring_size(vq);
+	size = sizeof(*head) + ctx_sz * ring_size;
+
+	head = kmalloc(size, GFP_ATOMIC);
+	if (!head)
+		return NULL;
+
+	memset(head, 0, sizeof(*head));
+
+	head->active = true;
+	head->frame_size = xsk_pool_get_rx_frame_size(pool);
+
+	p = head + 1;
+	for (n = 0; n < ring_size; ++n) {
+		ctx = p;
+		ctx->head = head;
+		ctx->next = head->ctx;
+		head->ctx = ctx;
+
+		p += ctx_sz;
+	}
+
+	return head;
+}
+
+static int virtnet_xsk_pool_enable(struct net_device *dev,
+				   struct xsk_buff_pool *pool,
+				   u16 qid)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct send_queue *sq;
+
+	if (qid >= vi->curr_queue_pairs)
+		return -EINVAL;
+
+	sq = &vi->sq[qid];
+
+	/* xsk zerocopy depend on the tx napi.
+	 *
+	 * All data is actually consumed and sent out from the xsk tx queue
+	 * under the tx napi mechanism.
+	 */
+	if (!sq->napi.weight)
+		return -EPERM;
+
+	memset(&sq->xsk, 0, sizeof(sq->xsk));
+
+	sq->xsk.ctx_head = virtnet_xsk_ctx_alloc(pool, sq->vq);
+	if (!sq->xsk.ctx_head)
+		return -ENOMEM;
+
+	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
+	 * safe.
+	 */
+	rcu_assign_pointer(sq->xsk.pool, pool);
+
+	return 0;
+}
+
+static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct send_queue *sq;
+
+	if (qid >= vi->curr_queue_pairs)
+		return -EINVAL;
+
+	sq = &vi->sq[qid];
+
+	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
+	 * safe.
+	 */
+	rcu_assign_pointer(sq->xsk.pool, NULL);
+
+	/* Sync with the XSK wakeup and with NAPI. */
+	synchronize_net();
+
+	if (READ_ONCE(sq->xsk.ctx_head->ref))
+		WRITE_ONCE(sq->xsk.ctx_head->active, false);
+	else
+		kfree(sq->xsk.ctx_head);
+
+	sq->xsk.ctx_head = NULL;
+
+	return 0;
+}
+
+int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	if (xdp->xsk.pool)
+		return virtnet_xsk_pool_enable(dev, xdp->xsk.pool,
+					       xdp->xsk.queue_id);
+	else
+		return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id);
+}
+
diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
new file mode 100644
index 000000000000..54948e0b07fc
--- /dev/null
+++ b/drivers/net/virtio/xsk.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef __XSK_H__
+#define __XSK_H__
+
+#define VIRTIO_XSK_FLAG	BIT(1)
+
+/* When xsk disable, under normal circumstances, the network card must reclaim
+ * all the memory that has been sent and the memory added to the rq queue by
+ * destroying the queue.
+ *
+ * But virtio's queue does not support separate setting to been disable. "Reset"
+ * is not very suitable.
+ *
+ * The method here is that each sent chunk or chunk added to the rq queue is
+ * described by an independent structure struct virtnet_xsk_ctx.
+ *
+ * We will use get_page(page) to refer to the page where these chunks are
+ * located. And these pages will be recorded in struct virtnet_xsk_ctx. So these
+ * chunks in vq are safe. When recycling, put the these page.
+ *
+ * These structures point to struct virtnet_xsk_ctx_head, and ref records how
+ * many chunks have not been reclaimed. If active == 0, it means that xsk has
+ * been disabled.
+ *
+ * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
+ * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
+ * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
+ * ref is 0.
+ */
+struct virtnet_xsk_ctx;
+struct virtnet_xsk_ctx_head {
+	struct virtnet_xsk_ctx *ctx;
+
+	/* how many ctx has been add to vq */
+	u64 ref;
+
+	unsigned int frame_size;
+
+	/* the xsk status */
+	bool active;
+};
+
+struct virtnet_xsk_ctx {
+	struct virtnet_xsk_ctx_head *head;
+	struct virtnet_xsk_ctx *next;
+
+	struct page *page;
+
+	/* xsk unaligned mode will use two page in one desc */
+	struct page *page_unaligned;
+};
+
+struct virtnet_xsk_ctx_tx {
+	/* this *MUST* be the first */
+	struct virtnet_xsk_ctx ctx;
+
+	/* xsk tx xmit use this record the len of packet */
+	u32 len;
+};
+
+static inline void *xsk_to_ptr(struct virtnet_xsk_ctx_tx *ctx)
+{
+	return (void *)((unsigned long)ctx | VIRTIO_XSK_FLAG);
+}
+
+static inline struct virtnet_xsk_ctx_tx *ptr_to_xsk(void *ptr)
+{
+	unsigned long p;
+
+	p = (unsigned long)ptr;
+	return (struct virtnet_xsk_ctx_tx *)(p & ~VIRTIO_XSK_FLAG);
+}
+
+static inline void virtnet_xsk_ctx_put(struct virtnet_xsk_ctx *ctx)
+{
+	put_page(ctx->page);
+	if (ctx->page_unaligned)
+		put_page(ctx->page_unaligned);
+
+	--ctx->head->ref;
+
+	if (ctx->head->active) {
+		ctx->next = ctx->head->ctx;
+		ctx->head->ctx = ctx;
+	} else {
+		if (!ctx->head->ref)
+			kfree(ctx->head);
+	}
+}
+
+#define virtnet_xsk_ctx_tx_put(ctx) \
+	virtnet_xsk_ctx_put((struct virtnet_xsk_ctx *)ctx)
+
+int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
+int virtnet_poll_xsk(struct send_queue *sq, int budget);
+void virtnet_xsk_complete(struct send_queue *sq, u32 num);
+int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
+#endif
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:22   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

Compared to the case of xsk tx, the case of xsk zc rx is more
complicated.

When we process the buf received by vq, we may encounter ordinary
buffers, or xsk buffers. What makes the situation more complicated is
that in the case of mergeable, when num_buffer > 1, we may still
encounter the case where xsk buffer is mixed with ordinary buffer.

Another thing that makes the situation more complicated is that when we
get an xsk buffer from vq, the xsk bound to this xsk buffer may have
been unbound.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio/virtio_net.c | 238 ++++++++++++++-----
 drivers/net/virtio/virtio_net.h |  27 +++
 drivers/net/virtio/xsk.c        | 396 +++++++++++++++++++++++++++++++-
 drivers/net/virtio/xsk.h        |  75 ++++++
 4 files changed, 678 insertions(+), 58 deletions(-)

diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 40d7751f1c5f..9503133e71f0 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -125,11 +125,6 @@ static int rxq2vq(int rxq)
 	return rxq * 2;
 }
 
-static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff *skb)
-{
-	return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb;
-}
-
 /*
  * private is used to chain pages for big packets, put the whole
  * most recent used list in the beginning for reuse
@@ -458,6 +453,68 @@ static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
 	return vi->xdp_enabled ? VIRTIO_XDP_HEADROOM : 0;
 }
 
+/* return value:
+ *  1: XDP_PASS should handle to build skb
+ * -1: xdp err, should handle to free the buf and return NULL
+ *  0: buf has been consumed by xdp
+ */
+int virtnet_run_xdp(struct net_device *dev,
+		    struct bpf_prog *xdp_prog,
+		    struct xdp_buff *xdp,
+		    unsigned int *xdp_xmit,
+		    struct virtnet_rq_stats *stats)
+{
+	struct xdp_frame *xdpf;
+	int act, err;
+
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	stats->xdp_packets++;
+
+	switch (act) {
+	case XDP_PASS:
+		return 1;
+
+	case XDP_TX:
+		stats->xdp_tx++;
+		xdpf = xdp_convert_buff_to_frame(xdp);
+		if (unlikely(!xdpf))
+			goto err_xdp;
+		err = virtnet_xdp_xmit(dev, 1, &xdpf, 0);
+		if (unlikely(!err)) {
+			xdp_return_frame_rx_napi(xdpf);
+		} else if (unlikely(err < 0)) {
+			trace_xdp_exception(dev, xdp_prog, act);
+			goto err_xdp;
+		}
+		*xdp_xmit |= VIRTIO_XDP_TX;
+		return 0;
+
+	case XDP_REDIRECT:
+		stats->xdp_redirects++;
+		err = xdp_do_redirect(dev, xdp, xdp_prog);
+		if (err)
+			goto err_xdp;
+
+		*xdp_xmit |= VIRTIO_XDP_REDIR;
+		return 0;
+
+	default:
+		bpf_warn_invalid_xdp_action(act);
+		fallthrough;
+
+	case XDP_ABORTED:
+		trace_xdp_exception(dev, xdp_prog, act);
+		fallthrough;
+
+	case XDP_DROP:
+		break;
+	}
+
+err_xdp:
+	stats->xdp_drops++;
+	return -1;
+}
+
 /* We copy the packet for XDP in the following cases:
  *
  * 1) Packet is scattered across multiple rx buffers.
@@ -491,27 +548,40 @@ static struct page *xdp_linearize_page(struct receive_queue *rq,
 		int tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 		unsigned int buflen;
 		void *buf;
+		void *ctx;
 		int off;
 
-		buf = virtqueue_get_buf(rq->vq, &buflen);
+		buf = virtqueue_get_buf_ctx(rq->vq, &buflen, &ctx);
 		if (unlikely(!buf))
 			goto err_buf;
 
-		p = virt_to_head_page(buf);
-		off = buf - page_address(p);
-
 		/* guard against a misconfigured or uncooperative backend that
 		 * is sending packet larger than the MTU.
 		 */
 		if ((page_off + buflen + tailroom) > PAGE_SIZE) {
-			put_page(p);
+			virtnet_rx_put_buf(buf, ctx);
 			goto err_buf;
 		}
 
-		memcpy(page_address(page) + page_off,
-		       page_address(p) + off, buflen);
+		if (is_xsk_ctx(ctx)) {
+			struct virtnet_xsk_ctx_rx *xsk_ctx;
+
+			xsk_ctx = (struct virtnet_xsk_ctx_rx *)buf;
+
+			virtnet_xsk_ctx_rx_copy(xsk_ctx,
+						page_address(page) + page_off,
+						buflen, true);
+
+			virtnet_xsk_ctx_rx_put(xsk_ctx);
+		} else {
+			p = virt_to_head_page(buf);
+			off = buf - page_address(p);
+
+			memcpy(page_address(page) + page_off,
+			       page_address(p) + off, buflen);
+			put_page(p);
+		}
 		page_off += buflen;
-		put_page(p);
 	}
 
 	/* Headroom does not contribute to packet length */
@@ -522,17 +592,16 @@ static struct page *xdp_linearize_page(struct receive_queue *rq,
 	return NULL;
 }
 
-static void merge_drop_follow_bufs(struct net_device *dev,
-				   struct receive_queue *rq,
-				   u16 num_buf,
-				   struct virtnet_rq_stats *stats)
+void merge_drop_follow_bufs(struct net_device *dev,
+			    struct receive_queue *rq,
+			    u16 num_buf,
+			    struct virtnet_rq_stats *stats)
 {
-	struct page *page;
 	unsigned int len;
-	void *buf;
+	void *buf, *ctx;
 
 	while (num_buf-- > 1) {
-		buf = virtqueue_get_buf(rq->vq, &len);
+		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
 		if (unlikely(!buf)) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 dev->name, num_buf);
@@ -540,23 +609,80 @@ static void merge_drop_follow_bufs(struct net_device *dev,
 			break;
 		}
 		stats->bytes += len;
-		page = virt_to_head_page(buf);
-		put_page(page);
+		virtnet_rx_put_buf(buf, ctx);
+	}
+}
+
+static char *merge_get_follow_buf(struct net_device *dev,
+				  struct receive_queue *rq,
+				  int *plen, int *ptruesize,
+				  int index, int total)
+{
+	struct virtnet_xsk_ctx_rx *xsk_ctx;
+	unsigned int truesize;
+	char *buf;
+	void *ctx;
+	int len;
+
+	buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
+	if (unlikely(!buf)) {
+		pr_debug("%s: rx error: %d buffers out of %d missing\n",
+			 dev->name, index, total);
+		dev->stats.rx_length_errors++;
+		return NULL;
+	}
+
+	if (is_xsk_ctx(ctx)) {
+		xsk_ctx = (struct virtnet_xsk_ctx_rx *)buf;
+
+		if (unlikely(len > xsk_ctx->ctx.head->truesize)) {
+			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
+				 dev->name, len, (unsigned long)ctx);
+			dev->stats.rx_length_errors++;
+			virtnet_xsk_ctx_rx_put(xsk_ctx);
+			return ERR_PTR(-EDQUOT);
+		}
+
+		truesize = len;
+
+		buf = virtnet_alloc_frag(rq, truesize, GFP_ATOMIC);
+		if (unlikely(!buf)) {
+			virtnet_xsk_ctx_rx_put(xsk_ctx);
+			return ERR_PTR(-ENOMEM);
+		}
+
+		virtnet_xsk_ctx_rx_copy(xsk_ctx, buf, len, true);
+		virtnet_xsk_ctx_rx_put(xsk_ctx);
+	} else {
+		truesize = mergeable_ctx_to_truesize(ctx);
+		if (unlikely(len > truesize)) {
+			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
+				 dev->name, len, (unsigned long)ctx);
+			dev->stats.rx_length_errors++;
+
+			put_page(virt_to_head_page(buf));
+			return ERR_PTR(-EDQUOT);
+		}
 	}
+
+	*plen = len;
+	*ptruesize = truesize;
+
+	return buf;
 }
 
-static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
-						 struct virtnet_info *vi,
-						 struct receive_queue *rq,
-						 struct sk_buff *head_skb,
-						 u16 num_buf,
-						 struct virtnet_rq_stats *stats)
+struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
+					  struct virtnet_info *vi,
+					  struct receive_queue *rq,
+					  struct sk_buff *head_skb,
+					  u16 num_buf,
+					  struct virtnet_rq_stats *stats)
 {
 	struct sk_buff *curr_skb;
 	unsigned int truesize;
 	unsigned int len, num;
 	struct page *page;
-	void *buf, *ctx;
+	void *buf;
 	int offset;
 
 	curr_skb = head_skb;
@@ -565,25 +691,17 @@ static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
 	while (--num_buf) {
 		int num_skb_frags;
 
-		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
-		if (unlikely(!buf)) {
-			pr_debug("%s: rx error: %d buffers out of %d missing\n",
-				 dev->name, num_buf, num);
-			dev->stats.rx_length_errors++;
+		buf = merge_get_follow_buf(dev, rq, &len, &truesize,
+					   num_buf, num);
+		if (unlikely(!buf))
 			goto err_buf;
-		}
+
+		if (IS_ERR(buf))
+			goto err_drop;
 
 		stats->bytes += len;
 		page = virt_to_head_page(buf);
 
-		truesize = mergeable_ctx_to_truesize(ctx);
-		if (unlikely(len > truesize)) {
-			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
-				 dev->name, len, (unsigned long)ctx);
-			dev->stats.rx_length_errors++;
-			goto err_skb;
-		}
-
 		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
 		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
 			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
@@ -618,6 +736,7 @@ static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
 
 err_skb:
 	put_page(page);
+err_drop:
 	merge_drop_follow_bufs(dev, rq, num_buf, stats);
 err_buf:
 	stats->drops++;
@@ -982,16 +1101,18 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
 		if (vi->mergeable_rx_bufs) {
-			put_page(virt_to_head_page(buf));
+			virtnet_rx_put_buf(buf, ctx);
 		} else if (vi->big_packets) {
 			give_pages(rq, buf);
 		} else {
-			put_page(virt_to_head_page(buf));
+			virtnet_rx_put_buf(buf, ctx);
 		}
 		return;
 	}
 
-	if (vi->mergeable_rx_bufs)
+	if (is_xsk_ctx(ctx))
+		skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
+	else if (vi->mergeable_rx_bufs)
 		skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
 					stats);
 	else if (vi->big_packets)
@@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
 	int err;
 	bool oom;
 
+	/* Because virtio-net does not yet support flow direct, all rx
+	 * channels must also process other non-xsk packets. If there is
+	 * no buf available from xsk temporarily, we try to allocate
+	 * memory to the channel.
+	 */
+	if (fill_recv_xsk(vi, rq, gfp))
+		goto kick;
+
 	do {
 		if (vi->mergeable_rx_bufs)
 			err = add_recvbuf_mergeable(vi, rq, gfp);
@@ -1187,6 +1316,8 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
 		if (err)
 			break;
 	} while (rq->vq->num_free);
+
+kick:
 	if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) {
 		unsigned long flags;
 
@@ -2575,7 +2706,7 @@ static void free_receive_page_frags(struct virtnet_info *vi)
 
 static void free_unused_bufs(struct virtnet_info *vi)
 {
-	void *buf;
+	void *buf, *ctx;
 	int i;
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
@@ -2593,14 +2724,13 @@ static void free_unused_bufs(struct virtnet_info *vi)
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		struct virtqueue *vq = vi->rq[i].vq;
 
-		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (vi->mergeable_rx_bufs) {
-				put_page(virt_to_head_page(buf));
-			} else if (vi->big_packets) {
+		while ((buf = virtqueue_detach_unused_buf_ctx(vq, &ctx)) != NULL) {
+			if (vi->mergeable_rx_bufs)
+				virtnet_rx_put_buf(buf, ctx);
+			else if (vi->big_packets)
 				give_pages(&vi->rq[i], buf);
-			} else {
-				put_page(virt_to_head_page(buf));
-			}
+			else
+				virtnet_rx_put_buf(buf, ctx);
 		}
 	}
 }
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index e3da829887dc..70af880f469d 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -177,8 +177,23 @@ struct receive_queue {
 	char name[40];
 
 	struct xdp_rxq_info xdp_rxq;
+
+	struct {
+		struct xsk_buff_pool __rcu *pool;
+
+		/* xdp rxq used by xsk */
+		struct xdp_rxq_info xdp_rxq;
+
+		/* ctx used to record the page added to vq */
+		struct virtnet_xsk_ctx_head *ctx_head;
+	} xsk;
 };
 
+static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff *skb)
+{
+	return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb;
+}
+
 static inline bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
 {
 	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
@@ -258,4 +273,16 @@ static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
 	if (xsknum)
 		virtnet_xsk_complete(sq, xsknum);
 }
+
+int virtnet_run_xdp(struct net_device *dev, struct bpf_prog *xdp_prog,
+		    struct xdp_buff *xdp, unsigned int *xdp_xmit,
+		    struct virtnet_rq_stats *stats);
+struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
+					  struct virtnet_info *vi,
+					  struct receive_queue *rq,
+					  struct sk_buff *head_skb,
+					  u16 num_buf,
+					  struct virtnet_rq_stats *stats);
+void merge_drop_follow_bufs(struct net_device *dev, struct receive_queue *rq,
+			    u16 num_buf, struct virtnet_rq_stats *stats);
 #endif
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index f98b68576709..36cda2dcf8e7 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -20,6 +20,75 @@ static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *
 }
 
 #define virtnet_xsk_ctx_tx_get(head) ((struct virtnet_xsk_ctx_tx *)virtnet_xsk_ctx_get(head))
+#define virtnet_xsk_ctx_rx_get(head) ((struct virtnet_xsk_ctx_rx *)virtnet_xsk_ctx_get(head))
+
+static unsigned int virtnet_receive_buf_num(struct virtnet_info *vi, char *buf)
+{
+	struct virtio_net_hdr_mrg_rxbuf *hdr;
+
+	if (vi->mergeable_rx_bufs) {
+		hdr = (struct virtio_net_hdr_mrg_rxbuf *)buf;
+		return virtio16_to_cpu(vi->vdev, hdr->num_buffers);
+	}
+
+	return 1;
+}
+
+/* when xsk rx ctx ref two page, copy to dst from two page */
+static void virtnet_xsk_rx_ctx_merge(struct virtnet_xsk_ctx_rx *ctx,
+				     char *dst, unsigned int len)
+{
+	unsigned int size;
+	int offset;
+	char *src;
+
+	/* data start from first page */
+	if (ctx->offset >= 0) {
+		offset = ctx->offset;
+
+		size = min_t(int, PAGE_SIZE - offset, len);
+		src = page_address(ctx->ctx.page) + offset;
+		memcpy(dst, src, size);
+
+		if (len > size) {
+			src = page_address(ctx->ctx.page_unaligned);
+			memcpy(dst + size, src, len - size);
+		}
+
+	} else {
+		offset = -ctx->offset;
+
+		src = page_address(ctx->ctx.page_unaligned) + offset;
+
+		memcpy(dst, src, len);
+	}
+}
+
+/* copy ctx to dst, need to make sure that len is safe */
+void virtnet_xsk_ctx_rx_copy(struct virtnet_xsk_ctx_rx *ctx,
+			     char *dst, unsigned int len,
+			     bool hdr)
+{
+	char *src;
+	int size;
+
+	if (hdr) {
+		size = min_t(int, ctx->ctx.head->hdr_len, len);
+		memcpy(dst, &ctx->hdr, size);
+		len -= size;
+		if (!len)
+			return;
+		dst += size;
+	}
+
+	if (!ctx->ctx.page_unaligned) {
+		src = page_address(ctx->ctx.page) + ctx->offset;
+		memcpy(dst, src, len);
+
+	} else {
+		virtnet_xsk_rx_ctx_merge(ctx, dst, len);
+	}
+}
 
 static void virtnet_xsk_check_queue(struct send_queue *sq)
 {
@@ -45,6 +114,267 @@ static void virtnet_xsk_check_queue(struct send_queue *sq)
 		netif_stop_subqueue(dev, qnum);
 }
 
+static struct sk_buff *virtnet_xsk_construct_skb_xdp(struct receive_queue *rq,
+						     struct xdp_buff *xdp)
+{
+	unsigned int metasize = xdp->data - xdp->data_meta;
+	struct sk_buff *skb;
+	unsigned int size;
+
+	size = xdp->data_end - xdp->data_hard_start;
+	skb = napi_alloc_skb(&rq->napi, size);
+	if (unlikely(!skb))
+		return NULL;
+
+	skb_reserve(skb, xdp->data_meta - xdp->data_hard_start);
+
+	size = xdp->data_end - xdp->data_meta;
+	memcpy(__skb_put(skb, size), xdp->data_meta, size);
+
+	if (metasize) {
+		__skb_pull(skb, metasize);
+		skb_metadata_set(skb, metasize);
+	}
+
+	return skb;
+}
+
+static struct sk_buff *virtnet_xsk_construct_skb_ctx(struct net_device *dev,
+						     struct virtnet_info *vi,
+						     struct receive_queue *rq,
+						     struct virtnet_xsk_ctx_rx *ctx,
+						     unsigned int len,
+						     struct virtnet_rq_stats *stats)
+{
+	struct virtio_net_hdr_mrg_rxbuf *hdr;
+	struct sk_buff *skb;
+	int num_buf;
+	char *dst;
+
+	len -= vi->hdr_len;
+
+	skb = napi_alloc_skb(&rq->napi, len);
+	if (unlikely(!skb))
+		return NULL;
+
+	dst = __skb_put(skb, len);
+
+	virtnet_xsk_ctx_rx_copy(ctx, dst, len, false);
+
+	num_buf = virtnet_receive_buf_num(vi, (char *)&ctx->hdr);
+	if (num_buf > 1) {
+		skb = merge_receive_follow_bufs(dev, vi, rq, skb, num_buf,
+						stats);
+		if (!skb)
+			return NULL;
+	}
+
+	hdr = skb_vnet_hdr(skb);
+	memcpy(hdr, &ctx->hdr, vi->hdr_len);
+
+	return skb;
+}
+
+/* len not include virtio-net hdr */
+static struct xdp_buff *virtnet_xsk_check_xdp(struct virtnet_info *vi,
+					      struct receive_queue *rq,
+					      struct virtnet_xsk_ctx_rx *ctx,
+					      struct xdp_buff *_xdp,
+					      unsigned int len)
+{
+	struct xdp_buff *xdp;
+	struct page *page;
+	int frame_sz;
+	char *data;
+
+	if (ctx->ctx.head->active) {
+		xdp = ctx->xdp;
+		xdp->data_end = xdp->data + len;
+
+		return xdp;
+	}
+
+	/* ctx->xdp is invalid, because of that is released */
+
+	if (!ctx->ctx.page_unaligned) {
+		data = page_address(ctx->ctx.page) + ctx->offset;
+		page = ctx->ctx.page;
+	} else {
+		page = alloc_page(GFP_ATOMIC);
+		if (!page)
+			return NULL;
+
+		data = page_address(page) + ctx->headroom;
+
+		virtnet_xsk_rx_ctx_merge(ctx, data, len);
+
+		put_page(ctx->ctx.page);
+		put_page(ctx->ctx.page_unaligned);
+
+		/* page will been put when ctx is put */
+		ctx->ctx.page = page;
+		ctx->ctx.page_unaligned = NULL;
+	}
+
+	/* If xdp consume the data with XDP_REDIRECT/XDP_TX, the page
+	 * ref will been dec. So call get_page here.
+	 *
+	 * If xdp has been consumed, the page ref will dec auto and
+	 * virtnet_xsk_ctx_rx_put will dec the ref again.
+	 *
+	 * If xdp has not been consumed, then manually put_page once before
+	 * virtnet_xsk_ctx_rx_put.
+	 */
+	get_page(page);
+
+	xdp = _xdp;
+
+	frame_sz = ctx->ctx.head->frame_size + ctx->headroom;
+
+	/* use xdp rxq without MEM_TYPE_XSK_BUFF_POOL */
+	xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq);
+	xdp_prepare_buff(xdp, data - ctx->headroom, ctx->headroom, len, true);
+
+	return xdp;
+}
+
+int add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue *rq,
+		    struct xsk_buff_pool *pool, gfp_t gfp)
+{
+	struct page *page, *page_start, *page_end;
+	unsigned long data, data_end, data_start;
+	struct virtnet_xsk_ctx_rx *ctx;
+	struct xdp_buff *xsk_xdp;
+	int err, size, n;
+	u32 offset;
+
+	xsk_xdp = xsk_buff_alloc(pool);
+	if (!xsk_xdp)
+		return -ENOMEM;
+
+	ctx = virtnet_xsk_ctx_rx_get(rq->xsk.ctx_head);
+
+	ctx->xdp = xsk_xdp;
+	ctx->headroom = xsk_xdp->data - xsk_xdp->data_hard_start;
+
+	offset = offset_in_page(xsk_xdp->data);
+
+	data_start = (unsigned long)xsk_xdp->data_hard_start;
+	data       = (unsigned long)xsk_xdp->data;
+	data_end   = data + ctx->ctx.head->frame_size - 1;
+
+	page_start = vmalloc_to_page((void *)data_start);
+
+	ctx->ctx.page = page_start;
+	get_page(page_start);
+
+	if ((data_end & PAGE_MASK) == (data_start & PAGE_MASK)) {
+		page_end = page_start;
+		page = page_start;
+		ctx->offset = offset;
+
+		ctx->ctx.page_unaligned = NULL;
+		n = 2;
+	} else {
+		page_end = vmalloc_to_page((void *)data_end);
+
+		ctx->ctx.page_unaligned = page_end;
+		get_page(page_end);
+
+		if ((data_start & PAGE_MASK) == (data & PAGE_MASK)) {
+			page = page_start;
+			ctx->offset = offset;
+			n = 3;
+		} else {
+			page = page_end;
+			ctx->offset = -offset;
+			n = 2;
+		}
+	}
+
+	size = min_t(int, PAGE_SIZE - offset, ctx->ctx.head->frame_size);
+
+	sg_init_table(rq->sg, n);
+	sg_set_buf(rq->sg, &ctx->hdr, vi->hdr_len);
+	sg_set_page(rq->sg + 1, page, size, offset);
+
+	if (n == 3) {
+		size = ctx->ctx.head->frame_size - size;
+		sg_set_page(rq->sg + 2, page_end, size, 0);
+	}
+
+	err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, n, ctx,
+				      VIRTNET_XSK_BUFF_CTX, gfp);
+	if (err < 0)
+		virtnet_xsk_ctx_rx_put(ctx);
+
+	return err;
+}
+
+struct sk_buff *receive_xsk(struct net_device *dev, struct virtnet_info *vi,
+			    struct receive_queue *rq, void *buf,
+			    unsigned int len, unsigned int *xdp_xmit,
+			    struct virtnet_rq_stats *stats)
+{
+	struct virtnet_xsk_ctx_rx *ctx;
+	struct xsk_buff_pool *pool;
+	struct sk_buff *skb = NULL;
+	struct xdp_buff *xdp, _xdp;
+	struct bpf_prog *xdp_prog;
+	u16 num_buf = 1;
+	int ret;
+
+	ctx = (struct virtnet_xsk_ctx_rx *)buf;
+
+	rcu_read_lock();
+
+	pool     = rcu_dereference(rq->xsk.pool);
+	xdp_prog = rcu_dereference(rq->xdp_prog);
+	if (!pool || !xdp_prog)
+		goto skb;
+
+	/* this may happen when xsk chunk size too small. */
+	num_buf = virtnet_receive_buf_num(vi, (char *)&ctx->hdr);
+	if (num_buf > 1)
+		goto drop;
+
+	xdp = virtnet_xsk_check_xdp(vi, rq, ctx, &_xdp, len - vi->hdr_len);
+	if (!xdp)
+		goto drop;
+
+	ret = virtnet_run_xdp(dev, xdp_prog, xdp, xdp_xmit, stats);
+	if (unlikely(ret)) {
+		/* pair for get_page inside virtnet_xsk_check_xdp */
+		if (!ctx->ctx.head->active)
+			put_page(ctx->ctx.page);
+
+		if (unlikely(ret < 0))
+			goto drop;
+
+		/* XDP_PASS */
+		skb = virtnet_xsk_construct_skb_xdp(rq, xdp);
+	} else {
+		/* ctx->xdp has been consumed */
+		ctx->xdp = NULL;
+	}
+
+end:
+	virtnet_xsk_ctx_rx_put(ctx);
+	rcu_read_unlock();
+	return skb;
+
+skb:
+	skb = virtnet_xsk_construct_skb_ctx(dev, vi, rq, ctx, len, stats);
+	goto end;
+
+drop:
+	stats->drops++;
+
+	if (num_buf > 1)
+		merge_drop_follow_bufs(dev, rq, num_buf, stats);
+	goto end;
+}
+
 void virtnet_xsk_complete(struct send_queue *sq, u32 num)
 {
 	struct xsk_buff_pool *pool;
@@ -238,15 +568,20 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
 	return 0;
 }
 
-static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct xsk_buff_pool *pool,
-							  struct virtqueue *vq)
+static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct virtnet_info *vi,
+							  struct xsk_buff_pool *pool,
+							  struct virtqueue *vq,
+							  bool rx)
 {
 	struct virtnet_xsk_ctx_head *head;
 	u32 size, n, ring_size, ctx_sz;
 	struct virtnet_xsk_ctx *ctx;
 	void *p;
 
-	ctx_sz = sizeof(struct virtnet_xsk_ctx_tx);
+	if (rx)
+		ctx_sz = sizeof(struct virtnet_xsk_ctx_rx);
+	else
+		ctx_sz = sizeof(struct virtnet_xsk_ctx_tx);
 
 	ring_size = virtqueue_get_vring_size(vq);
 	size = sizeof(*head) + ctx_sz * ring_size;
@@ -259,6 +594,8 @@ static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct xsk_buff_pool *
 
 	head->active = true;
 	head->frame_size = xsk_pool_get_rx_frame_size(pool);
+	head->hdr_len = vi->hdr_len;
+	head->truesize = head->frame_size + vi->hdr_len;
 
 	p = head + 1;
 	for (n = 0; n < ring_size; ++n) {
@@ -278,12 +615,15 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
 				   u16 qid)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct receive_queue *rq;
 	struct send_queue *sq;
+	int err;
 
 	if (qid >= vi->curr_queue_pairs)
 		return -EINVAL;
 
 	sq = &vi->sq[qid];
+	rq = &vi->rq[qid];
 
 	/* xsk zerocopy depend on the tx napi.
 	 *
@@ -295,31 +635,68 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
 
 	memset(&sq->xsk, 0, sizeof(sq->xsk));
 
-	sq->xsk.ctx_head = virtnet_xsk_ctx_alloc(pool, sq->vq);
+	sq->xsk.ctx_head = virtnet_xsk_ctx_alloc(vi, pool, sq->vq, false);
 	if (!sq->xsk.ctx_head)
 		return -ENOMEM;
 
+	/* In big_packets mode, xdp cannot work, so there is no need to
+	 * initialize xsk of rq.
+	 */
+	if (!vi->big_packets || vi->mergeable_rx_bufs) {
+		err = xdp_rxq_info_reg(&rq->xsk.xdp_rxq, dev, qid,
+				       rq->napi.napi_id);
+		if (err < 0)
+			goto err;
+
+		err = xdp_rxq_info_reg_mem_model(&rq->xsk.xdp_rxq,
+						 MEM_TYPE_XSK_BUFF_POOL, NULL);
+		if (err < 0) {
+			xdp_rxq_info_unreg(&rq->xsk.xdp_rxq);
+			goto err;
+		}
+
+		rq->xsk.ctx_head = virtnet_xsk_ctx_alloc(vi, pool, rq->vq, true);
+		if (!rq->xsk.ctx_head) {
+			err = -ENOMEM;
+			goto err;
+		}
+
+		xsk_pool_set_rxq_info(pool, &rq->xsk.xdp_rxq);
+
+		/* Here is already protected by rtnl_lock, so rcu_assign_pointer
+		 * is safe.
+		 */
+		rcu_assign_pointer(rq->xsk.pool, pool);
+	}
+
 	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
 	 * safe.
 	 */
 	rcu_assign_pointer(sq->xsk.pool, pool);
 
 	return 0;
+
+err:
+	kfree(sq->xsk.ctx_head);
+	return err;
 }
 
 static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct receive_queue *rq;
 	struct send_queue *sq;
 
 	if (qid >= vi->curr_queue_pairs)
 		return -EINVAL;
 
 	sq = &vi->sq[qid];
+	rq = &vi->rq[qid];
 
 	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
 	 * safe.
 	 */
+	rcu_assign_pointer(rq->xsk.pool, NULL);
 	rcu_assign_pointer(sq->xsk.pool, NULL);
 
 	/* Sync with the XSK wakeup and with NAPI. */
@@ -332,6 +709,17 @@ static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
 
 	sq->xsk.ctx_head = NULL;
 
+	if (!vi->big_packets || vi->mergeable_rx_bufs) {
+		if (READ_ONCE(rq->xsk.ctx_head->ref))
+			WRITE_ONCE(rq->xsk.ctx_head->active, false);
+		else
+			kfree(rq->xsk.ctx_head);
+
+		rq->xsk.ctx_head = NULL;
+
+		xdp_rxq_info_unreg(&rq->xsk.xdp_rxq);
+	}
+
 	return 0;
 }
 
diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
index 54948e0b07fc..fe22cf78d505 100644
--- a/drivers/net/virtio/xsk.h
+++ b/drivers/net/virtio/xsk.h
@@ -5,6 +5,8 @@
 
 #define VIRTIO_XSK_FLAG	BIT(1)
 
+#define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
+
 /* When xsk disable, under normal circumstances, the network card must reclaim
  * all the memory that has been sent and the memory added to the rq queue by
  * destroying the queue.
@@ -36,6 +38,8 @@ struct virtnet_xsk_ctx_head {
 	u64 ref;
 
 	unsigned int frame_size;
+	unsigned int truesize;
+	unsigned int hdr_len;
 
 	/* the xsk status */
 	bool active;
@@ -59,6 +63,28 @@ struct virtnet_xsk_ctx_tx {
 	u32 len;
 };
 
+struct virtnet_xsk_ctx_rx {
+	/* this *MUST* be the first */
+	struct virtnet_xsk_ctx ctx;
+
+	/* xdp get from xsk */
+	struct xdp_buff *xdp;
+
+	/* offset of the xdp.data inside it's page */
+	int offset;
+
+	/* xsk xdp headroom */
+	unsigned int headroom;
+
+	/* Users don't want us to occupy xsk frame to save virtio hdr */
+	struct virtio_net_hdr_mrg_rxbuf hdr;
+};
+
+static inline bool is_xsk_ctx(void *ctx)
+{
+	return ctx == VIRTNET_XSK_BUFF_CTX;
+}
+
 static inline void *xsk_to_ptr(struct virtnet_xsk_ctx_tx *ctx)
 {
 	return (void *)((unsigned long)ctx | VIRTIO_XSK_FLAG);
@@ -92,8 +118,57 @@ static inline void virtnet_xsk_ctx_put(struct virtnet_xsk_ctx *ctx)
 #define virtnet_xsk_ctx_tx_put(ctx) \
 	virtnet_xsk_ctx_put((struct virtnet_xsk_ctx *)ctx)
 
+static inline void virtnet_xsk_ctx_rx_put(struct virtnet_xsk_ctx_rx *ctx)
+{
+	if (ctx->xdp && ctx->ctx.head->active)
+		xsk_buff_free(ctx->xdp);
+
+	virtnet_xsk_ctx_put((struct virtnet_xsk_ctx *)ctx);
+}
+
+static inline void virtnet_rx_put_buf(char *buf, void *ctx)
+{
+	if (is_xsk_ctx(ctx))
+		virtnet_xsk_ctx_rx_put((struct virtnet_xsk_ctx_rx *)buf);
+	else
+		put_page(virt_to_head_page(buf));
+}
+
+void virtnet_xsk_ctx_rx_copy(struct virtnet_xsk_ctx_rx *ctx,
+			     char *dst, unsigned int len, bool hdr);
+int add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue *rq,
+		    struct xsk_buff_pool *pool, gfp_t gfp);
+struct sk_buff *receive_xsk(struct net_device *dev, struct virtnet_info *vi,
+			    struct receive_queue *rq, void *buf,
+			    unsigned int len, unsigned int *xdp_xmit,
+			    struct virtnet_rq_stats *stats);
 int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
 int virtnet_poll_xsk(struct send_queue *sq, int budget);
 void virtnet_xsk_complete(struct send_queue *sq, u32 num);
 int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
+
+static inline bool fill_recv_xsk(struct virtnet_info *vi,
+				 struct receive_queue *rq,
+				 gfp_t gfp)
+{
+	struct xsk_buff_pool *pool;
+	int err;
+
+	rcu_read_lock();
+	pool = rcu_dereference(rq->xsk.pool);
+	if (pool) {
+		while (rq->vq->num_free >= 3) {
+			err = add_recvbuf_xsk(vi, rq, pool, gfp);
+			if (err)
+				break;
+		}
+	} else {
+		rcu_read_unlock();
+		return false;
+	}
+	rcu_read_unlock();
+
+	return err != -ENOMEM;
+}
+
 #endif
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
@ 2021-06-10  8:22   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

Compared to the case of xsk tx, the case of xsk zc rx is more
complicated.

When we process the buf received by vq, we may encounter ordinary
buffers, or xsk buffers. What makes the situation more complicated is
that in the case of mergeable, when num_buffer > 1, we may still
encounter the case where xsk buffer is mixed with ordinary buffer.

Another thing that makes the situation more complicated is that when we
get an xsk buffer from vq, the xsk bound to this xsk buffer may have
been unbound.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio/virtio_net.c | 238 ++++++++++++++-----
 drivers/net/virtio/virtio_net.h |  27 +++
 drivers/net/virtio/xsk.c        | 396 +++++++++++++++++++++++++++++++-
 drivers/net/virtio/xsk.h        |  75 ++++++
 4 files changed, 678 insertions(+), 58 deletions(-)

diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 40d7751f1c5f..9503133e71f0 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -125,11 +125,6 @@ static int rxq2vq(int rxq)
 	return rxq * 2;
 }
 
-static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff *skb)
-{
-	return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb;
-}
-
 /*
  * private is used to chain pages for big packets, put the whole
  * most recent used list in the beginning for reuse
@@ -458,6 +453,68 @@ static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
 	return vi->xdp_enabled ? VIRTIO_XDP_HEADROOM : 0;
 }
 
+/* return value:
+ *  1: XDP_PASS should handle to build skb
+ * -1: xdp err, should handle to free the buf and return NULL
+ *  0: buf has been consumed by xdp
+ */
+int virtnet_run_xdp(struct net_device *dev,
+		    struct bpf_prog *xdp_prog,
+		    struct xdp_buff *xdp,
+		    unsigned int *xdp_xmit,
+		    struct virtnet_rq_stats *stats)
+{
+	struct xdp_frame *xdpf;
+	int act, err;
+
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	stats->xdp_packets++;
+
+	switch (act) {
+	case XDP_PASS:
+		return 1;
+
+	case XDP_TX:
+		stats->xdp_tx++;
+		xdpf = xdp_convert_buff_to_frame(xdp);
+		if (unlikely(!xdpf))
+			goto err_xdp;
+		err = virtnet_xdp_xmit(dev, 1, &xdpf, 0);
+		if (unlikely(!err)) {
+			xdp_return_frame_rx_napi(xdpf);
+		} else if (unlikely(err < 0)) {
+			trace_xdp_exception(dev, xdp_prog, act);
+			goto err_xdp;
+		}
+		*xdp_xmit |= VIRTIO_XDP_TX;
+		return 0;
+
+	case XDP_REDIRECT:
+		stats->xdp_redirects++;
+		err = xdp_do_redirect(dev, xdp, xdp_prog);
+		if (err)
+			goto err_xdp;
+
+		*xdp_xmit |= VIRTIO_XDP_REDIR;
+		return 0;
+
+	default:
+		bpf_warn_invalid_xdp_action(act);
+		fallthrough;
+
+	case XDP_ABORTED:
+		trace_xdp_exception(dev, xdp_prog, act);
+		fallthrough;
+
+	case XDP_DROP:
+		break;
+	}
+
+err_xdp:
+	stats->xdp_drops++;
+	return -1;
+}
+
 /* We copy the packet for XDP in the following cases:
  *
  * 1) Packet is scattered across multiple rx buffers.
@@ -491,27 +548,40 @@ static struct page *xdp_linearize_page(struct receive_queue *rq,
 		int tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 		unsigned int buflen;
 		void *buf;
+		void *ctx;
 		int off;
 
-		buf = virtqueue_get_buf(rq->vq, &buflen);
+		buf = virtqueue_get_buf_ctx(rq->vq, &buflen, &ctx);
 		if (unlikely(!buf))
 			goto err_buf;
 
-		p = virt_to_head_page(buf);
-		off = buf - page_address(p);
-
 		/* guard against a misconfigured or uncooperative backend that
 		 * is sending packet larger than the MTU.
 		 */
 		if ((page_off + buflen + tailroom) > PAGE_SIZE) {
-			put_page(p);
+			virtnet_rx_put_buf(buf, ctx);
 			goto err_buf;
 		}
 
-		memcpy(page_address(page) + page_off,
-		       page_address(p) + off, buflen);
+		if (is_xsk_ctx(ctx)) {
+			struct virtnet_xsk_ctx_rx *xsk_ctx;
+
+			xsk_ctx = (struct virtnet_xsk_ctx_rx *)buf;
+
+			virtnet_xsk_ctx_rx_copy(xsk_ctx,
+						page_address(page) + page_off,
+						buflen, true);
+
+			virtnet_xsk_ctx_rx_put(xsk_ctx);
+		} else {
+			p = virt_to_head_page(buf);
+			off = buf - page_address(p);
+
+			memcpy(page_address(page) + page_off,
+			       page_address(p) + off, buflen);
+			put_page(p);
+		}
 		page_off += buflen;
-		put_page(p);
 	}
 
 	/* Headroom does not contribute to packet length */
@@ -522,17 +592,16 @@ static struct page *xdp_linearize_page(struct receive_queue *rq,
 	return NULL;
 }
 
-static void merge_drop_follow_bufs(struct net_device *dev,
-				   struct receive_queue *rq,
-				   u16 num_buf,
-				   struct virtnet_rq_stats *stats)
+void merge_drop_follow_bufs(struct net_device *dev,
+			    struct receive_queue *rq,
+			    u16 num_buf,
+			    struct virtnet_rq_stats *stats)
 {
-	struct page *page;
 	unsigned int len;
-	void *buf;
+	void *buf, *ctx;
 
 	while (num_buf-- > 1) {
-		buf = virtqueue_get_buf(rq->vq, &len);
+		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
 		if (unlikely(!buf)) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 dev->name, num_buf);
@@ -540,23 +609,80 @@ static void merge_drop_follow_bufs(struct net_device *dev,
 			break;
 		}
 		stats->bytes += len;
-		page = virt_to_head_page(buf);
-		put_page(page);
+		virtnet_rx_put_buf(buf, ctx);
+	}
+}
+
+static char *merge_get_follow_buf(struct net_device *dev,
+				  struct receive_queue *rq,
+				  int *plen, int *ptruesize,
+				  int index, int total)
+{
+	struct virtnet_xsk_ctx_rx *xsk_ctx;
+	unsigned int truesize;
+	char *buf;
+	void *ctx;
+	int len;
+
+	buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
+	if (unlikely(!buf)) {
+		pr_debug("%s: rx error: %d buffers out of %d missing\n",
+			 dev->name, index, total);
+		dev->stats.rx_length_errors++;
+		return NULL;
+	}
+
+	if (is_xsk_ctx(ctx)) {
+		xsk_ctx = (struct virtnet_xsk_ctx_rx *)buf;
+
+		if (unlikely(len > xsk_ctx->ctx.head->truesize)) {
+			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
+				 dev->name, len, (unsigned long)ctx);
+			dev->stats.rx_length_errors++;
+			virtnet_xsk_ctx_rx_put(xsk_ctx);
+			return ERR_PTR(-EDQUOT);
+		}
+
+		truesize = len;
+
+		buf = virtnet_alloc_frag(rq, truesize, GFP_ATOMIC);
+		if (unlikely(!buf)) {
+			virtnet_xsk_ctx_rx_put(xsk_ctx);
+			return ERR_PTR(-ENOMEM);
+		}
+
+		virtnet_xsk_ctx_rx_copy(xsk_ctx, buf, len, true);
+		virtnet_xsk_ctx_rx_put(xsk_ctx);
+	} else {
+		truesize = mergeable_ctx_to_truesize(ctx);
+		if (unlikely(len > truesize)) {
+			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
+				 dev->name, len, (unsigned long)ctx);
+			dev->stats.rx_length_errors++;
+
+			put_page(virt_to_head_page(buf));
+			return ERR_PTR(-EDQUOT);
+		}
 	}
+
+	*plen = len;
+	*ptruesize = truesize;
+
+	return buf;
 }
 
-static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
-						 struct virtnet_info *vi,
-						 struct receive_queue *rq,
-						 struct sk_buff *head_skb,
-						 u16 num_buf,
-						 struct virtnet_rq_stats *stats)
+struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
+					  struct virtnet_info *vi,
+					  struct receive_queue *rq,
+					  struct sk_buff *head_skb,
+					  u16 num_buf,
+					  struct virtnet_rq_stats *stats)
 {
 	struct sk_buff *curr_skb;
 	unsigned int truesize;
 	unsigned int len, num;
 	struct page *page;
-	void *buf, *ctx;
+	void *buf;
 	int offset;
 
 	curr_skb = head_skb;
@@ -565,25 +691,17 @@ static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
 	while (--num_buf) {
 		int num_skb_frags;
 
-		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
-		if (unlikely(!buf)) {
-			pr_debug("%s: rx error: %d buffers out of %d missing\n",
-				 dev->name, num_buf, num);
-			dev->stats.rx_length_errors++;
+		buf = merge_get_follow_buf(dev, rq, &len, &truesize,
+					   num_buf, num);
+		if (unlikely(!buf))
 			goto err_buf;
-		}
+
+		if (IS_ERR(buf))
+			goto err_drop;
 
 		stats->bytes += len;
 		page = virt_to_head_page(buf);
 
-		truesize = mergeable_ctx_to_truesize(ctx);
-		if (unlikely(len > truesize)) {
-			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
-				 dev->name, len, (unsigned long)ctx);
-			dev->stats.rx_length_errors++;
-			goto err_skb;
-		}
-
 		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
 		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
 			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
@@ -618,6 +736,7 @@ static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
 
 err_skb:
 	put_page(page);
+err_drop:
 	merge_drop_follow_bufs(dev, rq, num_buf, stats);
 err_buf:
 	stats->drops++;
@@ -982,16 +1101,18 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
 		if (vi->mergeable_rx_bufs) {
-			put_page(virt_to_head_page(buf));
+			virtnet_rx_put_buf(buf, ctx);
 		} else if (vi->big_packets) {
 			give_pages(rq, buf);
 		} else {
-			put_page(virt_to_head_page(buf));
+			virtnet_rx_put_buf(buf, ctx);
 		}
 		return;
 	}
 
-	if (vi->mergeable_rx_bufs)
+	if (is_xsk_ctx(ctx))
+		skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
+	else if (vi->mergeable_rx_bufs)
 		skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
 					stats);
 	else if (vi->big_packets)
@@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
 	int err;
 	bool oom;
 
+	/* Because virtio-net does not yet support flow direct, all rx
+	 * channels must also process other non-xsk packets. If there is
+	 * no buf available from xsk temporarily, we try to allocate
+	 * memory to the channel.
+	 */
+	if (fill_recv_xsk(vi, rq, gfp))
+		goto kick;
+
 	do {
 		if (vi->mergeable_rx_bufs)
 			err = add_recvbuf_mergeable(vi, rq, gfp);
@@ -1187,6 +1316,8 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
 		if (err)
 			break;
 	} while (rq->vq->num_free);
+
+kick:
 	if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) {
 		unsigned long flags;
 
@@ -2575,7 +2706,7 @@ static void free_receive_page_frags(struct virtnet_info *vi)
 
 static void free_unused_bufs(struct virtnet_info *vi)
 {
-	void *buf;
+	void *buf, *ctx;
 	int i;
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
@@ -2593,14 +2724,13 @@ static void free_unused_bufs(struct virtnet_info *vi)
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		struct virtqueue *vq = vi->rq[i].vq;
 
-		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (vi->mergeable_rx_bufs) {
-				put_page(virt_to_head_page(buf));
-			} else if (vi->big_packets) {
+		while ((buf = virtqueue_detach_unused_buf_ctx(vq, &ctx)) != NULL) {
+			if (vi->mergeable_rx_bufs)
+				virtnet_rx_put_buf(buf, ctx);
+			else if (vi->big_packets)
 				give_pages(&vi->rq[i], buf);
-			} else {
-				put_page(virt_to_head_page(buf));
-			}
+			else
+				virtnet_rx_put_buf(buf, ctx);
 		}
 	}
 }
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index e3da829887dc..70af880f469d 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -177,8 +177,23 @@ struct receive_queue {
 	char name[40];
 
 	struct xdp_rxq_info xdp_rxq;
+
+	struct {
+		struct xsk_buff_pool __rcu *pool;
+
+		/* xdp rxq used by xsk */
+		struct xdp_rxq_info xdp_rxq;
+
+		/* ctx used to record the page added to vq */
+		struct virtnet_xsk_ctx_head *ctx_head;
+	} xsk;
 };
 
+static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff *skb)
+{
+	return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb;
+}
+
 static inline bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
 {
 	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
@@ -258,4 +273,16 @@ static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
 	if (xsknum)
 		virtnet_xsk_complete(sq, xsknum);
 }
+
+int virtnet_run_xdp(struct net_device *dev, struct bpf_prog *xdp_prog,
+		    struct xdp_buff *xdp, unsigned int *xdp_xmit,
+		    struct virtnet_rq_stats *stats);
+struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
+					  struct virtnet_info *vi,
+					  struct receive_queue *rq,
+					  struct sk_buff *head_skb,
+					  u16 num_buf,
+					  struct virtnet_rq_stats *stats);
+void merge_drop_follow_bufs(struct net_device *dev, struct receive_queue *rq,
+			    u16 num_buf, struct virtnet_rq_stats *stats);
 #endif
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index f98b68576709..36cda2dcf8e7 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -20,6 +20,75 @@ static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *
 }
 
 #define virtnet_xsk_ctx_tx_get(head) ((struct virtnet_xsk_ctx_tx *)virtnet_xsk_ctx_get(head))
+#define virtnet_xsk_ctx_rx_get(head) ((struct virtnet_xsk_ctx_rx *)virtnet_xsk_ctx_get(head))
+
+static unsigned int virtnet_receive_buf_num(struct virtnet_info *vi, char *buf)
+{
+	struct virtio_net_hdr_mrg_rxbuf *hdr;
+
+	if (vi->mergeable_rx_bufs) {
+		hdr = (struct virtio_net_hdr_mrg_rxbuf *)buf;
+		return virtio16_to_cpu(vi->vdev, hdr->num_buffers);
+	}
+
+	return 1;
+}
+
+/* when xsk rx ctx ref two page, copy to dst from two page */
+static void virtnet_xsk_rx_ctx_merge(struct virtnet_xsk_ctx_rx *ctx,
+				     char *dst, unsigned int len)
+{
+	unsigned int size;
+	int offset;
+	char *src;
+
+	/* data start from first page */
+	if (ctx->offset >= 0) {
+		offset = ctx->offset;
+
+		size = min_t(int, PAGE_SIZE - offset, len);
+		src = page_address(ctx->ctx.page) + offset;
+		memcpy(dst, src, size);
+
+		if (len > size) {
+			src = page_address(ctx->ctx.page_unaligned);
+			memcpy(dst + size, src, len - size);
+		}
+
+	} else {
+		offset = -ctx->offset;
+
+		src = page_address(ctx->ctx.page_unaligned) + offset;
+
+		memcpy(dst, src, len);
+	}
+}
+
+/* copy ctx to dst, need to make sure that len is safe */
+void virtnet_xsk_ctx_rx_copy(struct virtnet_xsk_ctx_rx *ctx,
+			     char *dst, unsigned int len,
+			     bool hdr)
+{
+	char *src;
+	int size;
+
+	if (hdr) {
+		size = min_t(int, ctx->ctx.head->hdr_len, len);
+		memcpy(dst, &ctx->hdr, size);
+		len -= size;
+		if (!len)
+			return;
+		dst += size;
+	}
+
+	if (!ctx->ctx.page_unaligned) {
+		src = page_address(ctx->ctx.page) + ctx->offset;
+		memcpy(dst, src, len);
+
+	} else {
+		virtnet_xsk_rx_ctx_merge(ctx, dst, len);
+	}
+}
 
 static void virtnet_xsk_check_queue(struct send_queue *sq)
 {
@@ -45,6 +114,267 @@ static void virtnet_xsk_check_queue(struct send_queue *sq)
 		netif_stop_subqueue(dev, qnum);
 }
 
+static struct sk_buff *virtnet_xsk_construct_skb_xdp(struct receive_queue *rq,
+						     struct xdp_buff *xdp)
+{
+	unsigned int metasize = xdp->data - xdp->data_meta;
+	struct sk_buff *skb;
+	unsigned int size;
+
+	size = xdp->data_end - xdp->data_hard_start;
+	skb = napi_alloc_skb(&rq->napi, size);
+	if (unlikely(!skb))
+		return NULL;
+
+	skb_reserve(skb, xdp->data_meta - xdp->data_hard_start);
+
+	size = xdp->data_end - xdp->data_meta;
+	memcpy(__skb_put(skb, size), xdp->data_meta, size);
+
+	if (metasize) {
+		__skb_pull(skb, metasize);
+		skb_metadata_set(skb, metasize);
+	}
+
+	return skb;
+}
+
+static struct sk_buff *virtnet_xsk_construct_skb_ctx(struct net_device *dev,
+						     struct virtnet_info *vi,
+						     struct receive_queue *rq,
+						     struct virtnet_xsk_ctx_rx *ctx,
+						     unsigned int len,
+						     struct virtnet_rq_stats *stats)
+{
+	struct virtio_net_hdr_mrg_rxbuf *hdr;
+	struct sk_buff *skb;
+	int num_buf;
+	char *dst;
+
+	len -= vi->hdr_len;
+
+	skb = napi_alloc_skb(&rq->napi, len);
+	if (unlikely(!skb))
+		return NULL;
+
+	dst = __skb_put(skb, len);
+
+	virtnet_xsk_ctx_rx_copy(ctx, dst, len, false);
+
+	num_buf = virtnet_receive_buf_num(vi, (char *)&ctx->hdr);
+	if (num_buf > 1) {
+		skb = merge_receive_follow_bufs(dev, vi, rq, skb, num_buf,
+						stats);
+		if (!skb)
+			return NULL;
+	}
+
+	hdr = skb_vnet_hdr(skb);
+	memcpy(hdr, &ctx->hdr, vi->hdr_len);
+
+	return skb;
+}
+
+/* len not include virtio-net hdr */
+static struct xdp_buff *virtnet_xsk_check_xdp(struct virtnet_info *vi,
+					      struct receive_queue *rq,
+					      struct virtnet_xsk_ctx_rx *ctx,
+					      struct xdp_buff *_xdp,
+					      unsigned int len)
+{
+	struct xdp_buff *xdp;
+	struct page *page;
+	int frame_sz;
+	char *data;
+
+	if (ctx->ctx.head->active) {
+		xdp = ctx->xdp;
+		xdp->data_end = xdp->data + len;
+
+		return xdp;
+	}
+
+	/* ctx->xdp is invalid, because of that is released */
+
+	if (!ctx->ctx.page_unaligned) {
+		data = page_address(ctx->ctx.page) + ctx->offset;
+		page = ctx->ctx.page;
+	} else {
+		page = alloc_page(GFP_ATOMIC);
+		if (!page)
+			return NULL;
+
+		data = page_address(page) + ctx->headroom;
+
+		virtnet_xsk_rx_ctx_merge(ctx, data, len);
+
+		put_page(ctx->ctx.page);
+		put_page(ctx->ctx.page_unaligned);
+
+		/* page will been put when ctx is put */
+		ctx->ctx.page = page;
+		ctx->ctx.page_unaligned = NULL;
+	}
+
+	/* If xdp consume the data with XDP_REDIRECT/XDP_TX, the page
+	 * ref will been dec. So call get_page here.
+	 *
+	 * If xdp has been consumed, the page ref will dec auto and
+	 * virtnet_xsk_ctx_rx_put will dec the ref again.
+	 *
+	 * If xdp has not been consumed, then manually put_page once before
+	 * virtnet_xsk_ctx_rx_put.
+	 */
+	get_page(page);
+
+	xdp = _xdp;
+
+	frame_sz = ctx->ctx.head->frame_size + ctx->headroom;
+
+	/* use xdp rxq without MEM_TYPE_XSK_BUFF_POOL */
+	xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq);
+	xdp_prepare_buff(xdp, data - ctx->headroom, ctx->headroom, len, true);
+
+	return xdp;
+}
+
+int add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue *rq,
+		    struct xsk_buff_pool *pool, gfp_t gfp)
+{
+	struct page *page, *page_start, *page_end;
+	unsigned long data, data_end, data_start;
+	struct virtnet_xsk_ctx_rx *ctx;
+	struct xdp_buff *xsk_xdp;
+	int err, size, n;
+	u32 offset;
+
+	xsk_xdp = xsk_buff_alloc(pool);
+	if (!xsk_xdp)
+		return -ENOMEM;
+
+	ctx = virtnet_xsk_ctx_rx_get(rq->xsk.ctx_head);
+
+	ctx->xdp = xsk_xdp;
+	ctx->headroom = xsk_xdp->data - xsk_xdp->data_hard_start;
+
+	offset = offset_in_page(xsk_xdp->data);
+
+	data_start = (unsigned long)xsk_xdp->data_hard_start;
+	data       = (unsigned long)xsk_xdp->data;
+	data_end   = data + ctx->ctx.head->frame_size - 1;
+
+	page_start = vmalloc_to_page((void *)data_start);
+
+	ctx->ctx.page = page_start;
+	get_page(page_start);
+
+	if ((data_end & PAGE_MASK) == (data_start & PAGE_MASK)) {
+		page_end = page_start;
+		page = page_start;
+		ctx->offset = offset;
+
+		ctx->ctx.page_unaligned = NULL;
+		n = 2;
+	} else {
+		page_end = vmalloc_to_page((void *)data_end);
+
+		ctx->ctx.page_unaligned = page_end;
+		get_page(page_end);
+
+		if ((data_start & PAGE_MASK) == (data & PAGE_MASK)) {
+			page = page_start;
+			ctx->offset = offset;
+			n = 3;
+		} else {
+			page = page_end;
+			ctx->offset = -offset;
+			n = 2;
+		}
+	}
+
+	size = min_t(int, PAGE_SIZE - offset, ctx->ctx.head->frame_size);
+
+	sg_init_table(rq->sg, n);
+	sg_set_buf(rq->sg, &ctx->hdr, vi->hdr_len);
+	sg_set_page(rq->sg + 1, page, size, offset);
+
+	if (n == 3) {
+		size = ctx->ctx.head->frame_size - size;
+		sg_set_page(rq->sg + 2, page_end, size, 0);
+	}
+
+	err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, n, ctx,
+				      VIRTNET_XSK_BUFF_CTX, gfp);
+	if (err < 0)
+		virtnet_xsk_ctx_rx_put(ctx);
+
+	return err;
+}
+
+struct sk_buff *receive_xsk(struct net_device *dev, struct virtnet_info *vi,
+			    struct receive_queue *rq, void *buf,
+			    unsigned int len, unsigned int *xdp_xmit,
+			    struct virtnet_rq_stats *stats)
+{
+	struct virtnet_xsk_ctx_rx *ctx;
+	struct xsk_buff_pool *pool;
+	struct sk_buff *skb = NULL;
+	struct xdp_buff *xdp, _xdp;
+	struct bpf_prog *xdp_prog;
+	u16 num_buf = 1;
+	int ret;
+
+	ctx = (struct virtnet_xsk_ctx_rx *)buf;
+
+	rcu_read_lock();
+
+	pool     = rcu_dereference(rq->xsk.pool);
+	xdp_prog = rcu_dereference(rq->xdp_prog);
+	if (!pool || !xdp_prog)
+		goto skb;
+
+	/* this may happen when xsk chunk size too small. */
+	num_buf = virtnet_receive_buf_num(vi, (char *)&ctx->hdr);
+	if (num_buf > 1)
+		goto drop;
+
+	xdp = virtnet_xsk_check_xdp(vi, rq, ctx, &_xdp, len - vi->hdr_len);
+	if (!xdp)
+		goto drop;
+
+	ret = virtnet_run_xdp(dev, xdp_prog, xdp, xdp_xmit, stats);
+	if (unlikely(ret)) {
+		/* pair for get_page inside virtnet_xsk_check_xdp */
+		if (!ctx->ctx.head->active)
+			put_page(ctx->ctx.page);
+
+		if (unlikely(ret < 0))
+			goto drop;
+
+		/* XDP_PASS */
+		skb = virtnet_xsk_construct_skb_xdp(rq, xdp);
+	} else {
+		/* ctx->xdp has been consumed */
+		ctx->xdp = NULL;
+	}
+
+end:
+	virtnet_xsk_ctx_rx_put(ctx);
+	rcu_read_unlock();
+	return skb;
+
+skb:
+	skb = virtnet_xsk_construct_skb_ctx(dev, vi, rq, ctx, len, stats);
+	goto end;
+
+drop:
+	stats->drops++;
+
+	if (num_buf > 1)
+		merge_drop_follow_bufs(dev, rq, num_buf, stats);
+	goto end;
+}
+
 void virtnet_xsk_complete(struct send_queue *sq, u32 num)
 {
 	struct xsk_buff_pool *pool;
@@ -238,15 +568,20 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
 	return 0;
 }
 
-static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct xsk_buff_pool *pool,
-							  struct virtqueue *vq)
+static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct virtnet_info *vi,
+							  struct xsk_buff_pool *pool,
+							  struct virtqueue *vq,
+							  bool rx)
 {
 	struct virtnet_xsk_ctx_head *head;
 	u32 size, n, ring_size, ctx_sz;
 	struct virtnet_xsk_ctx *ctx;
 	void *p;
 
-	ctx_sz = sizeof(struct virtnet_xsk_ctx_tx);
+	if (rx)
+		ctx_sz = sizeof(struct virtnet_xsk_ctx_rx);
+	else
+		ctx_sz = sizeof(struct virtnet_xsk_ctx_tx);
 
 	ring_size = virtqueue_get_vring_size(vq);
 	size = sizeof(*head) + ctx_sz * ring_size;
@@ -259,6 +594,8 @@ static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct xsk_buff_pool *
 
 	head->active = true;
 	head->frame_size = xsk_pool_get_rx_frame_size(pool);
+	head->hdr_len = vi->hdr_len;
+	head->truesize = head->frame_size + vi->hdr_len;
 
 	p = head + 1;
 	for (n = 0; n < ring_size; ++n) {
@@ -278,12 +615,15 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
 				   u16 qid)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct receive_queue *rq;
 	struct send_queue *sq;
+	int err;
 
 	if (qid >= vi->curr_queue_pairs)
 		return -EINVAL;
 
 	sq = &vi->sq[qid];
+	rq = &vi->rq[qid];
 
 	/* xsk zerocopy depend on the tx napi.
 	 *
@@ -295,31 +635,68 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
 
 	memset(&sq->xsk, 0, sizeof(sq->xsk));
 
-	sq->xsk.ctx_head = virtnet_xsk_ctx_alloc(pool, sq->vq);
+	sq->xsk.ctx_head = virtnet_xsk_ctx_alloc(vi, pool, sq->vq, false);
 	if (!sq->xsk.ctx_head)
 		return -ENOMEM;
 
+	/* In big_packets mode, xdp cannot work, so there is no need to
+	 * initialize xsk of rq.
+	 */
+	if (!vi->big_packets || vi->mergeable_rx_bufs) {
+		err = xdp_rxq_info_reg(&rq->xsk.xdp_rxq, dev, qid,
+				       rq->napi.napi_id);
+		if (err < 0)
+			goto err;
+
+		err = xdp_rxq_info_reg_mem_model(&rq->xsk.xdp_rxq,
+						 MEM_TYPE_XSK_BUFF_POOL, NULL);
+		if (err < 0) {
+			xdp_rxq_info_unreg(&rq->xsk.xdp_rxq);
+			goto err;
+		}
+
+		rq->xsk.ctx_head = virtnet_xsk_ctx_alloc(vi, pool, rq->vq, true);
+		if (!rq->xsk.ctx_head) {
+			err = -ENOMEM;
+			goto err;
+		}
+
+		xsk_pool_set_rxq_info(pool, &rq->xsk.xdp_rxq);
+
+		/* Here is already protected by rtnl_lock, so rcu_assign_pointer
+		 * is safe.
+		 */
+		rcu_assign_pointer(rq->xsk.pool, pool);
+	}
+
 	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
 	 * safe.
 	 */
 	rcu_assign_pointer(sq->xsk.pool, pool);
 
 	return 0;
+
+err:
+	kfree(sq->xsk.ctx_head);
+	return err;
 }
 
 static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct receive_queue *rq;
 	struct send_queue *sq;
 
 	if (qid >= vi->curr_queue_pairs)
 		return -EINVAL;
 
 	sq = &vi->sq[qid];
+	rq = &vi->rq[qid];
 
 	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
 	 * safe.
 	 */
+	rcu_assign_pointer(rq->xsk.pool, NULL);
 	rcu_assign_pointer(sq->xsk.pool, NULL);
 
 	/* Sync with the XSK wakeup and with NAPI. */
@@ -332,6 +709,17 @@ static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
 
 	sq->xsk.ctx_head = NULL;
 
+	if (!vi->big_packets || vi->mergeable_rx_bufs) {
+		if (READ_ONCE(rq->xsk.ctx_head->ref))
+			WRITE_ONCE(rq->xsk.ctx_head->active, false);
+		else
+			kfree(rq->xsk.ctx_head);
+
+		rq->xsk.ctx_head = NULL;
+
+		xdp_rxq_info_unreg(&rq->xsk.xdp_rxq);
+	}
+
 	return 0;
 }
 
diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
index 54948e0b07fc..fe22cf78d505 100644
--- a/drivers/net/virtio/xsk.h
+++ b/drivers/net/virtio/xsk.h
@@ -5,6 +5,8 @@
 
 #define VIRTIO_XSK_FLAG	BIT(1)
 
+#define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
+
 /* When xsk disable, under normal circumstances, the network card must reclaim
  * all the memory that has been sent and the memory added to the rq queue by
  * destroying the queue.
@@ -36,6 +38,8 @@ struct virtnet_xsk_ctx_head {
 	u64 ref;
 
 	unsigned int frame_size;
+	unsigned int truesize;
+	unsigned int hdr_len;
 
 	/* the xsk status */
 	bool active;
@@ -59,6 +63,28 @@ struct virtnet_xsk_ctx_tx {
 	u32 len;
 };
 
+struct virtnet_xsk_ctx_rx {
+	/* this *MUST* be the first */
+	struct virtnet_xsk_ctx ctx;
+
+	/* xdp get from xsk */
+	struct xdp_buff *xdp;
+
+	/* offset of the xdp.data inside it's page */
+	int offset;
+
+	/* xsk xdp headroom */
+	unsigned int headroom;
+
+	/* Users don't want us to occupy xsk frame to save virtio hdr */
+	struct virtio_net_hdr_mrg_rxbuf hdr;
+};
+
+static inline bool is_xsk_ctx(void *ctx)
+{
+	return ctx == VIRTNET_XSK_BUFF_CTX;
+}
+
 static inline void *xsk_to_ptr(struct virtnet_xsk_ctx_tx *ctx)
 {
 	return (void *)((unsigned long)ctx | VIRTIO_XSK_FLAG);
@@ -92,8 +118,57 @@ static inline void virtnet_xsk_ctx_put(struct virtnet_xsk_ctx *ctx)
 #define virtnet_xsk_ctx_tx_put(ctx) \
 	virtnet_xsk_ctx_put((struct virtnet_xsk_ctx *)ctx)
 
+static inline void virtnet_xsk_ctx_rx_put(struct virtnet_xsk_ctx_rx *ctx)
+{
+	if (ctx->xdp && ctx->ctx.head->active)
+		xsk_buff_free(ctx->xdp);
+
+	virtnet_xsk_ctx_put((struct virtnet_xsk_ctx *)ctx);
+}
+
+static inline void virtnet_rx_put_buf(char *buf, void *ctx)
+{
+	if (is_xsk_ctx(ctx))
+		virtnet_xsk_ctx_rx_put((struct virtnet_xsk_ctx_rx *)buf);
+	else
+		put_page(virt_to_head_page(buf));
+}
+
+void virtnet_xsk_ctx_rx_copy(struct virtnet_xsk_ctx_rx *ctx,
+			     char *dst, unsigned int len, bool hdr);
+int add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue *rq,
+		    struct xsk_buff_pool *pool, gfp_t gfp);
+struct sk_buff *receive_xsk(struct net_device *dev, struct virtnet_info *vi,
+			    struct receive_queue *rq, void *buf,
+			    unsigned int len, unsigned int *xdp_xmit,
+			    struct virtnet_rq_stats *stats);
 int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
 int virtnet_poll_xsk(struct send_queue *sq, int budget);
 void virtnet_xsk_complete(struct send_queue *sq, u32 num);
 int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
+
+static inline bool fill_recv_xsk(struct virtnet_info *vi,
+				 struct receive_queue *rq,
+				 gfp_t gfp)
+{
+	struct xsk_buff_pool *pool;
+	int err;
+
+	rcu_read_lock();
+	pool = rcu_dereference(rq->xsk.pool);
+	if (pool) {
+		while (rq->vq->num_free >= 3) {
+			err = add_recvbuf_xsk(vi, rq, pool, gfp);
+			if (err)
+				break;
+		}
+	} else {
+		rcu_read_unlock();
+		return false;
+	}
+	rcu_read_unlock();
+
+	return err != -ENOMEM;
+}
+
 #endif
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:22   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

Calling virtqueue_napi_schedule() in wakeup results in napi running on
the current cpu. If the application is not busy, then there is no
problem. But if the application itself is busy, it will cause a lot of
scheduling.

If the application is continuously sending data packets, due to the
continuous scheduling between the application and napi, the data packet
transmission will not be smooth, and there will be an obvious delay in
the transmission (you can use tcpdump to see it). When pressing a
channel to 100% (vhost reaches 100%), the cpu where the application is
located reaches 100%.

This patch sends a small amount of data directly in wakeup. The purpose
of this is to trigger the tx interrupt. The tx interrupt will be
awakened on the cpu of its affinity, and then trigger the operation of
the napi mechanism, napi can continue to consume the xsk tx queue. Two
cpus are running, cpu0 is running applications, cpu1 executes
napi consumption data. The same is to press a channel to 100%, but the
utilization rate of cpu0 is 12.7% and the utilization rate of cpu1 is
2.9%.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio/xsk.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 36cda2dcf8e7..3973c82d1ad2 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -547,6 +547,7 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct xsk_buff_pool *pool;
+	struct netdev_queue *txq;
 	struct send_queue *sq;

 	if (!netif_running(dev))
@@ -559,11 +560,28 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)

 	rcu_read_lock();
 	pool = rcu_dereference(sq->xsk.pool);
-	if (pool) {
-		local_bh_disable();
-		virtqueue_napi_schedule(&sq->napi, sq->vq);
-		local_bh_enable();
-	}
+	if (!pool)
+		goto end;
+
+	if (napi_if_scheduled_mark_missed(&sq->napi))
+		goto end;
+
+	txq = netdev_get_tx_queue(dev, qid);
+
+	__netif_tx_lock_bh(txq);
+
+	/* Send part of the packet directly to reduce the delay in sending the
+	 * packet, and this can actively trigger the tx interrupts.
+	 *
+	 * If no packet is sent out, the ring of the device is full. In this
+	 * case, we will still get a tx interrupt response. Then we will deal
+	 * with the subsequent packet sending work.
+	 */
+	virtnet_xsk_run(sq, pool, sq->napi.weight, false);
+
+	__netif_tx_unlock_bh(txq);
+
+end:
 	rcu_read_unlock();
 	return 0;
 }
-- 
2.31.0

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup
@ 2021-06-10  8:22   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

Calling virtqueue_napi_schedule() in wakeup results in napi running on
the current cpu. If the application is not busy, then there is no
problem. But if the application itself is busy, it will cause a lot of
scheduling.

If the application is continuously sending data packets, due to the
continuous scheduling between the application and napi, the data packet
transmission will not be smooth, and there will be an obvious delay in
the transmission (you can use tcpdump to see it). When pressing a
channel to 100% (vhost reaches 100%), the cpu where the application is
located reaches 100%.

This patch sends a small amount of data directly in wakeup. The purpose
of this is to trigger the tx interrupt. The tx interrupt will be
awakened on the cpu of its affinity, and then trigger the operation of
the napi mechanism, napi can continue to consume the xsk tx queue. Two
cpus are running, cpu0 is running applications, cpu1 executes
napi consumption data. The same is to press a channel to 100%, but the
utilization rate of cpu0 is 12.7% and the utilization rate of cpu1 is
2.9%.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio/xsk.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 36cda2dcf8e7..3973c82d1ad2 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -547,6 +547,7 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct xsk_buff_pool *pool;
+	struct netdev_queue *txq;
 	struct send_queue *sq;

 	if (!netif_running(dev))
@@ -559,11 +560,28 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)

 	rcu_read_lock();
 	pool = rcu_dereference(sq->xsk.pool);
-	if (pool) {
-		local_bh_disable();
-		virtqueue_napi_schedule(&sq->napi, sq->vq);
-		local_bh_enable();
-	}
+	if (!pool)
+		goto end;
+
+	if (napi_if_scheduled_mark_missed(&sq->napi))
+		goto end;
+
+	txq = netdev_get_tx_queue(dev, qid);
+
+	__netif_tx_lock_bh(txq);
+
+	/* Send part of the packet directly to reduce the delay in sending the
+	 * packet, and this can actively trigger the tx interrupts.
+	 *
+	 * If no packet is sent out, the ring of the device is full. In this
+	 * case, we will still get a tx interrupt response. Then we will deal
+	 * with the subsequent packet sending work.
+	 */
+	virtnet_xsk_run(sq, pool, sq->napi.weight, false);
+
+	__netif_tx_unlock_bh(txq);
+
+end:
 	rcu_read_unlock();
 	return 0;
 }
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold
  2021-06-10  8:21 ` Xuan Zhuo
@ 2021-06-10  8:22   ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Xuan Zhuo, virtualization, bpf,
	dust . li

After testing, the performance of calling kick every time is not stable.
And if all the packets are sent and kicked again, the performance is not
good. So add a module parameter to specify how many packets are sent to
call a kick.

8 is a relatively stable value with the best performance.

Here is the pps of the test of xsk_kick_thr under different values (from
1 to 64).

thr  PPS             thr PPS             thr PPS
1    2924116.74247 | 23  3683263.04348 | 45  2777907.22963
2    3441010.57191 | 24  3078880.13043 | 46  2781376.21739
3    3636728.72378 | 25  2859219.57656 | 47  2777271.91304
4    3637518.61468 | 26  2851557.9593  | 48  2800320.56575
5    3651738.16251 | 27  2834783.54408 | 49  2813039.87599
6    3652176.69231 | 28  2847012.41472 | 50  3445143.01839
7    3665415.80602 | 29  2860633.91304 | 51  3666918.01281
8    3665045.16555 | 30  2857903.5786  | 52  3059929.2709
9    3671023.2401  | 31  2835589.98963 | 53  2831515.21739
10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
21   2845655.55789 | 43  2752951.93264 |
22   3450389.84365 | 44  2753107.27164 |

It can be found that when the value of xsk_kick_thr is relatively small,
the performance is not good, and when its value is greater than 13, the
performance will be more irregular and unstable. It looks similar from 3
to 13, I chose 8 as the default value.

The test environment is qemu + vhost-net. I modified vhost-net to drop
the packets sent by vm directly, so that the cpu of vm can run higher.
By default, the processes in the vm and the cpu of softirqd are too low,
and there is no obvious difference in the test data.

During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
run for 300s, the pps of every second was recorded, and the average of
the pps was finally taken. The vhost process cpu on the host has also
reached 100%.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 drivers/net/virtio/virtio_net.c |  1 +
 drivers/net/virtio/xsk.c        | 18 ++++++++++++++++--
 drivers/net/virtio/xsk.h        |  2 ++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 9503133e71f0..dfe509939b45 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
+module_param(xsk_kick_thr, int, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 3973c82d1ad2..2f3ba6ab4798 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -5,6 +5,8 @@
 
 #include "virtio_net.h"
 
+int xsk_kick_thr = 8;
+
 static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
 
 static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
@@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
 	struct xdp_desc desc;
 	int err, packet = 0;
 	int ret = -EAGAIN;
+	int need_kick = 0;
 
 	while (budget-- > 0) {
 		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
@@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
 		}
 
 		++packet;
+		++need_kick;
+		if (need_kick > xsk_kick_thr) {
+			if (virtqueue_kick_prepare(sq->vq) &&
+			    virtqueue_notify(sq->vq))
+				++stats->kicks;
+
+			need_kick = 0;
+		}
 	}
 
 	if (packet) {
-		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
-			++stats->kicks;
+		if (need_kick) {
+			if (virtqueue_kick_prepare(sq->vq) &&
+			    virtqueue_notify(sq->vq))
+				++stats->kicks;
+		}
 
 		*done += packet;
 		stats->xdp_tx += packet;
diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
index fe22cf78d505..4f0f4f9cf23b 100644
--- a/drivers/net/virtio/xsk.h
+++ b/drivers/net/virtio/xsk.h
@@ -7,6 +7,8 @@
 
 #define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
 
+extern int xsk_kick_thr;
+
 /* When xsk disable, under normal circumstances, the network card must reclaim
  * all the memory that has been sent and the memory added to the rq queue by
  * destroying the queue.
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold
@ 2021-06-10  8:22   ` Xuan Zhuo
  0 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-10  8:22 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

After testing, the performance of calling kick every time is not stable.
And if all the packets are sent and kicked again, the performance is not
good. So add a module parameter to specify how many packets are sent to
call a kick.

8 is a relatively stable value with the best performance.

Here is the pps of the test of xsk_kick_thr under different values (from
1 to 64).

thr  PPS             thr PPS             thr PPS
1    2924116.74247 | 23  3683263.04348 | 45  2777907.22963
2    3441010.57191 | 24  3078880.13043 | 46  2781376.21739
3    3636728.72378 | 25  2859219.57656 | 47  2777271.91304
4    3637518.61468 | 26  2851557.9593  | 48  2800320.56575
5    3651738.16251 | 27  2834783.54408 | 49  2813039.87599
6    3652176.69231 | 28  2847012.41472 | 50  3445143.01839
7    3665415.80602 | 29  2860633.91304 | 51  3666918.01281
8    3665045.16555 | 30  2857903.5786  | 52  3059929.2709
9    3671023.2401  | 31  2835589.98963 | 53  2831515.21739
10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
21   2845655.55789 | 43  2752951.93264 |
22   3450389.84365 | 44  2753107.27164 |

It can be found that when the value of xsk_kick_thr is relatively small,
the performance is not good, and when its value is greater than 13, the
performance will be more irregular and unstable. It looks similar from 3
to 13, I chose 8 as the default value.

The test environment is qemu + vhost-net. I modified vhost-net to drop
the packets sent by vm directly, so that the cpu of vm can run higher.
By default, the processes in the vm and the cpu of softirqd are too low,
and there is no obvious difference in the test data.

During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
run for 300s, the pps of every second was recorded, and the average of
the pps was finally taken. The vhost process cpu on the host has also
reached 100%.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 drivers/net/virtio/virtio_net.c |  1 +
 drivers/net/virtio/xsk.c        | 18 ++++++++++++++++--
 drivers/net/virtio/xsk.h        |  2 ++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 9503133e71f0..dfe509939b45 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
+module_param(xsk_kick_thr, int, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 3973c82d1ad2..2f3ba6ab4798 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -5,6 +5,8 @@
 
 #include "virtio_net.h"
 
+int xsk_kick_thr = 8;
+
 static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
 
 static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
@@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
 	struct xdp_desc desc;
 	int err, packet = 0;
 	int ret = -EAGAIN;
+	int need_kick = 0;
 
 	while (budget-- > 0) {
 		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
@@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
 		}
 
 		++packet;
+		++need_kick;
+		if (need_kick > xsk_kick_thr) {
+			if (virtqueue_kick_prepare(sq->vq) &&
+			    virtqueue_notify(sq->vq))
+				++stats->kicks;
+
+			need_kick = 0;
+		}
 	}
 
 	if (packet) {
-		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
-			++stats->kicks;
+		if (need_kick) {
+			if (virtqueue_kick_prepare(sq->vq) &&
+			    virtqueue_notify(sq->vq))
+				++stats->kicks;
+		}
 
 		*done += packet;
 		stats->xdp_tx += packet;
diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
index fe22cf78d505..4f0f4f9cf23b 100644
--- a/drivers/net/virtio/xsk.h
+++ b/drivers/net/virtio/xsk.h
@@ -7,6 +7,8 @@
 
 #define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
 
+extern int xsk_kick_thr;
+
 /* When xsk disable, under normal circumstances, the network card must reclaim
  * all the memory that has been sent and the memory added to the rq queue by
  * destroying the queue.
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 06/15] virtio-net: unify the code for recycling the xmit ptr
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-16  2:42     ` Jason Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  2:42 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust . li


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> Now there are two types of "skb" and "xdp frame" during recycling old
> xmit.
>
> There are two completely similar and independent implementations. This
> is inconvenient for the subsequent addition of new types. So extract a
> function from this piece of code and call this function uniformly to
> recover old xmit ptr.
>
> Rename free_old_xmit_skbs() to free_old_xmit().
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/net/virtio_net.c | 86 ++++++++++++++++++----------------------
>   1 file changed, 38 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 6c1233f0ab3e..d791543a8dd8 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -264,6 +264,30 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
>   	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
>   }
>   
> +static void __free_old_xmit(struct send_queue *sq, bool in_napi,
> +			    struct virtnet_sq_stats *stats)
> +{
> +	unsigned int len;
> +	void *ptr;
> +
> +	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		if (!is_xdp_frame(ptr)) {
> +			struct sk_buff *skb = ptr;
> +
> +			pr_debug("Sent skb %p\n", skb);
> +
> +			stats->bytes += skb->len;
> +			napi_consume_skb(skb, in_napi);
> +		} else {
> +			struct xdp_frame *frame = ptr_to_xdp(ptr);
> +
> +			stats->bytes += frame->len;
> +			xdp_return_frame(frame);
> +		}
> +		stats->packets++;
> +	}
> +}
> +
>   /* Converting between virtqueue no. and kernel tx/rx queue no.
>    * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
>    */
> @@ -572,15 +596,12 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   			    int n, struct xdp_frame **frames, u32 flags)
>   {
>   	struct virtnet_info *vi = netdev_priv(dev);
> +	struct virtnet_sq_stats stats = {};
>   	struct receive_queue *rq = vi->rq;
>   	struct bpf_prog *xdp_prog;
>   	struct send_queue *sq;
> -	unsigned int len;
> -	int packets = 0;
> -	int bytes = 0;
>   	int nxmit = 0;
>   	int kicks = 0;
> -	void *ptr;
>   	int ret;
>   	int i;
>   
> @@ -599,20 +620,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   	}
>   
>   	/* Free up any pending old buffers before queueing new ones. */
> -	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (likely(is_xdp_frame(ptr))) {
> -			struct xdp_frame *frame = ptr_to_xdp(ptr);
> -
> -			bytes += frame->len;
> -			xdp_return_frame(frame);
> -		} else {
> -			struct sk_buff *skb = ptr;
> -
> -			bytes += skb->len;
> -			napi_consume_skb(skb, false);
> -		}
> -		packets++;
> -	}
> +	__free_old_xmit(sq, false, &stats);
>   
>   	for (i = 0; i < n; i++) {
>   		struct xdp_frame *xdpf = frames[i];
> @@ -629,8 +637,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   	}
>   out:
>   	u64_stats_update_begin(&sq->stats.syncp);
> -	sq->stats.bytes += bytes;
> -	sq->stats.packets += packets;
> +	sq->stats.bytes += stats.bytes;
> +	sq->stats.packets += stats.packets;
>   	sq->stats.xdp_tx += n;
>   	sq->stats.xdp_tx_drops += n - nxmit;
>   	sq->stats.kicks += kicks;
> @@ -1459,39 +1467,21 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>   	return stats.packets;
>   }
>   
> -static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
> +static void free_old_xmit(struct send_queue *sq, bool in_napi)
>   {
> -	unsigned int len;
> -	unsigned int packets = 0;
> -	unsigned int bytes = 0;
> -	void *ptr;
> +	struct virtnet_sq_stats stats = {};
>   
> -	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (likely(!is_xdp_frame(ptr))) {
> -			struct sk_buff *skb = ptr;
> -
> -			pr_debug("Sent skb %p\n", skb);
> -
> -			bytes += skb->len;
> -			napi_consume_skb(skb, in_napi);
> -		} else {
> -			struct xdp_frame *frame = ptr_to_xdp(ptr);
> -
> -			bytes += frame->len;
> -			xdp_return_frame(frame);
> -		}
> -		packets++;
> -	}
> +	__free_old_xmit(sq, in_napi, &stats);
>   
>   	/* Avoid overhead when no packets have been processed
>   	 * happens when called speculatively from start_xmit.
>   	 */
> -	if (!packets)
> +	if (!stats.packets)
>   		return;
>   
>   	u64_stats_update_begin(&sq->stats.syncp);
> -	sq->stats.bytes += bytes;
> -	sq->stats.packets += packets;
> +	sq->stats.bytes += stats.bytes;
> +	sq->stats.packets += stats.packets;
>   	u64_stats_update_end(&sq->stats.syncp);
>   }
>   
> @@ -1516,7 +1506,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
>   		return;
>   
>   	if (__netif_tx_trylock(txq)) {
> -		free_old_xmit_skbs(sq, true);
> +		free_old_xmit(sq, true);
>   		__netif_tx_unlock(txq);
>   	}
>   
> @@ -1601,7 +1591,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>   
>   	txq = netdev_get_tx_queue(vi->dev, index);
>   	__netif_tx_lock(txq, raw_smp_processor_id());
> -	free_old_xmit_skbs(sq, true);
> +	free_old_xmit(sq, true);
>   	__netif_tx_unlock(txq);
>   
>   	virtqueue_napi_complete(napi, sq->vq, 0);
> @@ -1670,7 +1660,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>   	bool use_napi = sq->napi.weight;
>   
>   	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(sq, false);
> +	free_old_xmit(sq, false);
>   
>   	if (use_napi && kick)
>   		virtqueue_enable_cb_delayed(sq->vq);
> @@ -1714,7 +1704,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>   		if (!use_napi &&
>   		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>   			/* More just got used, free them then recheck. */
> -			free_old_xmit_skbs(sq, false);
> +			free_old_xmit(sq, false);
>   			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>   				netif_start_subqueue(dev, qnum);
>   				virtqueue_disable_cb(sq->vq);


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 06/15] virtio-net: unify the code for recycling the xmit ptr
@ 2021-06-16  2:42     ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  2:42 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> Now there are two types of "skb" and "xdp frame" during recycling old
> xmit.
>
> There are two completely similar and independent implementations. This
> is inconvenient for the subsequent addition of new types. So extract a
> function from this piece of code and call this function uniformly to
> recover old xmit ptr.
>
> Rename free_old_xmit_skbs() to free_old_xmit().
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/net/virtio_net.c | 86 ++++++++++++++++++----------------------
>   1 file changed, 38 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 6c1233f0ab3e..d791543a8dd8 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -264,6 +264,30 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
>   	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
>   }
>   
> +static void __free_old_xmit(struct send_queue *sq, bool in_napi,
> +			    struct virtnet_sq_stats *stats)
> +{
> +	unsigned int len;
> +	void *ptr;
> +
> +	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		if (!is_xdp_frame(ptr)) {
> +			struct sk_buff *skb = ptr;
> +
> +			pr_debug("Sent skb %p\n", skb);
> +
> +			stats->bytes += skb->len;
> +			napi_consume_skb(skb, in_napi);
> +		} else {
> +			struct xdp_frame *frame = ptr_to_xdp(ptr);
> +
> +			stats->bytes += frame->len;
> +			xdp_return_frame(frame);
> +		}
> +		stats->packets++;
> +	}
> +}
> +
>   /* Converting between virtqueue no. and kernel tx/rx queue no.
>    * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
>    */
> @@ -572,15 +596,12 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   			    int n, struct xdp_frame **frames, u32 flags)
>   {
>   	struct virtnet_info *vi = netdev_priv(dev);
> +	struct virtnet_sq_stats stats = {};
>   	struct receive_queue *rq = vi->rq;
>   	struct bpf_prog *xdp_prog;
>   	struct send_queue *sq;
> -	unsigned int len;
> -	int packets = 0;
> -	int bytes = 0;
>   	int nxmit = 0;
>   	int kicks = 0;
> -	void *ptr;
>   	int ret;
>   	int i;
>   
> @@ -599,20 +620,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   	}
>   
>   	/* Free up any pending old buffers before queueing new ones. */
> -	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (likely(is_xdp_frame(ptr))) {
> -			struct xdp_frame *frame = ptr_to_xdp(ptr);
> -
> -			bytes += frame->len;
> -			xdp_return_frame(frame);
> -		} else {
> -			struct sk_buff *skb = ptr;
> -
> -			bytes += skb->len;
> -			napi_consume_skb(skb, false);
> -		}
> -		packets++;
> -	}
> +	__free_old_xmit(sq, false, &stats);
>   
>   	for (i = 0; i < n; i++) {
>   		struct xdp_frame *xdpf = frames[i];
> @@ -629,8 +637,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   	}
>   out:
>   	u64_stats_update_begin(&sq->stats.syncp);
> -	sq->stats.bytes += bytes;
> -	sq->stats.packets += packets;
> +	sq->stats.bytes += stats.bytes;
> +	sq->stats.packets += stats.packets;
>   	sq->stats.xdp_tx += n;
>   	sq->stats.xdp_tx_drops += n - nxmit;
>   	sq->stats.kicks += kicks;
> @@ -1459,39 +1467,21 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>   	return stats.packets;
>   }
>   
> -static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
> +static void free_old_xmit(struct send_queue *sq, bool in_napi)
>   {
> -	unsigned int len;
> -	unsigned int packets = 0;
> -	unsigned int bytes = 0;
> -	void *ptr;
> +	struct virtnet_sq_stats stats = {};
>   
> -	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (likely(!is_xdp_frame(ptr))) {
> -			struct sk_buff *skb = ptr;
> -
> -			pr_debug("Sent skb %p\n", skb);
> -
> -			bytes += skb->len;
> -			napi_consume_skb(skb, in_napi);
> -		} else {
> -			struct xdp_frame *frame = ptr_to_xdp(ptr);
> -
> -			bytes += frame->len;
> -			xdp_return_frame(frame);
> -		}
> -		packets++;
> -	}
> +	__free_old_xmit(sq, in_napi, &stats);
>   
>   	/* Avoid overhead when no packets have been processed
>   	 * happens when called speculatively from start_xmit.
>   	 */
> -	if (!packets)
> +	if (!stats.packets)
>   		return;
>   
>   	u64_stats_update_begin(&sq->stats.syncp);
> -	sq->stats.bytes += bytes;
> -	sq->stats.packets += packets;
> +	sq->stats.bytes += stats.bytes;
> +	sq->stats.packets += stats.packets;
>   	u64_stats_update_end(&sq->stats.syncp);
>   }
>   
> @@ -1516,7 +1506,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
>   		return;
>   
>   	if (__netif_tx_trylock(txq)) {
> -		free_old_xmit_skbs(sq, true);
> +		free_old_xmit(sq, true);
>   		__netif_tx_unlock(txq);
>   	}
>   
> @@ -1601,7 +1591,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>   
>   	txq = netdev_get_tx_queue(vi->dev, index);
>   	__netif_tx_lock(txq, raw_smp_processor_id());
> -	free_old_xmit_skbs(sq, true);
> +	free_old_xmit(sq, true);
>   	__netif_tx_unlock(txq);
>   
>   	virtqueue_napi_complete(napi, sq->vq, 0);
> @@ -1670,7 +1660,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>   	bool use_napi = sq->napi.weight;
>   
>   	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(sq, false);
> +	free_old_xmit(sq, false);
>   
>   	if (use_napi && kick)
>   		virtqueue_enable_cb_delayed(sq->vq);
> @@ -1714,7 +1704,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>   		if (!use_napi &&
>   		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>   			/* More just got used, free them then recheck. */
> -			free_old_xmit_skbs(sq, false);
> +			free_old_xmit(sq, false);
>   			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>   				netif_start_subqueue(dev, qnum);
>   				virtqueue_disable_cb(sq->vq);

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 07/15] virtio-net: standalone virtnet_aloc_frag function
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-16  2:45     ` Jason Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  2:45 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust . li


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> This logic is used by small and merge when adding buf, and the
> subsequent patch will also use this logic, so it is separated as an
> independent function.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


Acked-by: Jason Wang <jasiowang@redhat.com>


> ---
>   drivers/net/virtio_net.c | 29 ++++++++++++++++++++---------
>   1 file changed, 20 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d791543a8dd8..3fd87bf2b2ad 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -264,6 +264,22 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
>   	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
>   }
>   
> +static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
> +				int gfp)
> +{
> +	struct page_frag *alloc_frag = &rq->alloc_frag;
> +	char *buf;
> +
> +	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
> +		return NULL;
> +
> +	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> +	get_page(alloc_frag->page);
> +	alloc_frag->offset += len;
> +
> +	return buf;
> +}
> +
>   static void __free_old_xmit(struct send_queue *sq, bool in_napi,
>   			    struct virtnet_sq_stats *stats)
>   {
> @@ -1190,7 +1206,6 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
>   static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
>   			     gfp_t gfp)
>   {
> -	struct page_frag *alloc_frag = &rq->alloc_frag;
>   	char *buf;
>   	unsigned int xdp_headroom = virtnet_get_headroom(vi);
>   	void *ctx = (void *)(unsigned long)xdp_headroom;
> @@ -1199,12 +1214,10 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
>   
>   	len = SKB_DATA_ALIGN(len) +
>   	      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> -	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
> +	buf = virtnet_alloc_frag(rq, len, gfp);
> +	if (unlikely(!buf))
>   		return -ENOMEM;
>   
> -	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> -	get_page(alloc_frag->page);
> -	alloc_frag->offset += len;
>   	sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom,
>   		    vi->hdr_len + GOOD_PACKET_LEN);
>   	err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> @@ -1295,13 +1308,11 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
>   	 * disabled GSO for XDP, it won't be a big issue.
>   	 */
>   	len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
> -	if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> +	buf = virtnet_alloc_frag(rq, len + room, gfp);
> +	if (unlikely(!buf))
>   		return -ENOMEM;
>   
> -	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
>   	buf += headroom; /* advance address leaving hole at front of pkt */
> -	get_page(alloc_frag->page);
> -	alloc_frag->offset += len + room;
>   	hole = alloc_frag->size - alloc_frag->offset;
>   	if (hole < len + room) {
>   		/* To avoid internal fragmentation, if there is very likely not


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 07/15] virtio-net: standalone virtnet_aloc_frag function
@ 2021-06-16  2:45     ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  2:45 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> This logic is used by small and merge when adding buf, and the
> subsequent patch will also use this logic, so it is separated as an
> independent function.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


Acked-by: Jason Wang <jasiowang@redhat.com>


> ---
>   drivers/net/virtio_net.c | 29 ++++++++++++++++++++---------
>   1 file changed, 20 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d791543a8dd8..3fd87bf2b2ad 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -264,6 +264,22 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
>   	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
>   }
>   
> +static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
> +				int gfp)
> +{
> +	struct page_frag *alloc_frag = &rq->alloc_frag;
> +	char *buf;
> +
> +	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
> +		return NULL;
> +
> +	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> +	get_page(alloc_frag->page);
> +	alloc_frag->offset += len;
> +
> +	return buf;
> +}
> +
>   static void __free_old_xmit(struct send_queue *sq, bool in_napi,
>   			    struct virtnet_sq_stats *stats)
>   {
> @@ -1190,7 +1206,6 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
>   static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
>   			     gfp_t gfp)
>   {
> -	struct page_frag *alloc_frag = &rq->alloc_frag;
>   	char *buf;
>   	unsigned int xdp_headroom = virtnet_get_headroom(vi);
>   	void *ctx = (void *)(unsigned long)xdp_headroom;
> @@ -1199,12 +1214,10 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
>   
>   	len = SKB_DATA_ALIGN(len) +
>   	      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> -	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
> +	buf = virtnet_alloc_frag(rq, len, gfp);
> +	if (unlikely(!buf))
>   		return -ENOMEM;
>   
> -	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> -	get_page(alloc_frag->page);
> -	alloc_frag->offset += len;
>   	sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom,
>   		    vi->hdr_len + GOOD_PACKET_LEN);
>   	err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> @@ -1295,13 +1308,11 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
>   	 * disabled GSO for XDP, it won't be a big issue.
>   	 */
>   	len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
> -	if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> +	buf = virtnet_alloc_frag(rq, len + room, gfp);
> +	if (unlikely(!buf))
>   		return -ENOMEM;
>   
> -	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
>   	buf += headroom; /* advance address leaving hole at front of pkt */
> -	get_page(alloc_frag->page);
> -	alloc_frag->offset += len + room;
>   	hole = alloc_frag->size - alloc_frag->offset;
>   	if (hole < len + room) {
>   		/* To avoid internal fragmentation, if there is very likely not

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 08/15] virtio-net: split the receive_mergeable function
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-16  7:33     ` Jason Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  7:33 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust . li


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> receive_mergeable() is too complicated, so this function is split here.
> One is to make the function more readable. On the other hand, the two
> independent functions will be called separately in subsequent patches.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   drivers/net/virtio_net.c | 181 ++++++++++++++++++++++++---------------
>   1 file changed, 111 insertions(+), 70 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 3fd87bf2b2ad..989aba600e63 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -733,6 +733,109 @@ static struct page *xdp_linearize_page(struct receive_queue *rq,
>   	return NULL;
>   }
>   
> +static void merge_drop_follow_bufs(struct net_device *dev,
> +				   struct receive_queue *rq,
> +				   u16 num_buf,
> +				   struct virtnet_rq_stats *stats)


Patch looks good. Nit here, I guess we need a better name, how about 
"merge_buffers()" for this and "drop_buffers()" for the next function?

Thanks


> +{
> +	struct page *page;
> +	unsigned int len;
> +	void *buf;
> +
> +	while (num_buf-- > 1) {
> +		buf = virtqueue_get_buf(rq->vq, &len);
> +		if (unlikely(!buf)) {
> +			pr_debug("%s: rx error: %d buffers missing\n",
> +				 dev->name, num_buf);
> +			dev->stats.rx_length_errors++;
> +			break;
> +		}
> +		stats->bytes += len;
> +		page = virt_to_head_page(buf);
> +		put_page(page);
> +	}
> +}
> +
> +static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
> +						 struct virtnet_info *vi,
> +						 struct receive_queue *rq,
> +						 struct sk_buff *head_skb,
> +						 u16 num_buf,
> +						 struct virtnet_rq_stats *stats)
> +{
> +	struct sk_buff *curr_skb;
> +	unsigned int truesize;
> +	unsigned int len, num;
> +	struct page *page;
> +	void *buf, *ctx;
> +	int offset;
> +
> +	curr_skb = head_skb;
> +	num = num_buf;
> +
> +	while (--num_buf) {
> +		int num_skb_frags;
> +
> +		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
> +		if (unlikely(!buf)) {
> +			pr_debug("%s: rx error: %d buffers out of %d missing\n",
> +				 dev->name, num_buf, num);
> +			dev->stats.rx_length_errors++;
> +			goto err_buf;
> +		}
> +
> +		stats->bytes += len;
> +		page = virt_to_head_page(buf);
> +
> +		truesize = mergeable_ctx_to_truesize(ctx);
> +		if (unlikely(len > truesize)) {
> +			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
> +				 dev->name, len, (unsigned long)ctx);
> +			dev->stats.rx_length_errors++;
> +			goto err_skb;
> +		}
> +
> +		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
> +		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
> +			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
> +
> +			if (unlikely(!nskb))
> +				goto err_skb;
> +			if (curr_skb == head_skb)
> +				skb_shinfo(curr_skb)->frag_list = nskb;
> +			else
> +				curr_skb->next = nskb;
> +			curr_skb = nskb;
> +			head_skb->truesize += nskb->truesize;
> +			num_skb_frags = 0;
> +		}
> +		if (curr_skb != head_skb) {
> +			head_skb->data_len += len;
> +			head_skb->len += len;
> +			head_skb->truesize += truesize;
> +		}
> +		offset = buf - page_address(page);
> +		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
> +			put_page(page);
> +			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
> +					     len, truesize);
> +		} else {
> +			skb_add_rx_frag(curr_skb, num_skb_frags, page,
> +					offset, len, truesize);
> +		}
> +	}
> +
> +	return head_skb;
> +
> +err_skb:
> +	put_page(page);
> +	merge_drop_follow_bufs(dev, rq, num_buf, stats);
> +err_buf:
> +	stats->drops++;
> +	dev_kfree_skb(head_skb);
> +	return NULL;
> +}
> +
>   static struct sk_buff *receive_small(struct net_device *dev,
>   				     struct virtnet_info *vi,
>   				     struct receive_queue *rq,
> @@ -909,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>   	u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
>   	struct page *page = virt_to_head_page(buf);
>   	int offset = buf - page_address(page);
> -	struct sk_buff *head_skb, *curr_skb;
> +	struct sk_buff *head_skb;
>   	struct bpf_prog *xdp_prog;
>   	unsigned int truesize = mergeable_ctx_to_truesize(ctx);
>   	unsigned int headroom = mergeable_ctx_to_headroom(ctx);
> @@ -1054,65 +1157,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>   
>   	head_skb = page_to_skb(vi, rq, page, offset, len, truesize, !xdp_prog,
>   			       metasize, !!headroom);
> -	curr_skb = head_skb;
> -
> -	if (unlikely(!curr_skb))
> +	if (unlikely(!head_skb))
>   		goto err_skb;
> -	while (--num_buf) {
> -		int num_skb_frags;
>   
> -		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
> -		if (unlikely(!buf)) {
> -			pr_debug("%s: rx error: %d buffers out of %d missing\n",
> -				 dev->name, num_buf,
> -				 virtio16_to_cpu(vi->vdev,
> -						 hdr->num_buffers));
> -			dev->stats.rx_length_errors++;
> -			goto err_buf;
> -		}
> -
> -		stats->bytes += len;
> -		page = virt_to_head_page(buf);
> -
> -		truesize = mergeable_ctx_to_truesize(ctx);
> -		if (unlikely(len > truesize)) {
> -			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
> -				 dev->name, len, (unsigned long)ctx);
> -			dev->stats.rx_length_errors++;
> -			goto err_skb;
> -		}
> -
> -		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
> -		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
> -			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
> -
> -			if (unlikely(!nskb))
> -				goto err_skb;
> -			if (curr_skb == head_skb)
> -				skb_shinfo(curr_skb)->frag_list = nskb;
> -			else
> -				curr_skb->next = nskb;
> -			curr_skb = nskb;
> -			head_skb->truesize += nskb->truesize;
> -			num_skb_frags = 0;
> -		}
> -		if (curr_skb != head_skb) {
> -			head_skb->data_len += len;
> -			head_skb->len += len;
> -			head_skb->truesize += truesize;
> -		}
> -		offset = buf - page_address(page);
> -		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
> -			put_page(page);
> -			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
> -					     len, truesize);
> -		} else {
> -			skb_add_rx_frag(curr_skb, num_skb_frags, page,
> -					offset, len, truesize);
> -		}
> -	}
> +	if (num_buf > 1)
> +		head_skb = merge_receive_follow_bufs(dev, vi, rq, head_skb,
> +						     num_buf, stats);
> +	if (head_skb)
> +		ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
>   
> -	ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
>   	return head_skb;
>   
>   err_xdp:
> @@ -1120,19 +1173,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>   	stats->xdp_drops++;
>   err_skb:
>   	put_page(page);
> -	while (num_buf-- > 1) {
> -		buf = virtqueue_get_buf(rq->vq, &len);
> -		if (unlikely(!buf)) {
> -			pr_debug("%s: rx error: %d buffers missing\n",
> -				 dev->name, num_buf);
> -			dev->stats.rx_length_errors++;
> -			break;
> -		}
> -		stats->bytes += len;
> -		page = virt_to_head_page(buf);
> -		put_page(page);
> -	}
> -err_buf:
> +	merge_drop_follow_bufs(dev, rq, num_buf, stats);
>   	stats->drops++;
>   	dev_kfree_skb(head_skb);
>   xdp_xmit:


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 08/15] virtio-net: split the receive_mergeable function
@ 2021-06-16  7:33     ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  7:33 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> receive_mergeable() is too complicated, so this function is split here.
> One is to make the function more readable. On the other hand, the two
> independent functions will be called separately in subsequent patches.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   drivers/net/virtio_net.c | 181 ++++++++++++++++++++++++---------------
>   1 file changed, 111 insertions(+), 70 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 3fd87bf2b2ad..989aba600e63 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -733,6 +733,109 @@ static struct page *xdp_linearize_page(struct receive_queue *rq,
>   	return NULL;
>   }
>   
> +static void merge_drop_follow_bufs(struct net_device *dev,
> +				   struct receive_queue *rq,
> +				   u16 num_buf,
> +				   struct virtnet_rq_stats *stats)


Patch looks good. Nit here, I guess we need a better name, how about 
"merge_buffers()" for this and "drop_buffers()" for the next function?

Thanks


> +{
> +	struct page *page;
> +	unsigned int len;
> +	void *buf;
> +
> +	while (num_buf-- > 1) {
> +		buf = virtqueue_get_buf(rq->vq, &len);
> +		if (unlikely(!buf)) {
> +			pr_debug("%s: rx error: %d buffers missing\n",
> +				 dev->name, num_buf);
> +			dev->stats.rx_length_errors++;
> +			break;
> +		}
> +		stats->bytes += len;
> +		page = virt_to_head_page(buf);
> +		put_page(page);
> +	}
> +}
> +
> +static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
> +						 struct virtnet_info *vi,
> +						 struct receive_queue *rq,
> +						 struct sk_buff *head_skb,
> +						 u16 num_buf,
> +						 struct virtnet_rq_stats *stats)
> +{
> +	struct sk_buff *curr_skb;
> +	unsigned int truesize;
> +	unsigned int len, num;
> +	struct page *page;
> +	void *buf, *ctx;
> +	int offset;
> +
> +	curr_skb = head_skb;
> +	num = num_buf;
> +
> +	while (--num_buf) {
> +		int num_skb_frags;
> +
> +		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
> +		if (unlikely(!buf)) {
> +			pr_debug("%s: rx error: %d buffers out of %d missing\n",
> +				 dev->name, num_buf, num);
> +			dev->stats.rx_length_errors++;
> +			goto err_buf;
> +		}
> +
> +		stats->bytes += len;
> +		page = virt_to_head_page(buf);
> +
> +		truesize = mergeable_ctx_to_truesize(ctx);
> +		if (unlikely(len > truesize)) {
> +			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
> +				 dev->name, len, (unsigned long)ctx);
> +			dev->stats.rx_length_errors++;
> +			goto err_skb;
> +		}
> +
> +		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
> +		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
> +			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
> +
> +			if (unlikely(!nskb))
> +				goto err_skb;
> +			if (curr_skb == head_skb)
> +				skb_shinfo(curr_skb)->frag_list = nskb;
> +			else
> +				curr_skb->next = nskb;
> +			curr_skb = nskb;
> +			head_skb->truesize += nskb->truesize;
> +			num_skb_frags = 0;
> +		}
> +		if (curr_skb != head_skb) {
> +			head_skb->data_len += len;
> +			head_skb->len += len;
> +			head_skb->truesize += truesize;
> +		}
> +		offset = buf - page_address(page);
> +		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
> +			put_page(page);
> +			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
> +					     len, truesize);
> +		} else {
> +			skb_add_rx_frag(curr_skb, num_skb_frags, page,
> +					offset, len, truesize);
> +		}
> +	}
> +
> +	return head_skb;
> +
> +err_skb:
> +	put_page(page);
> +	merge_drop_follow_bufs(dev, rq, num_buf, stats);
> +err_buf:
> +	stats->drops++;
> +	dev_kfree_skb(head_skb);
> +	return NULL;
> +}
> +
>   static struct sk_buff *receive_small(struct net_device *dev,
>   				     struct virtnet_info *vi,
>   				     struct receive_queue *rq,
> @@ -909,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>   	u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
>   	struct page *page = virt_to_head_page(buf);
>   	int offset = buf - page_address(page);
> -	struct sk_buff *head_skb, *curr_skb;
> +	struct sk_buff *head_skb;
>   	struct bpf_prog *xdp_prog;
>   	unsigned int truesize = mergeable_ctx_to_truesize(ctx);
>   	unsigned int headroom = mergeable_ctx_to_headroom(ctx);
> @@ -1054,65 +1157,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>   
>   	head_skb = page_to_skb(vi, rq, page, offset, len, truesize, !xdp_prog,
>   			       metasize, !!headroom);
> -	curr_skb = head_skb;
> -
> -	if (unlikely(!curr_skb))
> +	if (unlikely(!head_skb))
>   		goto err_skb;
> -	while (--num_buf) {
> -		int num_skb_frags;
>   
> -		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
> -		if (unlikely(!buf)) {
> -			pr_debug("%s: rx error: %d buffers out of %d missing\n",
> -				 dev->name, num_buf,
> -				 virtio16_to_cpu(vi->vdev,
> -						 hdr->num_buffers));
> -			dev->stats.rx_length_errors++;
> -			goto err_buf;
> -		}
> -
> -		stats->bytes += len;
> -		page = virt_to_head_page(buf);
> -
> -		truesize = mergeable_ctx_to_truesize(ctx);
> -		if (unlikely(len > truesize)) {
> -			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
> -				 dev->name, len, (unsigned long)ctx);
> -			dev->stats.rx_length_errors++;
> -			goto err_skb;
> -		}
> -
> -		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
> -		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
> -			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
> -
> -			if (unlikely(!nskb))
> -				goto err_skb;
> -			if (curr_skb == head_skb)
> -				skb_shinfo(curr_skb)->frag_list = nskb;
> -			else
> -				curr_skb->next = nskb;
> -			curr_skb = nskb;
> -			head_skb->truesize += nskb->truesize;
> -			num_skb_frags = 0;
> -		}
> -		if (curr_skb != head_skb) {
> -			head_skb->data_len += len;
> -			head_skb->len += len;
> -			head_skb->truesize += truesize;
> -		}
> -		offset = buf - page_address(page);
> -		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
> -			put_page(page);
> -			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
> -					     len, truesize);
> -		} else {
> -			skb_add_rx_frag(curr_skb, num_skb_frags, page,
> -					offset, len, truesize);
> -		}
> -	}
> +	if (num_buf > 1)
> +		head_skb = merge_receive_follow_bufs(dev, vi, rq, head_skb,
> +						     num_buf, stats);
> +	if (head_skb)
> +		ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
>   
> -	ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
>   	return head_skb;
>   
>   err_xdp:
> @@ -1120,19 +1173,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>   	stats->xdp_drops++;
>   err_skb:
>   	put_page(page);
> -	while (num_buf-- > 1) {
> -		buf = virtqueue_get_buf(rq->vq, &len);
> -		if (unlikely(!buf)) {
> -			pr_debug("%s: rx error: %d buffers missing\n",
> -				 dev->name, num_buf);
> -			dev->stats.rx_length_errors++;
> -			break;
> -		}
> -		stats->bytes += len;
> -		page = virt_to_head_page(buf);
> -		put_page(page);
> -	}
> -err_buf:
> +	merge_drop_follow_bufs(dev, rq, num_buf, stats);
>   	stats->drops++;
>   	dev_kfree_skb(head_skb);
>   xdp_xmit:

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 10/15] virtio-net: independent directory
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-16  7:34     ` Jason Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  7:34 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust . li


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> Create a separate directory for virtio-net. AF_XDP support will be added
> later, and a separate xsk.c file will be added.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   MAINTAINERS                           |  2 +-
>   drivers/net/Kconfig                   |  8 +-------
>   drivers/net/Makefile                  |  2 +-
>   drivers/net/virtio/Kconfig            | 11 +++++++++++
>   drivers/net/virtio/Makefile           |  6 ++++++
>   drivers/net/{ => virtio}/virtio_net.c |  0
>   6 files changed, 20 insertions(+), 9 deletions(-)
>   create mode 100644 drivers/net/virtio/Kconfig
>   create mode 100644 drivers/net/virtio/Makefile
>   rename drivers/net/{ => virtio}/virtio_net.c (100%)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e69c1991ec3b..2041267f19f1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -19344,7 +19344,7 @@ S:	Maintained
>   F:	Documentation/devicetree/bindings/virtio/
>   F:	drivers/block/virtio_blk.c
>   F:	drivers/crypto/virtio/
> -F:	drivers/net/virtio_net.c
> +F:	drivers/net/virtio/
>   F:	drivers/vdpa/
>   F:	drivers/virtio/
>   F:	include/linux/vdpa.h
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 4da68ba8448f..2297fe4183ae 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -392,13 +392,7 @@ config VETH
>   	  When one end receives the packet it appears on its pair and vice
>   	  versa.
>   
> -config VIRTIO_NET
> -	tristate "Virtio network driver"
> -	depends on VIRTIO
> -	select NET_FAILOVER
> -	help
> -	  This is the virtual network driver for virtio.  It can be used with
> -	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
> +source "drivers/net/virtio/Kconfig"
>   
>   config NLMON
>   	tristate "Virtual netlink monitoring device"
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 7ffd2d03efaf..c4c7419e0398 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -28,7 +28,7 @@ obj-$(CONFIG_NET_TEAM) += team/
>   obj-$(CONFIG_TUN) += tun.o
>   obj-$(CONFIG_TAP) += tap.o
>   obj-$(CONFIG_VETH) += veth.o
> -obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> +obj-$(CONFIG_VIRTIO_NET) += virtio/
>   obj-$(CONFIG_VXLAN) += vxlan.o
>   obj-$(CONFIG_GENEVE) += geneve.o
>   obj-$(CONFIG_BAREUDP) += bareudp.o
> diff --git a/drivers/net/virtio/Kconfig b/drivers/net/virtio/Kconfig
> new file mode 100644
> index 000000000000..9bc2a2fc6c3e
> --- /dev/null
> +++ b/drivers/net/virtio/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +# virtio-net device configuration
> +#
> +config VIRTIO_NET
> +	tristate "Virtio network driver"
> +	depends on VIRTIO
> +	select NET_FAILOVER
> +	help
> +	  This is the virtual network driver for virtio.  It can be used with
> +	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> new file mode 100644
> index 000000000000..ccc80f40f33a
> --- /dev/null
> +++ b/drivers/net/virtio/Makefile
> @@ -0,0 +1,6 @@
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Makefile for the virtio network device drivers.
> +#
> +
> +obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio/virtio_net.c
> similarity index 100%
> rename from drivers/net/virtio_net.c
> rename to drivers/net/virtio/virtio_net.c


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 10/15] virtio-net: independent directory
@ 2021-06-16  7:34     ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  7:34 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> Create a separate directory for virtio-net. AF_XDP support will be added
> later, and a separate xsk.c file will be added.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   MAINTAINERS                           |  2 +-
>   drivers/net/Kconfig                   |  8 +-------
>   drivers/net/Makefile                  |  2 +-
>   drivers/net/virtio/Kconfig            | 11 +++++++++++
>   drivers/net/virtio/Makefile           |  6 ++++++
>   drivers/net/{ => virtio}/virtio_net.c |  0
>   6 files changed, 20 insertions(+), 9 deletions(-)
>   create mode 100644 drivers/net/virtio/Kconfig
>   create mode 100644 drivers/net/virtio/Makefile
>   rename drivers/net/{ => virtio}/virtio_net.c (100%)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e69c1991ec3b..2041267f19f1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -19344,7 +19344,7 @@ S:	Maintained
>   F:	Documentation/devicetree/bindings/virtio/
>   F:	drivers/block/virtio_blk.c
>   F:	drivers/crypto/virtio/
> -F:	drivers/net/virtio_net.c
> +F:	drivers/net/virtio/
>   F:	drivers/vdpa/
>   F:	drivers/virtio/
>   F:	include/linux/vdpa.h
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 4da68ba8448f..2297fe4183ae 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -392,13 +392,7 @@ config VETH
>   	  When one end receives the packet it appears on its pair and vice
>   	  versa.
>   
> -config VIRTIO_NET
> -	tristate "Virtio network driver"
> -	depends on VIRTIO
> -	select NET_FAILOVER
> -	help
> -	  This is the virtual network driver for virtio.  It can be used with
> -	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
> +source "drivers/net/virtio/Kconfig"
>   
>   config NLMON
>   	tristate "Virtual netlink monitoring device"
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 7ffd2d03efaf..c4c7419e0398 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -28,7 +28,7 @@ obj-$(CONFIG_NET_TEAM) += team/
>   obj-$(CONFIG_TUN) += tun.o
>   obj-$(CONFIG_TAP) += tap.o
>   obj-$(CONFIG_VETH) += veth.o
> -obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> +obj-$(CONFIG_VIRTIO_NET) += virtio/
>   obj-$(CONFIG_VXLAN) += vxlan.o
>   obj-$(CONFIG_GENEVE) += geneve.o
>   obj-$(CONFIG_BAREUDP) += bareudp.o
> diff --git a/drivers/net/virtio/Kconfig b/drivers/net/virtio/Kconfig
> new file mode 100644
> index 000000000000..9bc2a2fc6c3e
> --- /dev/null
> +++ b/drivers/net/virtio/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +# virtio-net device configuration
> +#
> +config VIRTIO_NET
> +	tristate "Virtio network driver"
> +	depends on VIRTIO
> +	select NET_FAILOVER
> +	help
> +	  This is the virtual network driver for virtio.  It can be used with
> +	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> new file mode 100644
> index 000000000000..ccc80f40f33a
> --- /dev/null
> +++ b/drivers/net/virtio/Makefile
> @@ -0,0 +1,6 @@
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Makefile for the virtio network device drivers.
> +#
> +
> +obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio/virtio_net.c
> similarity index 100%
> rename from drivers/net/virtio_net.c
> rename to drivers/net/virtio/virtio_net.c

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 11/15] virtio-net: move to virtio_net.h
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-16  7:35     ` Jason Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  7:35 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust . li


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> Move some structure definitions and inline functions into the
> virtio_net.h file.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/net/virtio/virtio_net.c | 225 +------------------------------
>   drivers/net/virtio/virtio_net.h | 230 ++++++++++++++++++++++++++++++++
>   2 files changed, 232 insertions(+), 223 deletions(-)
>   create mode 100644 drivers/net/virtio/virtio_net.h
>
> diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
> index 953739860563..395ec1f18331 100644
> --- a/drivers/net/virtio/virtio_net.c
> +++ b/drivers/net/virtio/virtio_net.c
> @@ -4,24 +4,8 @@
>    * Copyright 2007 Rusty Russell <rusty@rustcorp.com.au> IBM Corporation
>    */
>   //#define DEBUG
> -#include <linux/netdevice.h>
> -#include <linux/etherdevice.h>
> -#include <linux/ethtool.h>
> -#include <linux/module.h>
> -#include <linux/virtio.h>
> -#include <linux/virtio_net.h>
> -#include <linux/bpf.h>
> -#include <linux/bpf_trace.h>
> -#include <linux/scatterlist.h>
> -#include <linux/if_vlan.h>
> -#include <linux/slab.h>
> -#include <linux/cpu.h>
> -#include <linux/average.h>
> -#include <linux/filter.h>
> -#include <linux/kernel.h>
> -#include <net/route.h>
> -#include <net/xdp.h>
> -#include <net/net_failover.h>
> +
> +#include "virtio_net.h"
>   
>   static int napi_weight = NAPI_POLL_WEIGHT;
>   module_param(napi_weight, int, 0444);
> @@ -44,15 +28,6 @@ module_param(napi_tx, bool, 0644);
>   #define VIRTIO_XDP_TX		BIT(0)
>   #define VIRTIO_XDP_REDIR	BIT(1)
>   
> -#define VIRTIO_XDP_FLAG	BIT(0)
> -
> -/* RX packet size EWMA. The average packet size is used to determine the packet
> - * buffer size when refilling RX rings. As the entire RX ring may be refilled
> - * at once, the weight is chosen so that the EWMA will be insensitive to short-
> - * term, transient changes in packet size.
> - */
> -DECLARE_EWMA(pkt_len, 0, 64)
> -
>   #define VIRTNET_DRIVER_VERSION "1.0.0"
>   
>   static const unsigned long guest_offloads[] = {
> @@ -68,35 +43,6 @@ static const unsigned long guest_offloads[] = {
>   				(1ULL << VIRTIO_NET_F_GUEST_ECN)  | \
>   				(1ULL << VIRTIO_NET_F_GUEST_UFO))
>   
> -struct virtnet_stat_desc {
> -	char desc[ETH_GSTRING_LEN];
> -	size_t offset;
> -};
> -
> -struct virtnet_sq_stats {
> -	struct u64_stats_sync syncp;
> -	u64 packets;
> -	u64 bytes;
> -	u64 xdp_tx;
> -	u64 xdp_tx_drops;
> -	u64 kicks;
> -};
> -
> -struct virtnet_rq_stats {
> -	struct u64_stats_sync syncp;
> -	u64 packets;
> -	u64 bytes;
> -	u64 drops;
> -	u64 xdp_packets;
> -	u64 xdp_tx;
> -	u64 xdp_redirects;
> -	u64 xdp_drops;
> -	u64 kicks;
> -};
> -
> -#define VIRTNET_SQ_STAT(m)	offsetof(struct virtnet_sq_stats, m)
> -#define VIRTNET_RQ_STAT(m)	offsetof(struct virtnet_rq_stats, m)
> -
>   static const struct virtnet_stat_desc virtnet_sq_stats_desc[] = {
>   	{ "packets",		VIRTNET_SQ_STAT(packets) },
>   	{ "bytes",		VIRTNET_SQ_STAT(bytes) },
> @@ -119,54 +65,6 @@ static const struct virtnet_stat_desc virtnet_rq_stats_desc[] = {
>   #define VIRTNET_SQ_STATS_LEN	ARRAY_SIZE(virtnet_sq_stats_desc)
>   #define VIRTNET_RQ_STATS_LEN	ARRAY_SIZE(virtnet_rq_stats_desc)
>   
> -/* Internal representation of a send virtqueue */
> -struct send_queue {
> -	/* Virtqueue associated with this send _queue */
> -	struct virtqueue *vq;
> -
> -	/* TX: fragments + linear part + virtio header */
> -	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> -
> -	/* Name of the send queue: output.$index */
> -	char name[40];
> -
> -	struct virtnet_sq_stats stats;
> -
> -	struct napi_struct napi;
> -};
> -
> -/* Internal representation of a receive virtqueue */
> -struct receive_queue {
> -	/* Virtqueue associated with this receive_queue */
> -	struct virtqueue *vq;
> -
> -	struct napi_struct napi;
> -
> -	struct bpf_prog __rcu *xdp_prog;
> -
> -	struct virtnet_rq_stats stats;
> -
> -	/* Chain pages by the private ptr. */
> -	struct page *pages;
> -
> -	/* Average packet length for mergeable receive buffers. */
> -	struct ewma_pkt_len mrg_avg_pkt_len;
> -
> -	/* Page frag for packet buffer allocation. */
> -	struct page_frag alloc_frag;
> -
> -	/* RX: fragments + linear part + virtio header */
> -	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> -
> -	/* Min single buffer size for mergeable buffers case. */
> -	unsigned int min_buf_len;
> -
> -	/* Name of this receive queue: input.$index */
> -	char name[40];
> -
> -	struct xdp_rxq_info xdp_rxq;
> -};
> -
>   /* Control VQ buffers: protected by the rtnl lock */
>   struct control_buf {
>   	struct virtio_net_ctrl_hdr hdr;
> @@ -178,67 +76,6 @@ struct control_buf {
>   	__virtio64 offloads;
>   };
>   
> -struct virtnet_info {
> -	struct virtio_device *vdev;
> -	struct virtqueue *cvq;
> -	struct net_device *dev;
> -	struct send_queue *sq;
> -	struct receive_queue *rq;
> -	unsigned int status;
> -
> -	/* Max # of queue pairs supported by the device */
> -	u16 max_queue_pairs;
> -
> -	/* # of queue pairs currently used by the driver */
> -	u16 curr_queue_pairs;
> -
> -	/* # of XDP queue pairs currently used by the driver */
> -	u16 xdp_queue_pairs;
> -
> -	/* xdp_queue_pairs may be 0, when xdp is already loaded. So add this. */
> -	bool xdp_enabled;
> -
> -	/* I like... big packets and I cannot lie! */
> -	bool big_packets;
> -
> -	/* Host will merge rx buffers for big packets (shake it! shake it!) */
> -	bool mergeable_rx_bufs;
> -
> -	/* Has control virtqueue */
> -	bool has_cvq;
> -
> -	/* Host can handle any s/g split between our header and packet data */
> -	bool any_header_sg;
> -
> -	/* Packet virtio header size */
> -	u8 hdr_len;
> -
> -	/* Work struct for refilling if we run low on memory. */
> -	struct delayed_work refill;
> -
> -	/* Work struct for config space updates */
> -	struct work_struct config_work;
> -
> -	/* Does the affinity hint is set for virtqueues? */
> -	bool affinity_hint_set;
> -
> -	/* CPU hotplug instances for online & dead */
> -	struct hlist_node node;
> -	struct hlist_node node_dead;
> -
> -	struct control_buf *ctrl;
> -
> -	/* Ethtool settings */
> -	u8 duplex;
> -	u32 speed;
> -
> -	unsigned long guest_offloads;
> -	unsigned long guest_offloads_capable;
> -
> -	/* failover when STANDBY feature enabled */
> -	struct failover *failover;
> -};
> -
>   struct padded_vnet_hdr {
>   	struct virtio_net_hdr_mrg_rxbuf hdr;
>   	/*
> @@ -249,21 +86,6 @@ struct padded_vnet_hdr {
>   	char padding[4];
>   };
>   
> -static bool is_xdp_frame(void *ptr)
> -{
> -	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> -}
> -
> -static void *xdp_to_ptr(struct xdp_frame *ptr)
> -{
> -	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
> -}
> -
> -static struct xdp_frame *ptr_to_xdp(void *ptr)
> -{
> -	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
> -}
> -
>   static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
>   				int gfp)
>   {
> @@ -280,30 +102,6 @@ static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
>   	return buf;
>   }
>   
> -static void __free_old_xmit(struct send_queue *sq, bool in_napi,
> -			    struct virtnet_sq_stats *stats)
> -{
> -	unsigned int len;
> -	void *ptr;
> -
> -	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (!is_xdp_frame(ptr)) {
> -			struct sk_buff *skb = ptr;
> -
> -			pr_debug("Sent skb %p\n", skb);
> -
> -			stats->bytes += skb->len;
> -			napi_consume_skb(skb, in_napi);
> -		} else {
> -			struct xdp_frame *frame = ptr_to_xdp(ptr);
> -
> -			stats->bytes += frame->len;
> -			xdp_return_frame(frame);
> -		}
> -		stats->packets++;
> -	}
> -}
> -
>   /* Converting between virtqueue no. and kernel tx/rx queue no.
>    * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
>    */
> @@ -359,15 +157,6 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>   	return p;
>   }
>   
> -static void virtqueue_napi_schedule(struct napi_struct *napi,
> -				    struct virtqueue *vq)
> -{
> -	if (napi_schedule_prep(napi)) {
> -		virtqueue_disable_cb(vq);
> -		__napi_schedule(napi);
> -	}
> -}
> -
>   static void virtqueue_napi_complete(struct napi_struct *napi,
>   				    struct virtqueue *vq, int processed)
>   {
> @@ -1537,16 +1326,6 @@ static void free_old_xmit(struct send_queue *sq, bool in_napi)
>   	u64_stats_update_end(&sq->stats.syncp);
>   }
>   
> -static bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
> -{
> -	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
> -		return false;
> -	else if (q < vi->curr_queue_pairs)
> -		return true;
> -	else
> -		return false;
> -}
> -
>   static void virtnet_poll_cleantx(struct receive_queue *rq)
>   {
>   	struct virtnet_info *vi = rq->vq->vdev->priv;
> diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> new file mode 100644
> index 000000000000..931cc81f92fb
> --- /dev/null
> +++ b/drivers/net/virtio/virtio_net.h
> @@ -0,0 +1,230 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +#ifndef __VIRTIO_NET_H__
> +#define __VIRTIO_NET_H__
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/module.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_net.h>
> +#include <linux/bpf.h>
> +#include <linux/bpf_trace.h>
> +#include <linux/scatterlist.h>
> +#include <linux/if_vlan.h>
> +#include <linux/slab.h>
> +#include <linux/cpu.h>
> +#include <linux/average.h>
> +#include <linux/filter.h>
> +#include <linux/kernel.h>
> +#include <net/route.h>
> +#include <net/xdp.h>
> +#include <net/net_failover.h>
> +#include <net/xdp_sock_drv.h>
> +
> +#define VIRTIO_XDP_FLAG	BIT(0)
> +
> +struct virtnet_info {
> +	struct virtio_device *vdev;
> +	struct virtqueue *cvq;
> +	struct net_device *dev;
> +	struct send_queue *sq;
> +	struct receive_queue *rq;
> +	unsigned int status;
> +
> +	/* Max # of queue pairs supported by the device */
> +	u16 max_queue_pairs;
> +
> +	/* # of queue pairs currently used by the driver */
> +	u16 curr_queue_pairs;
> +
> +	/* # of XDP queue pairs currently used by the driver */
> +	u16 xdp_queue_pairs;
> +
> +	/* xdp_queue_pairs may be 0, when xdp is already loaded. So add this. */
> +	bool xdp_enabled;
> +
> +	/* I like... big packets and I cannot lie! */
> +	bool big_packets;
> +
> +	/* Host will merge rx buffers for big packets (shake it! shake it!) */
> +	bool mergeable_rx_bufs;
> +
> +	/* Has control virtqueue */
> +	bool has_cvq;
> +
> +	/* Host can handle any s/g split between our header and packet data */
> +	bool any_header_sg;
> +
> +	/* Packet virtio header size */
> +	u8 hdr_len;
> +
> +	/* Work struct for refilling if we run low on memory. */
> +	struct delayed_work refill;
> +
> +	/* Work struct for config space updates */
> +	struct work_struct config_work;
> +
> +	/* Does the affinity hint is set for virtqueues? */
> +	bool affinity_hint_set;
> +
> +	/* CPU hotplug instances for online & dead */
> +	struct hlist_node node;
> +	struct hlist_node node_dead;
> +
> +	struct control_buf *ctrl;
> +
> +	/* Ethtool settings */
> +	u8 duplex;
> +	u32 speed;
> +
> +	unsigned long guest_offloads;
> +	unsigned long guest_offloads_capable;
> +
> +	/* failover when STANDBY feature enabled */
> +	struct failover *failover;
> +};
> +
> +/* RX packet size EWMA. The average packet size is used to determine the packet
> + * buffer size when refilling RX rings. As the entire RX ring may be refilled
> + * at once, the weight is chosen so that the EWMA will be insensitive to short-
> + * term, transient changes in packet size.
> + */
> +DECLARE_EWMA(pkt_len, 0, 64)
> +
> +struct virtnet_stat_desc {
> +	char desc[ETH_GSTRING_LEN];
> +	size_t offset;
> +};
> +
> +struct virtnet_sq_stats {
> +	struct u64_stats_sync syncp;
> +	u64 packets;
> +	u64 bytes;
> +	u64 xdp_tx;
> +	u64 xdp_tx_drops;
> +	u64 kicks;
> +};
> +
> +struct virtnet_rq_stats {
> +	struct u64_stats_sync syncp;
> +	u64 packets;
> +	u64 bytes;
> +	u64 drops;
> +	u64 xdp_packets;
> +	u64 xdp_tx;
> +	u64 xdp_redirects;
> +	u64 xdp_drops;
> +	u64 kicks;
> +};
> +
> +#define VIRTNET_SQ_STAT(m)	offsetof(struct virtnet_sq_stats, m)
> +#define VIRTNET_RQ_STAT(m)	offsetof(struct virtnet_rq_stats, m)
> +
> +/* Internal representation of a send virtqueue */
> +struct send_queue {
> +	/* Virtqueue associated with this send _queue */
> +	struct virtqueue *vq;
> +
> +	/* TX: fragments + linear part + virtio header */
> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> +
> +	/* Name of the send queue: output.$index */
> +	char name[40];
> +
> +	struct virtnet_sq_stats stats;
> +
> +	struct napi_struct napi;
> +};
> +
> +/* Internal representation of a receive virtqueue */
> +struct receive_queue {
> +	/* Virtqueue associated with this receive_queue */
> +	struct virtqueue *vq;
> +
> +	struct napi_struct napi;
> +
> +	struct bpf_prog __rcu *xdp_prog;
> +
> +	struct virtnet_rq_stats stats;
> +
> +	/* Chain pages by the private ptr. */
> +	struct page *pages;
> +
> +	/* Average packet length for mergeable receive buffers. */
> +	struct ewma_pkt_len mrg_avg_pkt_len;
> +
> +	/* Page frag for packet buffer allocation. */
> +	struct page_frag alloc_frag;
> +
> +	/* RX: fragments + linear part + virtio header */
> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> +
> +	/* Min single buffer size for mergeable buffers case. */
> +	unsigned int min_buf_len;
> +
> +	/* Name of this receive queue: input.$index */
> +	char name[40];
> +
> +	struct xdp_rxq_info xdp_rxq;
> +};
> +
> +static inline bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
> +{
> +	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
> +		return false;
> +	else if (q < vi->curr_queue_pairs)
> +		return true;
> +	else
> +		return false;
> +}
> +
> +static inline void virtqueue_napi_schedule(struct napi_struct *napi,
> +					   struct virtqueue *vq)
> +{
> +	if (napi_schedule_prep(napi)) {
> +		virtqueue_disable_cb(vq);
> +		__napi_schedule(napi);
> +	}
> +}
> +
> +static inline bool is_xdp_frame(void *ptr)
> +{
> +	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> +}
> +
> +static inline void *xdp_to_ptr(struct xdp_frame *ptr)
> +{
> +	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
> +}
> +
> +static inline struct xdp_frame *ptr_to_xdp(void *ptr)
> +{
> +	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
> +}
> +
> +static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
> +				   struct virtnet_sq_stats *stats)
> +{
> +	unsigned int len;
> +	void *ptr;
> +
> +	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		if (!is_xdp_frame(ptr)) {
> +			struct sk_buff *skb = ptr;
> +
> +			pr_debug("Sent skb %p\n", skb);
> +
> +			stats->bytes += skb->len;
> +			napi_consume_skb(skb, in_napi);
> +		} else {
> +			struct xdp_frame *frame = ptr_to_xdp(ptr);
> +
> +			stats->bytes += frame->len;
> +			xdp_return_frame(frame);
> +		}
> +		stats->packets++;
> +	}
> +}
> +
> +#endif


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 11/15] virtio-net: move to virtio_net.h
@ 2021-06-16  7:35     ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  7:35 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> Move some structure definitions and inline functions into the
> virtio_net.h file.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/net/virtio/virtio_net.c | 225 +------------------------------
>   drivers/net/virtio/virtio_net.h | 230 ++++++++++++++++++++++++++++++++
>   2 files changed, 232 insertions(+), 223 deletions(-)
>   create mode 100644 drivers/net/virtio/virtio_net.h
>
> diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
> index 953739860563..395ec1f18331 100644
> --- a/drivers/net/virtio/virtio_net.c
> +++ b/drivers/net/virtio/virtio_net.c
> @@ -4,24 +4,8 @@
>    * Copyright 2007 Rusty Russell <rusty@rustcorp.com.au> IBM Corporation
>    */
>   //#define DEBUG
> -#include <linux/netdevice.h>
> -#include <linux/etherdevice.h>
> -#include <linux/ethtool.h>
> -#include <linux/module.h>
> -#include <linux/virtio.h>
> -#include <linux/virtio_net.h>
> -#include <linux/bpf.h>
> -#include <linux/bpf_trace.h>
> -#include <linux/scatterlist.h>
> -#include <linux/if_vlan.h>
> -#include <linux/slab.h>
> -#include <linux/cpu.h>
> -#include <linux/average.h>
> -#include <linux/filter.h>
> -#include <linux/kernel.h>
> -#include <net/route.h>
> -#include <net/xdp.h>
> -#include <net/net_failover.h>
> +
> +#include "virtio_net.h"
>   
>   static int napi_weight = NAPI_POLL_WEIGHT;
>   module_param(napi_weight, int, 0444);
> @@ -44,15 +28,6 @@ module_param(napi_tx, bool, 0644);
>   #define VIRTIO_XDP_TX		BIT(0)
>   #define VIRTIO_XDP_REDIR	BIT(1)
>   
> -#define VIRTIO_XDP_FLAG	BIT(0)
> -
> -/* RX packet size EWMA. The average packet size is used to determine the packet
> - * buffer size when refilling RX rings. As the entire RX ring may be refilled
> - * at once, the weight is chosen so that the EWMA will be insensitive to short-
> - * term, transient changes in packet size.
> - */
> -DECLARE_EWMA(pkt_len, 0, 64)
> -
>   #define VIRTNET_DRIVER_VERSION "1.0.0"
>   
>   static const unsigned long guest_offloads[] = {
> @@ -68,35 +43,6 @@ static const unsigned long guest_offloads[] = {
>   				(1ULL << VIRTIO_NET_F_GUEST_ECN)  | \
>   				(1ULL << VIRTIO_NET_F_GUEST_UFO))
>   
> -struct virtnet_stat_desc {
> -	char desc[ETH_GSTRING_LEN];
> -	size_t offset;
> -};
> -
> -struct virtnet_sq_stats {
> -	struct u64_stats_sync syncp;
> -	u64 packets;
> -	u64 bytes;
> -	u64 xdp_tx;
> -	u64 xdp_tx_drops;
> -	u64 kicks;
> -};
> -
> -struct virtnet_rq_stats {
> -	struct u64_stats_sync syncp;
> -	u64 packets;
> -	u64 bytes;
> -	u64 drops;
> -	u64 xdp_packets;
> -	u64 xdp_tx;
> -	u64 xdp_redirects;
> -	u64 xdp_drops;
> -	u64 kicks;
> -};
> -
> -#define VIRTNET_SQ_STAT(m)	offsetof(struct virtnet_sq_stats, m)
> -#define VIRTNET_RQ_STAT(m)	offsetof(struct virtnet_rq_stats, m)
> -
>   static const struct virtnet_stat_desc virtnet_sq_stats_desc[] = {
>   	{ "packets",		VIRTNET_SQ_STAT(packets) },
>   	{ "bytes",		VIRTNET_SQ_STAT(bytes) },
> @@ -119,54 +65,6 @@ static const struct virtnet_stat_desc virtnet_rq_stats_desc[] = {
>   #define VIRTNET_SQ_STATS_LEN	ARRAY_SIZE(virtnet_sq_stats_desc)
>   #define VIRTNET_RQ_STATS_LEN	ARRAY_SIZE(virtnet_rq_stats_desc)
>   
> -/* Internal representation of a send virtqueue */
> -struct send_queue {
> -	/* Virtqueue associated with this send _queue */
> -	struct virtqueue *vq;
> -
> -	/* TX: fragments + linear part + virtio header */
> -	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> -
> -	/* Name of the send queue: output.$index */
> -	char name[40];
> -
> -	struct virtnet_sq_stats stats;
> -
> -	struct napi_struct napi;
> -};
> -
> -/* Internal representation of a receive virtqueue */
> -struct receive_queue {
> -	/* Virtqueue associated with this receive_queue */
> -	struct virtqueue *vq;
> -
> -	struct napi_struct napi;
> -
> -	struct bpf_prog __rcu *xdp_prog;
> -
> -	struct virtnet_rq_stats stats;
> -
> -	/* Chain pages by the private ptr. */
> -	struct page *pages;
> -
> -	/* Average packet length for mergeable receive buffers. */
> -	struct ewma_pkt_len mrg_avg_pkt_len;
> -
> -	/* Page frag for packet buffer allocation. */
> -	struct page_frag alloc_frag;
> -
> -	/* RX: fragments + linear part + virtio header */
> -	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> -
> -	/* Min single buffer size for mergeable buffers case. */
> -	unsigned int min_buf_len;
> -
> -	/* Name of this receive queue: input.$index */
> -	char name[40];
> -
> -	struct xdp_rxq_info xdp_rxq;
> -};
> -
>   /* Control VQ buffers: protected by the rtnl lock */
>   struct control_buf {
>   	struct virtio_net_ctrl_hdr hdr;
> @@ -178,67 +76,6 @@ struct control_buf {
>   	__virtio64 offloads;
>   };
>   
> -struct virtnet_info {
> -	struct virtio_device *vdev;
> -	struct virtqueue *cvq;
> -	struct net_device *dev;
> -	struct send_queue *sq;
> -	struct receive_queue *rq;
> -	unsigned int status;
> -
> -	/* Max # of queue pairs supported by the device */
> -	u16 max_queue_pairs;
> -
> -	/* # of queue pairs currently used by the driver */
> -	u16 curr_queue_pairs;
> -
> -	/* # of XDP queue pairs currently used by the driver */
> -	u16 xdp_queue_pairs;
> -
> -	/* xdp_queue_pairs may be 0, when xdp is already loaded. So add this. */
> -	bool xdp_enabled;
> -
> -	/* I like... big packets and I cannot lie! */
> -	bool big_packets;
> -
> -	/* Host will merge rx buffers for big packets (shake it! shake it!) */
> -	bool mergeable_rx_bufs;
> -
> -	/* Has control virtqueue */
> -	bool has_cvq;
> -
> -	/* Host can handle any s/g split between our header and packet data */
> -	bool any_header_sg;
> -
> -	/* Packet virtio header size */
> -	u8 hdr_len;
> -
> -	/* Work struct for refilling if we run low on memory. */
> -	struct delayed_work refill;
> -
> -	/* Work struct for config space updates */
> -	struct work_struct config_work;
> -
> -	/* Does the affinity hint is set for virtqueues? */
> -	bool affinity_hint_set;
> -
> -	/* CPU hotplug instances for online & dead */
> -	struct hlist_node node;
> -	struct hlist_node node_dead;
> -
> -	struct control_buf *ctrl;
> -
> -	/* Ethtool settings */
> -	u8 duplex;
> -	u32 speed;
> -
> -	unsigned long guest_offloads;
> -	unsigned long guest_offloads_capable;
> -
> -	/* failover when STANDBY feature enabled */
> -	struct failover *failover;
> -};
> -
>   struct padded_vnet_hdr {
>   	struct virtio_net_hdr_mrg_rxbuf hdr;
>   	/*
> @@ -249,21 +86,6 @@ struct padded_vnet_hdr {
>   	char padding[4];
>   };
>   
> -static bool is_xdp_frame(void *ptr)
> -{
> -	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> -}
> -
> -static void *xdp_to_ptr(struct xdp_frame *ptr)
> -{
> -	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
> -}
> -
> -static struct xdp_frame *ptr_to_xdp(void *ptr)
> -{
> -	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
> -}
> -
>   static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
>   				int gfp)
>   {
> @@ -280,30 +102,6 @@ static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
>   	return buf;
>   }
>   
> -static void __free_old_xmit(struct send_queue *sq, bool in_napi,
> -			    struct virtnet_sq_stats *stats)
> -{
> -	unsigned int len;
> -	void *ptr;
> -
> -	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (!is_xdp_frame(ptr)) {
> -			struct sk_buff *skb = ptr;
> -
> -			pr_debug("Sent skb %p\n", skb);
> -
> -			stats->bytes += skb->len;
> -			napi_consume_skb(skb, in_napi);
> -		} else {
> -			struct xdp_frame *frame = ptr_to_xdp(ptr);
> -
> -			stats->bytes += frame->len;
> -			xdp_return_frame(frame);
> -		}
> -		stats->packets++;
> -	}
> -}
> -
>   /* Converting between virtqueue no. and kernel tx/rx queue no.
>    * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
>    */
> @@ -359,15 +157,6 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>   	return p;
>   }
>   
> -static void virtqueue_napi_schedule(struct napi_struct *napi,
> -				    struct virtqueue *vq)
> -{
> -	if (napi_schedule_prep(napi)) {
> -		virtqueue_disable_cb(vq);
> -		__napi_schedule(napi);
> -	}
> -}
> -
>   static void virtqueue_napi_complete(struct napi_struct *napi,
>   				    struct virtqueue *vq, int processed)
>   {
> @@ -1537,16 +1326,6 @@ static void free_old_xmit(struct send_queue *sq, bool in_napi)
>   	u64_stats_update_end(&sq->stats.syncp);
>   }
>   
> -static bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
> -{
> -	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
> -		return false;
> -	else if (q < vi->curr_queue_pairs)
> -		return true;
> -	else
> -		return false;
> -}
> -
>   static void virtnet_poll_cleantx(struct receive_queue *rq)
>   {
>   	struct virtnet_info *vi = rq->vq->vdev->priv;
> diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> new file mode 100644
> index 000000000000..931cc81f92fb
> --- /dev/null
> +++ b/drivers/net/virtio/virtio_net.h
> @@ -0,0 +1,230 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +#ifndef __VIRTIO_NET_H__
> +#define __VIRTIO_NET_H__
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/module.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_net.h>
> +#include <linux/bpf.h>
> +#include <linux/bpf_trace.h>
> +#include <linux/scatterlist.h>
> +#include <linux/if_vlan.h>
> +#include <linux/slab.h>
> +#include <linux/cpu.h>
> +#include <linux/average.h>
> +#include <linux/filter.h>
> +#include <linux/kernel.h>
> +#include <net/route.h>
> +#include <net/xdp.h>
> +#include <net/net_failover.h>
> +#include <net/xdp_sock_drv.h>
> +
> +#define VIRTIO_XDP_FLAG	BIT(0)
> +
> +struct virtnet_info {
> +	struct virtio_device *vdev;
> +	struct virtqueue *cvq;
> +	struct net_device *dev;
> +	struct send_queue *sq;
> +	struct receive_queue *rq;
> +	unsigned int status;
> +
> +	/* Max # of queue pairs supported by the device */
> +	u16 max_queue_pairs;
> +
> +	/* # of queue pairs currently used by the driver */
> +	u16 curr_queue_pairs;
> +
> +	/* # of XDP queue pairs currently used by the driver */
> +	u16 xdp_queue_pairs;
> +
> +	/* xdp_queue_pairs may be 0, when xdp is already loaded. So add this. */
> +	bool xdp_enabled;
> +
> +	/* I like... big packets and I cannot lie! */
> +	bool big_packets;
> +
> +	/* Host will merge rx buffers for big packets (shake it! shake it!) */
> +	bool mergeable_rx_bufs;
> +
> +	/* Has control virtqueue */
> +	bool has_cvq;
> +
> +	/* Host can handle any s/g split between our header and packet data */
> +	bool any_header_sg;
> +
> +	/* Packet virtio header size */
> +	u8 hdr_len;
> +
> +	/* Work struct for refilling if we run low on memory. */
> +	struct delayed_work refill;
> +
> +	/* Work struct for config space updates */
> +	struct work_struct config_work;
> +
> +	/* Does the affinity hint is set for virtqueues? */
> +	bool affinity_hint_set;
> +
> +	/* CPU hotplug instances for online & dead */
> +	struct hlist_node node;
> +	struct hlist_node node_dead;
> +
> +	struct control_buf *ctrl;
> +
> +	/* Ethtool settings */
> +	u8 duplex;
> +	u32 speed;
> +
> +	unsigned long guest_offloads;
> +	unsigned long guest_offloads_capable;
> +
> +	/* failover when STANDBY feature enabled */
> +	struct failover *failover;
> +};
> +
> +/* RX packet size EWMA. The average packet size is used to determine the packet
> + * buffer size when refilling RX rings. As the entire RX ring may be refilled
> + * at once, the weight is chosen so that the EWMA will be insensitive to short-
> + * term, transient changes in packet size.
> + */
> +DECLARE_EWMA(pkt_len, 0, 64)
> +
> +struct virtnet_stat_desc {
> +	char desc[ETH_GSTRING_LEN];
> +	size_t offset;
> +};
> +
> +struct virtnet_sq_stats {
> +	struct u64_stats_sync syncp;
> +	u64 packets;
> +	u64 bytes;
> +	u64 xdp_tx;
> +	u64 xdp_tx_drops;
> +	u64 kicks;
> +};
> +
> +struct virtnet_rq_stats {
> +	struct u64_stats_sync syncp;
> +	u64 packets;
> +	u64 bytes;
> +	u64 drops;
> +	u64 xdp_packets;
> +	u64 xdp_tx;
> +	u64 xdp_redirects;
> +	u64 xdp_drops;
> +	u64 kicks;
> +};
> +
> +#define VIRTNET_SQ_STAT(m)	offsetof(struct virtnet_sq_stats, m)
> +#define VIRTNET_RQ_STAT(m)	offsetof(struct virtnet_rq_stats, m)
> +
> +/* Internal representation of a send virtqueue */
> +struct send_queue {
> +	/* Virtqueue associated with this send _queue */
> +	struct virtqueue *vq;
> +
> +	/* TX: fragments + linear part + virtio header */
> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> +
> +	/* Name of the send queue: output.$index */
> +	char name[40];
> +
> +	struct virtnet_sq_stats stats;
> +
> +	struct napi_struct napi;
> +};
> +
> +/* Internal representation of a receive virtqueue */
> +struct receive_queue {
> +	/* Virtqueue associated with this receive_queue */
> +	struct virtqueue *vq;
> +
> +	struct napi_struct napi;
> +
> +	struct bpf_prog __rcu *xdp_prog;
> +
> +	struct virtnet_rq_stats stats;
> +
> +	/* Chain pages by the private ptr. */
> +	struct page *pages;
> +
> +	/* Average packet length for mergeable receive buffers. */
> +	struct ewma_pkt_len mrg_avg_pkt_len;
> +
> +	/* Page frag for packet buffer allocation. */
> +	struct page_frag alloc_frag;
> +
> +	/* RX: fragments + linear part + virtio header */
> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> +
> +	/* Min single buffer size for mergeable buffers case. */
> +	unsigned int min_buf_len;
> +
> +	/* Name of this receive queue: input.$index */
> +	char name[40];
> +
> +	struct xdp_rxq_info xdp_rxq;
> +};
> +
> +static inline bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
> +{
> +	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
> +		return false;
> +	else if (q < vi->curr_queue_pairs)
> +		return true;
> +	else
> +		return false;
> +}
> +
> +static inline void virtqueue_napi_schedule(struct napi_struct *napi,
> +					   struct virtqueue *vq)
> +{
> +	if (napi_schedule_prep(napi)) {
> +		virtqueue_disable_cb(vq);
> +		__napi_schedule(napi);
> +	}
> +}
> +
> +static inline bool is_xdp_frame(void *ptr)
> +{
> +	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> +}
> +
> +static inline void *xdp_to_ptr(struct xdp_frame *ptr)
> +{
> +	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
> +}
> +
> +static inline struct xdp_frame *ptr_to_xdp(void *ptr)
> +{
> +	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
> +}
> +
> +static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
> +				   struct virtnet_sq_stats *stats)
> +{
> +	unsigned int len;
> +	void *ptr;
> +
> +	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		if (!is_xdp_frame(ptr)) {
> +			struct sk_buff *skb = ptr;
> +
> +			pr_debug("Sent skb %p\n", skb);
> +
> +			stats->bytes += skb->len;
> +			napi_consume_skb(skb, in_napi);
> +		} else {
> +			struct xdp_frame *frame = ptr_to_xdp(ptr);
> +
> +			stats->bytes += frame->len;
> +			xdp_return_frame(frame);
> +		}
> +		stats->packets++;
> +	}
> +}
> +
> +#endif

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 08/15] virtio-net: split the receive_mergeable function
  2021-06-16  7:33     ` Jason Wang
  (?)
@ 2021-06-16  7:52     ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-16  7:52 UTC (permalink / raw)
  To: Jason Wang
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

On Wed, 16 Jun 2021 15:33:05 +0800, Jason Wang <jasowang@redhat.com> wrote:
>
> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> > receive_mergeable() is too complicated, so this function is split here.
> > One is to make the function more readable. On the other hand, the two
> > independent functions will be called separately in subsequent patches.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> >   drivers/net/virtio_net.c | 181 ++++++++++++++++++++++++---------------
> >   1 file changed, 111 insertions(+), 70 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 3fd87bf2b2ad..989aba600e63 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -733,6 +733,109 @@ static struct page *xdp_linearize_page(struct receive_queue *rq,
> >   	return NULL;
> >   }
> >
> > +static void merge_drop_follow_bufs(struct net_device *dev,
> > +				   struct receive_queue *rq,
> > +				   u16 num_buf,
> > +				   struct virtnet_rq_stats *stats)
>
>
> Patch looks good. Nit here, I guess we need a better name, how about
> "merge_buffers()" for this and "drop_buffers()" for the next function?

The name sounds good.

Thanks.

>
> Thanks
>
>
> > +{
> > +	struct page *page;
> > +	unsigned int len;
> > +	void *buf;
> > +
> > +	while (num_buf-- > 1) {
> > +		buf = virtqueue_get_buf(rq->vq, &len);
> > +		if (unlikely(!buf)) {
> > +			pr_debug("%s: rx error: %d buffers missing\n",
> > +				 dev->name, num_buf);
> > +			dev->stats.rx_length_errors++;
> > +			break;
> > +		}
> > +		stats->bytes += len;
> > +		page = virt_to_head_page(buf);
> > +		put_page(page);
> > +	}
> > +}
> > +
> > +static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
> > +						 struct virtnet_info *vi,
> > +						 struct receive_queue *rq,
> > +						 struct sk_buff *head_skb,
> > +						 u16 num_buf,
> > +						 struct virtnet_rq_stats *stats)
> > +{
> > +	struct sk_buff *curr_skb;
> > +	unsigned int truesize;
> > +	unsigned int len, num;
> > +	struct page *page;
> > +	void *buf, *ctx;
> > +	int offset;
> > +
> > +	curr_skb = head_skb;
> > +	num = num_buf;
> > +
> > +	while (--num_buf) {
> > +		int num_skb_frags;
> > +
> > +		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
> > +		if (unlikely(!buf)) {
> > +			pr_debug("%s: rx error: %d buffers out of %d missing\n",
> > +				 dev->name, num_buf, num);
> > +			dev->stats.rx_length_errors++;
> > +			goto err_buf;
> > +		}
> > +
> > +		stats->bytes += len;
> > +		page = virt_to_head_page(buf);
> > +
> > +		truesize = mergeable_ctx_to_truesize(ctx);
> > +		if (unlikely(len > truesize)) {
> > +			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
> > +				 dev->name, len, (unsigned long)ctx);
> > +			dev->stats.rx_length_errors++;
> > +			goto err_skb;
> > +		}
> > +
> > +		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
> > +		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
> > +			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
> > +
> > +			if (unlikely(!nskb))
> > +				goto err_skb;
> > +			if (curr_skb == head_skb)
> > +				skb_shinfo(curr_skb)->frag_list = nskb;
> > +			else
> > +				curr_skb->next = nskb;
> > +			curr_skb = nskb;
> > +			head_skb->truesize += nskb->truesize;
> > +			num_skb_frags = 0;
> > +		}
> > +		if (curr_skb != head_skb) {
> > +			head_skb->data_len += len;
> > +			head_skb->len += len;
> > +			head_skb->truesize += truesize;
> > +		}
> > +		offset = buf - page_address(page);
> > +		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
> > +			put_page(page);
> > +			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
> > +					     len, truesize);
> > +		} else {
> > +			skb_add_rx_frag(curr_skb, num_skb_frags, page,
> > +					offset, len, truesize);
> > +		}
> > +	}
> > +
> > +	return head_skb;
> > +
> > +err_skb:
> > +	put_page(page);
> > +	merge_drop_follow_bufs(dev, rq, num_buf, stats);
> > +err_buf:
> > +	stats->drops++;
> > +	dev_kfree_skb(head_skb);
> > +	return NULL;
> > +}
> > +
> >   static struct sk_buff *receive_small(struct net_device *dev,
> >   				     struct virtnet_info *vi,
> >   				     struct receive_queue *rq,
> > @@ -909,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >   	u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
> >   	struct page *page = virt_to_head_page(buf);
> >   	int offset = buf - page_address(page);
> > -	struct sk_buff *head_skb, *curr_skb;
> > +	struct sk_buff *head_skb;
> >   	struct bpf_prog *xdp_prog;
> >   	unsigned int truesize = mergeable_ctx_to_truesize(ctx);
> >   	unsigned int headroom = mergeable_ctx_to_headroom(ctx);
> > @@ -1054,65 +1157,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >
> >   	head_skb = page_to_skb(vi, rq, page, offset, len, truesize, !xdp_prog,
> >   			       metasize, !!headroom);
> > -	curr_skb = head_skb;
> > -
> > -	if (unlikely(!curr_skb))
> > +	if (unlikely(!head_skb))
> >   		goto err_skb;
> > -	while (--num_buf) {
> > -		int num_skb_frags;
> >
> > -		buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
> > -		if (unlikely(!buf)) {
> > -			pr_debug("%s: rx error: %d buffers out of %d missing\n",
> > -				 dev->name, num_buf,
> > -				 virtio16_to_cpu(vi->vdev,
> > -						 hdr->num_buffers));
> > -			dev->stats.rx_length_errors++;
> > -			goto err_buf;
> > -		}
> > -
> > -		stats->bytes += len;
> > -		page = virt_to_head_page(buf);
> > -
> > -		truesize = mergeable_ctx_to_truesize(ctx);
> > -		if (unlikely(len > truesize)) {
> > -			pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
> > -				 dev->name, len, (unsigned long)ctx);
> > -			dev->stats.rx_length_errors++;
> > -			goto err_skb;
> > -		}
> > -
> > -		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
> > -		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
> > -			struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
> > -
> > -			if (unlikely(!nskb))
> > -				goto err_skb;
> > -			if (curr_skb == head_skb)
> > -				skb_shinfo(curr_skb)->frag_list = nskb;
> > -			else
> > -				curr_skb->next = nskb;
> > -			curr_skb = nskb;
> > -			head_skb->truesize += nskb->truesize;
> > -			num_skb_frags = 0;
> > -		}
> > -		if (curr_skb != head_skb) {
> > -			head_skb->data_len += len;
> > -			head_skb->len += len;
> > -			head_skb->truesize += truesize;
> > -		}
> > -		offset = buf - page_address(page);
> > -		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
> > -			put_page(page);
> > -			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
> > -					     len, truesize);
> > -		} else {
> > -			skb_add_rx_frag(curr_skb, num_skb_frags, page,
> > -					offset, len, truesize);
> > -		}
> > -	}
> > +	if (num_buf > 1)
> > +		head_skb = merge_receive_follow_bufs(dev, vi, rq, head_skb,
> > +						     num_buf, stats);
> > +	if (head_skb)
> > +		ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
> >
> > -	ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
> >   	return head_skb;
> >
> >   err_xdp:
> > @@ -1120,19 +1173,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >   	stats->xdp_drops++;
> >   err_skb:
> >   	put_page(page);
> > -	while (num_buf-- > 1) {
> > -		buf = virtqueue_get_buf(rq->vq, &len);
> > -		if (unlikely(!buf)) {
> > -			pr_debug("%s: rx error: %d buffers missing\n",
> > -				 dev->name, num_buf);
> > -			dev->stats.rx_length_errors++;
> > -			break;
> > -		}
> > -		stats->bytes += len;
> > -		page = virt_to_head_page(buf);
> > -		put_page(page);
> > -	}
> > -err_buf:
> > +	merge_drop_follow_bufs(dev, rq, num_buf, stats);
> >   	stats->drops++;
> >   	dev_kfree_skb(head_skb);
> >   xdp_xmit:
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-16  9:26     ` Jason Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  9:26 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust . li


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> AF_XDP(xdp socket, xsk) is a high-performance packet receiving and
> sending technology.
>
> This patch implements the binding and unbinding operations of xsk and
> the virtio-net queue for xsk zero copy xmit.
>
> The xsk zero copy xmit depends on tx napi. Because the actual sending
> of data is done in the process of tx napi. If tx napi does not
> work, then the data of the xsk tx queue will not be sent.
> So if tx napi is not true, an error will be reported when bind xsk.
>
> If xsk is active, it will prevent ethtool from modifying tx napi.
>
> When reclaiming ptr, a new type of ptr is added, which is distinguished
> based on the last two digits of ptr:
> 00: skb
> 01: xdp frame
> 10: xsk xmit ptr
>
> All sent xsk packets share the virtio-net header of xsk_hdr. If xsk
> needs to support csum and other functions later, consider assigning xsk
> hdr separately for each sent packet.
>
> Different from other physical network cards, you can reinitialize the
> channel when you bind xsk. And vrtio does not support independent reset
> channel, you can only reset the entire device. I think it is not
> appropriate for us to directly reset the entire setting. So the
> situation becomes a bit more complicated. We have to consider how
> to deal with the buffer referenced in vq after xsk is unbind.
>
> I added the ring size struct virtnet_xsk_ctx when xsk been bind. Each xsk
> buffer added to vq corresponds to a ctx. This ctx is used to record the
> page where the xsk buffer is located, and add a page reference. When the
> buffer is recycling, reduce the reference to page. When xsk has been
> unbind, and all related xsk buffers have been recycled, release all ctx.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>   drivers/net/virtio/Makefile     |   1 +
>   drivers/net/virtio/virtio_net.c |  20 +-
>   drivers/net/virtio/virtio_net.h |  37 +++-
>   drivers/net/virtio/xsk.c        | 346 ++++++++++++++++++++++++++++++++
>   drivers/net/virtio/xsk.h        |  99 +++++++++
>   5 files changed, 497 insertions(+), 6 deletions(-)
>   create mode 100644 drivers/net/virtio/xsk.c
>   create mode 100644 drivers/net/virtio/xsk.h
>
> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> index ccc80f40f33a..db79d2e7925f 100644
> --- a/drivers/net/virtio/Makefile
> +++ b/drivers/net/virtio/Makefile
> @@ -4,3 +4,4 @@
>   #
>   
>   obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> +obj-$(CONFIG_VIRTIO_NET) += xsk.o
> diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
> index 395ec1f18331..40d7751f1c5f 100644
> --- a/drivers/net/virtio/virtio_net.c
> +++ b/drivers/net/virtio/virtio_net.c
> @@ -1423,6 +1423,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>   
>   	txq = netdev_get_tx_queue(vi->dev, index);
>   	__netif_tx_lock(txq, raw_smp_processor_id());
> +	work_done += virtnet_poll_xsk(sq, budget);
>   	free_old_xmit(sq, true);
>   	__netif_tx_unlock(txq);
>   
> @@ -2133,8 +2134,16 @@ static int virtnet_set_coalesce(struct net_device *dev,
>   	if (napi_weight ^ vi->sq[0].napi.weight) {
>   		if (dev->flags & IFF_UP)
>   			return -EBUSY;
> -		for (i = 0; i < vi->max_queue_pairs; i++)
> +
> +		for (i = 0; i < vi->max_queue_pairs; i++) {
> +			/* xsk xmit depend on the tx napi. So if xsk is active,
> +			 * prevent modifications to tx napi.
> +			 */
> +			if (rtnl_dereference(vi->sq[i].xsk.pool))
> +				continue;


So this can result tx NAPI is used by some queues buy not the others. I 
think such inconsistency breaks the semantic of set_coalesce() which 
assumes the operation is done at the device not some specific queues.

How about just fail here?


> +
>   			vi->sq[i].napi.weight = napi_weight;
> +		}
>   	}
>   
>   	return 0;
> @@ -2407,6 +2416,8 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>   	switch (xdp->command) {
>   	case XDP_SETUP_PROG:
>   		return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
> +	case XDP_SETUP_XSK_POOL:
> +		return virtnet_xsk_pool_setup(dev, xdp);
>   	default:
>   		return -EINVAL;
>   	}
> @@ -2466,6 +2477,7 @@ static const struct net_device_ops virtnet_netdev = {
>   	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
>   	.ndo_bpf		= virtnet_xdp,
>   	.ndo_xdp_xmit		= virtnet_xdp_xmit,
> +	.ndo_xsk_wakeup         = virtnet_xsk_wakeup,
>   	.ndo_features_check	= passthru_features_check,
>   	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
>   	.ndo_set_features	= virtnet_set_features,
> @@ -2569,10 +2581,12 @@ static void free_unused_bufs(struct virtnet_info *vi)
>   	for (i = 0; i < vi->max_queue_pairs; i++) {
>   		struct virtqueue *vq = vi->sq[i].vq;
>   		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> -			if (!is_xdp_frame(buf))
> +			if (is_skb_ptr(buf))
>   				dev_kfree_skb(buf);
> -			else
> +			else if (is_xdp_frame(buf))
>   				xdp_return_frame(ptr_to_xdp(buf));
> +			else
> +				virtnet_xsk_ctx_tx_put(ptr_to_xsk(buf));
>   		}
>   	}
>   
> diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> index 931cc81f92fb..e3da829887dc 100644
> --- a/drivers/net/virtio/virtio_net.h
> +++ b/drivers/net/virtio/virtio_net.h
> @@ -135,6 +135,16 @@ struct send_queue {
>   	struct virtnet_sq_stats stats;
>   
>   	struct napi_struct napi;
> +
> +	struct {
> +		struct xsk_buff_pool __rcu *pool;
> +
> +		/* xsk wait for tx inter or softirq */
> +		bool need_wakeup;
> +
> +		/* ctx used to record the page added to vq */
> +		struct virtnet_xsk_ctx_head *ctx_head;
> +	} xsk;
>   };
>   
>   /* Internal representation of a receive virtqueue */
> @@ -188,6 +198,13 @@ static inline void virtqueue_napi_schedule(struct napi_struct *napi,
>   	}
>   }
>   
> +#include "xsk.h"
> +
> +static inline bool is_skb_ptr(void *ptr)
> +{
> +	return !((unsigned long)ptr & (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG));
> +}
> +
>   static inline bool is_xdp_frame(void *ptr)
>   {
>   	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> @@ -206,25 +223,39 @@ static inline struct xdp_frame *ptr_to_xdp(void *ptr)
>   static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
>   				   struct virtnet_sq_stats *stats)
>   {
> +	unsigned int xsknum = 0;
>   	unsigned int len;
>   	void *ptr;
>   
>   	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (!is_xdp_frame(ptr)) {
> +		if (is_skb_ptr(ptr)) {
>   			struct sk_buff *skb = ptr;
>   
>   			pr_debug("Sent skb %p\n", skb);
>   
>   			stats->bytes += skb->len;
>   			napi_consume_skb(skb, in_napi);
> -		} else {
> +		} else if (is_xdp_frame(ptr)) {
>   			struct xdp_frame *frame = ptr_to_xdp(ptr);
>   
>   			stats->bytes += frame->len;
>   			xdp_return_frame(frame);
> +		} else {
> +			struct virtnet_xsk_ctx_tx *ctx;
> +
> +			ctx = ptr_to_xsk(ptr);
> +
> +			/* Maybe this ptr was sent by the last xsk. */
> +			if (ctx->ctx.head->active)
> +				++xsknum;
> +
> +			stats->bytes += ctx->len;
> +			virtnet_xsk_ctx_tx_put(ctx);
>   		}
>   		stats->packets++;
>   	}
> -}
>   
> +	if (xsknum)
> +		virtnet_xsk_complete(sq, xsknum);
> +}
>   #endif
> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> new file mode 100644
> index 000000000000..f98b68576709
> --- /dev/null
> +++ b/drivers/net/virtio/xsk.c
> @@ -0,0 +1,346 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * virtio-net xsk
> + */
> +
> +#include "virtio_net.h"
> +
> +static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
> +
> +static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
> +{
> +	struct virtnet_xsk_ctx *ctx;
> +
> +	ctx = head->ctx;
> +	head->ctx = ctx->next;
> +
> +	++head->ref;
> +
> +	return ctx;
> +}
> +
> +#define virtnet_xsk_ctx_tx_get(head) ((struct virtnet_xsk_ctx_tx *)virtnet_xsk_ctx_get(head))
> +
> +static void virtnet_xsk_check_queue(struct send_queue *sq)
> +{
> +	struct virtnet_info *vi = sq->vq->vdev->priv;
> +	struct net_device *dev = vi->dev;
> +	int qnum = sq - vi->sq;
> +
> +	/* If it is a raw buffer queue, it does not check whether the status
> +	 * of the queue is stopped when sending. So there is no need to check
> +	 * the situation of the raw buffer queue.
> +	 */
> +	if (is_xdp_raw_buffer_queue(vi, qnum))
> +		return;
> +
> +	/* If this sq is not the exclusive queue of the current cpu,
> +	 * then it may be called by start_xmit, so check it running out
> +	 * of space.
> +	 *
> +	 * Stop the queue to avoid getting packets that we are
> +	 * then unable to transmit. Then wait the tx interrupt.
> +	 */
> +	if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
> +		netif_stop_subqueue(dev, qnum);
> +}
> +
> +void virtnet_xsk_complete(struct send_queue *sq, u32 num)
> +{
> +	struct xsk_buff_pool *pool;
> +
> +	rcu_read_lock();
> +	pool = rcu_dereference(sq->xsk.pool);
> +	if (!pool) {
> +		rcu_read_unlock();
> +		return;
> +	}
> +	xsk_tx_completed(pool, num);
> +	rcu_read_unlock();
> +
> +	if (sq->xsk.need_wakeup) {
> +		sq->xsk.need_wakeup = false;
> +		virtqueue_napi_schedule(&sq->napi, sq->vq);
> +	}
> +}
> +
> +static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> +			    struct xdp_desc *desc)
> +{
> +	struct virtnet_xsk_ctx_tx *ctx;
> +	struct virtnet_info *vi;
> +	u32 offset, n, len;
> +	struct page *page;
> +	void *data;
> +
> +	vi = sq->vq->vdev->priv;
> +
> +	data = xsk_buff_raw_get_data(pool, desc->addr);
> +	offset = offset_in_page(data);
> +
> +	ctx = virtnet_xsk_ctx_tx_get(sq->xsk.ctx_head);
> +
> +	/* xsk unaligned mode, desc may use two pages */
> +	if (desc->len > PAGE_SIZE - offset)
> +		n = 3;
> +	else
> +		n = 2;
> +
> +	sg_init_table(sq->sg, n);
> +	sg_set_buf(sq->sg, &xsk_hdr, vi->hdr_len);
> +
> +	/* handle for xsk first page */
> +	len = min_t(int, desc->len, PAGE_SIZE - offset);
> +	page = vmalloc_to_page(data);
> +	sg_set_page(sq->sg + 1, page, len, offset);
> +
> +	/* ctx is used to record and reference this page to prevent xsk from
> +	 * being released before this xmit is recycled
> +	 */


I'm a little bit surprised that this is done manually per device instead 
of doing it in xsk core.


> +	ctx->ctx.page = page;
> +	get_page(page);
> +
> +	/* xsk unaligned mode, handle for the second page */
> +	if (len < desc->len) {
> +		page = vmalloc_to_page(data + len);
> +		len = min_t(int, desc->len - len, PAGE_SIZE);
> +		sg_set_page(sq->sg + 2, page, len, 0);
> +
> +		ctx->ctx.page_unaligned = page;
> +		get_page(page);
> +	} else {
> +		ctx->ctx.page_unaligned = NULL;
> +	}
> +
> +	return virtqueue_add_outbuf(sq->vq, sq->sg, n,
> +				   xsk_to_ptr(ctx), GFP_ATOMIC);
> +}
> +
> +static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> +				  struct xsk_buff_pool *pool,
> +				  unsigned int budget,
> +				  bool in_napi, int *done,
> +				  struct virtnet_sq_stats *stats)
> +{
> +	struct xdp_desc desc;
> +	int err, packet = 0;
> +	int ret = -EAGAIN;
> +
> +	while (budget-- > 0) {
> +		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {


AF_XDP doesn't use skb, so I don't see why MAX_SKB_FRAGS is used.

Looking at virtnet_xsk_xmit(), it looks to me 3 is more suitable here. 
Or did AF_XDP core can handle queue full gracefully then we don't even 
need to worry about this?


> +			ret = -EBUSY;


-ENOSPC looks better.


> +			break;
> +		}
> +
> +		if (!xsk_tx_peek_desc(pool, &desc)) {
> +			/* done */
> +			ret = 0;
> +			break;
> +		}
> +
> +		err = virtnet_xsk_xmit(sq, pool, &desc);
> +		if (unlikely(err)) {


If we always reserve sufficient slots, this should be an unexpected 
error, do we need log this as what has been done in start_xmit()?

         /* This should not happen! */
         if (unlikely(err)) {
                 dev->stats.tx_fifo_errors++;
                 if (net_ratelimit())
                         dev_warn(&dev->dev,
                                  "Unexpected TXQ (%d) queue failure: %d\n",
                                  qnum, err);


> +			ret = -EBUSY;
> +			break;
> +		}
> +
> +		++packet;
> +	}
> +
> +	if (packet) {
> +		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> +			++stats->kicks;
> +
> +		*done += packet;
> +		stats->xdp_tx += packet;
> +
> +		xsk_tx_release(pool);
> +	}
> +
> +	return ret;
> +}
> +
> +static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
> +			   int budget, bool in_napi)
> +{
> +	struct virtnet_sq_stats stats = {};
> +	int done = 0;
> +	int err;
> +
> +	sq->xsk.need_wakeup = false;
> +	__free_old_xmit(sq, in_napi, &stats);
> +
> +	/* return err:
> +	 * -EAGAIN: done == budget
> +	 * -EBUSY:  done < budget
> +	 *  0    :  done < budget
> +	 */


It's better to move the comment to the implementation of 
virtnet_xsk_xmit_batch().


> +xmit:
> +	err = virtnet_xsk_xmit_batch(sq, pool, budget - done, in_napi,
> +				     &done, &stats);
> +	if (err == -EBUSY) {
> +		__free_old_xmit(sq, in_napi, &stats);
> +
> +		/* If the space is enough, let napi run again. */
> +		if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)


The comment does not match the code.


> +			goto xmit;
> +		else
> +			sq->xsk.need_wakeup = true;
> +	}
> +
> +	virtnet_xsk_check_queue(sq);
> +
> +	u64_stats_update_begin(&sq->stats.syncp);
> +	sq->stats.packets += stats.packets;
> +	sq->stats.bytes += stats.bytes;
> +	sq->stats.kicks += stats.kicks;
> +	sq->stats.xdp_tx += stats.xdp_tx;
> +	u64_stats_update_end(&sq->stats.syncp);
> +
> +	return done;
> +}
> +
> +int virtnet_poll_xsk(struct send_queue *sq, int budget)
> +{
> +	struct xsk_buff_pool *pool;
> +	int work_done = 0;
> +
> +	rcu_read_lock();
> +	pool = rcu_dereference(sq->xsk.pool);
> +	if (pool)
> +		work_done = virtnet_xsk_run(sq, pool, budget, true);
> +	rcu_read_unlock();
> +	return work_done;
> +}
> +
> +int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct xsk_buff_pool *pool;
> +	struct send_queue *sq;
> +
> +	if (!netif_running(dev))
> +		return -ENETDOWN;
> +
> +	if (qid >= vi->curr_queue_pairs)
> +		return -EINVAL;


I wonder how we can hit this check. Note that we prevent the user from 
modifying queue pairs when XDP is enabled:

         /* For now we don't support modifying channels while XDP is loaded
          * also when XDP is loaded all RX queues have XDP programs so 
we only
          * need to check a single RX queue.
          */
         if (vi->rq[0].xdp_prog)
                 return -EINVAL;

> +
> +	sq = &vi->sq[qid];
> +
> +	rcu_read_lock();


Can we simply use rcu_read_lock_bh() here?


> +	pool = rcu_dereference(sq->xsk.pool);
> +	if (pool) {
> +		local_bh_disable();
> +		virtqueue_napi_schedule(&sq->napi, sq->vq);
> +		local_bh_enable();
> +	}
> +	rcu_read_unlock();
> +	return 0;
> +}
> +
> +static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct xsk_buff_pool *pool,
> +							  struct virtqueue *vq)
> +{
> +	struct virtnet_xsk_ctx_head *head;
> +	u32 size, n, ring_size, ctx_sz;
> +	struct virtnet_xsk_ctx *ctx;
> +	void *p;
> +
> +	ctx_sz = sizeof(struct virtnet_xsk_ctx_tx);
> +
> +	ring_size = virtqueue_get_vring_size(vq);
> +	size = sizeof(*head) + ctx_sz * ring_size;
> +
> +	head = kmalloc(size, GFP_ATOMIC);
> +	if (!head)
> +		return NULL;
> +
> +	memset(head, 0, sizeof(*head));
> +
> +	head->active = true;
> +	head->frame_size = xsk_pool_get_rx_frame_size(pool);
> +
> +	p = head + 1;
> +	for (n = 0; n < ring_size; ++n) {
> +		ctx = p;
> +		ctx->head = head;
> +		ctx->next = head->ctx;
> +		head->ctx = ctx;
> +
> +		p += ctx_sz;
> +	}
> +
> +	return head;
> +}
> +
> +static int virtnet_xsk_pool_enable(struct net_device *dev,
> +				   struct xsk_buff_pool *pool,
> +				   u16 qid)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct send_queue *sq;
> +
> +	if (qid >= vi->curr_queue_pairs)
> +		return -EINVAL;
> +
> +	sq = &vi->sq[qid];
> +
> +	/* xsk zerocopy depend on the tx napi.
> +	 *
> +	 * All data is actually consumed and sent out from the xsk tx queue
> +	 * under the tx napi mechanism.
> +	 */
> +	if (!sq->napi.weight)
> +		return -EPERM;
> +
> +	memset(&sq->xsk, 0, sizeof(sq->xsk));
> +
> +	sq->xsk.ctx_head = virtnet_xsk_ctx_alloc(pool, sq->vq);
> +	if (!sq->xsk.ctx_head)
> +		return -ENOMEM;
> +
> +	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> +	 * safe.
> +	 */
> +	rcu_assign_pointer(sq->xsk.pool, pool);
> +
> +	return 0;
> +}
> +
> +static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct send_queue *sq;
> +
> +	if (qid >= vi->curr_queue_pairs)
> +		return -EINVAL;
> +
> +	sq = &vi->sq[qid];
> +
> +	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> +	 * safe.
> +	 */
> +	rcu_assign_pointer(sq->xsk.pool, NULL);
> +
> +	/* Sync with the XSK wakeup and with NAPI. */
> +	synchronize_net();
> +
> +	if (READ_ONCE(sq->xsk.ctx_head->ref))
> +		WRITE_ONCE(sq->xsk.ctx_head->active, false);
> +	else
> +		kfree(sq->xsk.ctx_head);
> +
> +	sq->xsk.ctx_head = NULL;
> +
> +	return 0;
> +}
> +
> +int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp)
> +{
> +	if (xdp->xsk.pool)
> +		return virtnet_xsk_pool_enable(dev, xdp->xsk.pool,
> +					       xdp->xsk.queue_id);
> +	else
> +		return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id);
> +}
> +
> diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> new file mode 100644
> index 000000000000..54948e0b07fc
> --- /dev/null
> +++ b/drivers/net/virtio/xsk.h
> @@ -0,0 +1,99 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +#ifndef __XSK_H__
> +#define __XSK_H__
> +
> +#define VIRTIO_XSK_FLAG	BIT(1)
> +
> +/* When xsk disable, under normal circumstances, the network card must reclaim
> + * all the memory that has been sent and the memory added to the rq queue by
> + * destroying the queue.
> + *
> + * But virtio's queue does not support separate setting to been disable.


This is a call for us to implement per queue enable/disable. Virtio-mmio 
has such facility but virtio-pci only allow to disable a queue (not enable).


> "Reset"
> + * is not very suitable.
> + *
> + * The method here is that each sent chunk or chunk added to the rq queue is
> + * described by an independent structure struct virtnet_xsk_ctx.
> + *
> + * We will use get_page(page) to refer to the page where these chunks are
> + * located. And these pages will be recorded in struct virtnet_xsk_ctx. So these
> + * chunks in vq are safe. When recycling, put the these page.
> + *
> + * These structures point to struct virtnet_xsk_ctx_head, and ref records how
> + * many chunks have not been reclaimed. If active == 0, it means that xsk has
> + * been disabled.
> + *
> + * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
> + * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
> + * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
> + * ref is 0.


This looks complicated and it will increase the footprint. Consider the 
performance penalty and the complexity, I would suggest to use reset 
instead.

Then we don't need to introduce such context.

Thanks


> + */
> +struct virtnet_xsk_ctx;
> +struct virtnet_xsk_ctx_head {
> +	struct virtnet_xsk_ctx *ctx;
> +
> +	/* how many ctx has been add to vq */
> +	u64 ref;
> +
> +	unsigned int frame_size;
> +
> +	/* the xsk status */
> +	bool active;
> +};
> +
> +struct virtnet_xsk_ctx {
> +	struct virtnet_xsk_ctx_head *head;
> +	struct virtnet_xsk_ctx *next;
> +
> +	struct page *page;
> +
> +	/* xsk unaligned mode will use two page in one desc */
> +	struct page *page_unaligned;
> +};
> +
> +struct virtnet_xsk_ctx_tx {
> +	/* this *MUST* be the first */
> +	struct virtnet_xsk_ctx ctx;
> +
> +	/* xsk tx xmit use this record the len of packet */
> +	u32 len;
> +};
> +
> +static inline void *xsk_to_ptr(struct virtnet_xsk_ctx_tx *ctx)
> +{
> +	return (void *)((unsigned long)ctx | VIRTIO_XSK_FLAG);
> +}
> +
> +static inline struct virtnet_xsk_ctx_tx *ptr_to_xsk(void *ptr)
> +{
> +	unsigned long p;
> +
> +	p = (unsigned long)ptr;
> +	return (struct virtnet_xsk_ctx_tx *)(p & ~VIRTIO_XSK_FLAG);
> +}
> +
> +static inline void virtnet_xsk_ctx_put(struct virtnet_xsk_ctx *ctx)
> +{
> +	put_page(ctx->page);
> +	if (ctx->page_unaligned)
> +		put_page(ctx->page_unaligned);
> +
> +	--ctx->head->ref;
> +
> +	if (ctx->head->active) {
> +		ctx->next = ctx->head->ctx;
> +		ctx->head->ctx = ctx;
> +	} else {
> +		if (!ctx->head->ref)
> +			kfree(ctx->head);
> +	}
> +}
> +
> +#define virtnet_xsk_ctx_tx_put(ctx) \
> +	virtnet_xsk_ctx_put((struct virtnet_xsk_ctx *)ctx)
> +
> +int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
> +int virtnet_poll_xsk(struct send_queue *sq, int budget);
> +void virtnet_xsk_complete(struct send_queue *sq, u32 num);
> +int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> +#endif


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
@ 2021-06-16  9:26     ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  9:26 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> AF_XDP(xdp socket, xsk) is a high-performance packet receiving and
> sending technology.
>
> This patch implements the binding and unbinding operations of xsk and
> the virtio-net queue for xsk zero copy xmit.
>
> The xsk zero copy xmit depends on tx napi. Because the actual sending
> of data is done in the process of tx napi. If tx napi does not
> work, then the data of the xsk tx queue will not be sent.
> So if tx napi is not true, an error will be reported when bind xsk.
>
> If xsk is active, it will prevent ethtool from modifying tx napi.
>
> When reclaiming ptr, a new type of ptr is added, which is distinguished
> based on the last two digits of ptr:
> 00: skb
> 01: xdp frame
> 10: xsk xmit ptr
>
> All sent xsk packets share the virtio-net header of xsk_hdr. If xsk
> needs to support csum and other functions later, consider assigning xsk
> hdr separately for each sent packet.
>
> Different from other physical network cards, you can reinitialize the
> channel when you bind xsk. And vrtio does not support independent reset
> channel, you can only reset the entire device. I think it is not
> appropriate for us to directly reset the entire setting. So the
> situation becomes a bit more complicated. We have to consider how
> to deal with the buffer referenced in vq after xsk is unbind.
>
> I added the ring size struct virtnet_xsk_ctx when xsk been bind. Each xsk
> buffer added to vq corresponds to a ctx. This ctx is used to record the
> page where the xsk buffer is located, and add a page reference. When the
> buffer is recycling, reduce the reference to page. When xsk has been
> unbind, and all related xsk buffers have been recycled, release all ctx.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>   drivers/net/virtio/Makefile     |   1 +
>   drivers/net/virtio/virtio_net.c |  20 +-
>   drivers/net/virtio/virtio_net.h |  37 +++-
>   drivers/net/virtio/xsk.c        | 346 ++++++++++++++++++++++++++++++++
>   drivers/net/virtio/xsk.h        |  99 +++++++++
>   5 files changed, 497 insertions(+), 6 deletions(-)
>   create mode 100644 drivers/net/virtio/xsk.c
>   create mode 100644 drivers/net/virtio/xsk.h
>
> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> index ccc80f40f33a..db79d2e7925f 100644
> --- a/drivers/net/virtio/Makefile
> +++ b/drivers/net/virtio/Makefile
> @@ -4,3 +4,4 @@
>   #
>   
>   obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> +obj-$(CONFIG_VIRTIO_NET) += xsk.o
> diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
> index 395ec1f18331..40d7751f1c5f 100644
> --- a/drivers/net/virtio/virtio_net.c
> +++ b/drivers/net/virtio/virtio_net.c
> @@ -1423,6 +1423,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>   
>   	txq = netdev_get_tx_queue(vi->dev, index);
>   	__netif_tx_lock(txq, raw_smp_processor_id());
> +	work_done += virtnet_poll_xsk(sq, budget);
>   	free_old_xmit(sq, true);
>   	__netif_tx_unlock(txq);
>   
> @@ -2133,8 +2134,16 @@ static int virtnet_set_coalesce(struct net_device *dev,
>   	if (napi_weight ^ vi->sq[0].napi.weight) {
>   		if (dev->flags & IFF_UP)
>   			return -EBUSY;
> -		for (i = 0; i < vi->max_queue_pairs; i++)
> +
> +		for (i = 0; i < vi->max_queue_pairs; i++) {
> +			/* xsk xmit depend on the tx napi. So if xsk is active,
> +			 * prevent modifications to tx napi.
> +			 */
> +			if (rtnl_dereference(vi->sq[i].xsk.pool))
> +				continue;


So this can result tx NAPI is used by some queues buy not the others. I 
think such inconsistency breaks the semantic of set_coalesce() which 
assumes the operation is done at the device not some specific queues.

How about just fail here?


> +
>   			vi->sq[i].napi.weight = napi_weight;
> +		}
>   	}
>   
>   	return 0;
> @@ -2407,6 +2416,8 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>   	switch (xdp->command) {
>   	case XDP_SETUP_PROG:
>   		return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
> +	case XDP_SETUP_XSK_POOL:
> +		return virtnet_xsk_pool_setup(dev, xdp);
>   	default:
>   		return -EINVAL;
>   	}
> @@ -2466,6 +2477,7 @@ static const struct net_device_ops virtnet_netdev = {
>   	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
>   	.ndo_bpf		= virtnet_xdp,
>   	.ndo_xdp_xmit		= virtnet_xdp_xmit,
> +	.ndo_xsk_wakeup         = virtnet_xsk_wakeup,
>   	.ndo_features_check	= passthru_features_check,
>   	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
>   	.ndo_set_features	= virtnet_set_features,
> @@ -2569,10 +2581,12 @@ static void free_unused_bufs(struct virtnet_info *vi)
>   	for (i = 0; i < vi->max_queue_pairs; i++) {
>   		struct virtqueue *vq = vi->sq[i].vq;
>   		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> -			if (!is_xdp_frame(buf))
> +			if (is_skb_ptr(buf))
>   				dev_kfree_skb(buf);
> -			else
> +			else if (is_xdp_frame(buf))
>   				xdp_return_frame(ptr_to_xdp(buf));
> +			else
> +				virtnet_xsk_ctx_tx_put(ptr_to_xsk(buf));
>   		}
>   	}
>   
> diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> index 931cc81f92fb..e3da829887dc 100644
> --- a/drivers/net/virtio/virtio_net.h
> +++ b/drivers/net/virtio/virtio_net.h
> @@ -135,6 +135,16 @@ struct send_queue {
>   	struct virtnet_sq_stats stats;
>   
>   	struct napi_struct napi;
> +
> +	struct {
> +		struct xsk_buff_pool __rcu *pool;
> +
> +		/* xsk wait for tx inter or softirq */
> +		bool need_wakeup;
> +
> +		/* ctx used to record the page added to vq */
> +		struct virtnet_xsk_ctx_head *ctx_head;
> +	} xsk;
>   };
>   
>   /* Internal representation of a receive virtqueue */
> @@ -188,6 +198,13 @@ static inline void virtqueue_napi_schedule(struct napi_struct *napi,
>   	}
>   }
>   
> +#include "xsk.h"
> +
> +static inline bool is_skb_ptr(void *ptr)
> +{
> +	return !((unsigned long)ptr & (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG));
> +}
> +
>   static inline bool is_xdp_frame(void *ptr)
>   {
>   	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> @@ -206,25 +223,39 @@ static inline struct xdp_frame *ptr_to_xdp(void *ptr)
>   static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
>   				   struct virtnet_sq_stats *stats)
>   {
> +	unsigned int xsknum = 0;
>   	unsigned int len;
>   	void *ptr;
>   
>   	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (!is_xdp_frame(ptr)) {
> +		if (is_skb_ptr(ptr)) {
>   			struct sk_buff *skb = ptr;
>   
>   			pr_debug("Sent skb %p\n", skb);
>   
>   			stats->bytes += skb->len;
>   			napi_consume_skb(skb, in_napi);
> -		} else {
> +		} else if (is_xdp_frame(ptr)) {
>   			struct xdp_frame *frame = ptr_to_xdp(ptr);
>   
>   			stats->bytes += frame->len;
>   			xdp_return_frame(frame);
> +		} else {
> +			struct virtnet_xsk_ctx_tx *ctx;
> +
> +			ctx = ptr_to_xsk(ptr);
> +
> +			/* Maybe this ptr was sent by the last xsk. */
> +			if (ctx->ctx.head->active)
> +				++xsknum;
> +
> +			stats->bytes += ctx->len;
> +			virtnet_xsk_ctx_tx_put(ctx);
>   		}
>   		stats->packets++;
>   	}
> -}
>   
> +	if (xsknum)
> +		virtnet_xsk_complete(sq, xsknum);
> +}
>   #endif
> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> new file mode 100644
> index 000000000000..f98b68576709
> --- /dev/null
> +++ b/drivers/net/virtio/xsk.c
> @@ -0,0 +1,346 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * virtio-net xsk
> + */
> +
> +#include "virtio_net.h"
> +
> +static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
> +
> +static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
> +{
> +	struct virtnet_xsk_ctx *ctx;
> +
> +	ctx = head->ctx;
> +	head->ctx = ctx->next;
> +
> +	++head->ref;
> +
> +	return ctx;
> +}
> +
> +#define virtnet_xsk_ctx_tx_get(head) ((struct virtnet_xsk_ctx_tx *)virtnet_xsk_ctx_get(head))
> +
> +static void virtnet_xsk_check_queue(struct send_queue *sq)
> +{
> +	struct virtnet_info *vi = sq->vq->vdev->priv;
> +	struct net_device *dev = vi->dev;
> +	int qnum = sq - vi->sq;
> +
> +	/* If it is a raw buffer queue, it does not check whether the status
> +	 * of the queue is stopped when sending. So there is no need to check
> +	 * the situation of the raw buffer queue.
> +	 */
> +	if (is_xdp_raw_buffer_queue(vi, qnum))
> +		return;
> +
> +	/* If this sq is not the exclusive queue of the current cpu,
> +	 * then it may be called by start_xmit, so check it running out
> +	 * of space.
> +	 *
> +	 * Stop the queue to avoid getting packets that we are
> +	 * then unable to transmit. Then wait the tx interrupt.
> +	 */
> +	if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
> +		netif_stop_subqueue(dev, qnum);
> +}
> +
> +void virtnet_xsk_complete(struct send_queue *sq, u32 num)
> +{
> +	struct xsk_buff_pool *pool;
> +
> +	rcu_read_lock();
> +	pool = rcu_dereference(sq->xsk.pool);
> +	if (!pool) {
> +		rcu_read_unlock();
> +		return;
> +	}
> +	xsk_tx_completed(pool, num);
> +	rcu_read_unlock();
> +
> +	if (sq->xsk.need_wakeup) {
> +		sq->xsk.need_wakeup = false;
> +		virtqueue_napi_schedule(&sq->napi, sq->vq);
> +	}
> +}
> +
> +static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> +			    struct xdp_desc *desc)
> +{
> +	struct virtnet_xsk_ctx_tx *ctx;
> +	struct virtnet_info *vi;
> +	u32 offset, n, len;
> +	struct page *page;
> +	void *data;
> +
> +	vi = sq->vq->vdev->priv;
> +
> +	data = xsk_buff_raw_get_data(pool, desc->addr);
> +	offset = offset_in_page(data);
> +
> +	ctx = virtnet_xsk_ctx_tx_get(sq->xsk.ctx_head);
> +
> +	/* xsk unaligned mode, desc may use two pages */
> +	if (desc->len > PAGE_SIZE - offset)
> +		n = 3;
> +	else
> +		n = 2;
> +
> +	sg_init_table(sq->sg, n);
> +	sg_set_buf(sq->sg, &xsk_hdr, vi->hdr_len);
> +
> +	/* handle for xsk first page */
> +	len = min_t(int, desc->len, PAGE_SIZE - offset);
> +	page = vmalloc_to_page(data);
> +	sg_set_page(sq->sg + 1, page, len, offset);
> +
> +	/* ctx is used to record and reference this page to prevent xsk from
> +	 * being released before this xmit is recycled
> +	 */


I'm a little bit surprised that this is done manually per device instead 
of doing it in xsk core.


> +	ctx->ctx.page = page;
> +	get_page(page);
> +
> +	/* xsk unaligned mode, handle for the second page */
> +	if (len < desc->len) {
> +		page = vmalloc_to_page(data + len);
> +		len = min_t(int, desc->len - len, PAGE_SIZE);
> +		sg_set_page(sq->sg + 2, page, len, 0);
> +
> +		ctx->ctx.page_unaligned = page;
> +		get_page(page);
> +	} else {
> +		ctx->ctx.page_unaligned = NULL;
> +	}
> +
> +	return virtqueue_add_outbuf(sq->vq, sq->sg, n,
> +				   xsk_to_ptr(ctx), GFP_ATOMIC);
> +}
> +
> +static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> +				  struct xsk_buff_pool *pool,
> +				  unsigned int budget,
> +				  bool in_napi, int *done,
> +				  struct virtnet_sq_stats *stats)
> +{
> +	struct xdp_desc desc;
> +	int err, packet = 0;
> +	int ret = -EAGAIN;
> +
> +	while (budget-- > 0) {
> +		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {


AF_XDP doesn't use skb, so I don't see why MAX_SKB_FRAGS is used.

Looking at virtnet_xsk_xmit(), it looks to me 3 is more suitable here. 
Or did AF_XDP core can handle queue full gracefully then we don't even 
need to worry about this?


> +			ret = -EBUSY;


-ENOSPC looks better.


> +			break;
> +		}
> +
> +		if (!xsk_tx_peek_desc(pool, &desc)) {
> +			/* done */
> +			ret = 0;
> +			break;
> +		}
> +
> +		err = virtnet_xsk_xmit(sq, pool, &desc);
> +		if (unlikely(err)) {


If we always reserve sufficient slots, this should be an unexpected 
error, do we need log this as what has been done in start_xmit()?

         /* This should not happen! */
         if (unlikely(err)) {
                 dev->stats.tx_fifo_errors++;
                 if (net_ratelimit())
                         dev_warn(&dev->dev,
                                  "Unexpected TXQ (%d) queue failure: %d\n",
                                  qnum, err);


> +			ret = -EBUSY;
> +			break;
> +		}
> +
> +		++packet;
> +	}
> +
> +	if (packet) {
> +		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> +			++stats->kicks;
> +
> +		*done += packet;
> +		stats->xdp_tx += packet;
> +
> +		xsk_tx_release(pool);
> +	}
> +
> +	return ret;
> +}
> +
> +static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
> +			   int budget, bool in_napi)
> +{
> +	struct virtnet_sq_stats stats = {};
> +	int done = 0;
> +	int err;
> +
> +	sq->xsk.need_wakeup = false;
> +	__free_old_xmit(sq, in_napi, &stats);
> +
> +	/* return err:
> +	 * -EAGAIN: done == budget
> +	 * -EBUSY:  done < budget
> +	 *  0    :  done < budget
> +	 */


It's better to move the comment to the implementation of 
virtnet_xsk_xmit_batch().


> +xmit:
> +	err = virtnet_xsk_xmit_batch(sq, pool, budget - done, in_napi,
> +				     &done, &stats);
> +	if (err == -EBUSY) {
> +		__free_old_xmit(sq, in_napi, &stats);
> +
> +		/* If the space is enough, let napi run again. */
> +		if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)


The comment does not match the code.


> +			goto xmit;
> +		else
> +			sq->xsk.need_wakeup = true;
> +	}
> +
> +	virtnet_xsk_check_queue(sq);
> +
> +	u64_stats_update_begin(&sq->stats.syncp);
> +	sq->stats.packets += stats.packets;
> +	sq->stats.bytes += stats.bytes;
> +	sq->stats.kicks += stats.kicks;
> +	sq->stats.xdp_tx += stats.xdp_tx;
> +	u64_stats_update_end(&sq->stats.syncp);
> +
> +	return done;
> +}
> +
> +int virtnet_poll_xsk(struct send_queue *sq, int budget)
> +{
> +	struct xsk_buff_pool *pool;
> +	int work_done = 0;
> +
> +	rcu_read_lock();
> +	pool = rcu_dereference(sq->xsk.pool);
> +	if (pool)
> +		work_done = virtnet_xsk_run(sq, pool, budget, true);
> +	rcu_read_unlock();
> +	return work_done;
> +}
> +
> +int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct xsk_buff_pool *pool;
> +	struct send_queue *sq;
> +
> +	if (!netif_running(dev))
> +		return -ENETDOWN;
> +
> +	if (qid >= vi->curr_queue_pairs)
> +		return -EINVAL;


I wonder how we can hit this check. Note that we prevent the user from 
modifying queue pairs when XDP is enabled:

         /* For now we don't support modifying channels while XDP is loaded
          * also when XDP is loaded all RX queues have XDP programs so 
we only
          * need to check a single RX queue.
          */
         if (vi->rq[0].xdp_prog)
                 return -EINVAL;

> +
> +	sq = &vi->sq[qid];
> +
> +	rcu_read_lock();


Can we simply use rcu_read_lock_bh() here?


> +	pool = rcu_dereference(sq->xsk.pool);
> +	if (pool) {
> +		local_bh_disable();
> +		virtqueue_napi_schedule(&sq->napi, sq->vq);
> +		local_bh_enable();
> +	}
> +	rcu_read_unlock();
> +	return 0;
> +}
> +
> +static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct xsk_buff_pool *pool,
> +							  struct virtqueue *vq)
> +{
> +	struct virtnet_xsk_ctx_head *head;
> +	u32 size, n, ring_size, ctx_sz;
> +	struct virtnet_xsk_ctx *ctx;
> +	void *p;
> +
> +	ctx_sz = sizeof(struct virtnet_xsk_ctx_tx);
> +
> +	ring_size = virtqueue_get_vring_size(vq);
> +	size = sizeof(*head) + ctx_sz * ring_size;
> +
> +	head = kmalloc(size, GFP_ATOMIC);
> +	if (!head)
> +		return NULL;
> +
> +	memset(head, 0, sizeof(*head));
> +
> +	head->active = true;
> +	head->frame_size = xsk_pool_get_rx_frame_size(pool);
> +
> +	p = head + 1;
> +	for (n = 0; n < ring_size; ++n) {
> +		ctx = p;
> +		ctx->head = head;
> +		ctx->next = head->ctx;
> +		head->ctx = ctx;
> +
> +		p += ctx_sz;
> +	}
> +
> +	return head;
> +}
> +
> +static int virtnet_xsk_pool_enable(struct net_device *dev,
> +				   struct xsk_buff_pool *pool,
> +				   u16 qid)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct send_queue *sq;
> +
> +	if (qid >= vi->curr_queue_pairs)
> +		return -EINVAL;
> +
> +	sq = &vi->sq[qid];
> +
> +	/* xsk zerocopy depend on the tx napi.
> +	 *
> +	 * All data is actually consumed and sent out from the xsk tx queue
> +	 * under the tx napi mechanism.
> +	 */
> +	if (!sq->napi.weight)
> +		return -EPERM;
> +
> +	memset(&sq->xsk, 0, sizeof(sq->xsk));
> +
> +	sq->xsk.ctx_head = virtnet_xsk_ctx_alloc(pool, sq->vq);
> +	if (!sq->xsk.ctx_head)
> +		return -ENOMEM;
> +
> +	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> +	 * safe.
> +	 */
> +	rcu_assign_pointer(sq->xsk.pool, pool);
> +
> +	return 0;
> +}
> +
> +static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct send_queue *sq;
> +
> +	if (qid >= vi->curr_queue_pairs)
> +		return -EINVAL;
> +
> +	sq = &vi->sq[qid];
> +
> +	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> +	 * safe.
> +	 */
> +	rcu_assign_pointer(sq->xsk.pool, NULL);
> +
> +	/* Sync with the XSK wakeup and with NAPI. */
> +	synchronize_net();
> +
> +	if (READ_ONCE(sq->xsk.ctx_head->ref))
> +		WRITE_ONCE(sq->xsk.ctx_head->active, false);
> +	else
> +		kfree(sq->xsk.ctx_head);
> +
> +	sq->xsk.ctx_head = NULL;
> +
> +	return 0;
> +}
> +
> +int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp)
> +{
> +	if (xdp->xsk.pool)
> +		return virtnet_xsk_pool_enable(dev, xdp->xsk.pool,
> +					       xdp->xsk.queue_id);
> +	else
> +		return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id);
> +}
> +
> diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> new file mode 100644
> index 000000000000..54948e0b07fc
> --- /dev/null
> +++ b/drivers/net/virtio/xsk.h
> @@ -0,0 +1,99 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +#ifndef __XSK_H__
> +#define __XSK_H__
> +
> +#define VIRTIO_XSK_FLAG	BIT(1)
> +
> +/* When xsk disable, under normal circumstances, the network card must reclaim
> + * all the memory that has been sent and the memory added to the rq queue by
> + * destroying the queue.
> + *
> + * But virtio's queue does not support separate setting to been disable.


This is a call for us to implement per queue enable/disable. Virtio-mmio 
has such facility but virtio-pci only allow to disable a queue (not enable).


> "Reset"
> + * is not very suitable.
> + *
> + * The method here is that each sent chunk or chunk added to the rq queue is
> + * described by an independent structure struct virtnet_xsk_ctx.
> + *
> + * We will use get_page(page) to refer to the page where these chunks are
> + * located. And these pages will be recorded in struct virtnet_xsk_ctx. So these
> + * chunks in vq are safe. When recycling, put the these page.
> + *
> + * These structures point to struct virtnet_xsk_ctx_head, and ref records how
> + * many chunks have not been reclaimed. If active == 0, it means that xsk has
> + * been disabled.
> + *
> + * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
> + * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
> + * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
> + * ref is 0.


This looks complicated and it will increase the footprint. Consider the 
performance penalty and the complexity, I would suggest to use reset 
instead.

Then we don't need to introduce such context.

Thanks


> + */
> +struct virtnet_xsk_ctx;
> +struct virtnet_xsk_ctx_head {
> +	struct virtnet_xsk_ctx *ctx;
> +
> +	/* how many ctx has been add to vq */
> +	u64 ref;
> +
> +	unsigned int frame_size;
> +
> +	/* the xsk status */
> +	bool active;
> +};
> +
> +struct virtnet_xsk_ctx {
> +	struct virtnet_xsk_ctx_head *head;
> +	struct virtnet_xsk_ctx *next;
> +
> +	struct page *page;
> +
> +	/* xsk unaligned mode will use two page in one desc */
> +	struct page *page_unaligned;
> +};
> +
> +struct virtnet_xsk_ctx_tx {
> +	/* this *MUST* be the first */
> +	struct virtnet_xsk_ctx ctx;
> +
> +	/* xsk tx xmit use this record the len of packet */
> +	u32 len;
> +};
> +
> +static inline void *xsk_to_ptr(struct virtnet_xsk_ctx_tx *ctx)
> +{
> +	return (void *)((unsigned long)ctx | VIRTIO_XSK_FLAG);
> +}
> +
> +static inline struct virtnet_xsk_ctx_tx *ptr_to_xsk(void *ptr)
> +{
> +	unsigned long p;
> +
> +	p = (unsigned long)ptr;
> +	return (struct virtnet_xsk_ctx_tx *)(p & ~VIRTIO_XSK_FLAG);
> +}
> +
> +static inline void virtnet_xsk_ctx_put(struct virtnet_xsk_ctx *ctx)
> +{
> +	put_page(ctx->page);
> +	if (ctx->page_unaligned)
> +		put_page(ctx->page_unaligned);
> +
> +	--ctx->head->ref;
> +
> +	if (ctx->head->active) {
> +		ctx->next = ctx->head->ctx;
> +		ctx->head->ctx = ctx;
> +	} else {
> +		if (!ctx->head->ref)
> +			kfree(ctx->head);
> +	}
> +}
> +
> +#define virtnet_xsk_ctx_tx_put(ctx) \
> +	virtnet_xsk_ctx_put((struct virtnet_xsk_ctx *)ctx)
> +
> +int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
> +int virtnet_poll_xsk(struct send_queue *sq, int budget);
> +void virtnet_xsk_complete(struct send_queue *sq, u32 num);
> +int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> +#endif

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 03/15] virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
  2021-06-10  8:21   ` Xuan Zhuo
@ 2021-06-16  9:27     ` Jason Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  9:27 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust . li


在 2021/6/10 下午4:21, Xuan Zhuo 写道:
> virtio-net not use dma addr directly. So add this priv_flags
> IFF_NOT_USE_DMA_ADDR.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   drivers/net/virtio_net.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0416a7e00914..6c1233f0ab3e 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3064,7 +3064,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>   
>   	/* Set up network device as normal. */
>   	dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
> -			   IFF_TX_SKB_NO_LINEAR;
> +			   IFF_TX_SKB_NO_LINEAR | IFF_NOT_USE_DMA_ADDR;


I wonder instead of doing trick like this, how about teach the virtio 
core to accept DMA address via sg?

Thanks


>   	dev->netdev_ops = &virtnet_netdev;
>   	dev->features = NETIF_F_HIGHDMA;
>   


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 03/15] virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
@ 2021-06-16  9:27     ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16  9:27 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/10 下午4:21, Xuan Zhuo 写道:
> virtio-net not use dma addr directly. So add this priv_flags
> IFF_NOT_USE_DMA_ADDR.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   drivers/net/virtio_net.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0416a7e00914..6c1233f0ab3e 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3064,7 +3064,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>   
>   	/* Set up network device as normal. */
>   	dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
> -			   IFF_TX_SKB_NO_LINEAR;
> +			   IFF_TX_SKB_NO_LINEAR | IFF_NOT_USE_DMA_ADDR;


I wonder instead of doing trick like this, how about teach the virtio 
core to accept DMA address via sg?

Thanks


>   	dev->netdev_ops = &virtnet_netdev;
>   	dev->features = NETIF_F_HIGHDMA;
>   

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
  2021-06-16  9:26     ` Jason Wang
  (?)
@ 2021-06-16 10:10     ` Magnus Karlsson
  -1 siblings, 0 replies; 80+ messages in thread
From: Magnus Karlsson @ 2021-06-16 10:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Xuan Zhuo, Network Development, David S. Miller, Jakub Kicinski,
	Michael S. Tsirkin, Björn Töpel, Magnus Karlsson,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	virtualization, bpf, dust . li

On Wed, Jun 16, 2021 at 11:27 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> > AF_XDP(xdp socket, xsk) is a high-performance packet receiving and
> > sending technology.
> >
> > This patch implements the binding and unbinding operations of xsk and
> > the virtio-net queue for xsk zero copy xmit.
> >
> > The xsk zero copy xmit depends on tx napi. Because the actual sending
> > of data is done in the process of tx napi. If tx napi does not
> > work, then the data of the xsk tx queue will not be sent.
> > So if tx napi is not true, an error will be reported when bind xsk.
> >
> > If xsk is active, it will prevent ethtool from modifying tx napi.
> >
> > When reclaiming ptr, a new type of ptr is added, which is distinguished
> > based on the last two digits of ptr:
> > 00: skb
> > 01: xdp frame
> > 10: xsk xmit ptr
> >
> > All sent xsk packets share the virtio-net header of xsk_hdr. If xsk
> > needs to support csum and other functions later, consider assigning xsk
> > hdr separately for each sent packet.
> >
> > Different from other physical network cards, you can reinitialize the
> > channel when you bind xsk. And vrtio does not support independent reset
> > channel, you can only reset the entire device. I think it is not
> > appropriate for us to directly reset the entire setting. So the
> > situation becomes a bit more complicated. We have to consider how
> > to deal with the buffer referenced in vq after xsk is unbind.
> >
> > I added the ring size struct virtnet_xsk_ctx when xsk been bind. Each xsk
> > buffer added to vq corresponds to a ctx. This ctx is used to record the
> > page where the xsk buffer is located, and add a page reference. When the
> > buffer is recycling, reduce the reference to page. When xsk has been
> > unbind, and all related xsk buffers have been recycled, release all ctx.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > ---
> >   drivers/net/virtio/Makefile     |   1 +
> >   drivers/net/virtio/virtio_net.c |  20 +-
> >   drivers/net/virtio/virtio_net.h |  37 +++-
> >   drivers/net/virtio/xsk.c        | 346 ++++++++++++++++++++++++++++++++
> >   drivers/net/virtio/xsk.h        |  99 +++++++++
> >   5 files changed, 497 insertions(+), 6 deletions(-)
> >   create mode 100644 drivers/net/virtio/xsk.c
> >   create mode 100644 drivers/net/virtio/xsk.h
> >
> > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> > index ccc80f40f33a..db79d2e7925f 100644
> > --- a/drivers/net/virtio/Makefile
> > +++ b/drivers/net/virtio/Makefile
> > @@ -4,3 +4,4 @@
> >   #
> >
> >   obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> > +obj-$(CONFIG_VIRTIO_NET) += xsk.o
> > diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
> > index 395ec1f18331..40d7751f1c5f 100644
> > --- a/drivers/net/virtio/virtio_net.c
> > +++ b/drivers/net/virtio/virtio_net.c
> > @@ -1423,6 +1423,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> >
> >       txq = netdev_get_tx_queue(vi->dev, index);
> >       __netif_tx_lock(txq, raw_smp_processor_id());
> > +     work_done += virtnet_poll_xsk(sq, budget);
> >       free_old_xmit(sq, true);
> >       __netif_tx_unlock(txq);
> >
> > @@ -2133,8 +2134,16 @@ static int virtnet_set_coalesce(struct net_device *dev,
> >       if (napi_weight ^ vi->sq[0].napi.weight) {
> >               if (dev->flags & IFF_UP)
> >                       return -EBUSY;
> > -             for (i = 0; i < vi->max_queue_pairs; i++)
> > +
> > +             for (i = 0; i < vi->max_queue_pairs; i++) {
> > +                     /* xsk xmit depend on the tx napi. So if xsk is active,
> > +                      * prevent modifications to tx napi.
> > +                      */
> > +                     if (rtnl_dereference(vi->sq[i].xsk.pool))
> > +                             continue;
>
>
> So this can result tx NAPI is used by some queues buy not the others. I
> think such inconsistency breaks the semantic of set_coalesce() which
> assumes the operation is done at the device not some specific queues.
>
> How about just fail here?
>
>
> > +
> >                       vi->sq[i].napi.weight = napi_weight;
> > +             }
> >       }
> >
> >       return 0;
> > @@ -2407,6 +2416,8 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >       switch (xdp->command) {
> >       case XDP_SETUP_PROG:
> >               return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
> > +     case XDP_SETUP_XSK_POOL:
> > +             return virtnet_xsk_pool_setup(dev, xdp);
> >       default:
> >               return -EINVAL;
> >       }
> > @@ -2466,6 +2477,7 @@ static const struct net_device_ops virtnet_netdev = {
> >       .ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
> >       .ndo_bpf                = virtnet_xdp,
> >       .ndo_xdp_xmit           = virtnet_xdp_xmit,
> > +     .ndo_xsk_wakeup         = virtnet_xsk_wakeup,
> >       .ndo_features_check     = passthru_features_check,
> >       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
> >       .ndo_set_features       = virtnet_set_features,
> > @@ -2569,10 +2581,12 @@ static void free_unused_bufs(struct virtnet_info *vi)
> >       for (i = 0; i < vi->max_queue_pairs; i++) {
> >               struct virtqueue *vq = vi->sq[i].vq;
> >               while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> > -                     if (!is_xdp_frame(buf))
> > +                     if (is_skb_ptr(buf))
> >                               dev_kfree_skb(buf);
> > -                     else
> > +                     else if (is_xdp_frame(buf))
> >                               xdp_return_frame(ptr_to_xdp(buf));
> > +                     else
> > +                             virtnet_xsk_ctx_tx_put(ptr_to_xsk(buf));
> >               }
> >       }
> >
> > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > index 931cc81f92fb..e3da829887dc 100644
> > --- a/drivers/net/virtio/virtio_net.h
> > +++ b/drivers/net/virtio/virtio_net.h
> > @@ -135,6 +135,16 @@ struct send_queue {
> >       struct virtnet_sq_stats stats;
> >
> >       struct napi_struct napi;
> > +
> > +     struct {
> > +             struct xsk_buff_pool __rcu *pool;
> > +
> > +             /* xsk wait for tx inter or softirq */
> > +             bool need_wakeup;
> > +
> > +             /* ctx used to record the page added to vq */
> > +             struct virtnet_xsk_ctx_head *ctx_head;
> > +     } xsk;
> >   };
> >
> >   /* Internal representation of a receive virtqueue */
> > @@ -188,6 +198,13 @@ static inline void virtqueue_napi_schedule(struct napi_struct *napi,
> >       }
> >   }
> >
> > +#include "xsk.h"
> > +
> > +static inline bool is_skb_ptr(void *ptr)
> > +{
> > +     return !((unsigned long)ptr & (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG));
> > +}
> > +
> >   static inline bool is_xdp_frame(void *ptr)
> >   {
> >       return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> > @@ -206,25 +223,39 @@ static inline struct xdp_frame *ptr_to_xdp(void *ptr)
> >   static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
> >                                  struct virtnet_sq_stats *stats)
> >   {
> > +     unsigned int xsknum = 0;
> >       unsigned int len;
> >       void *ptr;
> >
> >       while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> > -             if (!is_xdp_frame(ptr)) {
> > +             if (is_skb_ptr(ptr)) {
> >                       struct sk_buff *skb = ptr;
> >
> >                       pr_debug("Sent skb %p\n", skb);
> >
> >                       stats->bytes += skb->len;
> >                       napi_consume_skb(skb, in_napi);
> > -             } else {
> > +             } else if (is_xdp_frame(ptr)) {
> >                       struct xdp_frame *frame = ptr_to_xdp(ptr);
> >
> >                       stats->bytes += frame->len;
> >                       xdp_return_frame(frame);
> > +             } else {
> > +                     struct virtnet_xsk_ctx_tx *ctx;
> > +
> > +                     ctx = ptr_to_xsk(ptr);
> > +
> > +                     /* Maybe this ptr was sent by the last xsk. */
> > +                     if (ctx->ctx.head->active)
> > +                             ++xsknum;
> > +
> > +                     stats->bytes += ctx->len;
> > +                     virtnet_xsk_ctx_tx_put(ctx);
> >               }
> >               stats->packets++;
> >       }
> > -}
> >
> > +     if (xsknum)
> > +             virtnet_xsk_complete(sq, xsknum);
> > +}
> >   #endif
> > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > new file mode 100644
> > index 000000000000..f98b68576709
> > --- /dev/null
> > +++ b/drivers/net/virtio/xsk.c
> > @@ -0,0 +1,346 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * virtio-net xsk
> > + */
> > +
> > +#include "virtio_net.h"
> > +
> > +static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
> > +
> > +static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
> > +{
> > +     struct virtnet_xsk_ctx *ctx;
> > +
> > +     ctx = head->ctx;
> > +     head->ctx = ctx->next;
> > +
> > +     ++head->ref;
> > +
> > +     return ctx;
> > +}
> > +
> > +#define virtnet_xsk_ctx_tx_get(head) ((struct virtnet_xsk_ctx_tx *)virtnet_xsk_ctx_get(head))
> > +
> > +static void virtnet_xsk_check_queue(struct send_queue *sq)
> > +{
> > +     struct virtnet_info *vi = sq->vq->vdev->priv;
> > +     struct net_device *dev = vi->dev;
> > +     int qnum = sq - vi->sq;
> > +
> > +     /* If it is a raw buffer queue, it does not check whether the status
> > +      * of the queue is stopped when sending. So there is no need to check
> > +      * the situation of the raw buffer queue.
> > +      */
> > +     if (is_xdp_raw_buffer_queue(vi, qnum))
> > +             return;
> > +
> > +     /* If this sq is not the exclusive queue of the current cpu,
> > +      * then it may be called by start_xmit, so check it running out
> > +      * of space.
> > +      *
> > +      * Stop the queue to avoid getting packets that we are
> > +      * then unable to transmit. Then wait the tx interrupt.
> > +      */
> > +     if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
> > +             netif_stop_subqueue(dev, qnum);
> > +}
> > +
> > +void virtnet_xsk_complete(struct send_queue *sq, u32 num)
> > +{
> > +     struct xsk_buff_pool *pool;
> > +
> > +     rcu_read_lock();
> > +     pool = rcu_dereference(sq->xsk.pool);
> > +     if (!pool) {
> > +             rcu_read_unlock();
> > +             return;
> > +     }
> > +     xsk_tx_completed(pool, num);
> > +     rcu_read_unlock();
> > +
> > +     if (sq->xsk.need_wakeup) {
> > +             sq->xsk.need_wakeup = false;
> > +             virtqueue_napi_schedule(&sq->napi, sq->vq);
> > +     }
> > +}
> > +
> > +static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> > +                         struct xdp_desc *desc)
> > +{
> > +     struct virtnet_xsk_ctx_tx *ctx;
> > +     struct virtnet_info *vi;
> > +     u32 offset, n, len;
> > +     struct page *page;
> > +     void *data;
> > +
> > +     vi = sq->vq->vdev->priv;
> > +
> > +     data = xsk_buff_raw_get_data(pool, desc->addr);
> > +     offset = offset_in_page(data);
> > +
> > +     ctx = virtnet_xsk_ctx_tx_get(sq->xsk.ctx_head);
> > +
> > +     /* xsk unaligned mode, desc may use two pages */
> > +     if (desc->len > PAGE_SIZE - offset)
> > +             n = 3;
> > +     else
> > +             n = 2;
> > +
> > +     sg_init_table(sq->sg, n);
> > +     sg_set_buf(sq->sg, &xsk_hdr, vi->hdr_len);
> > +
> > +     /* handle for xsk first page */
> > +     len = min_t(int, desc->len, PAGE_SIZE - offset);
> > +     page = vmalloc_to_page(data);
> > +     sg_set_page(sq->sg + 1, page, len, offset);
> > +
> > +     /* ctx is used to record and reference this page to prevent xsk from
> > +      * being released before this xmit is recycled
> > +      */
>
>
> I'm a little bit surprised that this is done manually per device instead
> of doing it in xsk core.

The pages that the data pointer points to are pinned by the xsk core
so they will not be released until the socket dies. In this case, we
will do a synchronize_net() to wait for the driver to stop using the
socket (and any pages), then start cleaning everything up. During this
clean up, ndo_bpf is called with the command XDP_SETUP_XSK_POOL with a
NULL pool pointer which means that the driver should tear the zero
copy path down and not use it anymore. Not until after that has
completed, is the umem memory with all its packet buffers released. Do
not see why this extra refcounting is needed, but might have missed
something of course. Is there something special with the virtio-net
driver we need to take care about in this context?

>
> > +     ctx->ctx.page = page;
> > +     get_page(page);
> > +
> > +     /* xsk unaligned mode, handle for the second page */
> > +     if (len < desc->len) {
> > +             page = vmalloc_to_page(data + len);
> > +             len = min_t(int, desc->len - len, PAGE_SIZE);
> > +             sg_set_page(sq->sg + 2, page, len, 0);
> > +
> > +             ctx->ctx.page_unaligned = page;
> > +             get_page(page);
> > +     } else {
> > +             ctx->ctx.page_unaligned = NULL;
> > +     }
> > +
> > +     return virtqueue_add_outbuf(sq->vq, sq->sg, n,
> > +                                xsk_to_ptr(ctx), GFP_ATOMIC);
> > +}
> > +
> > +static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> > +                               struct xsk_buff_pool *pool,
> > +                               unsigned int budget,
> > +                               bool in_napi, int *done,
> > +                               struct virtnet_sq_stats *stats)
> > +{
> > +     struct xdp_desc desc;
> > +     int err, packet = 0;
> > +     int ret = -EAGAIN;
> > +
> > +     while (budget-- > 0) {
> > +             if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
>
>
> AF_XDP doesn't use skb, so I don't see why MAX_SKB_FRAGS is used.
>
> Looking at virtnet_xsk_xmit(), it looks to me 3 is more suitable here.
> Or did AF_XDP core can handle queue full gracefully then we don't even
> need to worry about this?

We need to make sure that there is enough space in the outgoing queue
/ HW Tx ring somewhere. The easiest place to do this is before you get
the next packet from the Tx ring, as you would have to return it if
there was not enough space. Note that we do not have a function for
returning a packet to the Tx ring at the moment and I would like to
avoid adding one. As Jason says, no reason to test for anything skb
here. The zero-copy path never uses skbs, unless you get an XDP_PASS
from an XDP program.

>
> > +                     ret = -EBUSY;
>
>
> -ENOSPC looks better.
>
>
> > +                     break;
> > +             }
> > +
> > +             if (!xsk_tx_peek_desc(pool, &desc)) {
> > +                     /* done */
> > +                     ret = 0;
> > +                     break;
> > +             }
> > +
> > +             err = virtnet_xsk_xmit(sq, pool, &desc);
> > +             if (unlikely(err)) {
>
>
> If we always reserve sufficient slots, this should be an unexpected
> error, do we need log this as what has been done in start_xmit()?
>
>          /* This should not happen! */
>          if (unlikely(err)) {
>                  dev->stats.tx_fifo_errors++;
>                  if (net_ratelimit())
>                          dev_warn(&dev->dev,
>                                   "Unexpected TXQ (%d) queue failure: %d\n",
>                                   qnum, err);
>
>
> > +                     ret = -EBUSY;
> > +                     break;
> > +             }
> > +
> > +             ++packet;
> > +     }
> > +
> > +     if (packet) {
> > +             if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> > +                     ++stats->kicks;
> > +
> > +             *done += packet;
> > +             stats->xdp_tx += packet;
> > +
> > +             xsk_tx_release(pool);
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
> > +                        int budget, bool in_napi)
> > +{
> > +     struct virtnet_sq_stats stats = {};
> > +     int done = 0;
> > +     int err;
> > +
> > +     sq->xsk.need_wakeup = false;
> > +     __free_old_xmit(sq, in_napi, &stats);
> > +
> > +     /* return err:
> > +      * -EAGAIN: done == budget
> > +      * -EBUSY:  done < budget
> > +      *  0    :  done < budget
> > +      */
>
>
> It's better to move the comment to the implementation of
> virtnet_xsk_xmit_batch().
>
>
> > +xmit:
> > +     err = virtnet_xsk_xmit_batch(sq, pool, budget - done, in_napi,
> > +                                  &done, &stats);
> > +     if (err == -EBUSY) {
> > +             __free_old_xmit(sq, in_napi, &stats);
> > +
> > +             /* If the space is enough, let napi run again. */
> > +             if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
>
>
> The comment does not match the code.
>
>
> > +                     goto xmit;
> > +             else
> > +                     sq->xsk.need_wakeup = true;
> > +     }
> > +
> > +     virtnet_xsk_check_queue(sq);
> > +
> > +     u64_stats_update_begin(&sq->stats.syncp);
> > +     sq->stats.packets += stats.packets;
> > +     sq->stats.bytes += stats.bytes;
> > +     sq->stats.kicks += stats.kicks;
> > +     sq->stats.xdp_tx += stats.xdp_tx;
> > +     u64_stats_update_end(&sq->stats.syncp);
> > +
> > +     return done;
> > +}
> > +
> > +int virtnet_poll_xsk(struct send_queue *sq, int budget)
> > +{
> > +     struct xsk_buff_pool *pool;
> > +     int work_done = 0;
> > +
> > +     rcu_read_lock();
> > +     pool = rcu_dereference(sq->xsk.pool);
> > +     if (pool)
> > +             work_done = virtnet_xsk_run(sq, pool, budget, true);
> > +     rcu_read_unlock();
> > +     return work_done;
> > +}
> > +
> > +int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
> > +{
> > +     struct virtnet_info *vi = netdev_priv(dev);
> > +     struct xsk_buff_pool *pool;
> > +     struct send_queue *sq;
> > +
> > +     if (!netif_running(dev))
> > +             return -ENETDOWN;
> > +
> > +     if (qid >= vi->curr_queue_pairs)
> > +             return -EINVAL;
>
>
> I wonder how we can hit this check. Note that we prevent the user from
> modifying queue pairs when XDP is enabled:
>
>          /* For now we don't support modifying channels while XDP is loaded
>           * also when XDP is loaded all RX queues have XDP programs so
> we only
>           * need to check a single RX queue.
>           */
>          if (vi->rq[0].xdp_prog)
>                  return -EINVAL;
>
> > +
> > +     sq = &vi->sq[qid];
> > +
> > +     rcu_read_lock();
>
>
> Can we simply use rcu_read_lock_bh() here?
>
>
> > +     pool = rcu_dereference(sq->xsk.pool);
> > +     if (pool) {
> > +             local_bh_disable();
> > +             virtqueue_napi_schedule(&sq->napi, sq->vq);
> > +             local_bh_enable();
> > +     }
> > +     rcu_read_unlock();
> > +     return 0;
> > +}
> > +
> > +static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct xsk_buff_pool *pool,
> > +                                                       struct virtqueue *vq)
> > +{
> > +     struct virtnet_xsk_ctx_head *head;
> > +     u32 size, n, ring_size, ctx_sz;
> > +     struct virtnet_xsk_ctx *ctx;
> > +     void *p;
> > +
> > +     ctx_sz = sizeof(struct virtnet_xsk_ctx_tx);
> > +
> > +     ring_size = virtqueue_get_vring_size(vq);
> > +     size = sizeof(*head) + ctx_sz * ring_size;
> > +
> > +     head = kmalloc(size, GFP_ATOMIC);
> > +     if (!head)
> > +             return NULL;
> > +
> > +     memset(head, 0, sizeof(*head));
> > +
> > +     head->active = true;
> > +     head->frame_size = xsk_pool_get_rx_frame_size(pool);
> > +
> > +     p = head + 1;
> > +     for (n = 0; n < ring_size; ++n) {
> > +             ctx = p;
> > +             ctx->head = head;
> > +             ctx->next = head->ctx;
> > +             head->ctx = ctx;
> > +
> > +             p += ctx_sz;
> > +     }
> > +
> > +     return head;
> > +}
> > +
> > +static int virtnet_xsk_pool_enable(struct net_device *dev,
> > +                                struct xsk_buff_pool *pool,
> > +                                u16 qid)
> > +{
> > +     struct virtnet_info *vi = netdev_priv(dev);
> > +     struct send_queue *sq;
> > +
> > +     if (qid >= vi->curr_queue_pairs)
> > +             return -EINVAL;
> > +
> > +     sq = &vi->sq[qid];
> > +
> > +     /* xsk zerocopy depend on the tx napi.
> > +      *
> > +      * All data is actually consumed and sent out from the xsk tx queue
> > +      * under the tx napi mechanism.
> > +      */
> > +     if (!sq->napi.weight)
> > +             return -EPERM;
> > +
> > +     memset(&sq->xsk, 0, sizeof(sq->xsk));
> > +
> > +     sq->xsk.ctx_head = virtnet_xsk_ctx_alloc(pool, sq->vq);
> > +     if (!sq->xsk.ctx_head)
> > +             return -ENOMEM;
> > +
> > +     /* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> > +      * safe.
> > +      */
> > +     rcu_assign_pointer(sq->xsk.pool, pool);
> > +
> > +     return 0;
> > +}
> > +
> > +static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
> > +{
> > +     struct virtnet_info *vi = netdev_priv(dev);
> > +     struct send_queue *sq;
> > +
> > +     if (qid >= vi->curr_queue_pairs)
> > +             return -EINVAL;
> > +
> > +     sq = &vi->sq[qid];
> > +
> > +     /* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> > +      * safe.
> > +      */
> > +     rcu_assign_pointer(sq->xsk.pool, NULL);
> > +
> > +     /* Sync with the XSK wakeup and with NAPI. */
> > +     synchronize_net();
> > +
> > +     if (READ_ONCE(sq->xsk.ctx_head->ref))
> > +             WRITE_ONCE(sq->xsk.ctx_head->active, false);
> > +     else
> > +             kfree(sq->xsk.ctx_head);
> > +
> > +     sq->xsk.ctx_head = NULL;
> > +
> > +     return 0;
> > +}
> > +
> > +int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp)
> > +{
> > +     if (xdp->xsk.pool)
> > +             return virtnet_xsk_pool_enable(dev, xdp->xsk.pool,
> > +                                            xdp->xsk.queue_id);
> > +     else
> > +             return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id);
> > +}
> > +
> > diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> > new file mode 100644
> > index 000000000000..54948e0b07fc
> > --- /dev/null
> > +++ b/drivers/net/virtio/xsk.h
> > @@ -0,0 +1,99 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > +
> > +#ifndef __XSK_H__
> > +#define __XSK_H__
> > +
> > +#define VIRTIO_XSK_FLAG      BIT(1)
> > +
> > +/* When xsk disable, under normal circumstances, the network card must reclaim
> > + * all the memory that has been sent and the memory added to the rq queue by
> > + * destroying the queue.
> > + *
> > + * But virtio's queue does not support separate setting to been disable.
>
>
> This is a call for us to implement per queue enable/disable. Virtio-mmio
> has such facility but virtio-pci only allow to disable a queue (not enable).
>
>
> > "Reset"
> > + * is not very suitable.
> > + *
> > + * The method here is that each sent chunk or chunk added to the rq queue is
> > + * described by an independent structure struct virtnet_xsk_ctx.
> > + *
> > + * We will use get_page(page) to refer to the page where these chunks are
> > + * located. And these pages will be recorded in struct virtnet_xsk_ctx. So these
> > + * chunks in vq are safe. When recycling, put the these page.
> > + *
> > + * These structures point to struct virtnet_xsk_ctx_head, and ref records how
> > + * many chunks have not been reclaimed. If active == 0, it means that xsk has
> > + * been disabled.
> > + *
> > + * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
> > + * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
> > + * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
> > + * ref is 0.
>
>
> This looks complicated and it will increase the footprint. Consider the
> performance penalty and the complexity, I would suggest to use reset
> instead.

OK, this explains your reference counting previously. Let us keep it
simple as Jason suggests. If you need anything from the core xsk code,
just let me know. You are the first one to implement AF_XDP zero-copy
on a virtual device, so new requirements might pop up.

Thanks for working on this!

> Then we don't need to introduce such context.
>
> Thanks
>
>
> > + */
> > +struct virtnet_xsk_ctx;
> > +struct virtnet_xsk_ctx_head {
> > +     struct virtnet_xsk_ctx *ctx;
> > +
> > +     /* how many ctx has been add to vq */
> > +     u64 ref;
> > +
> > +     unsigned int frame_size;
> > +
> > +     /* the xsk status */
> > +     bool active;
> > +};
> > +
> > +struct virtnet_xsk_ctx {
> > +     struct virtnet_xsk_ctx_head *head;
> > +     struct virtnet_xsk_ctx *next;
> > +
> > +     struct page *page;
> > +
> > +     /* xsk unaligned mode will use two page in one desc */
> > +     struct page *page_unaligned;
> > +};
> > +
> > +struct virtnet_xsk_ctx_tx {
> > +     /* this *MUST* be the first */
> > +     struct virtnet_xsk_ctx ctx;
> > +
> > +     /* xsk tx xmit use this record the len of packet */
> > +     u32 len;
> > +};
> > +
> > +static inline void *xsk_to_ptr(struct virtnet_xsk_ctx_tx *ctx)
> > +{
> > +     return (void *)((unsigned long)ctx | VIRTIO_XSK_FLAG);
> > +}
> > +
> > +static inline struct virtnet_xsk_ctx_tx *ptr_to_xsk(void *ptr)
> > +{
> > +     unsigned long p;
> > +
> > +     p = (unsigned long)ptr;
> > +     return (struct virtnet_xsk_ctx_tx *)(p & ~VIRTIO_XSK_FLAG);
> > +}
> > +
> > +static inline void virtnet_xsk_ctx_put(struct virtnet_xsk_ctx *ctx)
> > +{
> > +     put_page(ctx->page);
> > +     if (ctx->page_unaligned)
> > +             put_page(ctx->page_unaligned);
> > +
> > +     --ctx->head->ref;
> > +
> > +     if (ctx->head->active) {
> > +             ctx->next = ctx->head->ctx;
> > +             ctx->head->ctx = ctx;
> > +     } else {
> > +             if (!ctx->head->ref)
> > +                     kfree(ctx->head);
> > +     }
> > +}
> > +
> > +#define virtnet_xsk_ctx_tx_put(ctx) \
> > +     virtnet_xsk_ctx_put((struct virtnet_xsk_ctx *)ctx)
> > +
> > +int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
> > +int virtnet_poll_xsk(struct send_queue *sq, int budget);
> > +void virtnet_xsk_complete(struct send_queue *sq, u32 num);
> > +int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> > +#endif
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
  2021-06-16  9:26     ` Jason Wang
  (?)
  (?)
@ 2021-06-16 10:19     ` Xuan Zhuo
  2021-06-16 12:51         ` Jason Wang
  -1 siblings, 1 reply; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-16 10:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

On Wed, 16 Jun 2021 17:26:33 +0800, Jason Wang <jasowang@redhat.com> wrote:
>
> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> > AF_XDP(xdp socket, xsk) is a high-performance packet receiving and
> > sending technology.
> >
> > This patch implements the binding and unbinding operations of xsk and
> > the virtio-net queue for xsk zero copy xmit.
> >
> > The xsk zero copy xmit depends on tx napi. Because the actual sending
> > of data is done in the process of tx napi. If tx napi does not
> > work, then the data of the xsk tx queue will not be sent.
> > So if tx napi is not true, an error will be reported when bind xsk.
> >
> > If xsk is active, it will prevent ethtool from modifying tx napi.
> >
> > When reclaiming ptr, a new type of ptr is added, which is distinguished
> > based on the last two digits of ptr:
> > 00: skb
> > 01: xdp frame
> > 10: xsk xmit ptr
> >
> > All sent xsk packets share the virtio-net header of xsk_hdr. If xsk
> > needs to support csum and other functions later, consider assigning xsk
> > hdr separately for each sent packet.
> >
> > Different from other physical network cards, you can reinitialize the
> > channel when you bind xsk. And vrtio does not support independent reset
> > channel, you can only reset the entire device. I think it is not
> > appropriate for us to directly reset the entire setting. So the
> > situation becomes a bit more complicated. We have to consider how
> > to deal with the buffer referenced in vq after xsk is unbind.
> >
> > I added the ring size struct virtnet_xsk_ctx when xsk been bind. Each xsk
> > buffer added to vq corresponds to a ctx. This ctx is used to record the
> > page where the xsk buffer is located, and add a page reference. When the
> > buffer is recycling, reduce the reference to page. When xsk has been
> > unbind, and all related xsk buffers have been recycled, release all ctx.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > ---
> >   drivers/net/virtio/Makefile     |   1 +
> >   drivers/net/virtio/virtio_net.c |  20 +-
> >   drivers/net/virtio/virtio_net.h |  37 +++-
> >   drivers/net/virtio/xsk.c        | 346 ++++++++++++++++++++++++++++++++
> >   drivers/net/virtio/xsk.h        |  99 +++++++++
> >   5 files changed, 497 insertions(+), 6 deletions(-)
> >   create mode 100644 drivers/net/virtio/xsk.c
> >   create mode 100644 drivers/net/virtio/xsk.h
> >
> > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> > index ccc80f40f33a..db79d2e7925f 100644
> > --- a/drivers/net/virtio/Makefile
> > +++ b/drivers/net/virtio/Makefile
> > @@ -4,3 +4,4 @@
> >   #
> >
> >   obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> > +obj-$(CONFIG_VIRTIO_NET) += xsk.o
> > diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
> > index 395ec1f18331..40d7751f1c5f 100644
> > --- a/drivers/net/virtio/virtio_net.c
> > +++ b/drivers/net/virtio/virtio_net.c
> > @@ -1423,6 +1423,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> >
> >   	txq = netdev_get_tx_queue(vi->dev, index);
> >   	__netif_tx_lock(txq, raw_smp_processor_id());
> > +	work_done += virtnet_poll_xsk(sq, budget);
> >   	free_old_xmit(sq, true);
> >   	__netif_tx_unlock(txq);
> >
> > @@ -2133,8 +2134,16 @@ static int virtnet_set_coalesce(struct net_device *dev,
> >   	if (napi_weight ^ vi->sq[0].napi.weight) {
> >   		if (dev->flags & IFF_UP)
> >   			return -EBUSY;
> > -		for (i = 0; i < vi->max_queue_pairs; i++)
> > +
> > +		for (i = 0; i < vi->max_queue_pairs; i++) {
> > +			/* xsk xmit depend on the tx napi. So if xsk is active,
> > +			 * prevent modifications to tx napi.
> > +			 */
> > +			if (rtnl_dereference(vi->sq[i].xsk.pool))
> > +				continue;
>
>
> So this can result tx NAPI is used by some queues buy not the others. I
> think such inconsistency breaks the semantic of set_coalesce() which
> assumes the operation is done at the device not some specific queues.
>
> How about just fail here?
>
>
> > +
> >   			vi->sq[i].napi.weight = napi_weight;
> > +		}
> >   	}
> >
> >   	return 0;
> > @@ -2407,6 +2416,8 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >   	switch (xdp->command) {
> >   	case XDP_SETUP_PROG:
> >   		return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
> > +	case XDP_SETUP_XSK_POOL:
> > +		return virtnet_xsk_pool_setup(dev, xdp);
> >   	default:
> >   		return -EINVAL;
> >   	}
> > @@ -2466,6 +2477,7 @@ static const struct net_device_ops virtnet_netdev = {
> >   	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
> >   	.ndo_bpf		= virtnet_xdp,
> >   	.ndo_xdp_xmit		= virtnet_xdp_xmit,
> > +	.ndo_xsk_wakeup         = virtnet_xsk_wakeup,
> >   	.ndo_features_check	= passthru_features_check,
> >   	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
> >   	.ndo_set_features	= virtnet_set_features,
> > @@ -2569,10 +2581,12 @@ static void free_unused_bufs(struct virtnet_info *vi)
> >   	for (i = 0; i < vi->max_queue_pairs; i++) {
> >   		struct virtqueue *vq = vi->sq[i].vq;
> >   		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> > -			if (!is_xdp_frame(buf))
> > +			if (is_skb_ptr(buf))
> >   				dev_kfree_skb(buf);
> > -			else
> > +			else if (is_xdp_frame(buf))
> >   				xdp_return_frame(ptr_to_xdp(buf));
> > +			else
> > +				virtnet_xsk_ctx_tx_put(ptr_to_xsk(buf));
> >   		}
> >   	}
> >
> > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > index 931cc81f92fb..e3da829887dc 100644
> > --- a/drivers/net/virtio/virtio_net.h
> > +++ b/drivers/net/virtio/virtio_net.h
> > @@ -135,6 +135,16 @@ struct send_queue {
> >   	struct virtnet_sq_stats stats;
> >
> >   	struct napi_struct napi;
> > +
> > +	struct {
> > +		struct xsk_buff_pool __rcu *pool;
> > +
> > +		/* xsk wait for tx inter or softirq */
> > +		bool need_wakeup;
> > +
> > +		/* ctx used to record the page added to vq */
> > +		struct virtnet_xsk_ctx_head *ctx_head;
> > +	} xsk;
> >   };
> >
> >   /* Internal representation of a receive virtqueue */
> > @@ -188,6 +198,13 @@ static inline void virtqueue_napi_schedule(struct napi_struct *napi,
> >   	}
> >   }
> >
> > +#include "xsk.h"
> > +
> > +static inline bool is_skb_ptr(void *ptr)
> > +{
> > +	return !((unsigned long)ptr & (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG));
> > +}
> > +
> >   static inline bool is_xdp_frame(void *ptr)
> >   {
> >   	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> > @@ -206,25 +223,39 @@ static inline struct xdp_frame *ptr_to_xdp(void *ptr)
> >   static inline void __free_old_xmit(struct send_queue *sq, bool in_napi,
> >   				   struct virtnet_sq_stats *stats)
> >   {
> > +	unsigned int xsknum = 0;
> >   	unsigned int len;
> >   	void *ptr;
> >
> >   	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> > -		if (!is_xdp_frame(ptr)) {
> > +		if (is_skb_ptr(ptr)) {
> >   			struct sk_buff *skb = ptr;
> >
> >   			pr_debug("Sent skb %p\n", skb);
> >
> >   			stats->bytes += skb->len;
> >   			napi_consume_skb(skb, in_napi);
> > -		} else {
> > +		} else if (is_xdp_frame(ptr)) {
> >   			struct xdp_frame *frame = ptr_to_xdp(ptr);
> >
> >   			stats->bytes += frame->len;
> >   			xdp_return_frame(frame);
> > +		} else {
> > +			struct virtnet_xsk_ctx_tx *ctx;
> > +
> > +			ctx = ptr_to_xsk(ptr);
> > +
> > +			/* Maybe this ptr was sent by the last xsk. */
> > +			if (ctx->ctx.head->active)
> > +				++xsknum;
> > +
> > +			stats->bytes += ctx->len;
> > +			virtnet_xsk_ctx_tx_put(ctx);
> >   		}
> >   		stats->packets++;
> >   	}
> > -}
> >
> > +	if (xsknum)
> > +		virtnet_xsk_complete(sq, xsknum);
> > +}
> >   #endif
> > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > new file mode 100644
> > index 000000000000..f98b68576709
> > --- /dev/null
> > +++ b/drivers/net/virtio/xsk.c
> > @@ -0,0 +1,346 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * virtio-net xsk
> > + */
> > +
> > +#include "virtio_net.h"
> > +
> > +static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
> > +
> > +static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
> > +{
> > +	struct virtnet_xsk_ctx *ctx;
> > +
> > +	ctx = head->ctx;
> > +	head->ctx = ctx->next;
> > +
> > +	++head->ref;
> > +
> > +	return ctx;
> > +}
> > +
> > +#define virtnet_xsk_ctx_tx_get(head) ((struct virtnet_xsk_ctx_tx *)virtnet_xsk_ctx_get(head))
> > +
> > +static void virtnet_xsk_check_queue(struct send_queue *sq)
> > +{
> > +	struct virtnet_info *vi = sq->vq->vdev->priv;
> > +	struct net_device *dev = vi->dev;
> > +	int qnum = sq - vi->sq;
> > +
> > +	/* If it is a raw buffer queue, it does not check whether the status
> > +	 * of the queue is stopped when sending. So there is no need to check
> > +	 * the situation of the raw buffer queue.
> > +	 */
> > +	if (is_xdp_raw_buffer_queue(vi, qnum))
> > +		return;
> > +
> > +	/* If this sq is not the exclusive queue of the current cpu,
> > +	 * then it may be called by start_xmit, so check it running out
> > +	 * of space.
> > +	 *
> > +	 * Stop the queue to avoid getting packets that we are
> > +	 * then unable to transmit. Then wait the tx interrupt.
> > +	 */
> > +	if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
> > +		netif_stop_subqueue(dev, qnum);
> > +}
> > +
> > +void virtnet_xsk_complete(struct send_queue *sq, u32 num)
> > +{
> > +	struct xsk_buff_pool *pool;
> > +
> > +	rcu_read_lock();
> > +	pool = rcu_dereference(sq->xsk.pool);
> > +	if (!pool) {
> > +		rcu_read_unlock();
> > +		return;
> > +	}
> > +	xsk_tx_completed(pool, num);
> > +	rcu_read_unlock();
> > +
> > +	if (sq->xsk.need_wakeup) {
> > +		sq->xsk.need_wakeup = false;
> > +		virtqueue_napi_schedule(&sq->napi, sq->vq);
> > +	}
> > +}
> > +
> > +static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> > +			    struct xdp_desc *desc)
> > +{
> > +	struct virtnet_xsk_ctx_tx *ctx;
> > +	struct virtnet_info *vi;
> > +	u32 offset, n, len;
> > +	struct page *page;
> > +	void *data;
> > +
> > +	vi = sq->vq->vdev->priv;
> > +
> > +	data = xsk_buff_raw_get_data(pool, desc->addr);
> > +	offset = offset_in_page(data);
> > +
> > +	ctx = virtnet_xsk_ctx_tx_get(sq->xsk.ctx_head);
> > +
> > +	/* xsk unaligned mode, desc may use two pages */
> > +	if (desc->len > PAGE_SIZE - offset)
> > +		n = 3;
> > +	else
> > +		n = 2;
> > +
> > +	sg_init_table(sq->sg, n);
> > +	sg_set_buf(sq->sg, &xsk_hdr, vi->hdr_len);
> > +
> > +	/* handle for xsk first page */
> > +	len = min_t(int, desc->len, PAGE_SIZE - offset);
> > +	page = vmalloc_to_page(data);
> > +	sg_set_page(sq->sg + 1, page, len, offset);
> > +
> > +	/* ctx is used to record and reference this page to prevent xsk from
> > +	 * being released before this xmit is recycled
> > +	 */
>
>
> I'm a little bit surprised that this is done manually per device instead
> of doing it in xsk core.
>
>
> > +	ctx->ctx.page = page;
> > +	get_page(page);
> > +
> > +	/* xsk unaligned mode, handle for the second page */
> > +	if (len < desc->len) {
> > +		page = vmalloc_to_page(data + len);
> > +		len = min_t(int, desc->len - len, PAGE_SIZE);
> > +		sg_set_page(sq->sg + 2, page, len, 0);
> > +
> > +		ctx->ctx.page_unaligned = page;
> > +		get_page(page);
> > +	} else {
> > +		ctx->ctx.page_unaligned = NULL;
> > +	}
> > +
> > +	return virtqueue_add_outbuf(sq->vq, sq->sg, n,
> > +				   xsk_to_ptr(ctx), GFP_ATOMIC);
> > +}
> > +
> > +static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> > +				  struct xsk_buff_pool *pool,
> > +				  unsigned int budget,
> > +				  bool in_napi, int *done,
> > +				  struct virtnet_sq_stats *stats)
> > +{
> > +	struct xdp_desc desc;
> > +	int err, packet = 0;
> > +	int ret = -EAGAIN;
> > +
> > +	while (budget-- > 0) {
> > +		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
>
>
> AF_XDP doesn't use skb, so I don't see why MAX_SKB_FRAGS is used.
>
> Looking at virtnet_xsk_xmit(), it looks to me 3 is more suitable here.
> Or did AF_XDP core can handle queue full gracefully then we don't even
> need to worry about this?
>
>
> > +			ret = -EBUSY;
>
>
> -ENOSPC looks better.
>
>
> > +			break;
> > +		}
> > +
> > +		if (!xsk_tx_peek_desc(pool, &desc)) {
> > +			/* done */
> > +			ret = 0;
> > +			break;
> > +		}
> > +
> > +		err = virtnet_xsk_xmit(sq, pool, &desc);
> > +		if (unlikely(err)) {
>
>
> If we always reserve sufficient slots, this should be an unexpected
> error, do we need log this as what has been done in start_xmit()?
>
>          /* This should not happen! */
>          if (unlikely(err)) {
>                  dev->stats.tx_fifo_errors++;
>                  if (net_ratelimit())
>                          dev_warn(&dev->dev,
>                                   "Unexpected TXQ (%d) queue failure: %d\n",
>                                   qnum, err);
>
>
> > +			ret = -EBUSY;
> > +			break;
> > +		}
> > +
> > +		++packet;
> > +	}
> > +
> > +	if (packet) {
> > +		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> > +			++stats->kicks;
> > +
> > +		*done += packet;
> > +		stats->xdp_tx += packet;
> > +
> > +		xsk_tx_release(pool);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
> > +			   int budget, bool in_napi)
> > +{
> > +	struct virtnet_sq_stats stats = {};
> > +	int done = 0;
> > +	int err;
> > +
> > +	sq->xsk.need_wakeup = false;
> > +	__free_old_xmit(sq, in_napi, &stats);
> > +
> > +	/* return err:
> > +	 * -EAGAIN: done == budget
> > +	 * -EBUSY:  done < budget
> > +	 *  0    :  done < budget
> > +	 */
>
>
> It's better to move the comment to the implementation of
> virtnet_xsk_xmit_batch().
>
>
> > +xmit:
> > +	err = virtnet_xsk_xmit_batch(sq, pool, budget - done, in_napi,
> > +				     &done, &stats);
> > +	if (err == -EBUSY) {
> > +		__free_old_xmit(sq, in_napi, &stats);
> > +
> > +		/* If the space is enough, let napi run again. */
> > +		if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
>
>
> The comment does not match the code.
>
>
> > +			goto xmit;
> > +		else
> > +			sq->xsk.need_wakeup = true;
> > +	}
> > +
> > +	virtnet_xsk_check_queue(sq);
> > +
> > +	u64_stats_update_begin(&sq->stats.syncp);
> > +	sq->stats.packets += stats.packets;
> > +	sq->stats.bytes += stats.bytes;
> > +	sq->stats.kicks += stats.kicks;
> > +	sq->stats.xdp_tx += stats.xdp_tx;
> > +	u64_stats_update_end(&sq->stats.syncp);
> > +
> > +	return done;
> > +}
> > +
> > +int virtnet_poll_xsk(struct send_queue *sq, int budget)
> > +{
> > +	struct xsk_buff_pool *pool;
> > +	int work_done = 0;
> > +
> > +	rcu_read_lock();
> > +	pool = rcu_dereference(sq->xsk.pool);
> > +	if (pool)
> > +		work_done = virtnet_xsk_run(sq, pool, budget, true);
> > +	rcu_read_unlock();
> > +	return work_done;
> > +}
> > +
> > +int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
> > +{
> > +	struct virtnet_info *vi = netdev_priv(dev);
> > +	struct xsk_buff_pool *pool;
> > +	struct send_queue *sq;
> > +
> > +	if (!netif_running(dev))
> > +		return -ENETDOWN;
> > +
> > +	if (qid >= vi->curr_queue_pairs)
> > +		return -EINVAL;
>
>
> I wonder how we can hit this check. Note that we prevent the user from
> modifying queue pairs when XDP is enabled:
>
>          /* For now we don't support modifying channels while XDP is loaded
>           * also when XDP is loaded all RX queues have XDP programs so
> we only
>           * need to check a single RX queue.
>           */
>          if (vi->rq[0].xdp_prog)
>                  return -EINVAL;
>
> > +
> > +	sq = &vi->sq[qid];
> > +
> > +	rcu_read_lock();
>
>
> Can we simply use rcu_read_lock_bh() here?
>
>
> > +	pool = rcu_dereference(sq->xsk.pool);
> > +	if (pool) {
> > +		local_bh_disable();
> > +		virtqueue_napi_schedule(&sq->napi, sq->vq);
> > +		local_bh_enable();
> > +	}
> > +	rcu_read_unlock();
> > +	return 0;
> > +}
> > +
> > +static struct virtnet_xsk_ctx_head *virtnet_xsk_ctx_alloc(struct xsk_buff_pool *pool,
> > +							  struct virtqueue *vq)
> > +{
> > +	struct virtnet_xsk_ctx_head *head;
> > +	u32 size, n, ring_size, ctx_sz;
> > +	struct virtnet_xsk_ctx *ctx;
> > +	void *p;
> > +
> > +	ctx_sz = sizeof(struct virtnet_xsk_ctx_tx);
> > +
> > +	ring_size = virtqueue_get_vring_size(vq);
> > +	size = sizeof(*head) + ctx_sz * ring_size;
> > +
> > +	head = kmalloc(size, GFP_ATOMIC);
> > +	if (!head)
> > +		return NULL;
> > +
> > +	memset(head, 0, sizeof(*head));
> > +
> > +	head->active = true;
> > +	head->frame_size = xsk_pool_get_rx_frame_size(pool);
> > +
> > +	p = head + 1;
> > +	for (n = 0; n < ring_size; ++n) {
> > +		ctx = p;
> > +		ctx->head = head;
> > +		ctx->next = head->ctx;
> > +		head->ctx = ctx;
> > +
> > +		p += ctx_sz;
> > +	}
> > +
> > +	return head;
> > +}
> > +
> > +static int virtnet_xsk_pool_enable(struct net_device *dev,
> > +				   struct xsk_buff_pool *pool,
> > +				   u16 qid)
> > +{
> > +	struct virtnet_info *vi = netdev_priv(dev);
> > +	struct send_queue *sq;
> > +
> > +	if (qid >= vi->curr_queue_pairs)
> > +		return -EINVAL;
> > +
> > +	sq = &vi->sq[qid];
> > +
> > +	/* xsk zerocopy depend on the tx napi.
> > +	 *
> > +	 * All data is actually consumed and sent out from the xsk tx queue
> > +	 * under the tx napi mechanism.
> > +	 */
> > +	if (!sq->napi.weight)
> > +		return -EPERM;
> > +
> > +	memset(&sq->xsk, 0, sizeof(sq->xsk));
> > +
> > +	sq->xsk.ctx_head = virtnet_xsk_ctx_alloc(pool, sq->vq);
> > +	if (!sq->xsk.ctx_head)
> > +		return -ENOMEM;
> > +
> > +	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> > +	 * safe.
> > +	 */
> > +	rcu_assign_pointer(sq->xsk.pool, pool);
> > +
> > +	return 0;
> > +}
> > +
> > +static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
> > +{
> > +	struct virtnet_info *vi = netdev_priv(dev);
> > +	struct send_queue *sq;
> > +
> > +	if (qid >= vi->curr_queue_pairs)
> > +		return -EINVAL;
> > +
> > +	sq = &vi->sq[qid];
> > +
> > +	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> > +	 * safe.
> > +	 */
> > +	rcu_assign_pointer(sq->xsk.pool, NULL);
> > +
> > +	/* Sync with the XSK wakeup and with NAPI. */
> > +	synchronize_net();
> > +
> > +	if (READ_ONCE(sq->xsk.ctx_head->ref))
> > +		WRITE_ONCE(sq->xsk.ctx_head->active, false);
> > +	else
> > +		kfree(sq->xsk.ctx_head);
> > +
> > +	sq->xsk.ctx_head = NULL;
> > +
> > +	return 0;
> > +}
> > +
> > +int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp)
> > +{
> > +	if (xdp->xsk.pool)
> > +		return virtnet_xsk_pool_enable(dev, xdp->xsk.pool,
> > +					       xdp->xsk.queue_id);
> > +	else
> > +		return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id);
> > +}
> > +
> > diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> > new file mode 100644
> > index 000000000000..54948e0b07fc
> > --- /dev/null
> > +++ b/drivers/net/virtio/xsk.h
> > @@ -0,0 +1,99 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > +
> > +#ifndef __XSK_H__
> > +#define __XSK_H__
> > +
> > +#define VIRTIO_XSK_FLAG	BIT(1)
> > +
> > +/* When xsk disable, under normal circumstances, the network card must reclaim
> > + * all the memory that has been sent and the memory added to the rq queue by
> > + * destroying the queue.
> > + *
> > + * But virtio's queue does not support separate setting to been disable.
>
>
> This is a call for us to implement per queue enable/disable. Virtio-mmio
> has such facility but virtio-pci only allow to disable a queue (not enable).
>
>
> > "Reset"
> > + * is not very suitable.
> > + *
> > + * The method here is that each sent chunk or chunk added to the rq queue is
> > + * described by an independent structure struct virtnet_xsk_ctx.
> > + *
> > + * We will use get_page(page) to refer to the page where these chunks are
> > + * located. And these pages will be recorded in struct virtnet_xsk_ctx. So these
> > + * chunks in vq are safe. When recycling, put the these page.
> > + *
> > + * These structures point to struct virtnet_xsk_ctx_head, and ref records how
> > + * many chunks have not been reclaimed. If active == 0, it means that xsk has
> > + * been disabled.
> > + *
> > + * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
> > + * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
> > + * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
> > + * ref is 0.
>
>
> This looks complicated and it will increase the footprint. Consider the
> performance penalty and the complexity, I would suggest to use reset
> instead.
>
> Then we don't need to introduce such context.


I don't like this either. It is best if we can reset the queue, but then,
according to my understanding, the backend should also be supported
synchronously, so if you don't update the backend synchronously, you can't use
xsk.

I don’t think resetting the entire dev is a good solution. If you want to bind
xsk to 10 queues, you may have to reset the entire device 10 times. I don’t
think this is a good way. But the current spec does not support reset single
queue, so I chose the current solution.

Jason, what do you think we are going to do? Realize the reset function of a
single queue?

Looking forward to your reply!!!

Thanks

>
> Thanks
>
>
> > + */
> > +struct virtnet_xsk_ctx;
> > +struct virtnet_xsk_ctx_head {
> > +	struct virtnet_xsk_ctx *ctx;
> > +
> > +	/* how many ctx has been add to vq */
> > +	u64 ref;
> > +
> > +	unsigned int frame_size;
> > +
> > +	/* the xsk status */
> > +	bool active;
> > +};
> > +
> > +struct virtnet_xsk_ctx {
> > +	struct virtnet_xsk_ctx_head *head;
> > +	struct virtnet_xsk_ctx *next;
> > +
> > +	struct page *page;
> > +
> > +	/* xsk unaligned mode will use two page in one desc */
> > +	struct page *page_unaligned;
> > +};
> > +
> > +struct virtnet_xsk_ctx_tx {
> > +	/* this *MUST* be the first */
> > +	struct virtnet_xsk_ctx ctx;
> > +
> > +	/* xsk tx xmit use this record the len of packet */
> > +	u32 len;
> > +};
> > +
> > +static inline void *xsk_to_ptr(struct virtnet_xsk_ctx_tx *ctx)
> > +{
> > +	return (void *)((unsigned long)ctx | VIRTIO_XSK_FLAG);
> > +}
> > +
> > +static inline struct virtnet_xsk_ctx_tx *ptr_to_xsk(void *ptr)
> > +{
> > +	unsigned long p;
> > +
> > +	p = (unsigned long)ptr;
> > +	return (struct virtnet_xsk_ctx_tx *)(p & ~VIRTIO_XSK_FLAG);
> > +}
> > +
> > +static inline void virtnet_xsk_ctx_put(struct virtnet_xsk_ctx *ctx)
> > +{
> > +	put_page(ctx->page);
> > +	if (ctx->page_unaligned)
> > +		put_page(ctx->page_unaligned);
> > +
> > +	--ctx->head->ref;
> > +
> > +	if (ctx->head->active) {
> > +		ctx->next = ctx->head->ctx;
> > +		ctx->head->ctx = ctx;
> > +	} else {
> > +		if (!ctx->head->ref)
> > +			kfree(ctx->head);
> > +	}
> > +}
> > +
> > +#define virtnet_xsk_ctx_tx_put(ctx) \
> > +	virtnet_xsk_ctx_put((struct virtnet_xsk_ctx *)ctx)
> > +
> > +int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
> > +int virtnet_poll_xsk(struct send_queue *sq, int budget);
> > +void virtnet_xsk_complete(struct send_queue *sq, u32 num);
> > +int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> > +#endif
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 03/15] virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
  2021-06-16  9:27     ` Jason Wang
  (?)
@ 2021-06-16 10:27     ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-16 10:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

On Wed, 16 Jun 2021 17:27:59 +0800, Jason Wang <jasowang@redhat.com> wrote:
>
> 在 2021/6/10 下午4:21, Xuan Zhuo 写道:
> > virtio-net not use dma addr directly. So add this priv_flags
> > IFF_NOT_USE_DMA_ADDR.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> >   drivers/net/virtio_net.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 0416a7e00914..6c1233f0ab3e 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3064,7 +3064,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> >
> >   	/* Set up network device as normal. */
> >   	dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
> > -			   IFF_TX_SKB_NO_LINEAR;
> > +			   IFF_TX_SKB_NO_LINEAR | IFF_NOT_USE_DMA_ADDR;
>
>
> I wonder instead of doing trick like this, how about teach the virtio
> core to accept DMA address via sg?

Ok, I will try to do this.

Thanks.


>
> Thanks
>
>
> >   	dev->netdev_ops = &virtnet_netdev;
> >   	dev->features = NETIF_F_HIGHDMA;
> >
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
  2021-06-16 10:19     ` Xuan Zhuo
@ 2021-06-16 12:51         ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16 12:51 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust.li, netdev


在 2021/6/16 下午6:19, Xuan Zhuo 写道:
>>> + * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
>>> + * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
>>> + * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
>>> + * ref is 0.
>> This looks complicated and it will increase the footprint. Consider the
>> performance penalty and the complexity, I would suggest to use reset
>> instead.
>>
>> Then we don't need to introduce such context.
> I don't like this either. It is best if we can reset the queue, but then,
> according to my understanding, the backend should also be supported
> synchronously, so if you don't update the backend synchronously, you can't use
> xsk.


Yes, actually, vhost-net support per vq suspending. The problem is that 
we're lacking a proper API at virtio level.

Virtio-pci has queue_enable but it forbids writing zero to that.


>
> I don’t think resetting the entire dev is a good solution. If you want to bind
> xsk to 10 queues, you may have to reset the entire device 10 times. I don’t
> think this is a good way. But the current spec does not support reset single
> queue, so I chose the current solution.
>
> Jason, what do you think we are going to do? Realize the reset function of a
> single queue?


Yes, it's the best way. Do you want to work on that?

We can start from the spec patch, and introduce it as basic facility and 
implement it in the PCI transport first.

Thanks


>
> Looking forward to your reply!!!
>
> Thanks
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
@ 2021-06-16 12:51         ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-16 12:51 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust.li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/16 下午6:19, Xuan Zhuo 写道:
>>> + * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
>>> + * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
>>> + * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
>>> + * ref is 0.
>> This looks complicated and it will increase the footprint. Consider the
>> performance penalty and the complexity, I would suggest to use reset
>> instead.
>>
>> Then we don't need to introduce such context.
> I don't like this either. It is best if we can reset the queue, but then,
> according to my understanding, the backend should also be supported
> synchronously, so if you don't update the backend synchronously, you can't use
> xsk.


Yes, actually, vhost-net support per vq suspending. The problem is that 
we're lacking a proper API at virtio level.

Virtio-pci has queue_enable but it forbids writing zero to that.


>
> I don’t think resetting the entire dev is a good solution. If you want to bind
> xsk to 10 queues, you may have to reset the entire device 10 times. I don’t
> think this is a good way. But the current spec does not support reset single
> queue, so I chose the current solution.
>
> Jason, what do you think we are going to do? Realize the reset function of a
> single queue?


Yes, it's the best way. Do you want to work on that?

We can start from the spec patch, and introduce it as basic facility and 
implement it in the PCI transport first.

Thanks


>
> Looking forward to your reply!!!
>
> Thanks
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
  2021-06-16 12:51         ` Jason Wang
  (?)
@ 2021-06-16 12:57         ` Xuan Zhuo
  2021-06-17  2:36             ` Jason Wang
  -1 siblings, 1 reply; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-16 12:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust.li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

On Wed, 16 Jun 2021 20:51:41 +0800, Jason Wang <jasowang@redhat.com> wrote:
>
> 在 2021/6/16 下午6:19, Xuan Zhuo 写道:
> >>> + * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
> >>> + * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
> >>> + * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
> >>> + * ref is 0.
> >> This looks complicated and it will increase the footprint. Consider the
> >> performance penalty and the complexity, I would suggest to use reset
> >> instead.
> >>
> >> Then we don't need to introduce such context.
> > I don't like this either. It is best if we can reset the queue, but then,
> > according to my understanding, the backend should also be supported
> > synchronously, so if you don't update the backend synchronously, you can't use
> > xsk.
>
>
> Yes, actually, vhost-net support per vq suspending. The problem is that
> we're lacking a proper API at virtio level.
>
> Virtio-pci has queue_enable but it forbids writing zero to that.
>
>
> >
> > I don’t think resetting the entire dev is a good solution. If you want to bind
> > xsk to 10 queues, you may have to reset the entire device 10 times. I don’t
> > think this is a good way. But the current spec does not support reset single
> > queue, so I chose the current solution.
> >
> > Jason, what do you think we are going to do? Realize the reset function of a
> > single queue?
>
>
> Yes, it's the best way. Do you want to work on that?

Of course, I am very willing to continue this work. Although users must upgrade
the backend to use virtio-net + xsk in the future, this makes the situation a
bit troublesome.

I will complete the kernel modification as soon as possible, but I am not
familiar with the process of submitting the spec patch. Can you give me some
guidance and where should I send the spec patch.

Thanks.

>
> We can start from the spec patch, and introduce it as basic facility and
> implement it in the PCI transport first.
>
> Thanks
>
>
> >
> > Looking forward to your reply!!!
> >
> > Thanks
> >
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
  2021-06-16 12:57         ` Xuan Zhuo
@ 2021-06-17  2:36             ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  2:36 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust.li, netdev


在 2021/6/16 下午8:57, Xuan Zhuo 写道:
> On Wed, 16 Jun 2021 20:51:41 +0800, Jason Wang <jasowang@redhat.com> wrote:
>> 在 2021/6/16 下午6:19, Xuan Zhuo 写道:
>>>>> + * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
>>>>> + * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
>>>>> + * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
>>>>> + * ref is 0.
>>>> This looks complicated and it will increase the footprint. Consider the
>>>> performance penalty and the complexity, I would suggest to use reset
>>>> instead.
>>>>
>>>> Then we don't need to introduce such context.
>>> I don't like this either. It is best if we can reset the queue, but then,
>>> according to my understanding, the backend should also be supported
>>> synchronously, so if you don't update the backend synchronously, you can't use
>>> xsk.
>>
>> Yes, actually, vhost-net support per vq suspending. The problem is that
>> we're lacking a proper API at virtio level.
>>
>> Virtio-pci has queue_enable but it forbids writing zero to that.
>>
>>
>>> I don’t think resetting the entire dev is a good solution. If you want to bind
>>> xsk to 10 queues, you may have to reset the entire device 10 times. I don’t
>>> think this is a good way. But the current spec does not support reset single
>>> queue, so I chose the current solution.
>>>
>>> Jason, what do you think we are going to do? Realize the reset function of a
>>> single queue?
>>
>> Yes, it's the best way. Do you want to work on that?
> Of course, I am very willing to continue this work. Although users must upgrade
> the backend to use virtio-net + xsk in the future, this makes the situation a
> bit troublesome.
>
> I will complete the kernel modification as soon as possible, but I am not
> familiar with the process of submitting the spec patch. Can you give me some
> guidance and where should I send the spec patch.


Subscribe the virtio dev mailing list [1] and send the spec path there.

Thanks

[1] 
https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=virtio#feedback


>
> Thanks.
>
>> We can start from the spec patch, and introduce it as basic facility and
>> implement it in the PCI transport first.
>>
>> Thanks
>>
>>
>>> Looking forward to your reply!!!
>>>
>>> Thanks
>>>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx
@ 2021-06-17  2:36             ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  2:36 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust.li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/16 下午8:57, Xuan Zhuo 写道:
> On Wed, 16 Jun 2021 20:51:41 +0800, Jason Wang <jasowang@redhat.com> wrote:
>> 在 2021/6/16 下午6:19, Xuan Zhuo 写道:
>>>>> + * In this way, even if xsk has been unbundled with rq/sq, or a new xsk and
>>>>> + * rq/sq  are bound, and a new virtnet_xsk_ctx_head is created. It will not
>>>>> + * affect the old virtnet_xsk_ctx to be recycled. And free all head and ctx when
>>>>> + * ref is 0.
>>>> This looks complicated and it will increase the footprint. Consider the
>>>> performance penalty and the complexity, I would suggest to use reset
>>>> instead.
>>>>
>>>> Then we don't need to introduce such context.
>>> I don't like this either. It is best if we can reset the queue, but then,
>>> according to my understanding, the backend should also be supported
>>> synchronously, so if you don't update the backend synchronously, you can't use
>>> xsk.
>>
>> Yes, actually, vhost-net support per vq suspending. The problem is that
>> we're lacking a proper API at virtio level.
>>
>> Virtio-pci has queue_enable but it forbids writing zero to that.
>>
>>
>>> I don’t think resetting the entire dev is a good solution. If you want to bind
>>> xsk to 10 queues, you may have to reset the entire device 10 times. I don’t
>>> think this is a good way. But the current spec does not support reset single
>>> queue, so I chose the current solution.
>>>
>>> Jason, what do you think we are going to do? Realize the reset function of a
>>> single queue?
>>
>> Yes, it's the best way. Do you want to work on that?
> Of course, I am very willing to continue this work. Although users must upgrade
> the backend to use virtio-net + xsk in the future, this makes the situation a
> bit troublesome.
>
> I will complete the kernel modification as soon as possible, but I am not
> familiar with the process of submitting the spec patch. Can you give me some
> guidance and where should I send the spec patch.


Subscribe the virtio dev mailing list [1] and send the spec path there.

Thanks

[1] 
https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=virtio#feedback


>
> Thanks.
>
>> We can start from the spec patch, and introduce it as basic facility and
>> implement it in the PCI transport first.
>>
>> Thanks
>>
>>
>>> Looking forward to your reply!!!
>>>
>>> Thanks
>>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-17  2:48     ` kernel test robot
  -1 siblings, 0 replies; 80+ messages in thread
From: kernel test robot @ 2021-06-17  2:48 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: kbuild-all, clang-built-linux, Jakub Kicinski,
	Michael S. Tsirkin, Jason Wang, Björn Töpel,
	Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer

[-- Attachment #1: Type: text/plain, Size: 4472 bytes --]

Hi Xuan,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git c7654495916e109f76a67fd3ae68f8fa70ab4faa
config: x86_64-randconfig-a015-20210617 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 64720f57bea6a6bf033feef4a5751ab9c0c3b401)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/f5f1e60139e7c38fbb4ed58d503e89bbb26c1464
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
        git checkout f5f1e60139e7c38fbb4ed58d503e89bbb26c1464
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/net/virtio/virtio_net.c:1304:2: warning: variable 'oom' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
           if (fill_recv_xsk(vi, rq, gfp))
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:56:28: note: expanded from macro 'if'
   #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:58:30: note: expanded from macro '__trace_if_var'
   #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/virtio/virtio_net.c:1329:10: note: uninitialized use occurs here
           return !oom;
                   ^~~
   drivers/net/virtio/virtio_net.c:1304:2: note: remove the 'if' if its condition is always false
           if (fill_recv_xsk(vi, rq, gfp))
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:56:23: note: expanded from macro 'if'
   #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                         ^
   drivers/net/virtio/virtio_net.c:1297:10: note: initialize the variable 'oom' to silence this warning
           bool oom;
                   ^
                    = 0
   1 warning generated.


vim +1304 drivers/net/virtio/virtio_net.c

  1285	
  1286	/*
  1287	 * Returns false if we couldn't fill entirely (OOM).
  1288	 *
  1289	 * Normally run in the receive path, but can also be run from ndo_open
  1290	 * before we're receiving packets, or from refill_work which is
  1291	 * careful to disable receiving (using napi_disable).
  1292	 */
  1293	static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
  1294				  gfp_t gfp)
  1295	{
  1296		int err;
  1297		bool oom;
  1298	
  1299		/* Because virtio-net does not yet support flow direct, all rx
  1300		 * channels must also process other non-xsk packets. If there is
  1301		 * no buf available from xsk temporarily, we try to allocate
  1302		 * memory to the channel.
  1303		 */
> 1304		if (fill_recv_xsk(vi, rq, gfp))
  1305			goto kick;
  1306	
  1307		do {
  1308			if (vi->mergeable_rx_bufs)
  1309				err = add_recvbuf_mergeable(vi, rq, gfp);
  1310			else if (vi->big_packets)
  1311				err = add_recvbuf_big(vi, rq, gfp);
  1312			else
  1313				err = add_recvbuf_small(vi, rq, gfp);
  1314	
  1315			oom = err == -ENOMEM;
  1316			if (err)
  1317				break;
  1318		} while (rq->vq->num_free);
  1319	
  1320	kick:
  1321		if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) {
  1322			unsigned long flags;
  1323	
  1324			flags = u64_stats_update_begin_irqsave(&rq->stats.syncp);
  1325			rq->stats.kicks++;
  1326			u64_stats_update_end_irqrestore(&rq->stats.syncp, flags);
  1327		}
  1328	
  1329		return !oom;
  1330	}
  1331	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31921 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
@ 2021-06-17  2:48     ` kernel test robot
  0 siblings, 0 replies; 80+ messages in thread
From: kernel test robot @ 2021-06-17  2:48 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4579 bytes --]

Hi Xuan,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git c7654495916e109f76a67fd3ae68f8fa70ab4faa
config: x86_64-randconfig-a015-20210617 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 64720f57bea6a6bf033feef4a5751ab9c0c3b401)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/f5f1e60139e7c38fbb4ed58d503e89bbb26c1464
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
        git checkout f5f1e60139e7c38fbb4ed58d503e89bbb26c1464
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/net/virtio/virtio_net.c:1304:2: warning: variable 'oom' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
           if (fill_recv_xsk(vi, rq, gfp))
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:56:28: note: expanded from macro 'if'
   #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:58:30: note: expanded from macro '__trace_if_var'
   #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/virtio/virtio_net.c:1329:10: note: uninitialized use occurs here
           return !oom;
                   ^~~
   drivers/net/virtio/virtio_net.c:1304:2: note: remove the 'if' if its condition is always false
           if (fill_recv_xsk(vi, rq, gfp))
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:56:23: note: expanded from macro 'if'
   #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                         ^
   drivers/net/virtio/virtio_net.c:1297:10: note: initialize the variable 'oom' to silence this warning
           bool oom;
                   ^
                    = 0
   1 warning generated.


vim +1304 drivers/net/virtio/virtio_net.c

  1285	
  1286	/*
  1287	 * Returns false if we couldn't fill entirely (OOM).
  1288	 *
  1289	 * Normally run in the receive path, but can also be run from ndo_open
  1290	 * before we're receiving packets, or from refill_work which is
  1291	 * careful to disable receiving (using napi_disable).
  1292	 */
  1293	static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
  1294				  gfp_t gfp)
  1295	{
  1296		int err;
  1297		bool oom;
  1298	
  1299		/* Because virtio-net does not yet support flow direct, all rx
  1300		 * channels must also process other non-xsk packets. If there is
  1301		 * no buf available from xsk temporarily, we try to allocate
  1302		 * memory to the channel.
  1303		 */
> 1304		if (fill_recv_xsk(vi, rq, gfp))
  1305			goto kick;
  1306	
  1307		do {
  1308			if (vi->mergeable_rx_bufs)
  1309				err = add_recvbuf_mergeable(vi, rq, gfp);
  1310			else if (vi->big_packets)
  1311				err = add_recvbuf_big(vi, rq, gfp);
  1312			else
  1313				err = add_recvbuf_small(vi, rq, gfp);
  1314	
  1315			oom = err == -ENOMEM;
  1316			if (err)
  1317				break;
  1318		} while (rq->vq->num_free);
  1319	
  1320	kick:
  1321		if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) {
  1322			unsigned long flags;
  1323	
  1324			flags = u64_stats_update_begin_irqsave(&rq->stats.syncp);
  1325			rq->stats.kicks++;
  1326			u64_stats_update_end_irqrestore(&rq->stats.syncp, flags);
  1327		}
  1328	
  1329		return !oom;
  1330	}
  1331	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 31921 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 05/15] virtio: support virtqueue_detach_unused_buf_ctx
  2021-06-10  8:21   ` Xuan Zhuo
@ 2021-06-17  2:48     ` kernel test robot
  -1 siblings, 0 replies; 80+ messages in thread
From: kernel test robot @ 2021-06-17  2:48 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: kbuild-all, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer

[-- Attachment #1: Type: text/plain, Size: 3144 bytes --]

Hi Xuan,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git c7654495916e109f76a67fd3ae68f8fa70ab4faa
config: parisc-randconfig-r023-20210617 (attached as .config)
compiler: hppa-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/6155fdb771fa9f6c96472440c6b846dbfc4aebde
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
        git checkout 6155fdb771fa9f6c96472440c6b846dbfc4aebde
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=parisc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/virtio/virtio_ring.c:1898: warning: expecting prototype for virtqueue_get_buf(). Prototype was for virtqueue_get_buf_ctx() instead
   drivers/virtio/virtio_ring.c:2024: warning: Function parameter or member 'ctx' not described in 'virtqueue_detach_unused_buf_ctx'
>> drivers/virtio/virtio_ring.c:2024: warning: expecting prototype for virtqueue_detach_unused_buf(). Prototype was for virtqueue_detach_unused_buf_ctx() instead


vim +2024 drivers/virtio/virtio_ring.c

e6f633e5beab65 Tiwei Bie  2018-11-21  2014  
138fd25148638a Tiwei Bie  2018-11-21  2015  /**
138fd25148638a Tiwei Bie  2018-11-21  2016   * virtqueue_detach_unused_buf - detach first unused buffer
a5581206c565a7 Jiang Biao 2019-04-23  2017   * @_vq: the struct virtqueue we're talking about.
138fd25148638a Tiwei Bie  2018-11-21  2018   *
138fd25148638a Tiwei Bie  2018-11-21  2019   * Returns NULL or the "data" token handed to virtqueue_add_*().
138fd25148638a Tiwei Bie  2018-11-21  2020   * This is not valid on an active queue; it is useful only for device
138fd25148638a Tiwei Bie  2018-11-21  2021   * shutdown.
138fd25148638a Tiwei Bie  2018-11-21  2022   */
6155fdb771fa9f Xuan Zhuo  2021-06-10  2023  void *virtqueue_detach_unused_buf_ctx(struct virtqueue *_vq, void **ctx)
138fd25148638a Tiwei Bie  2018-11-21 @2024  {
1ce9e6055fa0a9 Tiwei Bie  2018-11-21  2025  	struct vring_virtqueue *vq = to_vvq(_vq);
1ce9e6055fa0a9 Tiwei Bie  2018-11-21  2026  
6155fdb771fa9f Xuan Zhuo  2021-06-10  2027  	return vq->packed_ring ?
6155fdb771fa9f Xuan Zhuo  2021-06-10  2028  		virtqueue_detach_unused_buf_ctx_packed(_vq, ctx) :
6155fdb771fa9f Xuan Zhuo  2021-06-10  2029  		virtqueue_detach_unused_buf_ctx_split(_vq, ctx);
6155fdb771fa9f Xuan Zhuo  2021-06-10  2030  }
6155fdb771fa9f Xuan Zhuo  2021-06-10  2031  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28973 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 05/15] virtio: support virtqueue_detach_unused_buf_ctx
@ 2021-06-17  2:48     ` kernel test robot
  0 siblings, 0 replies; 80+ messages in thread
From: kernel test robot @ 2021-06-17  2:48 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3200 bytes --]

Hi Xuan,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git c7654495916e109f76a67fd3ae68f8fa70ab4faa
config: parisc-randconfig-r023-20210617 (attached as .config)
compiler: hppa-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/6155fdb771fa9f6c96472440c6b846dbfc4aebde
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
        git checkout 6155fdb771fa9f6c96472440c6b846dbfc4aebde
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=parisc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/virtio/virtio_ring.c:1898: warning: expecting prototype for virtqueue_get_buf(). Prototype was for virtqueue_get_buf_ctx() instead
   drivers/virtio/virtio_ring.c:2024: warning: Function parameter or member 'ctx' not described in 'virtqueue_detach_unused_buf_ctx'
>> drivers/virtio/virtio_ring.c:2024: warning: expecting prototype for virtqueue_detach_unused_buf(). Prototype was for virtqueue_detach_unused_buf_ctx() instead


vim +2024 drivers/virtio/virtio_ring.c

e6f633e5beab65 Tiwei Bie  2018-11-21  2014  
138fd25148638a Tiwei Bie  2018-11-21  2015  /**
138fd25148638a Tiwei Bie  2018-11-21  2016   * virtqueue_detach_unused_buf - detach first unused buffer
a5581206c565a7 Jiang Biao 2019-04-23  2017   * @_vq: the struct virtqueue we're talking about.
138fd25148638a Tiwei Bie  2018-11-21  2018   *
138fd25148638a Tiwei Bie  2018-11-21  2019   * Returns NULL or the "data" token handed to virtqueue_add_*().
138fd25148638a Tiwei Bie  2018-11-21  2020   * This is not valid on an active queue; it is useful only for device
138fd25148638a Tiwei Bie  2018-11-21  2021   * shutdown.
138fd25148638a Tiwei Bie  2018-11-21  2022   */
6155fdb771fa9f Xuan Zhuo  2021-06-10  2023  void *virtqueue_detach_unused_buf_ctx(struct virtqueue *_vq, void **ctx)
138fd25148638a Tiwei Bie  2018-11-21 @2024  {
1ce9e6055fa0a9 Tiwei Bie  2018-11-21  2025  	struct vring_virtqueue *vq = to_vvq(_vq);
1ce9e6055fa0a9 Tiwei Bie  2018-11-21  2026  
6155fdb771fa9f Xuan Zhuo  2021-06-10  2027  	return vq->packed_ring ?
6155fdb771fa9f Xuan Zhuo  2021-06-10  2028  		virtqueue_detach_unused_buf_ctx_packed(_vq, ctx) :
6155fdb771fa9f Xuan Zhuo  2021-06-10  2029  		virtqueue_detach_unused_buf_ctx_split(_vq, ctx);
6155fdb771fa9f Xuan Zhuo  2021-06-10  2030  }
6155fdb771fa9f Xuan Zhuo  2021-06-10  2031  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 28973 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-17  3:07     ` Jason Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  3:07 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust . li


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> Calling virtqueue_napi_schedule() in wakeup results in napi running on
> the current cpu. If the application is not busy, then there is no
> problem. But if the application itself is busy, it will cause a lot of
> scheduling.
>
> If the application is continuously sending data packets, due to the
> continuous scheduling between the application and napi, the data packet
> transmission will not be smooth, and there will be an obvious delay in
> the transmission (you can use tcpdump to see it). When pressing a
> channel to 100% (vhost reaches 100%), the cpu where the application is
> located reaches 100%.
>
> This patch sends a small amount of data directly in wakeup. The purpose
> of this is to trigger the tx interrupt. The tx interrupt will be
> awakened on the cpu of its affinity, and then trigger the operation of
> the napi mechanism, napi can continue to consume the xsk tx queue. Two
> cpus are running, cpu0 is running applications, cpu1 executes
> napi consumption data. The same is to press a channel to 100%, but the
> utilization rate of cpu0 is 12.7% and the utilization rate of cpu1 is
> 2.9%.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   drivers/net/virtio/xsk.c | 28 +++++++++++++++++++++++-----
>   1 file changed, 23 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> index 36cda2dcf8e7..3973c82d1ad2 100644
> --- a/drivers/net/virtio/xsk.c
> +++ b/drivers/net/virtio/xsk.c
> @@ -547,6 +547,7 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
>   {
>   	struct virtnet_info *vi = netdev_priv(dev);
>   	struct xsk_buff_pool *pool;
> +	struct netdev_queue *txq;
>   	struct send_queue *sq;
>   
>   	if (!netif_running(dev))
> @@ -559,11 +560,28 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
>   
>   	rcu_read_lock();
>   	pool = rcu_dereference(sq->xsk.pool);
> -	if (pool) {
> -		local_bh_disable();
> -		virtqueue_napi_schedule(&sq->napi, sq->vq);
> -		local_bh_enable();
> -	}
> +	if (!pool)
> +		goto end;
> +
> +	if (napi_if_scheduled_mark_missed(&sq->napi))
> +		goto end;
> +
> +	txq = netdev_get_tx_queue(dev, qid);
> +
> +	__netif_tx_lock_bh(txq);
> +
> +	/* Send part of the packet directly to reduce the delay in sending the
> +	 * packet, and this can actively trigger the tx interrupts.
> +	 *
> +	 * If no packet is sent out, the ring of the device is full. In this
> +	 * case, we will still get a tx interrupt response. Then we will deal
> +	 * with the subsequent packet sending work.
> +	 */
> +	virtnet_xsk_run(sq, pool, sq->napi.weight, false);


This looks tricky, and it won't be efficient since there could be some 
contention on the tx lock.

I wonder if we can simulate the interrupt via IPI like what RPS did.

In the long run, we may want to extend the spec to support interrupt 
trigger though driver.

Thanks


> +
> +	__netif_tx_unlock_bh(txq);
> +
> +end:
>   	rcu_read_unlock();
>   	return 0;
>   }


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup
@ 2021-06-17  3:07     ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  3:07 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> Calling virtqueue_napi_schedule() in wakeup results in napi running on
> the current cpu. If the application is not busy, then there is no
> problem. But if the application itself is busy, it will cause a lot of
> scheduling.
>
> If the application is continuously sending data packets, due to the
> continuous scheduling between the application and napi, the data packet
> transmission will not be smooth, and there will be an obvious delay in
> the transmission (you can use tcpdump to see it). When pressing a
> channel to 100% (vhost reaches 100%), the cpu where the application is
> located reaches 100%.
>
> This patch sends a small amount of data directly in wakeup. The purpose
> of this is to trigger the tx interrupt. The tx interrupt will be
> awakened on the cpu of its affinity, and then trigger the operation of
> the napi mechanism, napi can continue to consume the xsk tx queue. Two
> cpus are running, cpu0 is running applications, cpu1 executes
> napi consumption data. The same is to press a channel to 100%, but the
> utilization rate of cpu0 is 12.7% and the utilization rate of cpu1 is
> 2.9%.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   drivers/net/virtio/xsk.c | 28 +++++++++++++++++++++++-----
>   1 file changed, 23 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> index 36cda2dcf8e7..3973c82d1ad2 100644
> --- a/drivers/net/virtio/xsk.c
> +++ b/drivers/net/virtio/xsk.c
> @@ -547,6 +547,7 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
>   {
>   	struct virtnet_info *vi = netdev_priv(dev);
>   	struct xsk_buff_pool *pool;
> +	struct netdev_queue *txq;
>   	struct send_queue *sq;
>   
>   	if (!netif_running(dev))
> @@ -559,11 +560,28 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
>   
>   	rcu_read_lock();
>   	pool = rcu_dereference(sq->xsk.pool);
> -	if (pool) {
> -		local_bh_disable();
> -		virtqueue_napi_schedule(&sq->napi, sq->vq);
> -		local_bh_enable();
> -	}
> +	if (!pool)
> +		goto end;
> +
> +	if (napi_if_scheduled_mark_missed(&sq->napi))
> +		goto end;
> +
> +	txq = netdev_get_tx_queue(dev, qid);
> +
> +	__netif_tx_lock_bh(txq);
> +
> +	/* Send part of the packet directly to reduce the delay in sending the
> +	 * packet, and this can actively trigger the tx interrupts.
> +	 *
> +	 * If no packet is sent out, the ring of the device is full. In this
> +	 * case, we will still get a tx interrupt response. Then we will deal
> +	 * with the subsequent packet sending work.
> +	 */
> +	virtnet_xsk_run(sq, pool, sq->napi.weight, false);


This looks tricky, and it won't be efficient since there could be some 
contention on the tx lock.

I wonder if we can simulate the interrupt via IPI like what RPS did.

In the long run, we may want to extend the spec to support interrupt 
trigger though driver.

Thanks


> +
> +	__netif_tx_unlock_bh(txq);
> +
> +end:
>   	rcu_read_unlock();
>   	return 0;
>   }

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-17  3:08     ` Jason Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  3:08 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust . li


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> After testing, the performance of calling kick every time is not stable.
> And if all the packets are sent and kicked again, the performance is not
> good. So add a module parameter to specify how many packets are sent to
> call a kick.
>
> 8 is a relatively stable value with the best performance.
>
> Here is the pps of the test of xsk_kick_thr under different values (from
> 1 to 64).
>
> thr  PPS             thr PPS             thr PPS
> 1    2924116.74247 | 23  3683263.04348 | 45  2777907.22963
> 2    3441010.57191 | 24  3078880.13043 | 46  2781376.21739
> 3    3636728.72378 | 25  2859219.57656 | 47  2777271.91304
> 4    3637518.61468 | 26  2851557.9593  | 48  2800320.56575
> 5    3651738.16251 | 27  2834783.54408 | 49  2813039.87599
> 6    3652176.69231 | 28  2847012.41472 | 50  3445143.01839
> 7    3665415.80602 | 29  2860633.91304 | 51  3666918.01281
> 8    3665045.16555 | 30  2857903.5786  | 52  3059929.2709


I wonder what's the number for the case of non zc xsk?

Thanks


> 9    3671023.2401  | 31  2835589.98963 | 53  2831515.21739
> 10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
> 11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
> 12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
> 13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
> 14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
> 15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
> 16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
> 17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
> 18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
> 19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
> 20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
> 21   2845655.55789 | 43  2752951.93264 |
> 22   3450389.84365 | 44  2753107.27164 |
>
> It can be found that when the value of xsk_kick_thr is relatively small,
> the performance is not good, and when its value is greater than 13, the
> performance will be more irregular and unstable. It looks similar from 3
> to 13, I chose 8 as the default value.
>
> The test environment is qemu + vhost-net. I modified vhost-net to drop
> the packets sent by vm directly, so that the cpu of vm can run higher.
> By default, the processes in the vm and the cpu of softirqd are too low,
> and there is no obvious difference in the test data.
>
> During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
> run for 300s, the pps of every second was recorded, and the average of
> the pps was finally taken. The vhost process cpu on the host has also
> reached 100%.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>   drivers/net/virtio/virtio_net.c |  1 +
>   drivers/net/virtio/xsk.c        | 18 ++++++++++++++++--
>   drivers/net/virtio/xsk.h        |  2 ++
>   3 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
> index 9503133e71f0..dfe509939b45 100644
> --- a/drivers/net/virtio/virtio_net.c
> +++ b/drivers/net/virtio/virtio_net.c
> @@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
>   module_param(csum, bool, 0444);
>   module_param(gso, bool, 0444);
>   module_param(napi_tx, bool, 0644);
> +module_param(xsk_kick_thr, int, 0644);
>   
>   /* FIXME: MTU in config. */
>   #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> index 3973c82d1ad2..2f3ba6ab4798 100644
> --- a/drivers/net/virtio/xsk.c
> +++ b/drivers/net/virtio/xsk.c
> @@ -5,6 +5,8 @@
>   
>   #include "virtio_net.h"
>   
> +int xsk_kick_thr = 8;
> +
>   static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
>   
>   static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
> @@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>   	struct xdp_desc desc;
>   	int err, packet = 0;
>   	int ret = -EAGAIN;
> +	int need_kick = 0;
>   
>   	while (budget-- > 0) {
>   		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
> @@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>   		}
>   
>   		++packet;
> +		++need_kick;
> +		if (need_kick > xsk_kick_thr) {
> +			if (virtqueue_kick_prepare(sq->vq) &&
> +			    virtqueue_notify(sq->vq))
> +				++stats->kicks;
> +
> +			need_kick = 0;
> +		}
>   	}
>   
>   	if (packet) {
> -		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> -			++stats->kicks;
> +		if (need_kick) {
> +			if (virtqueue_kick_prepare(sq->vq) &&
> +			    virtqueue_notify(sq->vq))
> +				++stats->kicks;
> +		}
>   
>   		*done += packet;
>   		stats->xdp_tx += packet;
> diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> index fe22cf78d505..4f0f4f9cf23b 100644
> --- a/drivers/net/virtio/xsk.h
> +++ b/drivers/net/virtio/xsk.h
> @@ -7,6 +7,8 @@
>   
>   #define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
>   
> +extern int xsk_kick_thr;
> +
>   /* When xsk disable, under normal circumstances, the network card must reclaim
>    * all the memory that has been sent and the memory added to the rq queue by
>    * destroying the queue.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold
@ 2021-06-17  3:08     ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  3:08 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> After testing, the performance of calling kick every time is not stable.
> And if all the packets are sent and kicked again, the performance is not
> good. So add a module parameter to specify how many packets are sent to
> call a kick.
>
> 8 is a relatively stable value with the best performance.
>
> Here is the pps of the test of xsk_kick_thr under different values (from
> 1 to 64).
>
> thr  PPS             thr PPS             thr PPS
> 1    2924116.74247 | 23  3683263.04348 | 45  2777907.22963
> 2    3441010.57191 | 24  3078880.13043 | 46  2781376.21739
> 3    3636728.72378 | 25  2859219.57656 | 47  2777271.91304
> 4    3637518.61468 | 26  2851557.9593  | 48  2800320.56575
> 5    3651738.16251 | 27  2834783.54408 | 49  2813039.87599
> 6    3652176.69231 | 28  2847012.41472 | 50  3445143.01839
> 7    3665415.80602 | 29  2860633.91304 | 51  3666918.01281
> 8    3665045.16555 | 30  2857903.5786  | 52  3059929.2709


I wonder what's the number for the case of non zc xsk?

Thanks


> 9    3671023.2401  | 31  2835589.98963 | 53  2831515.21739
> 10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
> 11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
> 12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
> 13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
> 14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
> 15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
> 16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
> 17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
> 18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
> 19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
> 20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
> 21   2845655.55789 | 43  2752951.93264 |
> 22   3450389.84365 | 44  2753107.27164 |
>
> It can be found that when the value of xsk_kick_thr is relatively small,
> the performance is not good, and when its value is greater than 13, the
> performance will be more irregular and unstable. It looks similar from 3
> to 13, I chose 8 as the default value.
>
> The test environment is qemu + vhost-net. I modified vhost-net to drop
> the packets sent by vm directly, so that the cpu of vm can run higher.
> By default, the processes in the vm and the cpu of softirqd are too low,
> and there is no obvious difference in the test data.
>
> During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
> run for 300s, the pps of every second was recorded, and the average of
> the pps was finally taken. The vhost process cpu on the host has also
> reached 100%.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>   drivers/net/virtio/virtio_net.c |  1 +
>   drivers/net/virtio/xsk.c        | 18 ++++++++++++++++--
>   drivers/net/virtio/xsk.h        |  2 ++
>   3 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
> index 9503133e71f0..dfe509939b45 100644
> --- a/drivers/net/virtio/virtio_net.c
> +++ b/drivers/net/virtio/virtio_net.c
> @@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
>   module_param(csum, bool, 0444);
>   module_param(gso, bool, 0444);
>   module_param(napi_tx, bool, 0644);
> +module_param(xsk_kick_thr, int, 0644);
>   
>   /* FIXME: MTU in config. */
>   #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> index 3973c82d1ad2..2f3ba6ab4798 100644
> --- a/drivers/net/virtio/xsk.c
> +++ b/drivers/net/virtio/xsk.c
> @@ -5,6 +5,8 @@
>   
>   #include "virtio_net.h"
>   
> +int xsk_kick_thr = 8;
> +
>   static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
>   
>   static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
> @@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>   	struct xdp_desc desc;
>   	int err, packet = 0;
>   	int ret = -EAGAIN;
> +	int need_kick = 0;
>   
>   	while (budget-- > 0) {
>   		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
> @@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>   		}
>   
>   		++packet;
> +		++need_kick;
> +		if (need_kick > xsk_kick_thr) {
> +			if (virtqueue_kick_prepare(sq->vq) &&
> +			    virtqueue_notify(sq->vq))
> +				++stats->kicks;
> +
> +			need_kick = 0;
> +		}
>   	}
>   
>   	if (packet) {
> -		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> -			++stats->kicks;
> +		if (need_kick) {
> +			if (virtqueue_kick_prepare(sq->vq) &&
> +			    virtqueue_notify(sq->vq))
> +				++stats->kicks;
> +		}
>   
>   		*done += packet;
>   		stats->xdp_tx += packet;
> diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> index fe22cf78d505..4f0f4f9cf23b 100644
> --- a/drivers/net/virtio/xsk.h
> +++ b/drivers/net/virtio/xsk.h
> @@ -7,6 +7,8 @@
>   
>   #define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
>   
> +extern int xsk_kick_thr;
> +
>   /* When xsk disable, under normal circumstances, the network card must reclaim
>    * all the memory that has been sent and the memory added to the rq queue by
>    * destroying the queue.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-17  3:23     ` Jason Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  3:23 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust . li


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> Compared to the case of xsk tx, the case of xsk zc rx is more
> complicated.
>
> When we process the buf received by vq, we may encounter ordinary
> buffers, or xsk buffers. What makes the situation more complicated is
> that in the case of mergeable, when num_buffer > 1, we may still
> encounter the case where xsk buffer is mixed with ordinary buffer.
>
> Another thing that makes the situation more complicated is that when we
> get an xsk buffer from vq, the xsk bound to this xsk buffer may have
> been unbound.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


This is somehow similar to the case of tx where we don't have per vq reset.

[...]

>   
> -	if (vi->mergeable_rx_bufs)
> +	if (is_xsk_ctx(ctx))
> +		skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
> +	else if (vi->mergeable_rx_bufs)
>   		skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
>   					stats);
>   	else if (vi->big_packets)
> @@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
>   	int err;
>   	bool oom;
>   
> +	/* Because virtio-net does not yet support flow direct,


Note that this is not the case any more. RSS has been supported by 
virtio spec and qemu/vhost/tap now. We just need some work on the 
virtio-net driver part (e.g the ethool interface).

Thanks



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
@ 2021-06-17  3:23     ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  3:23 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> Compared to the case of xsk tx, the case of xsk zc rx is more
> complicated.
>
> When we process the buf received by vq, we may encounter ordinary
> buffers, or xsk buffers. What makes the situation more complicated is
> that in the case of mergeable, when num_buffer > 1, we may still
> encounter the case where xsk buffer is mixed with ordinary buffer.
>
> Another thing that makes the situation more complicated is that when we
> get an xsk buffer from vq, the xsk bound to this xsk buffer may have
> been unbound.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


This is somehow similar to the case of tx where we don't have per vq reset.

[...]

>   
> -	if (vi->mergeable_rx_bufs)
> +	if (is_xsk_ctx(ctx))
> +		skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
> +	else if (vi->mergeable_rx_bufs)
>   		skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
>   					stats);
>   	else if (vi->big_packets)
> @@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
>   	int err;
>   	bool oom;
>   
> +	/* Because virtio-net does not yet support flow direct,


Note that this is not the case any more. RSS has been supported by 
virtio spec and qemu/vhost/tap now. We just need some work on the 
virtio-net driver part (e.g the ethool interface).

Thanks


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
  2021-06-17  3:23     ` Jason Wang
  (?)
@ 2021-06-17  5:53     ` Xuan Zhuo
  2021-06-17  6:03         ` Jason Wang
  -1 siblings, 1 reply; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-17  5:53 UTC (permalink / raw)
  To: Jason Wang
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

On Thu, 17 Jun 2021 11:23:52 +0800, Jason Wang <jasowang@redhat.com> wrote:
>
> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> > Compared to the case of xsk tx, the case of xsk zc rx is more
> > complicated.
> >
> > When we process the buf received by vq, we may encounter ordinary
> > buffers, or xsk buffers. What makes the situation more complicated is
> > that in the case of mergeable, when num_buffer > 1, we may still
> > encounter the case where xsk buffer is mixed with ordinary buffer.
> >
> > Another thing that makes the situation more complicated is that when we
> > get an xsk buffer from vq, the xsk bound to this xsk buffer may have
> > been unbound.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>
>
> This is somehow similar to the case of tx where we don't have per vq reset.
>
> [...]
>
> >
> > -	if (vi->mergeable_rx_bufs)
> > +	if (is_xsk_ctx(ctx))
> > +		skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
> > +	else if (vi->mergeable_rx_bufs)
> >   		skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
> >   					stats);
> >   	else if (vi->big_packets)
> > @@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
> >   	int err;
> >   	bool oom;
> >
> > +	/* Because virtio-net does not yet support flow direct,
>
>
> Note that this is not the case any more. RSS has been supported by
> virtio spec and qemu/vhost/tap now. We just need some work on the
> virtio-net driver part (e.g the ethool interface).

Oh, are there any plans? Who is doing this work, can I help?

Thanks.

>
> Thanks
>
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup
  2021-06-17  3:07     ` Jason Wang
  (?)
@ 2021-06-17  5:55     ` Xuan Zhuo
  2021-06-17  6:01         ` Jason Wang
  -1 siblings, 1 reply; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-17  5:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

On Thu, 17 Jun 2021 11:07:17 +0800, Jason Wang <jasowang@redhat.com> wrote:
>
> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> > Calling virtqueue_napi_schedule() in wakeup results in napi running on
> > the current cpu. If the application is not busy, then there is no
> > problem. But if the application itself is busy, it will cause a lot of
> > scheduling.
> >
> > If the application is continuously sending data packets, due to the
> > continuous scheduling between the application and napi, the data packet
> > transmission will not be smooth, and there will be an obvious delay in
> > the transmission (you can use tcpdump to see it). When pressing a
> > channel to 100% (vhost reaches 100%), the cpu where the application is
> > located reaches 100%.
> >
> > This patch sends a small amount of data directly in wakeup. The purpose
> > of this is to trigger the tx interrupt. The tx interrupt will be
> > awakened on the cpu of its affinity, and then trigger the operation of
> > the napi mechanism, napi can continue to consume the xsk tx queue. Two
> > cpus are running, cpu0 is running applications, cpu1 executes
> > napi consumption data. The same is to press a channel to 100%, but the
> > utilization rate of cpu0 is 12.7% and the utilization rate of cpu1 is
> > 2.9%.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> >   drivers/net/virtio/xsk.c | 28 +++++++++++++++++++++++-----
> >   1 file changed, 23 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > index 36cda2dcf8e7..3973c82d1ad2 100644
> > --- a/drivers/net/virtio/xsk.c
> > +++ b/drivers/net/virtio/xsk.c
> > @@ -547,6 +547,7 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
> >   {
> >   	struct virtnet_info *vi = netdev_priv(dev);
> >   	struct xsk_buff_pool *pool;
> > +	struct netdev_queue *txq;
> >   	struct send_queue *sq;
> >
> >   	if (!netif_running(dev))
> > @@ -559,11 +560,28 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
> >
> >   	rcu_read_lock();
> >   	pool = rcu_dereference(sq->xsk.pool);
> > -	if (pool) {
> > -		local_bh_disable();
> > -		virtqueue_napi_schedule(&sq->napi, sq->vq);
> > -		local_bh_enable();
> > -	}
> > +	if (!pool)
> > +		goto end;
> > +
> > +	if (napi_if_scheduled_mark_missed(&sq->napi))
> > +		goto end;
> > +
> > +	txq = netdev_get_tx_queue(dev, qid);
> > +
> > +	__netif_tx_lock_bh(txq);
> > +
> > +	/* Send part of the packet directly to reduce the delay in sending the
> > +	 * packet, and this can actively trigger the tx interrupts.
> > +	 *
> > +	 * If no packet is sent out, the ring of the device is full. In this
> > +	 * case, we will still get a tx interrupt response. Then we will deal
> > +	 * with the subsequent packet sending work.
> > +	 */
> > +	virtnet_xsk_run(sq, pool, sq->napi.weight, false);
>
>
> This looks tricky, and it won't be efficient since there could be some
> contention on the tx lock.
>
> I wonder if we can simulate the interrupt via IPI like what RPS did.

Let me try.

>
> In the long run, we may want to extend the spec to support interrupt
> trigger though driver.

Can we submit this with reset queue?

Thanks.

>
> Thanks
>
>
> > +
> > +	__netif_tx_unlock_bh(txq);
> > +
> > +end:
> >   	rcu_read_unlock();
> >   	return 0;
> >   }
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold
  2021-06-17  3:08     ` Jason Wang
  (?)
@ 2021-06-17  5:56     ` Xuan Zhuo
  2021-06-17  6:00         ` Jason Wang
  -1 siblings, 1 reply; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-17  5:56 UTC (permalink / raw)
  To: Jason Wang
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust . li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

On Thu, 17 Jun 2021 11:08:34 +0800, Jason Wang <jasowang@redhat.com> wrote:
>
> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> > After testing, the performance of calling kick every time is not stable.
> > And if all the packets are sent and kicked again, the performance is not
> > good. So add a module parameter to specify how many packets are sent to
> > call a kick.
> >
> > 8 is a relatively stable value with the best performance.
> >
> > Here is the pps of the test of xsk_kick_thr under different values (from
> > 1 to 64).
> >
> > thr  PPS             thr PPS             thr PPS
> > 1    2924116.74247 | 23  3683263.04348 | 45  2777907.22963
> > 2    3441010.57191 | 24  3078880.13043 | 46  2781376.21739
> > 3    3636728.72378 | 25  2859219.57656 | 47  2777271.91304
> > 4    3637518.61468 | 26  2851557.9593  | 48  2800320.56575
> > 5    3651738.16251 | 27  2834783.54408 | 49  2813039.87599
> > 6    3652176.69231 | 28  2847012.41472 | 50  3445143.01839
> > 7    3665415.80602 | 29  2860633.91304 | 51  3666918.01281
> > 8    3665045.16555 | 30  2857903.5786  | 52  3059929.2709
>
>
> I wonder what's the number for the case of non zc xsk?


These data are used to compare the situation of sending different numbers of
packets to virtio at one time. I think it has nothing to do with non-zerocopy
xsk.

Thanks.

>
> Thanks
>
>
> > 9    3671023.2401  | 31  2835589.98963 | 53  2831515.21739
> > 10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
> > 11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
> > 12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
> > 13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
> > 14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
> > 15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
> > 16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
> > 17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
> > 18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
> > 19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
> > 20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
> > 21   2845655.55789 | 43  2752951.93264 |
> > 22   3450389.84365 | 44  2753107.27164 |
> >
> > It can be found that when the value of xsk_kick_thr is relatively small,
> > the performance is not good, and when its value is greater than 13, the
> > performance will be more irregular and unstable. It looks similar from 3
> > to 13, I chose 8 as the default value.
> >
> > The test environment is qemu + vhost-net. I modified vhost-net to drop
> > the packets sent by vm directly, so that the cpu of vm can run higher.
> > By default, the processes in the vm and the cpu of softirqd are too low,
> > and there is no obvious difference in the test data.
> >
> > During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
> > run for 300s, the pps of every second was recorded, and the average of
> > the pps was finally taken. The vhost process cpu on the host has also
> > reached 100%.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > ---
> >   drivers/net/virtio/virtio_net.c |  1 +
> >   drivers/net/virtio/xsk.c        | 18 ++++++++++++++++--
> >   drivers/net/virtio/xsk.h        |  2 ++
> >   3 files changed, 19 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
> > index 9503133e71f0..dfe509939b45 100644
> > --- a/drivers/net/virtio/virtio_net.c
> > +++ b/drivers/net/virtio/virtio_net.c
> > @@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
> >   module_param(csum, bool, 0444);
> >   module_param(gso, bool, 0444);
> >   module_param(napi_tx, bool, 0644);
> > +module_param(xsk_kick_thr, int, 0644);
> >
> >   /* FIXME: MTU in config. */
> >   #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > index 3973c82d1ad2..2f3ba6ab4798 100644
> > --- a/drivers/net/virtio/xsk.c
> > +++ b/drivers/net/virtio/xsk.c
> > @@ -5,6 +5,8 @@
> >
> >   #include "virtio_net.h"
> >
> > +int xsk_kick_thr = 8;
> > +
> >   static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
> >
> >   static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
> > @@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> >   	struct xdp_desc desc;
> >   	int err, packet = 0;
> >   	int ret = -EAGAIN;
> > +	int need_kick = 0;
> >
> >   	while (budget-- > 0) {
> >   		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
> > @@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> >   		}
> >
> >   		++packet;
> > +		++need_kick;
> > +		if (need_kick > xsk_kick_thr) {
> > +			if (virtqueue_kick_prepare(sq->vq) &&
> > +			    virtqueue_notify(sq->vq))
> > +				++stats->kicks;
> > +
> > +			need_kick = 0;
> > +		}
> >   	}
> >
> >   	if (packet) {
> > -		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> > -			++stats->kicks;
> > +		if (need_kick) {
> > +			if (virtqueue_kick_prepare(sq->vq) &&
> > +			    virtqueue_notify(sq->vq))
> > +				++stats->kicks;
> > +		}
> >
> >   		*done += packet;
> >   		stats->xdp_tx += packet;
> > diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> > index fe22cf78d505..4f0f4f9cf23b 100644
> > --- a/drivers/net/virtio/xsk.h
> > +++ b/drivers/net/virtio/xsk.h
> > @@ -7,6 +7,8 @@
> >
> >   #define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
> >
> > +extern int xsk_kick_thr;
> > +
> >   /* When xsk disable, under normal circumstances, the network card must reclaim
> >    * all the memory that has been sent and the memory added to the rq queue by
> >    * destroying the queue.
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold
  2021-06-17  5:56     ` Xuan Zhuo
@ 2021-06-17  6:00         ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  6:00 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust.li, netdev


在 2021/6/17 下午1:56, Xuan Zhuo 写道:
> On Thu, 17 Jun 2021 11:08:34 +0800, Jason Wang <jasowang@redhat.com> wrote:
>> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
>>> After testing, the performance of calling kick every time is not stable.
>>> And if all the packets are sent and kicked again, the performance is not
>>> good. So add a module parameter to specify how many packets are sent to
>>> call a kick.
>>>
>>> 8 is a relatively stable value with the best performance.
>>>
>>> Here is the pps of the test of xsk_kick_thr under different values (from
>>> 1 to 64).
>>>
>>> thr  PPS             thr PPS             thr PPS
>>> 1    2924116.74247 | 23  3683263.04348 | 45  2777907.22963
>>> 2    3441010.57191 | 24  3078880.13043 | 46  2781376.21739
>>> 3    3636728.72378 | 25  2859219.57656 | 47  2777271.91304
>>> 4    3637518.61468 | 26  2851557.9593  | 48  2800320.56575
>>> 5    3651738.16251 | 27  2834783.54408 | 49  2813039.87599
>>> 6    3652176.69231 | 28  2847012.41472 | 50  3445143.01839
>>> 7    3665415.80602 | 29  2860633.91304 | 51  3666918.01281
>>> 8    3665045.16555 | 30  2857903.5786  | 52  3059929.2709
>>
>> I wonder what's the number for the case of non zc xsk?
>
> These data are used to compare the situation of sending different numbers of
> packets to virtio at one time. I think it has nothing to do with non-zerocopy
> xsk.


Yes, but it would be helpful to see how much we can gain from zerocopy.

Thanks


>
> Thanks.
>
>> Thanks
>>
>>
>>> 9    3671023.2401  | 31  2835589.98963 | 53  2831515.21739
>>> 10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
>>> 11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
>>> 12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
>>> 13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
>>> 14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
>>> 15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
>>> 16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
>>> 17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
>>> 18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
>>> 19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
>>> 20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
>>> 21   2845655.55789 | 43  2752951.93264 |
>>> 22   3450389.84365 | 44  2753107.27164 |
>>>
>>> It can be found that when the value of xsk_kick_thr is relatively small,
>>> the performance is not good, and when its value is greater than 13, the
>>> performance will be more irregular and unstable. It looks similar from 3
>>> to 13, I chose 8 as the default value.
>>>
>>> The test environment is qemu + vhost-net. I modified vhost-net to drop
>>> the packets sent by vm directly, so that the cpu of vm can run higher.
>>> By default, the processes in the vm and the cpu of softirqd are too low,
>>> and there is no obvious difference in the test data.
>>>
>>> During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
>>> run for 300s, the pps of every second was recorded, and the average of
>>> the pps was finally taken. The vhost process cpu on the host has also
>>> reached 100%.
>>>
>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
>>> ---
>>>    drivers/net/virtio/virtio_net.c |  1 +
>>>    drivers/net/virtio/xsk.c        | 18 ++++++++++++++++--
>>>    drivers/net/virtio/xsk.h        |  2 ++
>>>    3 files changed, 19 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
>>> index 9503133e71f0..dfe509939b45 100644
>>> --- a/drivers/net/virtio/virtio_net.c
>>> +++ b/drivers/net/virtio/virtio_net.c
>>> @@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
>>>    module_param(csum, bool, 0444);
>>>    module_param(gso, bool, 0444);
>>>    module_param(napi_tx, bool, 0644);
>>> +module_param(xsk_kick_thr, int, 0644);
>>>
>>>    /* FIXME: MTU in config. */
>>>    #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
>>> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
>>> index 3973c82d1ad2..2f3ba6ab4798 100644
>>> --- a/drivers/net/virtio/xsk.c
>>> +++ b/drivers/net/virtio/xsk.c
>>> @@ -5,6 +5,8 @@
>>>
>>>    #include "virtio_net.h"
>>>
>>> +int xsk_kick_thr = 8;
>>> +
>>>    static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
>>>
>>>    static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
>>> @@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>>>    	struct xdp_desc desc;
>>>    	int err, packet = 0;
>>>    	int ret = -EAGAIN;
>>> +	int need_kick = 0;
>>>
>>>    	while (budget-- > 0) {
>>>    		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
>>> @@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>>>    		}
>>>
>>>    		++packet;
>>> +		++need_kick;
>>> +		if (need_kick > xsk_kick_thr) {
>>> +			if (virtqueue_kick_prepare(sq->vq) &&
>>> +			    virtqueue_notify(sq->vq))
>>> +				++stats->kicks;
>>> +
>>> +			need_kick = 0;
>>> +		}
>>>    	}
>>>
>>>    	if (packet) {
>>> -		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
>>> -			++stats->kicks;
>>> +		if (need_kick) {
>>> +			if (virtqueue_kick_prepare(sq->vq) &&
>>> +			    virtqueue_notify(sq->vq))
>>> +				++stats->kicks;
>>> +		}
>>>
>>>    		*done += packet;
>>>    		stats->xdp_tx += packet;
>>> diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
>>> index fe22cf78d505..4f0f4f9cf23b 100644
>>> --- a/drivers/net/virtio/xsk.h
>>> +++ b/drivers/net/virtio/xsk.h
>>> @@ -7,6 +7,8 @@
>>>
>>>    #define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
>>>
>>> +extern int xsk_kick_thr;
>>> +
>>>    /* When xsk disable, under normal circumstances, the network card must reclaim
>>>     * all the memory that has been sent and the memory added to the rq queue by
>>>     * destroying the queue.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold
@ 2021-06-17  6:00         ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  6:00 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust.li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/17 下午1:56, Xuan Zhuo 写道:
> On Thu, 17 Jun 2021 11:08:34 +0800, Jason Wang <jasowang@redhat.com> wrote:
>> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
>>> After testing, the performance of calling kick every time is not stable.
>>> And if all the packets are sent and kicked again, the performance is not
>>> good. So add a module parameter to specify how many packets are sent to
>>> call a kick.
>>>
>>> 8 is a relatively stable value with the best performance.
>>>
>>> Here is the pps of the test of xsk_kick_thr under different values (from
>>> 1 to 64).
>>>
>>> thr  PPS             thr PPS             thr PPS
>>> 1    2924116.74247 | 23  3683263.04348 | 45  2777907.22963
>>> 2    3441010.57191 | 24  3078880.13043 | 46  2781376.21739
>>> 3    3636728.72378 | 25  2859219.57656 | 47  2777271.91304
>>> 4    3637518.61468 | 26  2851557.9593  | 48  2800320.56575
>>> 5    3651738.16251 | 27  2834783.54408 | 49  2813039.87599
>>> 6    3652176.69231 | 28  2847012.41472 | 50  3445143.01839
>>> 7    3665415.80602 | 29  2860633.91304 | 51  3666918.01281
>>> 8    3665045.16555 | 30  2857903.5786  | 52  3059929.2709
>>
>> I wonder what's the number for the case of non zc xsk?
>
> These data are used to compare the situation of sending different numbers of
> packets to virtio at one time. I think it has nothing to do with non-zerocopy
> xsk.


Yes, but it would be helpful to see how much we can gain from zerocopy.

Thanks


>
> Thanks.
>
>> Thanks
>>
>>
>>> 9    3671023.2401  | 31  2835589.98963 | 53  2831515.21739
>>> 10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
>>> 11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
>>> 12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
>>> 13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
>>> 14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
>>> 15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
>>> 16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
>>> 17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
>>> 18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
>>> 19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
>>> 20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
>>> 21   2845655.55789 | 43  2752951.93264 |
>>> 22   3450389.84365 | 44  2753107.27164 |
>>>
>>> It can be found that when the value of xsk_kick_thr is relatively small,
>>> the performance is not good, and when its value is greater than 13, the
>>> performance will be more irregular and unstable. It looks similar from 3
>>> to 13, I chose 8 as the default value.
>>>
>>> The test environment is qemu + vhost-net. I modified vhost-net to drop
>>> the packets sent by vm directly, so that the cpu of vm can run higher.
>>> By default, the processes in the vm and the cpu of softirqd are too low,
>>> and there is no obvious difference in the test data.
>>>
>>> During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
>>> run for 300s, the pps of every second was recorded, and the average of
>>> the pps was finally taken. The vhost process cpu on the host has also
>>> reached 100%.
>>>
>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
>>> ---
>>>    drivers/net/virtio/virtio_net.c |  1 +
>>>    drivers/net/virtio/xsk.c        | 18 ++++++++++++++++--
>>>    drivers/net/virtio/xsk.h        |  2 ++
>>>    3 files changed, 19 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
>>> index 9503133e71f0..dfe509939b45 100644
>>> --- a/drivers/net/virtio/virtio_net.c
>>> +++ b/drivers/net/virtio/virtio_net.c
>>> @@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
>>>    module_param(csum, bool, 0444);
>>>    module_param(gso, bool, 0444);
>>>    module_param(napi_tx, bool, 0644);
>>> +module_param(xsk_kick_thr, int, 0644);
>>>
>>>    /* FIXME: MTU in config. */
>>>    #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
>>> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
>>> index 3973c82d1ad2..2f3ba6ab4798 100644
>>> --- a/drivers/net/virtio/xsk.c
>>> +++ b/drivers/net/virtio/xsk.c
>>> @@ -5,6 +5,8 @@
>>>
>>>    #include "virtio_net.h"
>>>
>>> +int xsk_kick_thr = 8;
>>> +
>>>    static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
>>>
>>>    static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
>>> @@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>>>    	struct xdp_desc desc;
>>>    	int err, packet = 0;
>>>    	int ret = -EAGAIN;
>>> +	int need_kick = 0;
>>>
>>>    	while (budget-- > 0) {
>>>    		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
>>> @@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>>>    		}
>>>
>>>    		++packet;
>>> +		++need_kick;
>>> +		if (need_kick > xsk_kick_thr) {
>>> +			if (virtqueue_kick_prepare(sq->vq) &&
>>> +			    virtqueue_notify(sq->vq))
>>> +				++stats->kicks;
>>> +
>>> +			need_kick = 0;
>>> +		}
>>>    	}
>>>
>>>    	if (packet) {
>>> -		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
>>> -			++stats->kicks;
>>> +		if (need_kick) {
>>> +			if (virtqueue_kick_prepare(sq->vq) &&
>>> +			    virtqueue_notify(sq->vq))
>>> +				++stats->kicks;
>>> +		}
>>>
>>>    		*done += packet;
>>>    		stats->xdp_tx += packet;
>>> diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
>>> index fe22cf78d505..4f0f4f9cf23b 100644
>>> --- a/drivers/net/virtio/xsk.h
>>> +++ b/drivers/net/virtio/xsk.h
>>> @@ -7,6 +7,8 @@
>>>
>>>    #define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
>>>
>>> +extern int xsk_kick_thr;
>>> +
>>>    /* When xsk disable, under normal circumstances, the network card must reclaim
>>>     * all the memory that has been sent and the memory added to the rq queue by
>>>     * destroying the queue.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup
  2021-06-17  5:55     ` Xuan Zhuo
@ 2021-06-17  6:01         ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  6:01 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust.li, netdev


在 2021/6/17 下午1:55, Xuan Zhuo 写道:
> On Thu, 17 Jun 2021 11:07:17 +0800, Jason Wang <jasowang@redhat.com> wrote:
>> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
>>> Calling virtqueue_napi_schedule() in wakeup results in napi running on
>>> the current cpu. If the application is not busy, then there is no
>>> problem. But if the application itself is busy, it will cause a lot of
>>> scheduling.
>>>
>>> If the application is continuously sending data packets, due to the
>>> continuous scheduling between the application and napi, the data packet
>>> transmission will not be smooth, and there will be an obvious delay in
>>> the transmission (you can use tcpdump to see it). When pressing a
>>> channel to 100% (vhost reaches 100%), the cpu where the application is
>>> located reaches 100%.
>>>
>>> This patch sends a small amount of data directly in wakeup. The purpose
>>> of this is to trigger the tx interrupt. The tx interrupt will be
>>> awakened on the cpu of its affinity, and then trigger the operation of
>>> the napi mechanism, napi can continue to consume the xsk tx queue. Two
>>> cpus are running, cpu0 is running applications, cpu1 executes
>>> napi consumption data. The same is to press a channel to 100%, but the
>>> utilization rate of cpu0 is 12.7% and the utilization rate of cpu1 is
>>> 2.9%.
>>>
>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> ---
>>>    drivers/net/virtio/xsk.c | 28 +++++++++++++++++++++++-----
>>>    1 file changed, 23 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
>>> index 36cda2dcf8e7..3973c82d1ad2 100644
>>> --- a/drivers/net/virtio/xsk.c
>>> +++ b/drivers/net/virtio/xsk.c
>>> @@ -547,6 +547,7 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
>>>    {
>>>    	struct virtnet_info *vi = netdev_priv(dev);
>>>    	struct xsk_buff_pool *pool;
>>> +	struct netdev_queue *txq;
>>>    	struct send_queue *sq;
>>>
>>>    	if (!netif_running(dev))
>>> @@ -559,11 +560,28 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
>>>
>>>    	rcu_read_lock();
>>>    	pool = rcu_dereference(sq->xsk.pool);
>>> -	if (pool) {
>>> -		local_bh_disable();
>>> -		virtqueue_napi_schedule(&sq->napi, sq->vq);
>>> -		local_bh_enable();
>>> -	}
>>> +	if (!pool)
>>> +		goto end;
>>> +
>>> +	if (napi_if_scheduled_mark_missed(&sq->napi))
>>> +		goto end;
>>> +
>>> +	txq = netdev_get_tx_queue(dev, qid);
>>> +
>>> +	__netif_tx_lock_bh(txq);
>>> +
>>> +	/* Send part of the packet directly to reduce the delay in sending the
>>> +	 * packet, and this can actively trigger the tx interrupts.
>>> +	 *
>>> +	 * If no packet is sent out, the ring of the device is full. In this
>>> +	 * case, we will still get a tx interrupt response. Then we will deal
>>> +	 * with the subsequent packet sending work.
>>> +	 */
>>> +	virtnet_xsk_run(sq, pool, sq->napi.weight, false);
>>
>> This looks tricky, and it won't be efficient since there could be some
>> contention on the tx lock.
>>
>> I wonder if we can simulate the interrupt via IPI like what RPS did.
> Let me try.
>
>> In the long run, we may want to extend the spec to support interrupt
>> trigger though driver.
> Can we submit this with reset queue?


We need separate features. And it looks to me it's not as urgent as reset.

Thanks


>
> Thanks.
>
>> Thanks
>>
>>
>>> +
>>> +	__netif_tx_unlock_bh(txq);
>>> +
>>> +end:
>>>    	rcu_read_unlock();
>>>    	return 0;
>>>    }


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup
@ 2021-06-17  6:01         ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  6:01 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust.li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson


在 2021/6/17 下午1:55, Xuan Zhuo 写道:
> On Thu, 17 Jun 2021 11:07:17 +0800, Jason Wang <jasowang@redhat.com> wrote:
>> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
>>> Calling virtqueue_napi_schedule() in wakeup results in napi running on
>>> the current cpu. If the application is not busy, then there is no
>>> problem. But if the application itself is busy, it will cause a lot of
>>> scheduling.
>>>
>>> If the application is continuously sending data packets, due to the
>>> continuous scheduling between the application and napi, the data packet
>>> transmission will not be smooth, and there will be an obvious delay in
>>> the transmission (you can use tcpdump to see it). When pressing a
>>> channel to 100% (vhost reaches 100%), the cpu where the application is
>>> located reaches 100%.
>>>
>>> This patch sends a small amount of data directly in wakeup. The purpose
>>> of this is to trigger the tx interrupt. The tx interrupt will be
>>> awakened on the cpu of its affinity, and then trigger the operation of
>>> the napi mechanism, napi can continue to consume the xsk tx queue. Two
>>> cpus are running, cpu0 is running applications, cpu1 executes
>>> napi consumption data. The same is to press a channel to 100%, but the
>>> utilization rate of cpu0 is 12.7% and the utilization rate of cpu1 is
>>> 2.9%.
>>>
>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> ---
>>>    drivers/net/virtio/xsk.c | 28 +++++++++++++++++++++++-----
>>>    1 file changed, 23 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
>>> index 36cda2dcf8e7..3973c82d1ad2 100644
>>> --- a/drivers/net/virtio/xsk.c
>>> +++ b/drivers/net/virtio/xsk.c
>>> @@ -547,6 +547,7 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
>>>    {
>>>    	struct virtnet_info *vi = netdev_priv(dev);
>>>    	struct xsk_buff_pool *pool;
>>> +	struct netdev_queue *txq;
>>>    	struct send_queue *sq;
>>>
>>>    	if (!netif_running(dev))
>>> @@ -559,11 +560,28 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
>>>
>>>    	rcu_read_lock();
>>>    	pool = rcu_dereference(sq->xsk.pool);
>>> -	if (pool) {
>>> -		local_bh_disable();
>>> -		virtqueue_napi_schedule(&sq->napi, sq->vq);
>>> -		local_bh_enable();
>>> -	}
>>> +	if (!pool)
>>> +		goto end;
>>> +
>>> +	if (napi_if_scheduled_mark_missed(&sq->napi))
>>> +		goto end;
>>> +
>>> +	txq = netdev_get_tx_queue(dev, qid);
>>> +
>>> +	__netif_tx_lock_bh(txq);
>>> +
>>> +	/* Send part of the packet directly to reduce the delay in sending the
>>> +	 * packet, and this can actively trigger the tx interrupts.
>>> +	 *
>>> +	 * If no packet is sent out, the ring of the device is full. In this
>>> +	 * case, we will still get a tx interrupt response. Then we will deal
>>> +	 * with the subsequent packet sending work.
>>> +	 */
>>> +	virtnet_xsk_run(sq, pool, sq->napi.weight, false);
>>
>> This looks tricky, and it won't be efficient since there could be some
>> contention on the tx lock.
>>
>> I wonder if we can simulate the interrupt via IPI like what RPS did.
> Let me try.
>
>> In the long run, we may want to extend the spec to support interrupt
>> trigger though driver.
> Can we submit this with reset queue?


We need separate features. And it looks to me it's not as urgent as reset.

Thanks


>
> Thanks.
>
>> Thanks
>>
>>
>>> +
>>> +	__netif_tx_unlock_bh(txq);
>>> +
>>> +end:
>>>    	rcu_read_unlock();
>>>    	return 0;
>>>    }

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
  2021-06-17  5:53     ` Xuan Zhuo
@ 2021-06-17  6:03         ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  6:03 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust.li, netdev,
	yuri Benditovich, Andrew Melnychenko


在 2021/6/17 下午1:53, Xuan Zhuo 写道:
> On Thu, 17 Jun 2021 11:23:52 +0800, Jason Wang <jasowang@redhat.com> wrote:
>> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
>>> Compared to the case of xsk tx, the case of xsk zc rx is more
>>> complicated.
>>>
>>> When we process the buf received by vq, we may encounter ordinary
>>> buffers, or xsk buffers. What makes the situation more complicated is
>>> that in the case of mergeable, when num_buffer > 1, we may still
>>> encounter the case where xsk buffer is mixed with ordinary buffer.
>>>
>>> Another thing that makes the situation more complicated is that when we
>>> get an xsk buffer from vq, the xsk bound to this xsk buffer may have
>>> been unbound.
>>>
>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>
>> This is somehow similar to the case of tx where we don't have per vq reset.
>>
>> [...]
>>
>>> -	if (vi->mergeable_rx_bufs)
>>> +	if (is_xsk_ctx(ctx))
>>> +		skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
>>> +	else if (vi->mergeable_rx_bufs)
>>>    		skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
>>>    					stats);
>>>    	else if (vi->big_packets)
>>> @@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
>>>    	int err;
>>>    	bool oom;
>>>
>>> +	/* Because virtio-net does not yet support flow direct,
>>
>> Note that this is not the case any more. RSS has been supported by
>> virtio spec and qemu/vhost/tap now. We just need some work on the
>> virtio-net driver part (e.g the ethool interface).
> Oh, are there any plans? Who is doing this work, can I help?


Qemu and spec has support RSS.

TAP support is ready via steering eBPF program, you can try to play it 
with current qemu master.

The only thing missed is the Linux driver, I think Yuri or Andrew is 
working on this.

Thanks


>
> Thanks.
>
>> Thanks
>>
>>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
@ 2021-06-17  6:03         ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  6:03 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrew Melnychenko,
	Andrii Nakryiko, netdev, yuri Benditovich, Björn Töpel,
	dust.li, Jonathan Lemon, KP Singh, Jakub Kicinski, bpf,
	virtualization, David S. Miller, Magnus Karlsson


在 2021/6/17 下午1:53, Xuan Zhuo 写道:
> On Thu, 17 Jun 2021 11:23:52 +0800, Jason Wang <jasowang@redhat.com> wrote:
>> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
>>> Compared to the case of xsk tx, the case of xsk zc rx is more
>>> complicated.
>>>
>>> When we process the buf received by vq, we may encounter ordinary
>>> buffers, or xsk buffers. What makes the situation more complicated is
>>> that in the case of mergeable, when num_buffer > 1, we may still
>>> encounter the case where xsk buffer is mixed with ordinary buffer.
>>>
>>> Another thing that makes the situation more complicated is that when we
>>> get an xsk buffer from vq, the xsk bound to this xsk buffer may have
>>> been unbound.
>>>
>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>
>> This is somehow similar to the case of tx where we don't have per vq reset.
>>
>> [...]
>>
>>> -	if (vi->mergeable_rx_bufs)
>>> +	if (is_xsk_ctx(ctx))
>>> +		skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
>>> +	else if (vi->mergeable_rx_bufs)
>>>    		skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
>>>    					stats);
>>>    	else if (vi->big_packets)
>>> @@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
>>>    	int err;
>>>    	bool oom;
>>>
>>> +	/* Because virtio-net does not yet support flow direct,
>>
>> Note that this is not the case any more. RSS has been supported by
>> virtio spec and qemu/vhost/tap now. We just need some work on the
>> virtio-net driver part (e.g the ethool interface).
> Oh, are there any plans? Who is doing this work, can I help?


Qemu and spec has support RSS.

TAP support is ready via steering eBPF program, you can try to play it 
with current qemu master.

The only thing missed is the Linux driver, I think Yuri or Andrew is 
working on this.

Thanks


>
> Thanks.
>
>> Thanks
>>
>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
  2021-06-17  6:03         ` Jason Wang
  (?)
@ 2021-06-17  6:37         ` Xuan Zhuo
  2021-06-17  6:58             ` Jason Wang
  -1 siblings, 1 reply; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-17  6:37 UTC (permalink / raw)
  To: Jason Wang
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrew Melnychenko,
	Andrii Nakryiko, netdev, yuri Benditovich, Björn Töpel,
	dust.li, Jonathan Lemon, KP Singh, Jakub Kicinski, bpf,
	virtualization, David S. Miller, Magnus Karlsson

On Thu, 17 Jun 2021 14:03:29 +0800, Jason Wang <jasowang@redhat.com> wrote:
>
> 在 2021/6/17 下午1:53, Xuan Zhuo 写道:
> > On Thu, 17 Jun 2021 11:23:52 +0800, Jason Wang <jasowang@redhat.com> wrote:
> >> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> >>> Compared to the case of xsk tx, the case of xsk zc rx is more
> >>> complicated.
> >>>
> >>> When we process the buf received by vq, we may encounter ordinary
> >>> buffers, or xsk buffers. What makes the situation more complicated is
> >>> that in the case of mergeable, when num_buffer > 1, we may still
> >>> encounter the case where xsk buffer is mixed with ordinary buffer.
> >>>
> >>> Another thing that makes the situation more complicated is that when we
> >>> get an xsk buffer from vq, the xsk bound to this xsk buffer may have
> >>> been unbound.
> >>>
> >>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>
> >> This is somehow similar to the case of tx where we don't have per vq reset.
> >>
> >> [...]
> >>
> >>> -	if (vi->mergeable_rx_bufs)
> >>> +	if (is_xsk_ctx(ctx))
> >>> +		skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
> >>> +	else if (vi->mergeable_rx_bufs)
> >>>    		skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
> >>>    					stats);
> >>>    	else if (vi->big_packets)
> >>> @@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
> >>>    	int err;
> >>>    	bool oom;
> >>>
> >>> +	/* Because virtio-net does not yet support flow direct,
> >>
> >> Note that this is not the case any more. RSS has been supported by
> >> virtio spec and qemu/vhost/tap now. We just need some work on the
> >> virtio-net driver part (e.g the ethool interface).
> > Oh, are there any plans? Who is doing this work, can I help?
>
>
> Qemu and spec has support RSS.
>
> TAP support is ready via steering eBPF program, you can try to play it
> with current qemu master.
>
> The only thing missed is the Linux driver, I think Yuri or Andrew is
> working on this.

I feel that in the case of xsk, the flow director is more appropriate.

Users may still want to allocate packets to a certain channel based on
information such as port/ip/tcp/udp, and then xsk will process them.

I will try to push the flow director to the spec.

Thanks.

>
> Thanks
>
>
> >
> > Thanks.
> >
> >> Thanks
> >>
> >>
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold
  2021-06-17  6:00         ` Jason Wang
  (?)
@ 2021-06-17  6:43         ` Xuan Zhuo
  -1 siblings, 0 replies; 80+ messages in thread
From: Xuan Zhuo @ 2021-06-17  6:43 UTC (permalink / raw)
  To: Jason Wang
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrii Nakryiko, netdev,
	Björn Töpel, dust.li, Jonathan Lemon, KP Singh,
	Jakub Kicinski, bpf, virtualization, David S. Miller,
	Magnus Karlsson

On Thu, 17 Jun 2021 14:00:40 +0800, Jason Wang <jasowang@redhat.com> wrote:
>
> 在 2021/6/17 下午1:56, Xuan Zhuo 写道:
> > On Thu, 17 Jun 2021 11:08:34 +0800, Jason Wang <jasowang@redhat.com> wrote:
> >> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
> >>> After testing, the performance of calling kick every time is not stable.
> >>> And if all the packets are sent and kicked again, the performance is not
> >>> good. So add a module parameter to specify how many packets are sent to
> >>> call a kick.
> >>>
> >>> 8 is a relatively stable value with the best performance.
> >>>
> >>> Here is the pps of the test of xsk_kick_thr under different values (from
> >>> 1 to 64).
> >>>
> >>> thr  PPS             thr PPS             thr PPS
> >>> 1    2924116.74247 | 23  3683263.04348 | 45  2777907.22963
> >>> 2    3441010.57191 | 24  3078880.13043 | 46  2781376.21739
> >>> 3    3636728.72378 | 25  2859219.57656 | 47  2777271.91304
> >>> 4    3637518.61468 | 26  2851557.9593  | 48  2800320.56575
> >>> 5    3651738.16251 | 27  2834783.54408 | 49  2813039.87599
> >>> 6    3652176.69231 | 28  2847012.41472 | 50  3445143.01839
> >>> 7    3665415.80602 | 29  2860633.91304 | 51  3666918.01281
> >>> 8    3665045.16555 | 30  2857903.5786  | 52  3059929.2709
> >>
> >> I wonder what's the number for the case of non zc xsk?
> >
> > These data are used to compare the situation of sending different numbers of
> > packets to virtio at one time. I think it has nothing to do with non-zerocopy
> > xsk.
>
>
> Yes, but it would be helpful to see how much we can gain from zerocopy.

Okay, I will add the performance data of no-zerocopy xsk in the next patch set.

Thanks.

>
> Thanks
>
>
> >
> > Thanks.
> >
> >> Thanks
> >>
> >>
> >>> 9    3671023.2401  | 31  2835589.98963 | 53  2831515.21739
> >>> 10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
> >>> 11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
> >>> 12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
> >>> 13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
> >>> 14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
> >>> 15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
> >>> 16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
> >>> 17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
> >>> 18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
> >>> 19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
> >>> 20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
> >>> 21   2845655.55789 | 43  2752951.93264 |
> >>> 22   3450389.84365 | 44  2753107.27164 |
> >>>
> >>> It can be found that when the value of xsk_kick_thr is relatively small,
> >>> the performance is not good, and when its value is greater than 13, the
> >>> performance will be more irregular and unstable. It looks similar from 3
> >>> to 13, I chose 8 as the default value.
> >>>
> >>> The test environment is qemu + vhost-net. I modified vhost-net to drop
> >>> the packets sent by vm directly, so that the cpu of vm can run higher.
> >>> By default, the processes in the vm and the cpu of softirqd are too low,
> >>> and there is no obvious difference in the test data.
> >>>
> >>> During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
> >>> run for 300s, the pps of every second was recorded, and the average of
> >>> the pps was finally taken. The vhost process cpu on the host has also
> >>> reached 100%.
> >>>
> >>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> >>> ---
> >>>    drivers/net/virtio/virtio_net.c |  1 +
> >>>    drivers/net/virtio/xsk.c        | 18 ++++++++++++++++--
> >>>    drivers/net/virtio/xsk.h        |  2 ++
> >>>    3 files changed, 19 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
> >>> index 9503133e71f0..dfe509939b45 100644
> >>> --- a/drivers/net/virtio/virtio_net.c
> >>> +++ b/drivers/net/virtio/virtio_net.c
> >>> @@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
> >>>    module_param(csum, bool, 0444);
> >>>    module_param(gso, bool, 0444);
> >>>    module_param(napi_tx, bool, 0644);
> >>> +module_param(xsk_kick_thr, int, 0644);
> >>>
> >>>    /* FIXME: MTU in config. */
> >>>    #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> >>> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> >>> index 3973c82d1ad2..2f3ba6ab4798 100644
> >>> --- a/drivers/net/virtio/xsk.c
> >>> +++ b/drivers/net/virtio/xsk.c
> >>> @@ -5,6 +5,8 @@
> >>>
> >>>    #include "virtio_net.h"
> >>>
> >>> +int xsk_kick_thr = 8;
> >>> +
> >>>    static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
> >>>
> >>>    static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head *head)
> >>> @@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> >>>    	struct xdp_desc desc;
> >>>    	int err, packet = 0;
> >>>    	int ret = -EAGAIN;
> >>> +	int need_kick = 0;
> >>>
> >>>    	while (budget-- > 0) {
> >>>    		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
> >>> @@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> >>>    		}
> >>>
> >>>    		++packet;
> >>> +		++need_kick;
> >>> +		if (need_kick > xsk_kick_thr) {
> >>> +			if (virtqueue_kick_prepare(sq->vq) &&
> >>> +			    virtqueue_notify(sq->vq))
> >>> +				++stats->kicks;
> >>> +
> >>> +			need_kick = 0;
> >>> +		}
> >>>    	}
> >>>
> >>>    	if (packet) {
> >>> -		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> >>> -			++stats->kicks;
> >>> +		if (need_kick) {
> >>> +			if (virtqueue_kick_prepare(sq->vq) &&
> >>> +			    virtqueue_notify(sq->vq))
> >>> +				++stats->kicks;
> >>> +		}
> >>>
> >>>    		*done += packet;
> >>>    		stats->xdp_tx += packet;
> >>> diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> >>> index fe22cf78d505..4f0f4f9cf23b 100644
> >>> --- a/drivers/net/virtio/xsk.h
> >>> +++ b/drivers/net/virtio/xsk.h
> >>> @@ -7,6 +7,8 @@
> >>>
> >>>    #define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
> >>>
> >>> +extern int xsk_kick_thr;
> >>> +
> >>>    /* When xsk disable, under normal circumstances, the network card must reclaim
> >>>     * all the memory that has been sent and the memory added to the rq queue by
> >>>     * destroying the queue.
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
  2021-06-17  6:37         ` Xuan Zhuo
@ 2021-06-17  6:58             ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  6:58 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: David S. Miller, Jakub Kicinski, Michael S. Tsirkin,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf, dust.li, netdev,
	yuri Benditovich, Andrew Melnychenko


在 2021/6/17 下午2:37, Xuan Zhuo 写道:
> On Thu, 17 Jun 2021 14:03:29 +0800, Jason Wang <jasowang@redhat.com> wrote:
>> 在 2021/6/17 下午1:53, Xuan Zhuo 写道:
>>> On Thu, 17 Jun 2021 11:23:52 +0800, Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
>>>>> Compared to the case of xsk tx, the case of xsk zc rx is more
>>>>> complicated.
>>>>>
>>>>> When we process the buf received by vq, we may encounter ordinary
>>>>> buffers, or xsk buffers. What makes the situation more complicated is
>>>>> that in the case of mergeable, when num_buffer > 1, we may still
>>>>> encounter the case where xsk buffer is mixed with ordinary buffer.
>>>>>
>>>>> Another thing that makes the situation more complicated is that when we
>>>>> get an xsk buffer from vq, the xsk bound to this xsk buffer may have
>>>>> been unbound.
>>>>>
>>>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>> This is somehow similar to the case of tx where we don't have per vq reset.
>>>>
>>>> [...]
>>>>
>>>>> -	if (vi->mergeable_rx_bufs)
>>>>> +	if (is_xsk_ctx(ctx))
>>>>> +		skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
>>>>> +	else if (vi->mergeable_rx_bufs)
>>>>>     		skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
>>>>>     					stats);
>>>>>     	else if (vi->big_packets)
>>>>> @@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
>>>>>     	int err;
>>>>>     	bool oom;
>>>>>
>>>>> +	/* Because virtio-net does not yet support flow direct,
>>>> Note that this is not the case any more. RSS has been supported by
>>>> virtio spec and qemu/vhost/tap now. We just need some work on the
>>>> virtio-net driver part (e.g the ethool interface).
>>> Oh, are there any plans? Who is doing this work, can I help?
>>
>> Qemu and spec has support RSS.
>>
>> TAP support is ready via steering eBPF program, you can try to play it
>> with current qemu master.
>>
>> The only thing missed is the Linux driver, I think Yuri or Andrew is
>> working on this.
> I feel that in the case of xsk, the flow director is more appropriate.
>
> Users may still want to allocate packets to a certain channel based on
> information such as port/ip/tcp/udp, and then xsk will process them.
>
> I will try to push the flow director to the spec.


That would be fine. For the backend implementation, it could still be 
implemented via steering eBPF.

Thanks


>
> Thanks.
>
>> Thanks
>>
>>
>>> Thanks.
>>>
>>>> Thanks
>>>>
>>>>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
@ 2021-06-17  6:58             ` Jason Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Jason Wang @ 2021-06-17  6:58 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Song Liu, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Yonghong Song,
	John Fastabend, Alexei Starovoitov, Andrew Melnychenko,
	Andrii Nakryiko, netdev, yuri Benditovich, Björn Töpel,
	dust.li, Jonathan Lemon, KP Singh, Jakub Kicinski, bpf,
	virtualization, David S. Miller, Magnus Karlsson


在 2021/6/17 下午2:37, Xuan Zhuo 写道:
> On Thu, 17 Jun 2021 14:03:29 +0800, Jason Wang <jasowang@redhat.com> wrote:
>> 在 2021/6/17 下午1:53, Xuan Zhuo 写道:
>>> On Thu, 17 Jun 2021 11:23:52 +0800, Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/6/10 下午4:22, Xuan Zhuo 写道:
>>>>> Compared to the case of xsk tx, the case of xsk zc rx is more
>>>>> complicated.
>>>>>
>>>>> When we process the buf received by vq, we may encounter ordinary
>>>>> buffers, or xsk buffers. What makes the situation more complicated is
>>>>> that in the case of mergeable, when num_buffer > 1, we may still
>>>>> encounter the case where xsk buffer is mixed with ordinary buffer.
>>>>>
>>>>> Another thing that makes the situation more complicated is that when we
>>>>> get an xsk buffer from vq, the xsk bound to this xsk buffer may have
>>>>> been unbound.
>>>>>
>>>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>> This is somehow similar to the case of tx where we don't have per vq reset.
>>>>
>>>> [...]
>>>>
>>>>> -	if (vi->mergeable_rx_bufs)
>>>>> +	if (is_xsk_ctx(ctx))
>>>>> +		skb = receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats);
>>>>> +	else if (vi->mergeable_rx_bufs)
>>>>>     		skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
>>>>>     					stats);
>>>>>     	else if (vi->big_packets)
>>>>> @@ -1175,6 +1296,14 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
>>>>>     	int err;
>>>>>     	bool oom;
>>>>>
>>>>> +	/* Because virtio-net does not yet support flow direct,
>>>> Note that this is not the case any more. RSS has been supported by
>>>> virtio spec and qemu/vhost/tap now. We just need some work on the
>>>> virtio-net driver part (e.g the ethool interface).
>>> Oh, are there any plans? Who is doing this work, can I help?
>>
>> Qemu and spec has support RSS.
>>
>> TAP support is ready via steering eBPF program, you can try to play it
>> with current qemu master.
>>
>> The only thing missed is the Linux driver, I think Yuri or Andrew is
>> working on this.
> I feel that in the case of xsk, the flow director is more appropriate.
>
> Users may still want to allocate packets to a certain channel based on
> information such as port/ip/tcp/udp, and then xsk will process them.
>
> I will try to push the flow director to the spec.


That would be fine. For the backend implementation, it could still be 
implemented via steering eBPF.

Thanks


>
> Thanks.
>
>> Thanks
>>
>>
>>> Thanks.
>>>
>>>> Thanks
>>>>
>>>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
  2021-06-10  8:22   ` Xuan Zhuo
@ 2021-06-21  3:26     ` kernel test robot
  -1 siblings, 0 replies; 80+ messages in thread
From: kernel test robot @ 2021-06-21  3:26 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: kbuild-all, Jakub Kicinski, Michael S. Tsirkin, Jason Wang,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer

[-- Attachment #1: Type: text/plain, Size: 2156 bytes --]

Hi Xuan,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git c7654495916e109f76a67fd3ae68f8fa70ab4faa
config: i386-randconfig-a004-20210620 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/f5f1e60139e7c38fbb4ed58d503e89bbb26c1464
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
        git checkout f5f1e60139e7c38fbb4ed58d503e89bbb26c1464
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

ERROR: modpost: missing MODULE_LICENSE() in drivers/net/virtio/xsk.o
ERROR: modpost: "merge_drop_follow_bufs" [drivers/net/virtio/xsk.ko] undefined!
ERROR: modpost: "virtnet_run_xdp" [drivers/net/virtio/xsk.ko] undefined!
ERROR: modpost: "merge_receive_follow_bufs" [drivers/net/virtio/xsk.ko] undefined!
ERROR: modpost: "virtnet_xsk_wakeup" [drivers/net/virtio/virtio_net.ko] undefined!
ERROR: modpost: "receive_xsk" [drivers/net/virtio/virtio_net.ko] undefined!
ERROR: modpost: "add_recvbuf_xsk" [drivers/net/virtio/virtio_net.ko] undefined!
ERROR: modpost: "virtnet_poll_xsk" [drivers/net/virtio/virtio_net.ko] undefined!
>> ERROR: modpost: "virtqueue_detach_unused_buf_ctx" [drivers/net/virtio/virtio_net.ko] undefined!
ERROR: modpost: "virtnet_xsk_ctx_rx_copy" [drivers/net/virtio/virtio_net.ko] undefined!
ERROR: modpost: "virtnet_xsk_complete" [drivers/net/virtio/virtio_net.ko] undefined!
WARNING: modpost: suppressed 1 unresolved symbol warnings because there were too many)

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37985 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx
@ 2021-06-21  3:26     ` kernel test robot
  0 siblings, 0 replies; 80+ messages in thread
From: kernel test robot @ 2021-06-21  3:26 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2197 bytes --]

Hi Xuan,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git c7654495916e109f76a67fd3ae68f8fa70ab4faa
config: i386-randconfig-a004-20210620 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/f5f1e60139e7c38fbb4ed58d503e89bbb26c1464
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Xuan-Zhuo/virtio-net-support-xdp-socket-zero-copy/20210617-033438
        git checkout f5f1e60139e7c38fbb4ed58d503e89bbb26c1464
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

ERROR: modpost: missing MODULE_LICENSE() in drivers/net/virtio/xsk.o
ERROR: modpost: "merge_drop_follow_bufs" [drivers/net/virtio/xsk.ko] undefined!
ERROR: modpost: "virtnet_run_xdp" [drivers/net/virtio/xsk.ko] undefined!
ERROR: modpost: "merge_receive_follow_bufs" [drivers/net/virtio/xsk.ko] undefined!
ERROR: modpost: "virtnet_xsk_wakeup" [drivers/net/virtio/virtio_net.ko] undefined!
ERROR: modpost: "receive_xsk" [drivers/net/virtio/virtio_net.ko] undefined!
ERROR: modpost: "add_recvbuf_xsk" [drivers/net/virtio/virtio_net.ko] undefined!
ERROR: modpost: "virtnet_poll_xsk" [drivers/net/virtio/virtio_net.ko] undefined!
>> ERROR: modpost: "virtqueue_detach_unused_buf_ctx" [drivers/net/virtio/virtio_net.ko] undefined!
ERROR: modpost: "virtnet_xsk_ctx_rx_copy" [drivers/net/virtio/virtio_net.ko] undefined!
ERROR: modpost: "virtnet_xsk_complete" [drivers/net/virtio/virtio_net.ko] undefined!
WARNING: modpost: suppressed 1 unresolved symbol warnings because there were too many)

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 37985 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2021-06-21  3:26 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-10  8:21 [PATCH net-next v5 00/15] virtio-net: support xdp socket zero copy Xuan Zhuo
2021-06-10  8:21 ` Xuan Zhuo
2021-06-10  8:21 ` [PATCH net-next v5 01/15] netdevice: priv_flags extend to 64bit Xuan Zhuo
2021-06-10  8:21   ` Xuan Zhuo
2021-06-10  8:21 ` [PATCH net-next v5 02/15] netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR Xuan Zhuo
2021-06-10  8:21   ` Xuan Zhuo
2021-06-10  8:21 ` [PATCH net-next v5 03/15] virtio-net: " Xuan Zhuo
2021-06-10  8:21   ` Xuan Zhuo
2021-06-16  9:27   ` Jason Wang
2021-06-16  9:27     ` Jason Wang
2021-06-16 10:27     ` Xuan Zhuo
2021-06-10  8:21 ` [PATCH net-next v5 04/15] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR Xuan Zhuo
2021-06-10  8:21   ` Xuan Zhuo
2021-06-10  8:21 ` [PATCH net-next v5 05/15] virtio: support virtqueue_detach_unused_buf_ctx Xuan Zhuo
2021-06-10  8:21   ` Xuan Zhuo
2021-06-17  2:48   ` kernel test robot
2021-06-17  2:48     ` kernel test robot
2021-06-10  8:22 ` [PATCH net-next v5 06/15] virtio-net: unify the code for recycling the xmit ptr Xuan Zhuo
2021-06-10  8:22   ` Xuan Zhuo
2021-06-16  2:42   ` Jason Wang
2021-06-16  2:42     ` Jason Wang
2021-06-10  8:22 ` [PATCH net-next v5 07/15] virtio-net: standalone virtnet_aloc_frag function Xuan Zhuo
2021-06-10  8:22   ` Xuan Zhuo
2021-06-16  2:45   ` Jason Wang
2021-06-16  2:45     ` Jason Wang
2021-06-10  8:22 ` [PATCH net-next v5 08/15] virtio-net: split the receive_mergeable function Xuan Zhuo
2021-06-10  8:22   ` Xuan Zhuo
2021-06-16  7:33   ` Jason Wang
2021-06-16  7:33     ` Jason Wang
2021-06-16  7:52     ` Xuan Zhuo
2021-06-10  8:22 ` [PATCH net-next v5 09/15] virtio-net: virtnet_poll_tx support budget check Xuan Zhuo
2021-06-10  8:22   ` Xuan Zhuo
2021-06-10  8:22 ` [PATCH net-next v5 10/15] virtio-net: independent directory Xuan Zhuo
2021-06-10  8:22   ` Xuan Zhuo
2021-06-16  7:34   ` Jason Wang
2021-06-16  7:34     ` Jason Wang
2021-06-10  8:22 ` [PATCH net-next v5 11/15] virtio-net: move to virtio_net.h Xuan Zhuo
2021-06-10  8:22   ` Xuan Zhuo
2021-06-16  7:35   ` Jason Wang
2021-06-16  7:35     ` Jason Wang
2021-06-10  8:22 ` [PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx Xuan Zhuo
2021-06-10  8:22   ` Xuan Zhuo
2021-06-16  9:26   ` Jason Wang
2021-06-16  9:26     ` Jason Wang
2021-06-16 10:10     ` Magnus Karlsson
2021-06-16 10:19     ` Xuan Zhuo
2021-06-16 12:51       ` Jason Wang
2021-06-16 12:51         ` Jason Wang
2021-06-16 12:57         ` Xuan Zhuo
2021-06-17  2:36           ` Jason Wang
2021-06-17  2:36             ` Jason Wang
2021-06-10  8:22 ` [PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx Xuan Zhuo
2021-06-10  8:22   ` Xuan Zhuo
2021-06-17  2:48   ` kernel test robot
2021-06-17  2:48     ` kernel test robot
2021-06-17  3:23   ` Jason Wang
2021-06-17  3:23     ` Jason Wang
2021-06-17  5:53     ` Xuan Zhuo
2021-06-17  6:03       ` Jason Wang
2021-06-17  6:03         ` Jason Wang
2021-06-17  6:37         ` Xuan Zhuo
2021-06-17  6:58           ` Jason Wang
2021-06-17  6:58             ` Jason Wang
2021-06-21  3:26   ` kernel test robot
2021-06-21  3:26     ` kernel test robot
2021-06-10  8:22 ` [PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup Xuan Zhuo
2021-06-10  8:22   ` Xuan Zhuo
2021-06-17  3:07   ` Jason Wang
2021-06-17  3:07     ` Jason Wang
2021-06-17  5:55     ` Xuan Zhuo
2021-06-17  6:01       ` Jason Wang
2021-06-17  6:01         ` Jason Wang
2021-06-10  8:22 ` [PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold Xuan Zhuo
2021-06-10  8:22   ` Xuan Zhuo
2021-06-17  3:08   ` Jason Wang
2021-06-17  3:08     ` Jason Wang
2021-06-17  5:56     ` Xuan Zhuo
2021-06-17  6:00       ` Jason Wang
2021-06-17  6:00         ` Jason Wang
2021-06-17  6:43         ` Xuan Zhuo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.