bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit
@ 2021-04-13  3:15 Xuan Zhuo
  2021-04-13  3:15 ` [PATCH net-next v4 01/10] netdevice: priv_flags extend to 64bit Xuan Zhuo
                   ` (9 more replies)
  0 siblings, 10 replies; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

XDP socket is an excellent by pass kernel network transmission framework. The
zero copy feature of xsk (XDP socket) needs to be supported by the driver. The
performance of zero copy is very good. mlx5 and intel ixgbe already support this
feature, This patch set allows virtio-net to support xsk's zerocopy xmit
feature.

And xsk's zerocopy rx has made major changes to virtio-net, and I hope to submit
it after this patch set are received.

Compared with other drivers, virtio-net does not directly obtain the dma
address, so I first obtain the xsk page, and then pass the page to virtio.

When recycling the sent packets, we have to distinguish between skb and xdp.
Now we have to distinguish between skb, xdp, xsk.

---------------- Performance Testing ------------

The udp package tool implemented by the interface of xsk vs sockperf(kernel udp)
for performance testing:

xsk zero copy xmit in virtio-net:
CPU        PPS         MSGSIZE    vhost-cpu
7.9%       511804      64         100%
13.3%%     484373      1500       100%

sockperf:
CPU        PPS         MSGSIZE    vhost-cpu
100%       375227      64         89.1%
100%       307322      1500       81.5%

v4:
    1. add priv_flags IFF_NOT_USE_DMA_ADDR
    2. more reasonable patch split


Xuan Zhuo (10):
  netdevice: priv_flags extend to 64bit
  netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR
  virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
  xsk: support get page by addr
  xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR
  virtio-net: unify the code for recycling the xmit ptr
  virtio-net: virtnet_poll_tx support budget check
  virtio-net: xsk zero copy xmit setup
  virtio-net: xsk zero copy xmit implement wakeup and xmit
  virtio-net: xsk zero copy xmit kick by threshold

 drivers/net/virtio_net.c   | 479 ++++++++++++++++++++++++++++++++-----
 include/linux/netdevice.h  | 139 ++++++-----
 include/net/xdp_sock_drv.h |  11 +
 net/xdp/xsk_buff_pool.c    |   2 +-
 4 files changed, 511 insertions(+), 120 deletions(-)

--
2.31.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net-next v4 01/10] netdevice: priv_flags extend to 64bit
  2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
@ 2021-04-13  3:15 ` Xuan Zhuo
  2021-04-13 18:26   ` Jakub Kicinski
  2021-04-13  3:15 ` [PATCH net-next v4 02/10] netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR Xuan Zhuo
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

The size of priv_flags is 32 bits, and the number of flags currently
available has reached 32. It is time to expand the size of priv_flags to
64 bits.

Here the priv_flags is modified to 8 bytes, but the size of struct
net_device has not changed, it is still 2176 bytes. It is because _tx is
aligned based on the cache line. But there is a 4-byte hole left here.

Since the fields before and after priv_flags are read mostly, I did not
adjust the order of the fields here.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/linux/netdevice.h | 136 ++++++++++++++++++++------------------
 1 file changed, 71 insertions(+), 65 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f57b70fc251f..86e4bd08c2f1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1549,6 +1549,8 @@ struct net_device_ops {
                                                          struct net_device_path *path);
 };
 
+typedef u64 netdev_priv_flags_t;
+
 /**
  * enum netdev_priv_flags - &struct net_device priv_flags
  *
@@ -1598,72 +1600,75 @@ struct net_device_ops {
  *	skb_headlen(skb) == 0 (data starts from frag0)
  */
 enum netdev_priv_flags {
-	IFF_802_1Q_VLAN			= 1<<0,
-	IFF_EBRIDGE			= 1<<1,
-	IFF_BONDING			= 1<<2,
-	IFF_ISATAP			= 1<<3,
-	IFF_WAN_HDLC			= 1<<4,
-	IFF_XMIT_DST_RELEASE		= 1<<5,
-	IFF_DONT_BRIDGE			= 1<<6,
-	IFF_DISABLE_NETPOLL		= 1<<7,
-	IFF_MACVLAN_PORT		= 1<<8,
-	IFF_BRIDGE_PORT			= 1<<9,
-	IFF_OVS_DATAPATH		= 1<<10,
-	IFF_TX_SKB_SHARING		= 1<<11,
-	IFF_UNICAST_FLT			= 1<<12,
-	IFF_TEAM_PORT			= 1<<13,
-	IFF_SUPP_NOFCS			= 1<<14,
-	IFF_LIVE_ADDR_CHANGE		= 1<<15,
-	IFF_MACVLAN			= 1<<16,
-	IFF_XMIT_DST_RELEASE_PERM	= 1<<17,
-	IFF_L3MDEV_MASTER		= 1<<18,
-	IFF_NO_QUEUE			= 1<<19,
-	IFF_OPENVSWITCH			= 1<<20,
-	IFF_L3MDEV_SLAVE		= 1<<21,
-	IFF_TEAM			= 1<<22,
-	IFF_RXFH_CONFIGURED		= 1<<23,
-	IFF_PHONY_HEADROOM		= 1<<24,
-	IFF_MACSEC			= 1<<25,
-	IFF_NO_RX_HANDLER		= 1<<26,
-	IFF_FAILOVER			= 1<<27,
-	IFF_FAILOVER_SLAVE		= 1<<28,
-	IFF_L3MDEV_RX_HANDLER		= 1<<29,
-	IFF_LIVE_RENAME_OK		= 1<<30,
-	IFF_TX_SKB_NO_LINEAR		= 1<<31,
+	IFF_802_1Q_VLAN_BIT,
+	IFF_EBRIDGE_BIT,
+	IFF_BONDING_BIT,
+	IFF_ISATAP_BIT,
+	IFF_WAN_HDLC_BIT,
+	IFF_XMIT_DST_RELEASE_BIT,
+	IFF_DONT_BRIDGE_BIT,
+	IFF_DISABLE_NETPOLL_BIT,
+	IFF_MACVLAN_PORT_BIT,
+	IFF_BRIDGE_PORT_BIT,
+	IFF_OVS_DATAPATH_BIT,
+	IFF_TX_SKB_SHARING_BIT,
+	IFF_UNICAST_FLT_BIT,
+	IFF_TEAM_PORT_BIT,
+	IFF_SUPP_NOFCS_BIT,
+	IFF_LIVE_ADDR_CHANGE_BIT,
+	IFF_MACVLAN_BIT,
+	IFF_XMIT_DST_RELEASE_PERM_BIT,
+	IFF_L3MDEV_MASTER_BIT,
+	IFF_NO_QUEUE_BIT,
+	IFF_OPENVSWITCH_BIT,
+	IFF_L3MDEV_SLAVE_BIT,
+	IFF_TEAM_BIT,
+	IFF_RXFH_CONFIGURED_BIT,
+	IFF_PHONY_HEADROOM_BIT,
+	IFF_MACSEC_BIT,
+	IFF_NO_RX_HANDLER_BIT,
+	IFF_FAILOVER_BIT,
+	IFF_FAILOVER_SLAVE_BIT,
+	IFF_L3MDEV_RX_HANDLER_BIT,
+	IFF_LIVE_RENAME_OK_BIT,
+	IFF_TX_SKB_NO_LINEAR_BIT,
 };
 
-#define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
-#define IFF_EBRIDGE			IFF_EBRIDGE
-#define IFF_BONDING			IFF_BONDING
-#define IFF_ISATAP			IFF_ISATAP
-#define IFF_WAN_HDLC			IFF_WAN_HDLC
-#define IFF_XMIT_DST_RELEASE		IFF_XMIT_DST_RELEASE
-#define IFF_DONT_BRIDGE			IFF_DONT_BRIDGE
-#define IFF_DISABLE_NETPOLL		IFF_DISABLE_NETPOLL
-#define IFF_MACVLAN_PORT		IFF_MACVLAN_PORT
-#define IFF_BRIDGE_PORT			IFF_BRIDGE_PORT
-#define IFF_OVS_DATAPATH		IFF_OVS_DATAPATH
-#define IFF_TX_SKB_SHARING		IFF_TX_SKB_SHARING
-#define IFF_UNICAST_FLT			IFF_UNICAST_FLT
-#define IFF_TEAM_PORT			IFF_TEAM_PORT
-#define IFF_SUPP_NOFCS			IFF_SUPP_NOFCS
-#define IFF_LIVE_ADDR_CHANGE		IFF_LIVE_ADDR_CHANGE
-#define IFF_MACVLAN			IFF_MACVLAN
-#define IFF_XMIT_DST_RELEASE_PERM	IFF_XMIT_DST_RELEASE_PERM
-#define IFF_L3MDEV_MASTER		IFF_L3MDEV_MASTER
-#define IFF_NO_QUEUE			IFF_NO_QUEUE
-#define IFF_OPENVSWITCH			IFF_OPENVSWITCH
-#define IFF_L3MDEV_SLAVE		IFF_L3MDEV_SLAVE
-#define IFF_TEAM			IFF_TEAM
-#define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
-#define IFF_PHONY_HEADROOM		IFF_PHONY_HEADROOM
-#define IFF_MACSEC			IFF_MACSEC
-#define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
-#define IFF_FAILOVER			IFF_FAILOVER
-#define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
-#define IFF_L3MDEV_RX_HANDLER		IFF_L3MDEV_RX_HANDLER
-#define IFF_LIVE_RENAME_OK		IFF_LIVE_RENAME_OK
-#define IFF_TX_SKB_NO_LINEAR		IFF_TX_SKB_NO_LINEAR
+#define __IFF_BIT(bit)		((netdev_priv_flags_t)1 << (bit))
+#define __IFF(name)		__IFF_BIT(IFF_##name##_BIT)
+
+#define IFF_802_1Q_VLAN			__IFF(802_1Q_VLAN)
+#define IFF_EBRIDGE			__IFF(EBRIDGE)
+#define IFF_BONDING			__IFF(BONDING)
+#define IFF_ISATAP			__IFF(ISATAP)
+#define IFF_WAN_HDLC			__IFF(WAN_HDLC)
+#define IFF_XMIT_DST_RELEASE		__IFF(XMIT_DST_RELEASE)
+#define IFF_DONT_BRIDGE			__IFF(DONT_BRIDGE)
+#define IFF_DISABLE_NETPOLL		__IFF(DISABLE_NETPOLL)
+#define IFF_MACVLAN_PORT		__IFF(MACVLAN_PORT)
+#define IFF_BRIDGE_PORT			__IFF(BRIDGE_PORT)
+#define IFF_OVS_DATAPATH		__IFF(OVS_DATAPATH)
+#define IFF_TX_SKB_SHARING		__IFF(TX_SKB_SHARING)
+#define IFF_UNICAST_FLT			__IFF(UNICAST_FLT)
+#define IFF_TEAM_PORT			__IFF(TEAM_PORT)
+#define IFF_SUPP_NOFCS			__IFF(SUPP_NOFCS)
+#define IFF_LIVE_ADDR_CHANGE		__IFF(LIVE_ADDR_CHANGE)
+#define IFF_MACVLAN			__IFF(MACVLAN)
+#define IFF_XMIT_DST_RELEASE_PERM	__IFF(XMIT_DST_RELEASE_PERM)
+#define IFF_L3MDEV_MASTER		__IFF(L3MDEV_MASTER)
+#define IFF_NO_QUEUE			__IFF(NO_QUEUE)
+#define IFF_OPENVSWITCH			__IFF(OPENVSWITCH)
+#define IFF_L3MDEV_SLAVE		__IFF(L3MDEV_SLAVE)
+#define IFF_TEAM			__IFF(TEAM)
+#define IFF_RXFH_CONFIGURED		__IFF(RXFH_CONFIGURED)
+#define IFF_PHONY_HEADROOM		__IFF(PHONY_HEADROOM)
+#define IFF_MACSEC			__IFF(MACSEC)
+#define IFF_NO_RX_HANDLER		__IFF(NO_RX_HANDLER)
+#define IFF_FAILOVER			__IFF(FAILOVER)
+#define IFF_FAILOVER_SLAVE		__IFF(FAILOVER_SLAVE)
+#define IFF_L3MDEV_RX_HANDLER		__IFF(L3MDEV_RX_HANDLER)
+#define IFF_LIVE_RENAME_OK		__IFF(LIVE_RENAME_OK)
+#define IFF_TX_SKB_NO_LINEAR		__IFF(TX_SKB_NO_LINEAR)
 
 /* Specifies the type of the struct net_device::ml_priv pointer */
 enum netdev_ml_priv_type {
@@ -1963,7 +1968,8 @@ struct net_device {
 
 	/* Read-mostly cache-line for fast-path access */
 	unsigned int		flags;
-	unsigned int		priv_flags;
+	/* 4 byte hole */
+	netdev_priv_flags_t	priv_flags;
 	const struct net_device_ops *netdev_ops;
 	int			ifindex;
 	unsigned short		gflags;
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v4 02/10] netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR
  2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
  2021-04-13  3:15 ` [PATCH net-next v4 01/10] netdevice: priv_flags extend to 64bit Xuan Zhuo
@ 2021-04-13  3:15 ` Xuan Zhuo
  2021-04-13  3:15 ` [PATCH net-next v4 03/10] virtio-net: " Xuan Zhuo
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

Some driver devices, such as virtio-net, do not directly use dma addr.
For upper-level frameworks such as xdp socket, that need to be aware of
this. So add a new priv_flag IFF_NOT_USE_DMA_ADDR.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/linux/netdevice.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 86e4bd08c2f1..78b2a8b2c31d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1598,6 +1598,7 @@ typedef u64 netdev_priv_flags_t;
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
  * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
  *	skb_headlen(skb) == 0 (data starts from frag0)
+ * @IFF_NOT_USE_DMA_ADDR: driver not use dma addr directly. such as virtio-net
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN_BIT,
@@ -1632,6 +1633,7 @@ enum netdev_priv_flags {
 	IFF_L3MDEV_RX_HANDLER_BIT,
 	IFF_LIVE_RENAME_OK_BIT,
 	IFF_TX_SKB_NO_LINEAR_BIT,
+	IFF_NOT_USE_DMA_ADDR_BIT,
 };
 
 #define __IFF_BIT(bit)		((netdev_priv_flags_t)1 << (bit))
@@ -1669,6 +1671,7 @@ enum netdev_priv_flags {
 #define IFF_L3MDEV_RX_HANDLER		__IFF(L3MDEV_RX_HANDLER)
 #define IFF_LIVE_RENAME_OK		__IFF(LIVE_RENAME_OK)
 #define IFF_TX_SKB_NO_LINEAR		__IFF(TX_SKB_NO_LINEAR)
+#define IFF_NOT_USE_DMA_ADDR		__IFF(NOT_USE_DMA_ADDR)
 
 /* Specifies the type of the struct net_device::ml_priv pointer */
 enum netdev_ml_priv_type {
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v4 03/10] virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
  2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
  2021-04-13  3:15 ` [PATCH net-next v4 01/10] netdevice: priv_flags extend to 64bit Xuan Zhuo
  2021-04-13  3:15 ` [PATCH net-next v4 02/10] netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR Xuan Zhuo
@ 2021-04-13  3:15 ` Xuan Zhuo
  2021-04-13  3:15 ` [PATCH net-next v4 04/10] xsk: support get page by addr Xuan Zhuo
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

virtio-net not use dma addr directly. So add this priv_flags
IFF_NOT_USE_DMA_ADDR.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bb4ea9dbc16b..52653e234a20 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3007,7 +3007,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up network device as normal. */
 	dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
-			   IFF_TX_SKB_NO_LINEAR;
+			   IFF_TX_SKB_NO_LINEAR | IFF_NOT_USE_DMA_ADDR;
 	dev->netdev_ops = &virtnet_netdev;
 	dev->features = NETIF_F_HIGHDMA;
 
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v4 04/10] xsk: support get page by addr
  2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
                   ` (2 preceding siblings ...)
  2021-04-13  3:15 ` [PATCH net-next v4 03/10] virtio-net: " Xuan Zhuo
@ 2021-04-13  3:15 ` Xuan Zhuo
  2021-04-14  6:59   ` Magnus Karlsson
  2021-04-13  3:15 ` [PATCH net-next v4 05/10] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR Xuan Zhuo
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

xsk adds an interface and returns the page corresponding to
data. virtio-net does not initialize dma, so it needs page to construct
scatterlist to pass to vring.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 include/net/xdp_sock_drv.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 4e295541e396..1d08b5d8d15f 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -72,6 +72,12 @@ static inline dma_addr_t xsk_buff_xdp_get_frame_dma(struct xdp_buff *xdp)
 	return xp_get_frame_dma(xskb);
 }
 
+static inline struct page *xsk_buff_xdp_get_page(struct xsk_buff_pool *pool, u64 addr)
+{
+	addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
+	return pool->umem->pgs[addr >> PAGE_SHIFT];
+}
+
 static inline struct xdp_buff *xsk_buff_alloc(struct xsk_buff_pool *pool)
 {
 	return xp_alloc(pool);
@@ -207,6 +213,11 @@ static inline dma_addr_t xsk_buff_xdp_get_frame_dma(struct xdp_buff *xdp)
 	return 0;
 }
 
+static inline struct page *xsk_buff_xdp_get_page(struct xsk_buff_pool *pool, u64 addr)
+{
+	return NULL;
+}
+
 static inline struct xdp_buff *xsk_buff_alloc(struct xsk_buff_pool *pool)
 {
 	return NULL;
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v4 05/10] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR
  2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
                   ` (3 preceding siblings ...)
  2021-04-13  3:15 ` [PATCH net-next v4 04/10] xsk: support get page by addr Xuan Zhuo
@ 2021-04-13  3:15 ` Xuan Zhuo
  2021-04-14  7:00   ` Magnus Karlsson
  2021-04-13  3:15 ` [PATCH net-next v4 06/10] virtio-net: unify the code for recycling the xmit ptr Xuan Zhuo
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

Some devices, such as virtio-net, do not directly use dma addr. These
devices do not initialize dma after completing the xsk setup, so the dma
check is skipped here.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 net/xdp/xsk_buff_pool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 8de01aaac4a0..a7e434de0308 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -171,7 +171,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (err)
 		goto err_unreg_pool;
 
-	if (!pool->dma_pages) {
+	if (!(netdev->priv_flags & IFF_NOT_USE_DMA_ADDR) && !pool->dma_pages) {
 		WARN(1, "Driver did not DMA map zero-copy buffers");
 		err = -EINVAL;
 		goto err_unreg_xsk;
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v4 06/10] virtio-net: unify the code for recycling the xmit ptr
  2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
                   ` (4 preceding siblings ...)
  2021-04-13  3:15 ` [PATCH net-next v4 05/10] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR Xuan Zhuo
@ 2021-04-13  3:15 ` Xuan Zhuo
  2021-04-14  3:32   ` Jason Wang
  2021-04-13  3:15 ` [PATCH net-next v4 07/10] virtio-net: virtnet_poll_tx support budget check Xuan Zhuo
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

Now there are two types of "skb" and "xdp frame" during recycling old
xmit.

There are two completely similar and independent implementations. This
is inconvenient for the subsequent addition of new types. So extract a
function from this piece of code and call this function uniformly to
recover old xmit ptr.

Rename free_old_xmit_skbs() to free_old_xmit().

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 86 ++++++++++++++++++----------------------
 1 file changed, 38 insertions(+), 48 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 52653e234a20..f3752b254965 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -264,6 +264,30 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
 	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
 }
 
+static void __free_old_xmit(struct send_queue *sq, bool in_napi,
+			    struct virtnet_sq_stats *stats)
+{
+	unsigned int len;
+	void *ptr;
+
+	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		if (likely(!is_xdp_frame(ptr))) {
+			struct sk_buff *skb = ptr;
+
+			pr_debug("Sent skb %p\n", skb);
+
+			stats->bytes += skb->len;
+			napi_consume_skb(skb, in_napi);
+		} else {
+			struct xdp_frame *frame = ptr_to_xdp(ptr);
+
+			stats->bytes += frame->len;
+			xdp_return_frame(frame);
+		}
+		stats->packets++;
+	}
+}
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -525,15 +549,12 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 			    int n, struct xdp_frame **frames, u32 flags)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_sq_stats stats = {};
 	struct receive_queue *rq = vi->rq;
 	struct bpf_prog *xdp_prog;
 	struct send_queue *sq;
-	unsigned int len;
-	int packets = 0;
-	int bytes = 0;
 	int nxmit = 0;
 	int kicks = 0;
-	void *ptr;
 	int ret;
 	int i;
 
@@ -552,20 +573,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 	}
 
 	/* Free up any pending old buffers before queueing new ones. */
-	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (likely(is_xdp_frame(ptr))) {
-			struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-			bytes += frame->len;
-			xdp_return_frame(frame);
-		} else {
-			struct sk_buff *skb = ptr;
-
-			bytes += skb->len;
-			napi_consume_skb(skb, false);
-		}
-		packets++;
-	}
+	__free_old_xmit(sq, false, &stats);
 
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *xdpf = frames[i];
@@ -582,8 +590,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 	}
 out:
 	u64_stats_update_begin(&sq->stats.syncp);
-	sq->stats.bytes += bytes;
-	sq->stats.packets += packets;
+	sq->stats.bytes += stats.bytes;
+	sq->stats.packets += stats.packets;
 	sq->stats.xdp_tx += n;
 	sq->stats.xdp_tx_drops += n - nxmit;
 	sq->stats.kicks += kicks;
@@ -1406,39 +1414,21 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
 	return stats.packets;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
+static void free_old_xmit(struct send_queue *sq, bool in_napi)
 {
-	unsigned int len;
-	unsigned int packets = 0;
-	unsigned int bytes = 0;
-	void *ptr;
-
-	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (likely(!is_xdp_frame(ptr))) {
-			struct sk_buff *skb = ptr;
-
-			pr_debug("Sent skb %p\n", skb);
+	struct virtnet_sq_stats stats = {};
 
-			bytes += skb->len;
-			napi_consume_skb(skb, in_napi);
-		} else {
-			struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-			bytes += frame->len;
-			xdp_return_frame(frame);
-		}
-		packets++;
-	}
+	__free_old_xmit(sq, in_napi, &stats);
 
 	/* Avoid overhead when no packets have been processed
 	 * happens when called speculatively from start_xmit.
 	 */
-	if (!packets)
+	if (!stats.packets)
 		return;
 
 	u64_stats_update_begin(&sq->stats.syncp);
-	sq->stats.bytes += bytes;
-	sq->stats.packets += packets;
+	sq->stats.bytes += stats.bytes;
+	sq->stats.packets += stats.packets;
 	u64_stats_update_end(&sq->stats.syncp);
 }
 
@@ -1463,7 +1453,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
 		return;
 
 	if (__netif_tx_trylock(txq)) {
-		free_old_xmit_skbs(sq, true);
+		free_old_xmit(sq, true);
 		__netif_tx_unlock(txq);
 	}
 
@@ -1548,7 +1538,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 
 	txq = netdev_get_tx_queue(vi->dev, index);
 	__netif_tx_lock(txq, raw_smp_processor_id());
-	free_old_xmit_skbs(sq, true);
+	free_old_xmit(sq, true);
 	__netif_tx_unlock(txq);
 
 	virtqueue_napi_complete(napi, sq->vq, 0);
@@ -1617,7 +1607,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	bool use_napi = sq->napi.weight;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(sq, false);
+	free_old_xmit(sq, false);
 
 	if (use_napi && kick)
 		virtqueue_enable_cb_delayed(sq->vq);
@@ -1661,7 +1651,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (!use_napi &&
 		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(sq, false);
+			free_old_xmit(sq, false);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v4 07/10] virtio-net: virtnet_poll_tx support budget check
  2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
                   ` (5 preceding siblings ...)
  2021-04-13  3:15 ` [PATCH net-next v4 06/10] virtio-net: unify the code for recycling the xmit ptr Xuan Zhuo
@ 2021-04-13  3:15 ` Xuan Zhuo
  2021-04-14  3:34   ` Jason Wang
  2021-04-13  3:15 ` [PATCH net-next v4 08/10] virtio-net: xsk zero copy xmit setup Xuan Zhuo
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

virtnet_poll_tx() check the work done like other network card drivers.

When work < budget, napi_poll() in dev.c will exit directly. And
virtqueue_napi_complete() will be called to close napi. If closing napi
fails or there is still data to be processed, virtqueue_napi_complete()
will make napi schedule again, and no conflicts with the logic of
napi_poll().

When work == budget, virtnet_poll_tx() will return the var 'work', and
the napi_poll() in dev.c will re-add napi to the queue.

The purpose of this patch is to support xsk xmit in virtio_poll_tx for
subsequent patch.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f3752b254965..f52a25091322 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1529,6 +1529,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	unsigned int index = vq2txq(sq->vq);
 	struct netdev_queue *txq;
+	int work_done = 0;
 
 	if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {
 		/* We don't need to enable cb for XDP */
@@ -1541,12 +1542,13 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 	free_old_xmit(sq, true);
 	__netif_tx_unlock(txq);
 
-	virtqueue_napi_complete(napi, sq->vq, 0);
+	if (work_done < budget)
+		virtqueue_napi_complete(napi, sq->vq, 0);
 
 	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
 		netif_tx_wake_queue(txq);
 
-	return 0;
+	return work_done;
 }
 
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v4 08/10] virtio-net: xsk zero copy xmit setup
  2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
                   ` (6 preceding siblings ...)
  2021-04-13  3:15 ` [PATCH net-next v4 07/10] virtio-net: virtnet_poll_tx support budget check Xuan Zhuo
@ 2021-04-13  3:15 ` Xuan Zhuo
  2021-04-14  4:01   ` Jason Wang
  2021-04-14  7:36   ` Magnus Karlsson
  2021-04-13  3:15 ` [PATCH net-next v4 09/10] virtio-net: xsk zero copy xmit implement wakeup and xmit Xuan Zhuo
  2021-04-13  3:15 ` [PATCH net-next v4 10/10] virtio-net: xsk zero copy xmit kick by threshold Xuan Zhuo
  9 siblings, 2 replies; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

xsk is a high-performance packet receiving and sending technology.

This patch implements the binding and unbinding operations of xsk and
the virtio-net queue for xsk zero copy xmit.

The xsk zero copy xmit depends on tx napi. So if tx napi is not true,
an error will be reported. And the entire operation is under the
protection of rtnl_lock.

If xsk is active, it will prevent ethtool from modifying tx napi.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 78 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 77 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f52a25091322..8242a9e9f17d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -22,6 +22,7 @@
 #include <net/route.h>
 #include <net/xdp.h>
 #include <net/net_failover.h>
+#include <net/xdp_sock_drv.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -133,6 +134,11 @@ struct send_queue {
 	struct virtnet_sq_stats stats;
 
 	struct napi_struct napi;
+
+	struct {
+		/* xsk pool */
+		struct xsk_buff_pool __rcu *pool;
+	} xsk;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -2249,8 +2255,19 @@ static int virtnet_set_coalesce(struct net_device *dev,
 	if (napi_weight ^ vi->sq[0].napi.weight) {
 		if (dev->flags & IFF_UP)
 			return -EBUSY;
-		for (i = 0; i < vi->max_queue_pairs; i++)
+		for (i = 0; i < vi->max_queue_pairs; i++) {
+			/* xsk xmit depend on the tx napi. So if xsk is active,
+			 * prevent modifications to tx napi.
+			 */
+			rcu_read_lock();
+			if (rcu_dereference(vi->sq[i].xsk.pool)) {
+				rcu_read_unlock();
+				continue;
+			}
+			rcu_read_unlock();
+
 			vi->sq[i].napi.weight = napi_weight;
+		}
 	}
 
 	return 0;
@@ -2518,11 +2535,70 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	return err;
 }
 
+static int virtnet_xsk_pool_enable(struct net_device *dev,
+				   struct xsk_buff_pool *pool,
+				   u16 qid)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct send_queue *sq;
+
+	if (qid >= vi->curr_queue_pairs)
+		return -EINVAL;
+
+	sq = &vi->sq[qid];
+
+	/* xsk zerocopy depend on the tx napi.
+	 *
+	 * xsk zerocopy xmit is driven by the tx interrupt. When the device is
+	 * not busy, napi will be called continuously to send data. When the
+	 * device is busy, wait for the notification interrupt after the
+	 * hardware has finished processing the data, and continue to send data
+	 * in napi.
+	 */
+	if (!sq->napi.weight)
+		return -EPERM;
+
+	rcu_read_lock();
+	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
+	 * safe.
+	 */
+	rcu_assign_pointer(sq->xsk.pool, pool);
+	rcu_read_unlock();
+
+	return 0;
+}
+
+static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct send_queue *sq;
+
+	if (qid >= vi->curr_queue_pairs)
+		return -EINVAL;
+
+	sq = &vi->sq[qid];
+
+	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
+	 * safe.
+	 */
+	rcu_assign_pointer(sq->xsk.pool, NULL);
+
+	synchronize_net(); /* Sync with the XSK wakeup and with NAPI. */
+
+	return 0;
+}
+
 static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
 		return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_SETUP_XSK_POOL:
+		if (xdp->xsk.pool)
+			return virtnet_xsk_pool_enable(dev, xdp->xsk.pool,
+						       xdp->xsk.queue_id);
+		else
+			return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id);
 	default:
 		return -EINVAL;
 	}
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v4 09/10] virtio-net: xsk zero copy xmit implement wakeup and xmit
  2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
                   ` (7 preceding siblings ...)
  2021-04-13  3:15 ` [PATCH net-next v4 08/10] virtio-net: xsk zero copy xmit setup Xuan Zhuo
@ 2021-04-13  3:15 ` Xuan Zhuo
  2021-04-14  5:46   ` Jason Wang
  2021-04-13  3:15 ` [PATCH net-next v4 10/10] virtio-net: xsk zero copy xmit kick by threshold Xuan Zhuo
  9 siblings, 1 reply; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

This patch implements the core part of xsk zerocopy xmit.

When the user calls sendto to consume the data in the xsk tx queue,
virtnet_xsk_wakeup() will be called.

In wakeup, it will try to send a part of the data directly. There are
two purposes for this realization:

1. Send part of the data quickly to reduce the transmission delay of the
   first packet.
2. Trigger tx interrupt, start napi to consume xsk tx data.

All sent xsk packets share the virtio-net header of xsk_hdr. If xsk
needs to support csum and other functions later, consider assigning xsk
hdr separately for each sent packet.

There are now three situations in free_old_xmit(): skb, xdp frame, xsk
desc.  Based on the last two bit of ptr returned by virtqueue_get_buf():
    00 is skb by default.
    01 represents the packet sent by xdp
    10 is the packet sent by xsk

If the xmit work of xsk has not been completed, but the ring is full,
napi must first exit and wait for the ring to be available, so
need_wakeup() is set. If free_old_xmit() is called first by start_xmit(),
we can quickly wake up napi to execute xsk xmit task.

When recycling, we need to count the number of bytes sent, so put xsk
desc->len into the ptr pointer. Because ptr does not point to meaningful
objects in xsk.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 296 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 292 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8242a9e9f17d..c441d6bf1510 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -46,6 +46,11 @@ module_param(napi_tx, bool, 0644);
 #define VIRTIO_XDP_REDIR	BIT(1)
 
 #define VIRTIO_XDP_FLAG	BIT(0)
+#define VIRTIO_XSK_FLAG	BIT(1)
+
+#define VIRTIO_XSK_PTR_SHIFT       4
+
+static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
 
 /* RX packet size EWMA. The average packet size is used to determine the packet
  * buffer size when refilling RX rings. As the entire RX ring may be refilled
@@ -138,6 +143,12 @@ struct send_queue {
 	struct {
 		/* xsk pool */
 		struct xsk_buff_pool __rcu *pool;
+
+		/* save the desc for next xmit, when xmit fail. */
+		struct xdp_desc last_desc;
+
+		/* xsk wait for tx inter or softirq */
+		bool need_wakeup;
 	} xsk;
 };
 
@@ -255,6 +266,15 @@ struct padded_vnet_hdr {
 	char padding[4];
 };
 
+static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
+			   int budget, bool in_napi);
+static void virtnet_xsk_complete(struct send_queue *sq, u32 num);
+
+static bool is_skb_ptr(void *ptr)
+{
+	return !((unsigned long)ptr & (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG));
+}
+
 static bool is_xdp_frame(void *ptr)
 {
 	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
@@ -265,6 +285,19 @@ static void *xdp_to_ptr(struct xdp_frame *ptr)
 	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
 }
 
+static void *xsk_to_ptr(struct xdp_desc *desc)
+{
+	/* save the desc len to ptr */
+	u64 p = desc->len << VIRTIO_XSK_PTR_SHIFT;
+
+	return (void *)((unsigned long)p | VIRTIO_XSK_FLAG);
+}
+
+static void ptr_to_xsk(void *ptr, struct xdp_desc *desc)
+{
+	desc->len = ((unsigned long)ptr) >> VIRTIO_XSK_PTR_SHIFT;
+}
+
 static struct xdp_frame *ptr_to_xdp(void *ptr)
 {
 	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
@@ -273,25 +306,35 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
 static void __free_old_xmit(struct send_queue *sq, bool in_napi,
 			    struct virtnet_sq_stats *stats)
 {
+	unsigned int xsknum = 0;
 	unsigned int len;
 	void *ptr;
 
 	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (likely(!is_xdp_frame(ptr))) {
+		if (is_skb_ptr(ptr)) {
 			struct sk_buff *skb = ptr;
 
 			pr_debug("Sent skb %p\n", skb);
 
 			stats->bytes += skb->len;
 			napi_consume_skb(skb, in_napi);
-		} else {
+		} else if (is_xdp_frame(ptr)) {
 			struct xdp_frame *frame = ptr_to_xdp(ptr);
 
 			stats->bytes += frame->len;
 			xdp_return_frame(frame);
+		} else {
+			struct xdp_desc desc;
+
+			ptr_to_xsk(ptr, &desc);
+			stats->bytes += desc.len;
+			++xsknum;
 		}
 		stats->packets++;
 	}
+
+	if (xsknum)
+		virtnet_xsk_complete(sq, xsknum);
 }
 
 /* Converting between virtqueue no. and kernel tx/rx queue no.
@@ -1529,6 +1572,19 @@ static int virtnet_open(struct net_device *dev)
 	return 0;
 }
 
+static int virtnet_poll_xsk(struct send_queue *sq, int budget)
+{
+	struct xsk_buff_pool *pool;
+	int work_done = 0;
+
+	rcu_read_lock();
+	pool = rcu_dereference(sq->xsk.pool);
+	if (pool)
+		work_done = virtnet_xsk_run(sq, pool, budget, true);
+	rcu_read_unlock();
+	return work_done;
+}
+
 static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 {
 	struct send_queue *sq = container_of(napi, struct send_queue, napi);
@@ -1545,6 +1601,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 
 	txq = netdev_get_tx_queue(vi->dev, index);
 	__netif_tx_lock(txq, raw_smp_processor_id());
+	work_done += virtnet_poll_xsk(sq, budget);
 	free_old_xmit(sq, true);
 	__netif_tx_unlock(txq);
 
@@ -2535,6 +2592,234 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	return err;
 }
 
+static void virtnet_xsk_check_queue(struct send_queue *sq)
+{
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct net_device *dev = vi->dev;
+	int qnum = sq - vi->sq;
+
+	/* If this sq is not the exclusive queue of the current cpu,
+	 * then it may be called by start_xmit, so check it running out
+	 * of space.
+	 *
+	 * And if it is a raw buffer queue, it does not check whether the status
+	 * of the queue is stopped when sending. So there is no need to check
+	 * the situation of the raw buffer queue.
+	 */
+	if (is_xdp_raw_buffer_queue(vi, qnum))
+		return;
+
+	/* Stop the queue to avoid getting packets that we are
+	 * then unable to transmit. Then wait the tx interrupt.
+	 */
+	if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
+		netif_stop_subqueue(dev, qnum);
+}
+
+static void virtnet_xsk_complete(struct send_queue *sq, u32 num)
+{
+	struct xsk_buff_pool *pool;
+
+	rcu_read_lock();
+
+	pool = rcu_dereference(sq->xsk.pool);
+	if (!pool) {
+		rcu_read_unlock();
+		return;
+	}
+	xsk_tx_completed(pool, num);
+	rcu_read_unlock();
+
+	if (sq->xsk.need_wakeup) {
+		sq->xsk.need_wakeup = false;
+		virtqueue_napi_schedule(&sq->napi, sq->vq);
+	}
+}
+
+static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
+			    struct xdp_desc *desc)
+{
+	struct virtnet_info *vi;
+	u32 offset, n, len;
+	struct page *page;
+	void *data;
+	u64 addr;
+	int err;
+
+	vi = sq->vq->vdev->priv;
+	addr = desc->addr;
+
+	data = xsk_buff_raw_get_data(pool, addr);
+	offset = offset_in_page(data);
+
+	/* xsk unaligned mode, desc may use two pages */
+	if (desc->len > PAGE_SIZE - offset)
+		n = 3;
+	else
+		n = 2;
+
+	sg_init_table(sq->sg, n);
+	sg_set_buf(sq->sg, &xsk_hdr, vi->hdr_len);
+
+	/* handle for xsk first page */
+	len = min_t(int, desc->len, PAGE_SIZE - offset);
+	page = xsk_buff_xdp_get_page(pool, addr);
+	sg_set_page(sq->sg + 1, page, len, offset);
+
+	/* xsk unaligned mode, handle for the second page */
+	if (len < desc->len) {
+		page = xsk_buff_xdp_get_page(pool, addr + len);
+		len = min_t(int, desc->len - len, PAGE_SIZE);
+		sg_set_page(sq->sg + 2, page, len, 0);
+	}
+
+	err = virtqueue_add_outbuf(sq->vq, sq->sg, n, xsk_to_ptr(desc),
+				   GFP_ATOMIC);
+	if (unlikely(err))
+		sq->xsk.last_desc = *desc;
+
+	return err;
+}
+
+static int virtnet_xsk_xmit_batch(struct send_queue *sq,
+				  struct xsk_buff_pool *pool,
+				  unsigned int budget,
+				  bool in_napi, int *done,
+				  struct virtnet_sq_stats *stats)
+{
+	struct xdp_desc desc;
+	int err, packet = 0;
+	int ret = -EAGAIN;
+
+	if (sq->xsk.last_desc.addr) {
+		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
+			return -EBUSY;
+
+		err = virtnet_xsk_xmit(sq, pool, &sq->xsk.last_desc);
+		if (unlikely(err))
+			return -EBUSY;
+
+		++packet;
+		--budget;
+		sq->xsk.last_desc.addr = 0;
+	}
+
+	while (budget-- > 0) {
+		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
+			ret = -EBUSY;
+			break;
+		}
+
+		if (!xsk_tx_peek_desc(pool, &desc)) {
+			/* done */
+			ret = 0;
+			break;
+		}
+
+		err = virtnet_xsk_xmit(sq, pool, &desc);
+		if (unlikely(err)) {
+			ret = -EBUSY;
+			break;
+		}
+
+		++packet;
+	}
+
+	if (packet) {
+		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
+			++stats->kicks;
+
+		*done = packet;
+		stats->xdp_tx += packet;
+
+		xsk_tx_release(pool);
+	}
+
+	return ret;
+}
+
+static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
+			   int budget, bool in_napi)
+{
+	struct virtnet_sq_stats stats = {};
+	int done = 0;
+	int err;
+
+	sq->xsk.need_wakeup = false;
+	__free_old_xmit(sq, in_napi, &stats);
+
+	/* return err:
+	 * -EAGAIN: done == budget
+	 * -EBUSY:  done < budget
+	 *  0    :  done < budget
+	 */
+	err = virtnet_xsk_xmit_batch(sq, pool, budget, in_napi, &done, &stats);
+	if (err == -EBUSY) {
+		__free_old_xmit(sq, in_napi, &stats);
+
+		/* If the space is enough, let napi run again. */
+		if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+			done = budget;
+		else
+			sq->xsk.need_wakeup = true;
+	}
+
+	virtnet_xsk_check_queue(sq);
+
+	u64_stats_update_begin(&sq->stats.syncp);
+	sq->stats.packets += stats.packets;
+	sq->stats.bytes += stats.bytes;
+	sq->stats.kicks += stats.kicks;
+	sq->stats.xdp_tx += stats.xdp_tx;
+	u64_stats_update_end(&sq->stats.syncp);
+
+	return done;
+}
+
+static int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct xsk_buff_pool *pool;
+	struct netdev_queue *txq;
+	struct send_queue *sq;
+
+	if (!netif_running(dev))
+		return -ENETDOWN;
+
+	if (qid >= vi->curr_queue_pairs)
+		return -EINVAL;
+
+	sq = &vi->sq[qid];
+
+	rcu_read_lock();
+
+	pool = rcu_dereference(sq->xsk.pool);
+	if (!pool)
+		goto end;
+
+	if (napi_if_scheduled_mark_missed(&sq->napi))
+		goto end;
+
+	txq = netdev_get_tx_queue(dev, qid);
+
+	__netif_tx_lock_bh(txq);
+
+	/* Send part of the packet directly to reduce the delay in sending the
+	 * packet, and this can actively trigger the tx interrupts.
+	 *
+	 * If no packet is sent out, the ring of the device is full. In this
+	 * case, we will still get a tx interrupt response. Then we will deal
+	 * with the subsequent packet sending work.
+	 */
+	virtnet_xsk_run(sq, pool, napi_weight, false);
+
+	__netif_tx_unlock_bh(txq);
+
+end:
+	rcu_read_unlock();
+	return 0;
+}
+
 static int virtnet_xsk_pool_enable(struct net_device *dev,
 				   struct xsk_buff_pool *pool,
 				   u16 qid)
@@ -2559,6 +2844,8 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
 		return -EPERM;
 
 	rcu_read_lock();
+	memset(&sq->xsk, 0, sizeof(sq->xsk));
+
 	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
 	 * safe.
 	 */
@@ -2658,6 +2945,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
 	.ndo_bpf		= virtnet_xdp,
 	.ndo_xdp_xmit		= virtnet_xdp_xmit,
+	.ndo_xsk_wakeup         = virtnet_xsk_wakeup,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
 	.ndo_set_features	= virtnet_set_features,
@@ -2761,9 +3049,9 @@ static void free_unused_bufs(struct virtnet_info *vi)
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		struct virtqueue *vq = vi->sq[i].vq;
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (!is_xdp_frame(buf))
+			if (is_skb_ptr(buf))
 				dev_kfree_skb(buf);
-			else
+			else if (is_xdp_frame(buf))
 				xdp_return_frame(ptr_to_xdp(buf));
 		}
 	}
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v4 10/10] virtio-net: xsk zero copy xmit kick by threshold
  2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
                   ` (8 preceding siblings ...)
  2021-04-13  3:15 ` [PATCH net-next v4 09/10] virtio-net: xsk zero copy xmit implement wakeup and xmit Xuan Zhuo
@ 2021-04-13  3:15 ` Xuan Zhuo
  2021-04-14  5:51   ` Jason Wang
  9 siblings, 1 reply; 21+ messages in thread
From: Xuan Zhuo @ 2021-04-13  3:15 UTC (permalink / raw)
  To: netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

After testing, the performance of calling kick every time is not stable.
And if all the packets are sent and kicked again, the performance is not
good. So add a module parameter to specify how many packets are sent to
call a kick.

8 is a relatively stable value with the best performance.

Here is the pps of the test of xsk_kick_thr under different values (from
1 to 64).

thr  PPS             thr PPS             thr PPS
1    2924116.74247 | 23  3683263.04348 | 45  2777907.22963
2    3441010.57191 | 24  3078880.13043 | 46  2781376.21739
3    3636728.72378 | 25  2859219.57656 | 47  2777271.91304
4    3637518.61468 | 26  2851557.9593  | 48  2800320.56575
5    3651738.16251 | 27  2834783.54408 | 49  2813039.87599
6    3652176.69231 | 28  2847012.41472 | 50  3445143.01839
7    3665415.80602 | 29  2860633.91304 | 51  3666918.01281
8    3665045.16555 | 30  2857903.5786  | 52  3059929.2709
9    3671023.2401  | 31  2835589.98963 | 53  2831515.21739
10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
21   2845655.55789 | 43  2752951.93264 |
22   3450389.84365 | 44  2753107.27164 |

It can be found that when the value of xsk_kick_thr is relatively small,
the performance is not good, and when its value is greater than 13, the
performance will be more irregular and unstable. It looks similar from 3
to 13, I chose 8 as the default value.

The test environment is qemu + vhost-net. I modified vhost-net to drop
the packets sent by vm directly, so that the cpu of vm can run higher.
By default, the processes in the vm and the cpu of softirqd are too low,
and there is no obvious difference in the test data.

During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
run for 300s, the pps of every second was recorded, and the average of
the pps was finally taken. The vhost process cpu on the host has also
reached 100%.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c441d6bf1510..4e360bfc2cf0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -28,9 +28,11 @@ static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
 static bool csum = true, gso = true, napi_tx = true;
+static int xsk_kick_thr = 8;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
+module_param(xsk_kick_thr, int, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -2690,6 +2692,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
 	struct xdp_desc desc;
 	int err, packet = 0;
 	int ret = -EAGAIN;
+	int need_kick = 0;
 
 	if (sq->xsk.last_desc.addr) {
 		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
@@ -2700,6 +2703,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
 			return -EBUSY;
 
 		++packet;
+		++need_kick;
 		--budget;
 		sq->xsk.last_desc.addr = 0;
 	}
@@ -2723,11 +2727,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
 		}
 
 		++packet;
+		++need_kick;
+		if (need_kick > xsk_kick_thr) {
+			if (virtqueue_kick_prepare(sq->vq) &&
+			    virtqueue_notify(sq->vq))
+				++stats->kicks;
+
+			need_kick = 0;
+		}
 	}
 
 	if (packet) {
-		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
-			++stats->kicks;
+		if (need_kick) {
+			if (virtqueue_kick_prepare(sq->vq) &&
+			    virtqueue_notify(sq->vq))
+				++stats->kicks;
+		}
 
 		*done = packet;
 		stats->xdp_tx += packet;
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v4 01/10] netdevice: priv_flags extend to 64bit
  2021-04-13  3:15 ` [PATCH net-next v4 01/10] netdevice: priv_flags extend to 64bit Xuan Zhuo
@ 2021-04-13 18:26   ` Jakub Kicinski
  0 siblings, 0 replies; 21+ messages in thread
From: Jakub Kicinski @ 2021-04-13 18:26 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: netdev, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li

On Tue, 13 Apr 2021 11:15:14 +0800 Xuan Zhuo wrote:
> @@ -1598,72 +1600,75 @@ struct net_device_ops {
>   *	skb_headlen(skb) == 0 (data starts from frag0)
>   */
>  enum netdev_priv_flags {
> -	IFF_802_1Q_VLAN			= 1<<0,
> +	IFF_802_1Q_VLAN_BIT,

This breaks kdoc, now the kdoc documents values which don't exist 
in the enum.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v4 06/10] virtio-net: unify the code for recycling the xmit ptr
  2021-04-13  3:15 ` [PATCH net-next v4 06/10] virtio-net: unify the code for recycling the xmit ptr Xuan Zhuo
@ 2021-04-14  3:32   ` Jason Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Wang @ 2021-04-14  3:32 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Michael S. Tsirkin, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li


在 2021/4/13 上午11:15, Xuan Zhuo 写道:
> Now there are two types of "skb" and "xdp frame" during recycling old
> xmit.
>
> There are two completely similar and independent implementations. This
> is inconvenient for the subsequent addition of new types. So extract a
> function from this piece of code and call this function uniformly to
> recover old xmit ptr.
>
> Rename free_old_xmit_skbs() to free_old_xmit().
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   drivers/net/virtio_net.c | 86 ++++++++++++++++++----------------------
>   1 file changed, 38 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 52653e234a20..f3752b254965 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -264,6 +264,30 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
>   	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
>   }
>   
> +static void __free_old_xmit(struct send_queue *sq, bool in_napi,
> +			    struct virtnet_sq_stats *stats)
> +{
> +	unsigned int len;
> +	void *ptr;
> +
> +	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		if (likely(!is_xdp_frame(ptr))) {


So I think we need to drop likely here because it could be called by skb 
path.

Other looks good.

Thanks


> +			struct sk_buff *skb = ptr;
> +
> +			pr_debug("Sent skb %p\n", skb);
> +
> +			stats->bytes += skb->len;
> +			napi_consume_skb(skb, in_napi);
> +		} else {
> +			struct xdp_frame *frame = ptr_to_xdp(ptr);
> +
> +			stats->bytes += frame->len;
> +			xdp_return_frame(frame);
> +		}
> +		stats->packets++;
> +	}
> +}
> +
>   /* Converting between virtqueue no. and kernel tx/rx queue no.
>    * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
>    */
> @@ -525,15 +549,12 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   			    int n, struct xdp_frame **frames, u32 flags)
>   {
>   	struct virtnet_info *vi = netdev_priv(dev);
> +	struct virtnet_sq_stats stats = {};
>   	struct receive_queue *rq = vi->rq;
>   	struct bpf_prog *xdp_prog;
>   	struct send_queue *sq;
> -	unsigned int len;
> -	int packets = 0;
> -	int bytes = 0;
>   	int nxmit = 0;
>   	int kicks = 0;
> -	void *ptr;
>   	int ret;
>   	int i;
>   
> @@ -552,20 +573,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   	}
>   
>   	/* Free up any pending old buffers before queueing new ones. */
> -	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (likely(is_xdp_frame(ptr))) {
> -			struct xdp_frame *frame = ptr_to_xdp(ptr);
> -
> -			bytes += frame->len;
> -			xdp_return_frame(frame);
> -		} else {
> -			struct sk_buff *skb = ptr;
> -
> -			bytes += skb->len;
> -			napi_consume_skb(skb, false);
> -		}
> -		packets++;
> -	}
> +	__free_old_xmit(sq, false, &stats);
>   
>   	for (i = 0; i < n; i++) {
>   		struct xdp_frame *xdpf = frames[i];
> @@ -582,8 +590,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   	}
>   out:
>   	u64_stats_update_begin(&sq->stats.syncp);
> -	sq->stats.bytes += bytes;
> -	sq->stats.packets += packets;
> +	sq->stats.bytes += stats.bytes;
> +	sq->stats.packets += stats.packets;
>   	sq->stats.xdp_tx += n;
>   	sq->stats.xdp_tx_drops += n - nxmit;
>   	sq->stats.kicks += kicks;
> @@ -1406,39 +1414,21 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>   	return stats.packets;
>   }
>   
> -static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
> +static void free_old_xmit(struct send_queue *sq, bool in_napi)
>   {
> -	unsigned int len;
> -	unsigned int packets = 0;
> -	unsigned int bytes = 0;
> -	void *ptr;
> -
> -	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (likely(!is_xdp_frame(ptr))) {
> -			struct sk_buff *skb = ptr;
> -
> -			pr_debug("Sent skb %p\n", skb);
> +	struct virtnet_sq_stats stats = {};
>   
> -			bytes += skb->len;
> -			napi_consume_skb(skb, in_napi);
> -		} else {
> -			struct xdp_frame *frame = ptr_to_xdp(ptr);
> -
> -			bytes += frame->len;
> -			xdp_return_frame(frame);
> -		}
> -		packets++;
> -	}
> +	__free_old_xmit(sq, in_napi, &stats);
>   
>   	/* Avoid overhead when no packets have been processed
>   	 * happens when called speculatively from start_xmit.
>   	 */
> -	if (!packets)
> +	if (!stats.packets)
>   		return;
>   
>   	u64_stats_update_begin(&sq->stats.syncp);
> -	sq->stats.bytes += bytes;
> -	sq->stats.packets += packets;
> +	sq->stats.bytes += stats.bytes;
> +	sq->stats.packets += stats.packets;
>   	u64_stats_update_end(&sq->stats.syncp);
>   }
>   
> @@ -1463,7 +1453,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
>   		return;
>   
>   	if (__netif_tx_trylock(txq)) {
> -		free_old_xmit_skbs(sq, true);
> +		free_old_xmit(sq, true);
>   		__netif_tx_unlock(txq);
>   	}
>   
> @@ -1548,7 +1538,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>   
>   	txq = netdev_get_tx_queue(vi->dev, index);
>   	__netif_tx_lock(txq, raw_smp_processor_id());
> -	free_old_xmit_skbs(sq, true);
> +	free_old_xmit(sq, true);
>   	__netif_tx_unlock(txq);
>   
>   	virtqueue_napi_complete(napi, sq->vq, 0);
> @@ -1617,7 +1607,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>   	bool use_napi = sq->napi.weight;
>   
>   	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(sq, false);
> +	free_old_xmit(sq, false);
>   
>   	if (use_napi && kick)
>   		virtqueue_enable_cb_delayed(sq->vq);
> @@ -1661,7 +1651,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>   		if (!use_napi &&
>   		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>   			/* More just got used, free them then recheck. */
> -			free_old_xmit_skbs(sq, false);
> +			free_old_xmit(sq, false);
>   			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>   				netif_start_subqueue(dev, qnum);
>   				virtqueue_disable_cb(sq->vq);


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v4 07/10] virtio-net: virtnet_poll_tx support budget check
  2021-04-13  3:15 ` [PATCH net-next v4 07/10] virtio-net: virtnet_poll_tx support budget check Xuan Zhuo
@ 2021-04-14  3:34   ` Jason Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Wang @ 2021-04-14  3:34 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Michael S. Tsirkin, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li


在 2021/4/13 上午11:15, Xuan Zhuo 写道:
> virtnet_poll_tx() check the work done like other network card drivers.
>
> When work < budget, napi_poll() in dev.c will exit directly. And
> virtqueue_napi_complete() will be called to close napi. If closing napi
> fails or there is still data to be processed, virtqueue_napi_complete()
> will make napi schedule again, and no conflicts with the logic of
> napi_poll().
>
> When work == budget, virtnet_poll_tx() will return the var 'work', and
> the napi_poll() in dev.c will re-add napi to the queue.
>
> The purpose of this patch is to support xsk xmit in virtio_poll_tx for
> subsequent patch.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/net/virtio_net.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f3752b254965..f52a25091322 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1529,6 +1529,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>   	struct virtnet_info *vi = sq->vq->vdev->priv;
>   	unsigned int index = vq2txq(sq->vq);
>   	struct netdev_queue *txq;
> +	int work_done = 0;
>   
>   	if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {
>   		/* We don't need to enable cb for XDP */
> @@ -1541,12 +1542,13 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>   	free_old_xmit(sq, true);
>   	__netif_tx_unlock(txq);
>   
> -	virtqueue_napi_complete(napi, sq->vq, 0);
> +	if (work_done < budget)
> +		virtqueue_napi_complete(napi, sq->vq, 0);
>   
>   	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
>   		netif_tx_wake_queue(txq);
>   
> -	return 0;
> +	return work_done;
>   }
>   
>   static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v4 08/10] virtio-net: xsk zero copy xmit setup
  2021-04-13  3:15 ` [PATCH net-next v4 08/10] virtio-net: xsk zero copy xmit setup Xuan Zhuo
@ 2021-04-14  4:01   ` Jason Wang
  2021-04-14  7:36   ` Magnus Karlsson
  1 sibling, 0 replies; 21+ messages in thread
From: Jason Wang @ 2021-04-14  4:01 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Michael S. Tsirkin, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li


在 2021/4/13 上午11:15, Xuan Zhuo 写道:
> xsk is a high-performance packet receiving and sending technology.
>
> This patch implements the binding and unbinding operations of xsk and
> the virtio-net queue for xsk zero copy xmit.
>
> The xsk zero copy xmit depends on tx napi.


It's better to describe why zero copy depends on tx napi.


>   So if tx napi is not true,
> an error will be reported. And the entire operation is under the
> protection of rtnl_lock.
>
> If xsk is active, it will prevent ethtool from modifying tx napi.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>   drivers/net/virtio_net.c | 78 +++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 77 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f52a25091322..8242a9e9f17d 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -22,6 +22,7 @@
>   #include <net/route.h>
>   #include <net/xdp.h>
>   #include <net/net_failover.h>
> +#include <net/xdp_sock_drv.h>
>   
>   static int napi_weight = NAPI_POLL_WEIGHT;
>   module_param(napi_weight, int, 0444);
> @@ -133,6 +134,11 @@ struct send_queue {
>   	struct virtnet_sq_stats stats;
>   
>   	struct napi_struct napi;
> +
> +	struct {
> +		/* xsk pool */


This comment is unnecessary since the code explains itself.


> +		struct xsk_buff_pool __rcu *pool;
> +	} xsk;
>   };
>   
>   /* Internal representation of a receive virtqueue */
> @@ -2249,8 +2255,19 @@ static int virtnet_set_coalesce(struct net_device *dev,
>   	if (napi_weight ^ vi->sq[0].napi.weight) {
>   		if (dev->flags & IFF_UP)
>   			return -EBUSY;
> -		for (i = 0; i < vi->max_queue_pairs; i++)
> +		for (i = 0; i < vi->max_queue_pairs; i++) {
> +			/* xsk xmit depend on the tx napi. So if xsk is active,
> +			 * prevent modifications to tx napi.
> +			 */
> +			rcu_read_lock();
> +			if (rcu_dereference(vi->sq[i].xsk.pool)) {


Let's use rtnl_derefernece() then the rcu_read_lock()/unlock() is not 
needed.


> +				rcu_read_unlock();
> +				continue;
> +			}
> +			rcu_read_unlock();
> +
>   			vi->sq[i].napi.weight = napi_weight;
> +		}
>   	}
>   
>   	return 0;
> @@ -2518,11 +2535,70 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
>   	return err;
>   }
>   
> +static int virtnet_xsk_pool_enable(struct net_device *dev,
> +				   struct xsk_buff_pool *pool,
> +				   u16 qid)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct send_queue *sq;
> +
> +	if (qid >= vi->curr_queue_pairs)
> +		return -EINVAL;
> +
> +	sq = &vi->sq[qid];
> +
> +	/* xsk zerocopy depend on the tx napi.
> +	 *
> +	 * xsk zerocopy xmit is driven by the tx interrupt. When the device is
> +	 * not busy, napi will be called continuously to send data. When the
> +	 * device is busy, wait for the notification interrupt after the
> +	 * hardware has finished processing the data, and continue to send data
> +	 * in napi.
> +	 */
> +	if (!sq->napi.weight)
> +		return -EPERM;
> +
> +	rcu_read_lock();
> +	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> +	 * safe.
> +	 */
> +	rcu_assign_pointer(sq->xsk.pool, pool);
> +	rcu_read_unlock();


Any reason for the rcu lock here? And don't we need to synchronize rcu here?


> +
> +	return 0;
> +}
> +
> +static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct send_queue *sq;
> +
> +	if (qid >= vi->curr_queue_pairs)
> +		return -EINVAL;
> +
> +	sq = &vi->sq[qid];
> +
> +	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> +	 * safe.
> +	 */
> +	rcu_assign_pointer(sq->xsk.pool, NULL);
> +
> +	synchronize_net(); /* Sync with the XSK wakeup and with NAPI. */


Let's move the comment above the code.

Thanks


> +
> +	return 0;
> +}
> +
>   static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>   {
>   	switch (xdp->command) {
>   	case XDP_SETUP_PROG:
>   		return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
> +	case XDP_SETUP_XSK_POOL:
> +		if (xdp->xsk.pool)
> +			return virtnet_xsk_pool_enable(dev, xdp->xsk.pool,
> +						       xdp->xsk.queue_id);
> +		else
> +			return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id);
>   	default:
>   		return -EINVAL;
>   	}


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v4 09/10] virtio-net: xsk zero copy xmit implement wakeup and xmit
  2021-04-13  3:15 ` [PATCH net-next v4 09/10] virtio-net: xsk zero copy xmit implement wakeup and xmit Xuan Zhuo
@ 2021-04-14  5:46   ` Jason Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Wang @ 2021-04-14  5:46 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Michael S. Tsirkin, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li


在 2021/4/13 上午11:15, Xuan Zhuo 写道:
> This patch implements the core part of xsk zerocopy xmit.
>
> When the user calls sendto to consume the data in the xsk tx queue,
> virtnet_xsk_wakeup() will be called.
>
> In wakeup, it will try to send a part of the data directly. There are
> two purposes for this realization:
>
> 1. Send part of the data quickly to reduce the transmission delay of the
>     first packet.
> 2. Trigger tx interrupt, start napi to consume xsk tx data.
>
> All sent xsk packets share the virtio-net header of xsk_hdr. If xsk
> needs to support csum and other functions later, consider assigning xsk
> hdr separately for each sent packet.
>
> There are now three situations in free_old_xmit(): skb, xdp frame, xsk
> desc.  Based on the last two bit of ptr returned by virtqueue_get_buf():
>      00 is skb by default.
>      01 represents the packet sent by xdp
>      10 is the packet sent by xsk
>
> If the xmit work of xsk has not been completed, but the ring is full,
> napi must first exit and wait for the ring to be available, so
> need_wakeup() is set. If free_old_xmit() is called first by start_xmit(),
> we can quickly wake up napi to execute xsk xmit task.
>
> When recycling, we need to count the number of bytes sent, so put xsk
> desc->len into the ptr pointer. Because ptr does not point to meaningful
> objects in xsk.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>   drivers/net/virtio_net.c | 296 ++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 292 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 8242a9e9f17d..c441d6bf1510 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -46,6 +46,11 @@ module_param(napi_tx, bool, 0644);
>   #define VIRTIO_XDP_REDIR	BIT(1)
>   
>   #define VIRTIO_XDP_FLAG	BIT(0)
> +#define VIRTIO_XSK_FLAG	BIT(1)
> +
> +#define VIRTIO_XSK_PTR_SHIFT       4
> +
> +static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
>   
>   /* RX packet size EWMA. The average packet size is used to determine the packet
>    * buffer size when refilling RX rings. As the entire RX ring may be refilled
> @@ -138,6 +143,12 @@ struct send_queue {
>   	struct {
>   		/* xsk pool */
>   		struct xsk_buff_pool __rcu *pool;
> +
> +		/* save the desc for next xmit, when xmit fail. */
> +		struct xdp_desc last_desc;


As replied in the pervious version this looks tricky. I think we need to 
make sure to reserve some slots as skb path did.

This looks exactly like what stmmac did which alos shares XDP and skb 
for the same ring.


> +
> +		/* xsk wait for tx inter or softirq */
> +		bool need_wakeup;
>   	} xsk;
>   };
>   
> @@ -255,6 +266,15 @@ struct padded_vnet_hdr {
>   	char padding[4];
>   };
>   
> +static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
> +			   int budget, bool in_napi);
> +static void virtnet_xsk_complete(struct send_queue *sq, u32 num);
> +
> +static bool is_skb_ptr(void *ptr)
> +{
> +	return !((unsigned long)ptr & (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG));
> +}
> +
>   static bool is_xdp_frame(void *ptr)
>   {
>   	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> @@ -265,6 +285,19 @@ static void *xdp_to_ptr(struct xdp_frame *ptr)
>   	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
>   }
>   
> +static void *xsk_to_ptr(struct xdp_desc *desc)
> +{
> +	/* save the desc len to ptr */
> +	u64 p = desc->len << VIRTIO_XSK_PTR_SHIFT;
> +
> +	return (void *)((unsigned long)p | VIRTIO_XSK_FLAG);
> +}
> +
> +static void ptr_to_xsk(void *ptr, struct xdp_desc *desc)
> +{
> +	desc->len = ((unsigned long)ptr) >> VIRTIO_XSK_PTR_SHIFT;
> +}
> +
>   static struct xdp_frame *ptr_to_xdp(void *ptr)
>   {
>   	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
> @@ -273,25 +306,35 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
>   static void __free_old_xmit(struct send_queue *sq, bool in_napi,
>   			    struct virtnet_sq_stats *stats)
>   {
> +	unsigned int xsknum = 0;
>   	unsigned int len;
>   	void *ptr;
>   
>   	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (likely(!is_xdp_frame(ptr))) {
> +		if (is_skb_ptr(ptr)) {
>   			struct sk_buff *skb = ptr;
>   
>   			pr_debug("Sent skb %p\n", skb);
>   
>   			stats->bytes += skb->len;
>   			napi_consume_skb(skb, in_napi);
> -		} else {
> +		} else if (is_xdp_frame(ptr)) {
>   			struct xdp_frame *frame = ptr_to_xdp(ptr);
>   
>   			stats->bytes += frame->len;
>   			xdp_return_frame(frame);
> +		} else {
> +			struct xdp_desc desc;
> +
> +			ptr_to_xsk(ptr, &desc);
> +			stats->bytes += desc.len;
> +			++xsknum;
>   		}
>   		stats->packets++;
>   	}
> +
> +	if (xsknum)
> +		virtnet_xsk_complete(sq, xsknum);
>   }
>   
>   /* Converting between virtqueue no. and kernel tx/rx queue no.
> @@ -1529,6 +1572,19 @@ static int virtnet_open(struct net_device *dev)
>   	return 0;
>   }
>   
> +static int virtnet_poll_xsk(struct send_queue *sq, int budget)
> +{
> +	struct xsk_buff_pool *pool;
> +	int work_done = 0;
> +
> +	rcu_read_lock();
> +	pool = rcu_dereference(sq->xsk.pool);
> +	if (pool)
> +		work_done = virtnet_xsk_run(sq, pool, budget, true);
> +	rcu_read_unlock();
> +	return work_done;
> +}
> +
>   static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>   {
>   	struct send_queue *sq = container_of(napi, struct send_queue, napi);
> @@ -1545,6 +1601,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>   
>   	txq = netdev_get_tx_queue(vi->dev, index);
>   	__netif_tx_lock(txq, raw_smp_processor_id());
> +	work_done += virtnet_poll_xsk(sq, budget);
>   	free_old_xmit(sq, true);
>   	__netif_tx_unlock(txq);
>   
> @@ -2535,6 +2592,234 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
>   	return err;
>   }
>   
> +static void virtnet_xsk_check_queue(struct send_queue *sq)
> +{
> +	struct virtnet_info *vi = sq->vq->vdev->priv;
> +	struct net_device *dev = vi->dev;
> +	int qnum = sq - vi->sq;
> +
> +	/* If this sq is not the exclusive queue of the current cpu,
> +	 * then it may be called by start_xmit, so check it running out
> +	 * of space.
> +	 *


I think it's better to move this check after is_xdp_raw_buffer_queue().


> +	 * And if it is a raw buffer queue, it does not check whether the status
> +	 * of the queue is stopped when sending. So there is no need to check
> +	 * the situation of the raw buffer queue.
> +	 */
> +	if (is_xdp_raw_buffer_queue(vi, qnum))
> +		return;
> +
> +	/* Stop the queue to avoid getting packets that we are
> +	 * then unable to transmit. Then wait the tx interrupt.
> +	 */
> +	if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
> +		netif_stop_subqueue(dev, qnum);


Is there any way to stop xsk TX here?


> +}
> +
> +static void virtnet_xsk_complete(struct send_queue *sq, u32 num)
> +{
> +	struct xsk_buff_pool *pool;
> +
> +	rcu_read_lock();
> +
> +	pool = rcu_dereference(sq->xsk.pool);
> +	if (!pool) {
> +		rcu_read_unlock();
> +		return;
> +	}
> +	xsk_tx_completed(pool, num);
> +	rcu_read_unlock();
> +
> +	if (sq->xsk.need_wakeup) {
> +		sq->xsk.need_wakeup = false;
> +		virtqueue_napi_schedule(&sq->napi, sq->vq);
> +	}
> +}
> +
> +static int virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> +			    struct xdp_desc *desc)
> +{
> +	struct virtnet_info *vi;
> +	u32 offset, n, len;
> +	struct page *page;
> +	void *data;
> +	u64 addr;
> +	int err;
> +
> +	vi = sq->vq->vdev->priv;
> +	addr = desc->addr;
> +
> +	data = xsk_buff_raw_get_data(pool, addr);
> +	offset = offset_in_page(data);
> +
> +	/* xsk unaligned mode, desc may use two pages */
> +	if (desc->len > PAGE_SIZE - offset)
> +		n = 3;
> +	else
> +		n = 2;
> +
> +	sg_init_table(sq->sg, n);
> +	sg_set_buf(sq->sg, &xsk_hdr, vi->hdr_len);
> +
> +	/* handle for xsk first page */
> +	len = min_t(int, desc->len, PAGE_SIZE - offset);
> +	page = xsk_buff_xdp_get_page(pool, addr);
> +	sg_set_page(sq->sg + 1, page, len, offset);
> +
> +	/* xsk unaligned mode, handle for the second page */
> +	if (len < desc->len) {
> +		page = xsk_buff_xdp_get_page(pool, addr + len);
> +		len = min_t(int, desc->len - len, PAGE_SIZE);
> +		sg_set_page(sq->sg + 2, page, len, 0);
> +	}
> +
> +	err = virtqueue_add_outbuf(sq->vq, sq->sg, n, xsk_to_ptr(desc),
> +				   GFP_ATOMIC);
> +	if (unlikely(err))
> +		sq->xsk.last_desc = *desc;
> +
> +	return err;
> +}
> +
> +static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> +				  struct xsk_buff_pool *pool,
> +				  unsigned int budget,
> +				  bool in_napi, int *done,
> +				  struct virtnet_sq_stats *stats)
> +{
> +	struct xdp_desc desc;
> +	int err, packet = 0;
> +	int ret = -EAGAIN;
> +
> +	if (sq->xsk.last_desc.addr) {
> +		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
> +			return -EBUSY;
> +
> +		err = virtnet_xsk_xmit(sq, pool, &sq->xsk.last_desc);
> +		if (unlikely(err))
> +			return -EBUSY;
> +
> +		++packet;
> +		--budget;
> +		sq->xsk.last_desc.addr = 0;


So I think we don't need to do this since we try always to reserve 2 + 
MAX_SKB_FRAGS, then it means we get -EIO/-ENOMEM which is bascially a 
broken device or dma map.


> +	}
> +
> +	while (budget-- > 0) {
> +		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
> +			ret = -EBUSY;
> +			break;
> +		}
> +
> +		if (!xsk_tx_peek_desc(pool, &desc)) {
> +			/* done */
> +			ret = 0;
> +			break;
> +		}
> +
> +		err = virtnet_xsk_xmit(sq, pool, &desc);
> +		if (unlikely(err)) {
> +			ret = -EBUSY;


Since the function will be called by NAPI I think we to report the 
number of packets that is transmitted as well.


> +			break;
> +		}
> +
> +		++packet;
> +	}
> +
> +	if (packet) {
> +		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> +			++stats->kicks;
> +
> +		*done = packet;
> +		stats->xdp_tx += packet;
> +
> +		xsk_tx_release(pool);
> +	}
> +
> +	return ret;
> +}
> +
> +static int virtnet_xsk_run(struct send_queue *sq, struct xsk_buff_pool *pool,
> +			   int budget, bool in_napi)
> +{
> +	struct virtnet_sq_stats stats = {};
> +	int done = 0;
> +	int err;
> +
> +	sq->xsk.need_wakeup = false;
> +	__free_old_xmit(sq, in_napi, &stats);
> +
> +	/* return err:
> +	 * -EAGAIN: done == budget
> +	 * -EBUSY:  done < budget
> +	 *  0    :  done < budget
> +	 */
> +	err = virtnet_xsk_xmit_batch(sq, pool, budget, in_napi, &done, &stats);
> +	if (err == -EBUSY) {
> +		__free_old_xmit(sq, in_napi, &stats);
> +
> +		/* If the space is enough, let napi run again. */
> +		if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
> +			done = budget;


Why you need to run NAPI isntead of a netif_tx_wake()?


> +		else
> +			sq->xsk.need_wakeup = true;


So done is 0, is this intended?


> +	}
> +
> +	virtnet_xsk_check_queue(sq);
> +
> +	u64_stats_update_begin(&sq->stats.syncp);
> +	sq->stats.packets += stats.packets;
> +	sq->stats.bytes += stats.bytes;
> +	sq->stats.kicks += stats.kicks;
> +	sq->stats.xdp_tx += stats.xdp_tx;
> +	u64_stats_update_end(&sq->stats.syncp);
> +
> +	return done;
> +}
> +
> +static int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct xsk_buff_pool *pool;
> +	struct netdev_queue *txq;
> +	struct send_queue *sq;
> +
> +	if (!netif_running(dev))
> +		return -ENETDOWN;
> +
> +	if (qid >= vi->curr_queue_pairs)
> +		return -EINVAL;
> +
> +	sq = &vi->sq[qid];
> +
> +	rcu_read_lock();
> +
> +	pool = rcu_dereference(sq->xsk.pool);
> +	if (!pool)
> +		goto end;
> +
> +	if (napi_if_scheduled_mark_missed(&sq->napi))
> +		goto end;
> +
> +	txq = netdev_get_tx_queue(dev, qid);
> +
> +	__netif_tx_lock_bh(txq);
> +
> +	/* Send part of the packet directly to reduce the delay in sending the
> +	 * packet, and this can actively trigger the tx interrupts.
> +	 *
> +	 * If no packet is sent out, the ring of the device is full. In this
> +	 * case, we will still get a tx interrupt response. Then we will deal
> +	 * with the subsequent packet sending work.


So stmmac schedule NAPI here, do you have perf numbers for this improvement?

Thanks


> +	 */
> +	virtnet_xsk_run(sq, pool, napi_weight, false);
> +
> +	__netif_tx_unlock_bh(txq);
> +
> +end:
> +	rcu_read_unlock();
> +	return 0;
> +}
> +
>   static int virtnet_xsk_pool_enable(struct net_device *dev,
>   				   struct xsk_buff_pool *pool,
>   				   u16 qid)
> @@ -2559,6 +2844,8 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
>   		return -EPERM;
>   
>   	rcu_read_lock();
> +	memset(&sq->xsk, 0, sizeof(sq->xsk));
> +
>   	/* Here is already protected by rtnl_lock, so rcu_assign_pointer is
>   	 * safe.
>   	 */
> @@ -2658,6 +2945,7 @@ static const struct net_device_ops virtnet_netdev = {
>   	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
>   	.ndo_bpf		= virtnet_xdp,
>   	.ndo_xdp_xmit		= virtnet_xdp_xmit,
> +	.ndo_xsk_wakeup         = virtnet_xsk_wakeup,
>   	.ndo_features_check	= passthru_features_check,
>   	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
>   	.ndo_set_features	= virtnet_set_features,
> @@ -2761,9 +3049,9 @@ static void free_unused_bufs(struct virtnet_info *vi)
>   	for (i = 0; i < vi->max_queue_pairs; i++) {
>   		struct virtqueue *vq = vi->sq[i].vq;
>   		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> -			if (!is_xdp_frame(buf))
> +			if (is_skb_ptr(buf))
>   				dev_kfree_skb(buf);
> -			else
> +			else if (is_xdp_frame(buf))
>   				xdp_return_frame(ptr_to_xdp(buf));
>   		}
>   	}


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v4 10/10] virtio-net: xsk zero copy xmit kick by threshold
  2021-04-13  3:15 ` [PATCH net-next v4 10/10] virtio-net: xsk zero copy xmit kick by threshold Xuan Zhuo
@ 2021-04-14  5:51   ` Jason Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Wang @ 2021-04-14  5:51 UTC (permalink / raw)
  To: Xuan Zhuo, netdev
  Cc: Michael S. Tsirkin, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, virtualization, bpf, dust . li


在 2021/4/13 上午11:15, Xuan Zhuo 写道:
> After testing, the performance of calling kick every time is not stable.
> And if all the packets are sent and kicked again, the performance is not
> good. So add a module parameter to specify how many packets are sent to
> call a kick.
>
> 8 is a relatively stable value with the best performance.
>
> Here is the pps of the test of xsk_kick_thr under different values (from
> 1 to 64).
>
> thr  PPS             thr PPS             thr PPS
> 1    2924116.74247 | 23  3683263.04348 | 45  2777907.22963
> 2    3441010.57191 | 24  3078880.13043 | 46  2781376.21739
> 3    3636728.72378 | 25  2859219.57656 | 47  2777271.91304
> 4    3637518.61468 | 26  2851557.9593  | 48  2800320.56575
> 5    3651738.16251 | 27  2834783.54408 | 49  2813039.87599
> 6    3652176.69231 | 28  2847012.41472 | 50  3445143.01839
> 7    3665415.80602 | 29  2860633.91304 | 51  3666918.01281
> 8    3665045.16555 | 30  2857903.5786  | 52  3059929.2709
> 9    3671023.2401  | 31  2835589.98963 | 53  2831515.21739
> 10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
> 11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
> 12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
> 13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
> 14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
> 15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
> 16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
> 17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
> 18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
> 19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
> 20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
> 21   2845655.55789 | 43  2752951.93264 |
> 22   3450389.84365 | 44  2753107.27164 |
>
> It can be found that when the value of xsk_kick_thr is relatively small,
> the performance is not good, and when its value is greater than 13, the
> performance will be more irregular and unstable. It looks similar from 3
> to 13, I chose 8 as the default value.
>
> The test environment is qemu + vhost-net. I modified vhost-net to drop
> the packets sent by vm directly, so that the cpu of vm can run higher.
> By default, the processes in the vm and the cpu of softirqd are too low,
> and there is no obvious difference in the test data.
>
> During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
> run for 300s, the pps of every second was recorded, and the average of
> the pps was finally taken. The vhost process cpu on the host has also
> reached 100%.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>   drivers/net/virtio_net.c | 19 +++++++++++++++++--
>   1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c441d6bf1510..4e360bfc2cf0 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -28,9 +28,11 @@ static int napi_weight = NAPI_POLL_WEIGHT;
>   module_param(napi_weight, int, 0444);
>   
>   static bool csum = true, gso = true, napi_tx = true;
> +static int xsk_kick_thr = 8;
>   module_param(csum, bool, 0444);
>   module_param(gso, bool, 0444);
>   module_param(napi_tx, bool, 0644);
> +module_param(xsk_kick_thr, int, 0644);
>   
>   /* FIXME: MTU in config. */
>   #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> @@ -2690,6 +2692,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>   	struct xdp_desc desc;
>   	int err, packet = 0;
>   	int ret = -EAGAIN;
> +	int need_kick = 0;
>   
>   	if (sq->xsk.last_desc.addr) {
>   		if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
> @@ -2700,6 +2703,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>   			return -EBUSY;
>   
>   		++packet;
> +		++need_kick;
>   		--budget;
>   		sq->xsk.last_desc.addr = 0;
>   	}
> @@ -2723,11 +2727,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
>   		}
>   
>   		++packet;
> +		++need_kick;
> +		if (need_kick > xsk_kick_thr) {
> +			if (virtqueue_kick_prepare(sq->vq) &&
> +			    virtqueue_notify(sq->vq))


I woner whether it's time to introduce a helper in the virtio core to do 
this.

Thanks


> +				++stats->kicks;
> +
> +			need_kick = 0;
> +		}
>   	}
>   
>   	if (packet) {
> -		if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> -			++stats->kicks;
> +		if (need_kick) {
> +			if (virtqueue_kick_prepare(sq->vq) &&
> +			    virtqueue_notify(sq->vq))
> +				++stats->kicks;
> +		}
>   
>   		*done = packet;
>   		stats->xdp_tx += packet;


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v4 04/10] xsk: support get page by addr
  2021-04-13  3:15 ` [PATCH net-next v4 04/10] xsk: support get page by addr Xuan Zhuo
@ 2021-04-14  6:59   ` Magnus Karlsson
  0 siblings, 0 replies; 21+ messages in thread
From: Magnus Karlsson @ 2021-04-14  6:59 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Network Development, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, Björn Töpel,
	Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	virtualization, bpf, dust . li

On Tue, Apr 13, 2021 at 9:58 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> xsk adds an interface and returns the page corresponding to
> data. virtio-net does not initialize dma, so it needs page to construct
> scatterlist to pass to vring.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>

Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>

> ---
>  include/net/xdp_sock_drv.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 4e295541e396..1d08b5d8d15f 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -72,6 +72,12 @@ static inline dma_addr_t xsk_buff_xdp_get_frame_dma(struct xdp_buff *xdp)
>         return xp_get_frame_dma(xskb);
>  }
>
> +static inline struct page *xsk_buff_xdp_get_page(struct xsk_buff_pool *pool, u64 addr)
> +{
> +       addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
> +       return pool->umem->pgs[addr >> PAGE_SHIFT];
> +}
> +
>  static inline struct xdp_buff *xsk_buff_alloc(struct xsk_buff_pool *pool)
>  {
>         return xp_alloc(pool);
> @@ -207,6 +213,11 @@ static inline dma_addr_t xsk_buff_xdp_get_frame_dma(struct xdp_buff *xdp)
>         return 0;
>  }
>
> +static inline struct page *xsk_buff_xdp_get_page(struct xsk_buff_pool *pool, u64 addr)
> +{
> +       return NULL;
> +}
> +
>  static inline struct xdp_buff *xsk_buff_alloc(struct xsk_buff_pool *pool)
>  {
>         return NULL;
> --
> 2.31.0
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v4 05/10] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR
  2021-04-13  3:15 ` [PATCH net-next v4 05/10] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR Xuan Zhuo
@ 2021-04-14  7:00   ` Magnus Karlsson
  0 siblings, 0 replies; 21+ messages in thread
From: Magnus Karlsson @ 2021-04-14  7:00 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Network Development, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, Björn Töpel,
	Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	virtualization, bpf, dust . li

On Tue, Apr 13, 2021 at 9:58 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> Some devices, such as virtio-net, do not directly use dma addr. These
> devices do not initialize dma after completing the xsk setup, so the dma
> check is skipped here.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>

Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>

> ---
>  net/xdp/xsk_buff_pool.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 8de01aaac4a0..a7e434de0308 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -171,7 +171,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>         if (err)
>                 goto err_unreg_pool;
>
> -       if (!pool->dma_pages) {
> +       if (!(netdev->priv_flags & IFF_NOT_USE_DMA_ADDR) && !pool->dma_pages) {
>                 WARN(1, "Driver did not DMA map zero-copy buffers");
>                 err = -EINVAL;
>                 goto err_unreg_xsk;
> --
> 2.31.0
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v4 08/10] virtio-net: xsk zero copy xmit setup
  2021-04-13  3:15 ` [PATCH net-next v4 08/10] virtio-net: xsk zero copy xmit setup Xuan Zhuo
  2021-04-14  4:01   ` Jason Wang
@ 2021-04-14  7:36   ` Magnus Karlsson
  2021-04-14  7:51     ` Jason Wang
  1 sibling, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2021-04-14  7:36 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Network Development, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, Björn Töpel,
	Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	virtualization, bpf, dust . li

On Tue, Apr 13, 2021 at 9:58 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> xsk is a high-performance packet receiving and sending technology.
>
> This patch implements the binding and unbinding operations of xsk and
> the virtio-net queue for xsk zero copy xmit.
>
> The xsk zero copy xmit depends on tx napi. So if tx napi is not true,
> an error will be reported. And the entire operation is under the
> protection of rtnl_lock.
>
> If xsk is active, it will prevent ethtool from modifying tx napi.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>  drivers/net/virtio_net.c | 78 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 77 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f52a25091322..8242a9e9f17d 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -22,6 +22,7 @@
>  #include <net/route.h>
>  #include <net/xdp.h>
>  #include <net/net_failover.h>
> +#include <net/xdp_sock_drv.h>
>
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -133,6 +134,11 @@ struct send_queue {
>         struct virtnet_sq_stats stats;
>
>         struct napi_struct napi;
> +
> +       struct {
> +               /* xsk pool */
> +               struct xsk_buff_pool __rcu *pool;
> +       } xsk;
>  };
>
>  /* Internal representation of a receive virtqueue */
> @@ -2249,8 +2255,19 @@ static int virtnet_set_coalesce(struct net_device *dev,
>         if (napi_weight ^ vi->sq[0].napi.weight) {
>                 if (dev->flags & IFF_UP)
>                         return -EBUSY;
> -               for (i = 0; i < vi->max_queue_pairs; i++)
> +               for (i = 0; i < vi->max_queue_pairs; i++) {
> +                       /* xsk xmit depend on the tx napi. So if xsk is active,
> +                        * prevent modifications to tx napi.
> +                        */
> +                       rcu_read_lock();
> +                       if (rcu_dereference(vi->sq[i].xsk.pool)) {
> +                               rcu_read_unlock();
> +                               continue;
> +                       }
> +                       rcu_read_unlock();
> +
>                         vi->sq[i].napi.weight = napi_weight;
> +               }
>         }
>
>         return 0;
> @@ -2518,11 +2535,70 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
>         return err;
>  }
>
> +static int virtnet_xsk_pool_enable(struct net_device *dev,
> +                                  struct xsk_buff_pool *pool,
> +                                  u16 qid)
> +{
> +       struct virtnet_info *vi = netdev_priv(dev);
> +       struct send_queue *sq;
> +
> +       if (qid >= vi->curr_queue_pairs)
> +               return -EINVAL;

Your implementation is the first implementation that only supports
zerocopy for one out of Rx and Tx, and this will currently confuse the
control plane in some situations since it assumes that both Rx and Tx
are enabled by a call to this NDO. For example: user space creates an
xsk socket with both an Rx and a Tx ring, then calls bind with the
XDP_ZEROCOPY flag. In this case, the call should fail if the device is
virtio-net since it only supports zerocopy for Tx. But it should
succeed if the user only created a Tx ring since that makes it a
Tx-only socket which can be supported.

So you need to introduce a new interface in xdp_sock_drv.h that can be
used to ask if this socket has Rx enabled and if so fail the call (at
least one of them has to be enabled, otherwise the bind call would
fail before this ndo is called). Then the logic above will act on that
and try to fall back to copy mode (if allowed). Such an interface
(with an added "is_tx_enabled") might in the future be useful for
physical NIC drivers too if they would like to save on resources for
Tx-only and Rx-only sockets. Currently, they all just assume every
socket is Rx and Tx.

Thanks: Magnus

> +
> +       sq = &vi->sq[qid];
> +
> +       /* xsk zerocopy depend on the tx napi.
> +        *
> +        * xsk zerocopy xmit is driven by the tx interrupt. When the device is
> +        * not busy, napi will be called continuously to send data. When the
> +        * device is busy, wait for the notification interrupt after the
> +        * hardware has finished processing the data, and continue to send data
> +        * in napi.
> +        */
> +       if (!sq->napi.weight)
> +               return -EPERM;
> +
> +       rcu_read_lock();
> +       /* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> +        * safe.
> +        */
> +       rcu_assign_pointer(sq->xsk.pool, pool);
> +       rcu_read_unlock();
> +
> +       return 0;
> +}
> +
> +static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
> +{
> +       struct virtnet_info *vi = netdev_priv(dev);
> +       struct send_queue *sq;
> +
> +       if (qid >= vi->curr_queue_pairs)
> +               return -EINVAL;
> +
> +       sq = &vi->sq[qid];
> +
> +       /* Here is already protected by rtnl_lock, so rcu_assign_pointer is
> +        * safe.
> +        */
> +       rcu_assign_pointer(sq->xsk.pool, NULL);
> +
> +       synchronize_net(); /* Sync with the XSK wakeup and with NAPI. */
> +
> +       return 0;
> +}
> +
>  static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>  {
>         switch (xdp->command) {
>         case XDP_SETUP_PROG:
>                 return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
> +       case XDP_SETUP_XSK_POOL:
> +               if (xdp->xsk.pool)
> +                       return virtnet_xsk_pool_enable(dev, xdp->xsk.pool,
> +                                                      xdp->xsk.queue_id);
> +               else
> +                       return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id);
>         default:
>                 return -EINVAL;
>         }
> --
> 2.31.0
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v4 08/10] virtio-net: xsk zero copy xmit setup
  2021-04-14  7:36   ` Magnus Karlsson
@ 2021-04-14  7:51     ` Jason Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Wang @ 2021-04-14  7:51 UTC (permalink / raw)
  To: Magnus Karlsson, Xuan Zhuo
  Cc: Network Development, Michael S. Tsirkin, David S. Miller,
	Jakub Kicinski, Björn Töpel, Magnus Karlsson,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, virtualization, bpf,
	dust . li


在 2021/4/14 下午3:36, Magnus Karlsson 写道:
> On Tue, Apr 13, 2021 at 9:58 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>> xsk is a high-performance packet receiving and sending technology.
>>
>> This patch implements the binding and unbinding operations of xsk and
>> the virtio-net queue for xsk zero copy xmit.
>>
>> The xsk zero copy xmit depends on tx napi. So if tx napi is not true,
>> an error will be reported. And the entire operation is under the
>> protection of rtnl_lock.
>>
>> If xsk is active, it will prevent ethtool from modifying tx napi.
>>
>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
>> ---
>>   drivers/net/virtio_net.c | 78 +++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 77 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index f52a25091322..8242a9e9f17d 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -22,6 +22,7 @@
>>   #include <net/route.h>
>>   #include <net/xdp.h>
>>   #include <net/net_failover.h>
>> +#include <net/xdp_sock_drv.h>
>>
>>   static int napi_weight = NAPI_POLL_WEIGHT;
>>   module_param(napi_weight, int, 0444);
>> @@ -133,6 +134,11 @@ struct send_queue {
>>          struct virtnet_sq_stats stats;
>>
>>          struct napi_struct napi;
>> +
>> +       struct {
>> +               /* xsk pool */
>> +               struct xsk_buff_pool __rcu *pool;
>> +       } xsk;
>>   };
>>
>>   /* Internal representation of a receive virtqueue */
>> @@ -2249,8 +2255,19 @@ static int virtnet_set_coalesce(struct net_device *dev,
>>          if (napi_weight ^ vi->sq[0].napi.weight) {
>>                  if (dev->flags & IFF_UP)
>>                          return -EBUSY;
>> -               for (i = 0; i < vi->max_queue_pairs; i++)
>> +               for (i = 0; i < vi->max_queue_pairs; i++) {
>> +                       /* xsk xmit depend on the tx napi. So if xsk is active,
>> +                        * prevent modifications to tx napi.
>> +                        */
>> +                       rcu_read_lock();
>> +                       if (rcu_dereference(vi->sq[i].xsk.pool)) {
>> +                               rcu_read_unlock();
>> +                               continue;
>> +                       }
>> +                       rcu_read_unlock();
>> +
>>                          vi->sq[i].napi.weight = napi_weight;
>> +               }
>>          }
>>
>>          return 0;
>> @@ -2518,11 +2535,70 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
>>          return err;
>>   }
>>
>> +static int virtnet_xsk_pool_enable(struct net_device *dev,
>> +                                  struct xsk_buff_pool *pool,
>> +                                  u16 qid)
>> +{
>> +       struct virtnet_info *vi = netdev_priv(dev);
>> +       struct send_queue *sq;
>> +
>> +       if (qid >= vi->curr_queue_pairs)
>> +               return -EINVAL;
> Your implementation is the first implementation that only supports
> zerocopy for one out of Rx and Tx, and this will currently confuse the
> control plane in some situations since it assumes that both Rx and Tx
> are enabled by a call to this NDO. For example: user space creates an
> xsk socket with both an Rx and a Tx ring, then calls bind with the
> XDP_ZEROCOPY flag. In this case, the call should fail if the device is
> virtio-net since it only supports zerocopy for Tx. But it should
> succeed if the user only created a Tx ring since that makes it a
> Tx-only socket which can be supported.
>
> So you need to introduce a new interface in xdp_sock_drv.h that can be
> used to ask if this socket has Rx enabled and if so fail the call (at
> least one of them has to be enabled, otherwise the bind call would
> fail before this ndo is called). Then the logic above will act on that
> and try to fall back to copy mode (if allowed). Such an interface
> (with an added "is_tx_enabled") might in the future be useful for
> physical NIC drivers too if they would like to save on resources for
> Tx-only and Rx-only sockets. Currently, they all just assume every
> socket is Rx and Tx.


So if there's no blocker for implementing the zerocopy RX, I think we'd 
better to implement it in this series without introducing new APIs for 
the upper layer.

Thanks


>
> Thanks: Magnus
>
>> +
>> +       sq = &vi->sq[qid];
>> +
>> +       /* xsk zerocopy depend on the tx napi.
>> +        *
>> +        * xsk zerocopy xmit is driven by the tx interrupt. When the device is
>> +        * not busy, napi will be called continuously to send data. When the
>> +        * device is busy, wait for the notification interrupt after the
>> +        * hardware has finished processing the data, and continue to send data
>> +        * in napi.
>> +        */
>> +       if (!sq->napi.weight)
>> +               return -EPERM;
>> +
>> +       rcu_read_lock();
>> +       /* Here is already protected by rtnl_lock, so rcu_assign_pointer is
>> +        * safe.
>> +        */
>> +       rcu_assign_pointer(sq->xsk.pool, pool);
>> +       rcu_read_unlock();
>> +
>> +       return 0;
>> +}
>> +
>> +static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
>> +{
>> +       struct virtnet_info *vi = netdev_priv(dev);
>> +       struct send_queue *sq;
>> +
>> +       if (qid >= vi->curr_queue_pairs)
>> +               return -EINVAL;
>> +
>> +       sq = &vi->sq[qid];
>> +
>> +       /* Here is already protected by rtnl_lock, so rcu_assign_pointer is
>> +        * safe.
>> +        */
>> +       rcu_assign_pointer(sq->xsk.pool, NULL);
>> +
>> +       synchronize_net(); /* Sync with the XSK wakeup and with NAPI. */
>> +
>> +       return 0;
>> +}
>> +
>>   static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>>   {
>>          switch (xdp->command) {
>>          case XDP_SETUP_PROG:
>>                  return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
>> +       case XDP_SETUP_XSK_POOL:
>> +               if (xdp->xsk.pool)
>> +                       return virtnet_xsk_pool_enable(dev, xdp->xsk.pool,
>> +                                                      xdp->xsk.queue_id);
>> +               else
>> +                       return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id);
>>          default:
>>                  return -EINVAL;
>>          }
>> --
>> 2.31.0
>>


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-04-14  7:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-13  3:15 [PATCH net-next v4 00/10] virtio-net support xdp socket zero copy xmit Xuan Zhuo
2021-04-13  3:15 ` [PATCH net-next v4 01/10] netdevice: priv_flags extend to 64bit Xuan Zhuo
2021-04-13 18:26   ` Jakub Kicinski
2021-04-13  3:15 ` [PATCH net-next v4 02/10] netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR Xuan Zhuo
2021-04-13  3:15 ` [PATCH net-next v4 03/10] virtio-net: " Xuan Zhuo
2021-04-13  3:15 ` [PATCH net-next v4 04/10] xsk: support get page by addr Xuan Zhuo
2021-04-14  6:59   ` Magnus Karlsson
2021-04-13  3:15 ` [PATCH net-next v4 05/10] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR Xuan Zhuo
2021-04-14  7:00   ` Magnus Karlsson
2021-04-13  3:15 ` [PATCH net-next v4 06/10] virtio-net: unify the code for recycling the xmit ptr Xuan Zhuo
2021-04-14  3:32   ` Jason Wang
2021-04-13  3:15 ` [PATCH net-next v4 07/10] virtio-net: virtnet_poll_tx support budget check Xuan Zhuo
2021-04-14  3:34   ` Jason Wang
2021-04-13  3:15 ` [PATCH net-next v4 08/10] virtio-net: xsk zero copy xmit setup Xuan Zhuo
2021-04-14  4:01   ` Jason Wang
2021-04-14  7:36   ` Magnus Karlsson
2021-04-14  7:51     ` Jason Wang
2021-04-13  3:15 ` [PATCH net-next v4 09/10] virtio-net: xsk zero copy xmit implement wakeup and xmit Xuan Zhuo
2021-04-14  5:46   ` Jason Wang
2021-04-13  3:15 ` [PATCH net-next v4 10/10] virtio-net: xsk zero copy xmit kick by threshold Xuan Zhuo
2021-04-14  5:51   ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).