netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next v13] vmxnet3: Add XDP support.
@ 2023-01-12 14:07 William Tu
  2023-01-13 15:43 ` Alexander Duyck
  2023-01-18 14:17 ` Alexander Lobakin
  0 siblings, 2 replies; 6+ messages in thread
From: William Tu @ 2023-01-12 14:07 UTC (permalink / raw)
  To: netdev
  Cc: jsankararama, gyang, doshir, alexander.duyck, gerhard,
	alexandr.lobakin, bang, tuc

The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.

Background:
The vmxnet3 rx consists of three rings: ring0, ring1, and dataring.
For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma
mapped to the ring's descriptor. If LRO is enabled and packet size larger
than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of
the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is
allocated using alloc_page. So for LRO packets, the payload will be in one
buffer from r0 and multiple from r1, for non-LRO packets, only one
descriptor in r0 is used for packet size less than 3k.

When receiving a packet, the first descriptor will have the sop (start of
packet) bit set, and the last descriptor will have the eop (end of packet)
bit set. Non-LRO packets will have only one descriptor with both sop and
eop set.

Other than r0 and r1, vmxnet3 dataring is specifically designed for
handling packets with small size, usually 128 bytes, defined in
VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend
driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in
order to avoid memory mapping/unmapping overhead. In summary, packet size:
    A. < 128B: use dataring
    B. 128B - 3K: use ring0 (VMXNET3_RX_BUF_SKB)
    C. > 3K: use ring0 and ring1 (VMXNET3_RX_BUF_SKB + VMXNET3_RX_BUF_PAGE)
As a result, the patch adds XDP support for packets using dataring
and r0 (case A and B), not the large packet size when LRO is enabled.

XDP Implementation:
When user loads and XDP prog, vmxnet3 driver checks configurations, such
as mtu, lro, and re-allocate the rx buffer size for reserving the extra
headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be
associated with every rx queue of the device. Note that when using dataring
for small packet size, vmxnet3 (front-end driver) doesn't control the
buffer allocation, as a result we allocate a new page and copy packet
from the dataring to XDP frame.

The receive side of XDP is implemented for case A and B, by invoking the
bpf program at vmxnet3_rq_rx_complete and handle its returned action.
The vmxnet3_process_xdp(), vmxnet3_process_xdp_small() function handles
the ring0 and dataring case separately, and decides the next journey of
the packet afterward.

For TX, vmxnet3 has split header design. Outgoing packets are parsed
first and protocol headers (L2/L3/L4) are copied to the backend. The
rest of the payload are dma mapped. Since XDP_TX does not parse the
packet protocol, the entire XDP frame is dma mapped for transmission
and transmitted in a batch. Later on, the frame is freed and recycled
back to the memory pool.

Performance:
Tested using two VMs inside one ESXi vSphere 7.0 machine, using single
core on each vmxnet3 device, sender using DPDK testpmd tx-mode attached
to vmxnet3 device, sending 64B or 512B UDP packet.

VM1 txgen:
$ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \
--forward-mode=txonly --eth-peer=0,<mac addr of vm2>
option: add "--txonly-multi-flow"
option: use --txpkts=512 or 64 byte

VM2 running XDP:
$ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode
$ ./samples/bpf/xdp_rxq_info -d ens160 -a <options>
options: XDP_DROP, XDP_PASS, XDP_TX

To test REDIRECT to cpu 0, use
$ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop

Single core performance comparison with skb-mode.
64B:      skb-mode -> native-mode
XDP_DROP: 1.6Mpps -> 2.4Mpps
XDP_PASS: 338Kpps -> 367Kpps
XDP_TX:   1.1Mpps -> 2.3Mpps
REDIRECT-drop: 1.3Mpps -> 2.3Mpps

512B:     skb-mode -> native-mode
XDP_DROP: 863Kpps -> 1.3Mpps
XDP_PASS: 275Kpps -> 376Kpps
XDP_TX:   554Kpps -> 1.2Mpps
REDIRECT-drop: 659Kpps -> 1.2Mpps

Limitations:
a. LRO will be disabled when users load XDP program
b. MTU will be checked and limit to
   VMXNET3_MAX_SKB_BUF_SIZE(3K) - XDP_PACKET_HEADROOM(256) -
   SKB_DATA_ALIGN(sizeof(struct skb_shared_info))

Signed-off-by: William Tu <u9012063@gmail.com>
---
v12 -> v13:
- feedbacks from Guolin:
  Instead of return -ENOTSUPP, disable the LRO in
  netdev->features, and print err msg
- add Sankararaman Jayaraman into CC list
- v12..v13
https://github.com/williamtu/net-next/compare/v12..v13

v11 -> v12:
work on feedbacks from Alexander Duyck
- fix issues and refactor the vmxnet3_unmap_tx_buf and
  unmap_pkt

v10 -> v11:
work on feedbacks from Alexander Duyck
internal feedback from Guolin and Ronak
- fix the issue of xdp_return_frame_bulk, move to up level
  of vmxnet3_unmap_tx_buf and some refactoring
- refactor and simplify vmxnet3_tq_cleanup
- disable XDP when LRO is enabled, suggested by Ronak

v9 -> v10:
- Mark as RFC as we're waiting for internal review
Feedback from Alexander Duyck
- fix dma mapping leak of ndo_xdp_xmit case
- remove unused MAP_INVALID and adjist bitfield

v8 -> v9:
new
- add rxdataring support (need extra copy but still fast)
- update performance number (much better than v8!)
  https://youtu.be/4lm1CSCi78Q

- work on feedbacks from Alexander Duyck and Alexander Lobakin
Alexander Lobakin
- use xdp_return_frame_bulk and see some performance improvement
- use xdp_do_flush not xdp_do_flush_map
- fix several alignment issues, formatting issues, minor code
  restructure, remove 2 dead functions, unrelated add/del of
  new lines, add braces when logical ops nearby, endianness
  conversion
- remove several oneliner goto label
- anonymous union of skb and xdpf
- remove xdp_enabled and use xdp prog directly to check
- use bitsfields macro --> I decide to do it later as
  there are many unrelated places needed to change.

Alexander Duyck
- use bitfield for tbi map type
- anonymous union of skb and xdpf
- remove use of modulus ops, cpu % tq_number

others
- fix issue reported from kernel test robot using sparse

v7 -> v8:
- work on feedbacks from Gerhard Engleder and Alexander
- change memory model to use page pool API, rewrite some of the
  XDP processing code
- attach bpf xdp prog to adapter, not per rx queue
- I reference some of the XDP implementation from
  drivers/net/ethernet/mediatek and
  drivers/net/ethernet/stmicro/stmmac/
- drop support for rxdataring for this version
- redo performance evaluation and demo here
  https://youtu.be/T7_0yrCXCe0
- check using /sys/kernel/debug/kmemleak

v6 -> v7:
work on feedbacks from Alexander Duyck on XDP_TX and XDP_REDIRECT
- fix memleak of xdp frame when doing ndo_xdp_xmit (vmxnet3_xdp_xmit)
- at vmxnet3_xdp_xmit_frame, fix return value, -NOSPC and ENOMEM,
  and free skb when dma map fails
- add xdp_buff_clean_frags_flag since we don't support frag
- update experiements of XDP_TX and XDP_REDIRECT
- for XDP_REDIRECT, I assume calling xdp_do_redirect will take
the buffer and free it, so I need to allocate a new buffer to
refill the rx ring. However, I hit some OOM when testing using
./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e <drop or pass>
- I couldn't find the reason so mark this patch as RFC

I tested the patch using below script:
while [ true ]; do
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_DROP --skb-mode
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_DROP
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_PASS --skb-mode
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_PASS
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_TX --skb-mode
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_TX
timeout 20 ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
timeout 20 ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e pass
done
---
 drivers/net/Kconfig                   |   1 +
 drivers/net/vmxnet3/Makefile          |   2 +-
 drivers/net/vmxnet3/vmxnet3_drv.c     | 225 ++++++++++++--
 drivers/net/vmxnet3/vmxnet3_ethtool.c |  14 +
 drivers/net/vmxnet3/vmxnet3_int.h     |  44 ++-
 drivers/net/vmxnet3/vmxnet3_xdp.c     | 425 ++++++++++++++++++++++++++
 drivers/net/vmxnet3/vmxnet3_xdp.h     |  41 +++
 7 files changed, 708 insertions(+), 44 deletions(-)
 create mode 100644 drivers/net/vmxnet3/vmxnet3_xdp.c
 create mode 100644 drivers/net/vmxnet3/vmxnet3_xdp.h

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 9e63b8c43f3e..a4419d661bdd 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -571,6 +571,7 @@ config VMXNET3
 	tristate "VMware VMXNET3 ethernet driver"
 	depends on PCI && INET
 	depends on PAGE_SIZE_LESS_THAN_64KB
+	select PAGE_POOL
 	help
 	  This driver supports VMware's vmxnet3 virtual ethernet NIC.
 	  To compile this driver as a module, choose M here: the
diff --git a/drivers/net/vmxnet3/Makefile b/drivers/net/vmxnet3/Makefile
index a666a88ac1ff..f82870c10205 100644
--- a/drivers/net/vmxnet3/Makefile
+++ b/drivers/net/vmxnet3/Makefile
@@ -32,4 +32,4 @@
 
 obj-$(CONFIG_VMXNET3) += vmxnet3.o
 
-vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o
+vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o vmxnet3_xdp.o
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index d3e7b27eb933..0c3b8f6bae22 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -28,6 +28,7 @@
 #include <net/ip6_checksum.h>
 
 #include "vmxnet3_int.h"
+#include "vmxnet3_xdp.h"
 
 char vmxnet3_driver_name[] = "vmxnet3";
 #define VMXNET3_DRIVER_DESC "VMware vmxnet3 virtual NIC driver"
@@ -323,17 +324,18 @@ static u32 get_bitfield32(const __le32 *bitfield, u32 pos, u32 size)
 
 
 static void
-vmxnet3_unmap_tx_buf(struct vmxnet3_tx_buf_info *tbi,
-		     struct pci_dev *pdev)
+vmxnet3_unmap_tx_buf(struct vmxnet3_tx_buf_info *tbi, struct pci_dev *pdev)
 {
-	if (tbi->map_type == VMXNET3_MAP_SINGLE)
+	u32 map_type = tbi->map_type;
+
+	if (map_type & VMXNET3_MAP_SINGLE)
 		dma_unmap_single(&pdev->dev, tbi->dma_addr, tbi->len,
 				 DMA_TO_DEVICE);
-	else if (tbi->map_type == VMXNET3_MAP_PAGE)
+	else if (map_type & VMXNET3_MAP_PAGE)
 		dma_unmap_page(&pdev->dev, tbi->dma_addr, tbi->len,
 			       DMA_TO_DEVICE);
 	else
-		BUG_ON(tbi->map_type != VMXNET3_MAP_NONE);
+		BUG_ON((map_type & ~VMXNET3_MAP_XDP));
 
 	tbi->map_type = VMXNET3_MAP_NONE; /* to help debugging */
 }
@@ -341,25 +343,25 @@ vmxnet3_unmap_tx_buf(struct vmxnet3_tx_buf_info *tbi,
 
 static int
 vmxnet3_unmap_pkt(u32 eop_idx, struct vmxnet3_tx_queue *tq,
-		  struct pci_dev *pdev,	struct vmxnet3_adapter *adapter)
+		  struct pci_dev *pdev,	struct vmxnet3_adapter *adapter,
+		  struct xdp_frame_bulk *bq)
 {
-	struct sk_buff *skb;
+	struct vmxnet3_tx_buf_info *tbi;
 	int entries = 0;
+	u32 map_type;
 
 	/* no out of order completion */
 	BUG_ON(tq->buf_info[eop_idx].sop_idx != tq->tx_ring.next2comp);
 	BUG_ON(VMXNET3_TXDESC_GET_EOP(&(tq->tx_ring.base[eop_idx].txd)) != 1);
 
-	skb = tq->buf_info[eop_idx].skb;
-	BUG_ON(skb == NULL);
-	tq->buf_info[eop_idx].skb = NULL;
-
+	tbi = &tq->buf_info[eop_idx];
+	BUG_ON(tbi->skb == NULL);
+	map_type = tbi->map_type;
 	VMXNET3_INC_RING_IDX_ONLY(eop_idx, tq->tx_ring.size);
 
 	while (tq->tx_ring.next2comp != eop_idx) {
 		vmxnet3_unmap_tx_buf(tq->buf_info + tq->tx_ring.next2comp,
 				     pdev);
-
 		/* update next2comp w/o tx_lock. Since we are marking more,
 		 * instead of less, tx ring entries avail, the worst case is
 		 * that the tx routine incorrectly re-queues a pkt due to
@@ -369,7 +371,14 @@ vmxnet3_unmap_pkt(u32 eop_idx, struct vmxnet3_tx_queue *tq,
 		entries++;
 	}
 
-	dev_kfree_skb_any(skb);
+	if (map_type & VMXNET3_MAP_XDP)
+		xdp_return_frame_bulk(tbi->xdpf, bq);
+	else
+		dev_kfree_skb_any(tbi->skb);
+
+	/* xdpf and skb are in an anonymous union. */
+	tbi->skb = NULL;
+
 	return entries;
 }
 
@@ -379,8 +388,10 @@ vmxnet3_tq_tx_complete(struct vmxnet3_tx_queue *tq,
 			struct vmxnet3_adapter *adapter)
 {
 	int completed = 0;
+	struct xdp_frame_bulk bq;
 	union Vmxnet3_GenericDesc *gdesc;
 
+	xdp_frame_bulk_init(&bq);
 	gdesc = tq->comp_ring.base + tq->comp_ring.next2proc;
 	while (VMXNET3_TCD_GET_GEN(&gdesc->tcd) == tq->comp_ring.gen) {
 		/* Prevent any &gdesc->tcd field from being (speculatively)
@@ -390,11 +401,12 @@ vmxnet3_tq_tx_complete(struct vmxnet3_tx_queue *tq,
 
 		completed += vmxnet3_unmap_pkt(VMXNET3_TCD_GET_TXIDX(
 					       &gdesc->tcd), tq, adapter->pdev,
-					       adapter);
+					       adapter, &bq);
 
 		vmxnet3_comp_ring_adv_next2proc(&tq->comp_ring);
 		gdesc = tq->comp_ring.base + tq->comp_ring.next2proc;
 	}
+	xdp_flush_frame_bulk(&bq);
 
 	if (completed) {
 		spin_lock(&tq->tx_lock);
@@ -414,26 +426,34 @@ static void
 vmxnet3_tq_cleanup(struct vmxnet3_tx_queue *tq,
 		   struct vmxnet3_adapter *adapter)
 {
+	struct xdp_frame_bulk bq;
+	u32 map_type;
 	int i;
 
+	xdp_frame_bulk_init(&bq);
+
 	while (tq->tx_ring.next2comp != tq->tx_ring.next2fill) {
 		struct vmxnet3_tx_buf_info *tbi;
 
 		tbi = tq->buf_info + tq->tx_ring.next2comp;
+		map_type = tbi->map_type;
 
 		vmxnet3_unmap_tx_buf(tbi, adapter->pdev);
 		if (tbi->skb) {
-			dev_kfree_skb_any(tbi->skb);
+			if (map_type & VMXNET3_MAP_XDP)
+				xdp_return_frame_bulk(tbi->xdpf, &bq);
+			else
+				dev_kfree_skb_any(tbi->skb);
 			tbi->skb = NULL;
 		}
 		vmxnet3_cmd_ring_adv_next2comp(&tq->tx_ring);
 	}
 
-	/* sanity check, verify all buffers are indeed unmapped and freed */
-	for (i = 0; i < tq->tx_ring.size; i++) {
-		BUG_ON(tq->buf_info[i].skb != NULL ||
-		       tq->buf_info[i].map_type != VMXNET3_MAP_NONE);
-	}
+	xdp_flush_frame_bulk(&bq);
+
+	/* sanity check, verify all buffers are indeed unmapped */
+	for (i = 0; i < tq->tx_ring.size; i++)
+		BUG_ON(tq->buf_info[i].map_type != VMXNET3_MAP_NONE);
 
 	tq->tx_ring.gen = VMXNET3_INIT_GEN;
 	tq->tx_ring.next2fill = tq->tx_ring.next2comp = 0;
@@ -587,7 +607,17 @@ vmxnet3_rq_alloc_rx_buf(struct vmxnet3_rx_queue *rq, u32 ring_idx,
 		gd = ring->base + ring->next2fill;
 		rbi->comp_state = VMXNET3_RXD_COMP_PENDING;
 
-		if (rbi->buf_type == VMXNET3_RX_BUF_SKB) {
+		if (rbi->buf_type == VMXNET3_RX_BUF_XDP) {
+			void *data = vmxnet3_pp_get_buff(rq->page_pool,
+							 &rbi->dma_addr,
+							 GFP_KERNEL);
+			if (!data) {
+				rq->stats.rx_buf_alloc_failure++;
+				break;
+			}
+			rbi->pp_page = virt_to_head_page(data);
+			val = VMXNET3_RXD_BTYPE_HEAD << VMXNET3_RXD_BTYPE_SHIFT;
+		} else if (rbi->buf_type == VMXNET3_RX_BUF_SKB) {
 			if (rbi->skb == NULL) {
 				rbi->skb = __netdev_alloc_skb_ip_align(adapter->netdev,
 								       rbi->len,
@@ -1253,6 +1283,60 @@ vmxnet3_tq_xmit(struct sk_buff *skb, struct vmxnet3_tx_queue *tq,
 	return NETDEV_TX_OK;
 }
 
+static int
+vmxnet3_create_pp(struct vmxnet3_adapter *adapter,
+		  struct vmxnet3_rx_queue *rq, int size)
+{
+	const struct page_pool_params pp_params = {
+		.order = 0,
+		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
+		.pool_size = size,
+		.nid = NUMA_NO_NODE,
+		.dev = &adapter->pdev->dev,
+		.offset = XDP_PACKET_HEADROOM,
+		.max_len = VMXNET3_XDP_MAX_MTU,
+		.dma_dir = DMA_BIDIRECTIONAL,
+	};
+	struct page_pool *pp;
+	int err;
+
+	pp = page_pool_create(&pp_params);
+	if (IS_ERR(pp))
+		return PTR_ERR(pp);
+
+	err = xdp_rxq_info_reg(&rq->xdp_rxq, adapter->netdev, rq->qid,
+			       rq->napi.napi_id);
+	if (err < 0)
+		goto err_free_pp;
+
+	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, MEM_TYPE_PAGE_POOL, pp);
+	if (err)
+		goto err_unregister_rxq;
+
+	rq->page_pool = pp;
+	return 0;
+
+err_unregister_rxq:
+	xdp_rxq_info_unreg(&rq->xdp_rxq);
+err_free_pp:
+	page_pool_destroy(pp);
+
+	return err;
+}
+
+void *
+vmxnet3_pp_get_buff(struct page_pool *pp, dma_addr_t *dma_addr,
+		    gfp_t gfp_mask)
+{
+	struct page *page;
+
+	page = page_pool_alloc_pages(pp, gfp_mask | __GFP_NOWARN);
+	if (!page)
+		return NULL;
+
+	*dma_addr = page_pool_get_dma_addr(page) + XDP_PACKET_HEADROOM;
+	return page_address(page);
+}
 
 static netdev_tx_t
 vmxnet3_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
@@ -1404,6 +1488,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 	struct Vmxnet3_RxDesc rxCmdDesc;
 	struct Vmxnet3_RxCompDesc rxComp;
 #endif
+	bool need_flush = 0;
+
 	vmxnet3_getRxComp(rcd, &rq->comp_ring.base[rq->comp_ring.next2proc].rcd,
 			  &rxComp);
 	while (rcd->gen == rq->comp_ring.gen) {
@@ -1444,6 +1530,31 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 			goto rcd_done;
 		}
 
+		if (rcd->sop && rcd->eop && vmxnet3_xdp_enabled(adapter)) {
+			struct sk_buff *skb_xdp_pass;
+			int act;
+
+			if (VMXNET3_RX_DATA_RING(adapter, rcd->rqID)) {
+				ctx->skb = NULL;
+				goto skip_xdp; /* Handle it later. */
+			}
+
+			if (rbi->buf_type != VMXNET3_RX_BUF_XDP)
+				goto rcd_done;
+
+			act = vmxnet3_process_xdp(adapter, rq, rcd, rbi, rxd,
+						  &skb_xdp_pass);
+			if (act == XDP_PASS) {
+				ctx->skb = skb_xdp_pass;
+				goto sop_done;
+			}
+			ctx->skb = NULL;
+			if (act == XDP_REDIRECT)
+				need_flush = true;
+			goto rcd_done;
+		}
+skip_xdp:
+
 		if (rcd->sop) { /* first buf of the pkt */
 			bool rxDataRingUsed;
 			u16 len;
@@ -1452,7 +1563,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 			       (rcd->rqID != rq->qid &&
 				rcd->rqID != rq->dataRingQid));
 
-			BUG_ON(rbi->buf_type != VMXNET3_RX_BUF_SKB);
+			BUG_ON(rbi->buf_type != VMXNET3_RX_BUF_SKB &&
+			       rbi->buf_type != VMXNET3_RX_BUF_XDP);
 			BUG_ON(ctx->skb != NULL || rbi->skb == NULL);
 
 			if (unlikely(rcd->len == 0)) {
@@ -1470,6 +1582,26 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 			rxDataRingUsed =
 				VMXNET3_RX_DATA_RING(adapter, rcd->rqID);
 			len = rxDataRingUsed ? rcd->len : rbi->len;
+
+			if (rxDataRingUsed && vmxnet3_xdp_enabled(adapter)) {
+				struct sk_buff *skb_xdp_pass;
+				size_t sz;
+				int act;
+
+				sz = rcd->rxdIdx * rq->data_ring.desc_size;
+				act = vmxnet3_process_xdp_small(adapter, rq,
+								&rq->data_ring.base[sz],
+								rcd->len,
+								&skb_xdp_pass);
+				if (act == XDP_PASS) {
+					ctx->skb = skb_xdp_pass;
+					goto sop_done;
+				}
+				if (act == XDP_REDIRECT)
+					need_flush = true;
+
+				goto rcd_done;
+			}
 			new_skb = netdev_alloc_skb_ip_align(adapter->netdev,
 							    len);
 			if (new_skb == NULL) {
@@ -1622,6 +1754,7 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 		}
 
 
+sop_done:
 		skb = ctx->skb;
 		if (rcd->eop) {
 			u32 mtu = adapter->netdev->mtu;
@@ -1730,6 +1863,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 		vmxnet3_getRxComp(rcd,
 				  &rq->comp_ring.base[rq->comp_ring.next2proc].rcd, &rxComp);
 	}
+	if (need_flush)
+		xdp_do_flush();
 
 	return num_pkts;
 }
@@ -1755,13 +1890,20 @@ vmxnet3_rq_cleanup(struct vmxnet3_rx_queue *rq,
 				&rq->rx_ring[ring_idx].base[i].rxd, &rxDesc);
 
 			if (rxd->btype == VMXNET3_RXD_BTYPE_HEAD &&
-					rq->buf_info[ring_idx][i].skb) {
+			    rq->buf_info[ring_idx][i].pp_page &&
+			    rq->buf_info[ring_idx][i].buf_type ==
+			    VMXNET3_RX_BUF_XDP) {
+				page_pool_recycle_direct(rq->page_pool,
+							 rq->buf_info[ring_idx][i].pp_page);
+				rq->buf_info[ring_idx][i].pp_page = NULL;
+			} else if (rxd->btype == VMXNET3_RXD_BTYPE_HEAD &&
+				   rq->buf_info[ring_idx][i].skb) {
 				dma_unmap_single(&adapter->pdev->dev, rxd->addr,
 						 rxd->len, DMA_FROM_DEVICE);
 				dev_kfree_skb(rq->buf_info[ring_idx][i].skb);
 				rq->buf_info[ring_idx][i].skb = NULL;
 			} else if (rxd->btype == VMXNET3_RXD_BTYPE_BODY &&
-					rq->buf_info[ring_idx][i].page) {
+				   rq->buf_info[ring_idx][i].page) {
 				dma_unmap_page(&adapter->pdev->dev, rxd->addr,
 					       rxd->len, DMA_FROM_DEVICE);
 				put_page(rq->buf_info[ring_idx][i].page);
@@ -1786,9 +1928,9 @@ vmxnet3_rq_cleanup_all(struct vmxnet3_adapter *adapter)
 
 	for (i = 0; i < adapter->num_rx_queues; i++)
 		vmxnet3_rq_cleanup(&adapter->rx_queue[i], adapter);
+	rcu_assign_pointer(adapter->xdp_bpf_prog, NULL);
 }
 
-
 static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
 			       struct vmxnet3_adapter *adapter)
 {
@@ -1815,6 +1957,13 @@ static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
 		}
 	}
 
+	if (rq->page_pool) {
+		if (xdp_rxq_info_is_reg(&rq->xdp_rxq))
+			xdp_rxq_info_unreg(&rq->xdp_rxq);
+		page_pool_destroy(rq->page_pool);
+		rq->page_pool = NULL;
+	}
+
 	if (rq->data_ring.base) {
 		dma_free_coherent(&adapter->pdev->dev,
 				  rq->rx_ring[0].size * rq->data_ring.desc_size,
@@ -1858,14 +2007,16 @@ static int
 vmxnet3_rq_init(struct vmxnet3_rx_queue *rq,
 		struct vmxnet3_adapter  *adapter)
 {
-	int i;
+	int i, err;
 
 	/* initialize buf_info */
 	for (i = 0; i < rq->rx_ring[0].size; i++) {
 
-		/* 1st buf for a pkt is skbuff */
+		/* 1st buf for a pkt is skbuff or xdp page */
 		if (i % adapter->rx_buf_per_pkt == 0) {
-			rq->buf_info[0][i].buf_type = VMXNET3_RX_BUF_SKB;
+			rq->buf_info[0][i].buf_type = vmxnet3_xdp_enabled(adapter) ?
+						      VMXNET3_RX_BUF_XDP :
+						      VMXNET3_RX_BUF_SKB;
 			rq->buf_info[0][i].len = adapter->skb_buf_size;
 		} else { /* subsequent bufs for a pkt is frag */
 			rq->buf_info[0][i].buf_type = VMXNET3_RX_BUF_PAGE;
@@ -1886,6 +2037,12 @@ vmxnet3_rq_init(struct vmxnet3_rx_queue *rq,
 		rq->rx_ring[i].gen = VMXNET3_INIT_GEN;
 		rq->rx_ring[i].isOutOfOrder = 0;
 	}
+
+	err = vmxnet3_create_pp(adapter, rq,
+				rq->rx_ring[0].size + rq->rx_ring[1].size);
+	if (err)
+		return err;
+
 	if (vmxnet3_rq_alloc_rx_buf(rq, 0, rq->rx_ring[0].size - 1,
 				    adapter) == 0) {
 		/* at least has 1 rx buffer for the 1st ring */
@@ -1989,7 +2146,7 @@ vmxnet3_rq_create(struct vmxnet3_rx_queue *rq, struct vmxnet3_adapter *adapter)
 }
 
 
-static int
+int
 vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter)
 {
 	int i, err = 0;
@@ -2585,7 +2742,7 @@ vmxnet3_setup_driver_shared(struct vmxnet3_adapter *adapter)
 	if (adapter->netdev->features & NETIF_F_RXCSUM)
 		devRead->misc.uptFeatures |= UPT1_F_RXCSUM;
 
-	if (adapter->netdev->features & NETIF_F_LRO) {
+	if ((adapter->netdev->features & NETIF_F_LRO)) {
 		devRead->misc.uptFeatures |= UPT1_F_LRO;
 		devRead->misc.maxNumRxSG = cpu_to_le16(1 + MAX_SKB_FRAGS);
 	}
@@ -3026,7 +3183,7 @@ vmxnet3_free_pci_resources(struct vmxnet3_adapter *adapter)
 }
 
 
-static void
+void
 vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
 {
 	size_t sz, i, ring0_size, ring1_size, comp_size;
@@ -3035,7 +3192,8 @@ vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
 		if (adapter->netdev->mtu <= VMXNET3_MAX_SKB_BUF_SIZE -
 					    VMXNET3_MAX_ETH_HDR_SIZE) {
 			adapter->skb_buf_size = adapter->netdev->mtu +
-						VMXNET3_MAX_ETH_HDR_SIZE;
+						VMXNET3_MAX_ETH_HDR_SIZE +
+						vmxnet3_xdp_headroom(adapter);
 			if (adapter->skb_buf_size < VMXNET3_MIN_T0_BUF_SIZE)
 				adapter->skb_buf_size = VMXNET3_MIN_T0_BUF_SIZE;
 
@@ -3563,7 +3721,6 @@ vmxnet3_reset_work(struct work_struct *data)
 	clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
 }
 
-
 static int
 vmxnet3_probe_device(struct pci_dev *pdev,
 		     const struct pci_device_id *id)
@@ -3585,6 +3742,8 @@ vmxnet3_probe_device(struct pci_dev *pdev,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 		.ndo_poll_controller = vmxnet3_netpoll,
 #endif
+		.ndo_bpf = vmxnet3_xdp,
+		.ndo_xdp_xmit = vmxnet3_xdp_xmit,
 	};
 	int err;
 	u32 ver;
diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
index 18cf7c723201..6f542236b26e 100644
--- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
+++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
@@ -76,6 +76,10 @@ vmxnet3_tq_driver_stats[] = {
 					 copy_skb_header) },
 	{ "  giant hdr",	offsetof(struct vmxnet3_tq_driver_stats,
 					 oversized_hdr) },
+	{ "  xdp xmit",		offsetof(struct vmxnet3_tq_driver_stats,
+					 xdp_xmit) },
+	{ "  xdp xmit err",	offsetof(struct vmxnet3_tq_driver_stats,
+					 xdp_xmit_err) },
 };
 
 /* per rq stats maintained by the device */
@@ -106,6 +110,16 @@ vmxnet3_rq_driver_stats[] = {
 					 drop_fcs) },
 	{ "  rx buf alloc fail", offsetof(struct vmxnet3_rq_driver_stats,
 					  rx_buf_alloc_failure) },
+	{ "     xdp packets", offsetof(struct vmxnet3_rq_driver_stats,
+				       xdp_packets) },
+	{ "     xdp tx", offsetof(struct vmxnet3_rq_driver_stats,
+				  xdp_tx) },
+	{ "     xdp redirects", offsetof(struct vmxnet3_rq_driver_stats,
+					 xdp_redirects) },
+	{ "     xdp drops", offsetof(struct vmxnet3_rq_driver_stats,
+				     xdp_drops) },
+	{ "     xdp aborted", offsetof(struct vmxnet3_rq_driver_stats,
+				       xdp_aborted) },
 };
 
 /* global stats maintained by the driver */
diff --git a/drivers/net/vmxnet3/vmxnet3_int.h b/drivers/net/vmxnet3/vmxnet3_int.h
index 3367db23aa13..e2fda61f3d01 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -56,6 +56,9 @@
 #include <linux/if_arp.h>
 #include <linux/inetdevice.h>
 #include <linux/log2.h>
+#include <linux/bpf.h>
+#include <linux/skbuff.h>
+#include <net/page_pool.h>
 
 #include "vmxnet3_defs.h"
 
@@ -188,19 +191,20 @@ struct vmxnet3_tx_data_ring {
 	dma_addr_t          basePA;
 };
 
-enum vmxnet3_buf_map_type {
-	VMXNET3_MAP_INVALID = 0,
-	VMXNET3_MAP_NONE,
-	VMXNET3_MAP_SINGLE,
-	VMXNET3_MAP_PAGE,
-};
+#define VMXNET3_MAP_NONE	0
+#define VMXNET3_MAP_SINGLE	BIT(0)
+#define VMXNET3_MAP_PAGE	BIT(1)
+#define VMXNET3_MAP_XDP		BIT(2)
 
 struct vmxnet3_tx_buf_info {
 	u32      map_type;
 	u16      len;
 	u16      sop_idx;
 	dma_addr_t  dma_addr;
-	struct sk_buff *skb;
+	union {
+		struct sk_buff *skb;
+		struct xdp_frame *xdpf;
+	};
 };
 
 struct vmxnet3_tq_driver_stats {
@@ -217,6 +221,9 @@ struct vmxnet3_tq_driver_stats {
 	u64 linearized;         /* # of pkts linearized */
 	u64 copy_skb_header;    /* # of times we have to copy skb header */
 	u64 oversized_hdr;
+
+	u64 xdp_xmit;
+	u64 xdp_xmit_err;
 };
 
 struct vmxnet3_tx_ctx {
@@ -253,12 +260,13 @@ struct vmxnet3_tx_queue {
 						    * stopped */
 	int				qid;
 	u16				txdata_desc_size;
-} __attribute__((__aligned__(SMP_CACHE_BYTES)));
+} ____cacheline_aligned;
 
 enum vmxnet3_rx_buf_type {
 	VMXNET3_RX_BUF_NONE = 0,
 	VMXNET3_RX_BUF_SKB = 1,
-	VMXNET3_RX_BUF_PAGE = 2
+	VMXNET3_RX_BUF_PAGE = 2,
+	VMXNET3_RX_BUF_XDP = 3
 };
 
 #define VMXNET3_RXD_COMP_PENDING        0
@@ -271,6 +279,7 @@ struct vmxnet3_rx_buf_info {
 	union {
 		struct sk_buff *skb;
 		struct page    *page;
+		struct page    *pp_page; /* Page Pool for XDP frame */
 	};
 	dma_addr_t dma_addr;
 };
@@ -285,6 +294,12 @@ struct vmxnet3_rq_driver_stats {
 	u64 drop_err;
 	u64 drop_fcs;
 	u64 rx_buf_alloc_failure;
+
+	u64 xdp_packets;	/* Total packets processed by XDP. */
+	u64 xdp_tx;
+	u64 xdp_redirects;
+	u64 xdp_drops;
+	u64 xdp_aborted;
 };
 
 struct vmxnet3_rx_data_ring {
@@ -307,7 +322,9 @@ struct vmxnet3_rx_queue {
 	struct vmxnet3_rx_buf_info     *buf_info[2];
 	struct Vmxnet3_RxQueueCtrl            *shared;
 	struct vmxnet3_rq_driver_stats  stats;
-} __attribute__((__aligned__(SMP_CACHE_BYTES)));
+	struct page_pool *page_pool;
+	struct xdp_rxq_info xdp_rxq;
+} ____cacheline_aligned;
 
 #define VMXNET3_DEVICE_MAX_TX_QUEUES 32
 #define VMXNET3_DEVICE_MAX_RX_QUEUES 32   /* Keep this value as a power of 2 */
@@ -415,6 +432,7 @@ struct vmxnet3_adapter {
 	u16    tx_prod_offset;
 	u16    rx_prod_offset;
 	u16    rx_prod2_offset;
+	struct bpf_prog __rcu *xdp_bpf_prog;
 };
 
 #define VMXNET3_WRITE_BAR0_REG(adapter, reg, val)  \
@@ -490,6 +508,12 @@ vmxnet3_tq_destroy_all(struct vmxnet3_adapter *adapter);
 void
 vmxnet3_rq_destroy_all(struct vmxnet3_adapter *adapter);
 
+int
+vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter);
+
+void
+vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter);
+
 netdev_features_t
 vmxnet3_fix_features(struct net_device *netdev, netdev_features_t features);
 
diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.c b/drivers/net/vmxnet3/vmxnet3_xdp.c
new file mode 100644
index 000000000000..94b7abf6fa04
--- /dev/null
+++ b/drivers/net/vmxnet3/vmxnet3_xdp.c
@@ -0,0 +1,425 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Linux driver for VMware's vmxnet3 ethernet NIC.
+ * Copyright (C) 2008-2023, VMware, Inc. All Rights Reserved.
+ * Maintained by: pv-drivers@vmware.com
+ *
+ */
+
+#include "vmxnet3_int.h"
+#include "vmxnet3_xdp.h"
+
+static void
+vmxnet3_xdp_exchange_program(struct vmxnet3_adapter *adapter,
+			     struct bpf_prog *prog)
+{
+	rcu_assign_pointer(adapter->xdp_bpf_prog, prog);
+}
+
+static int
+vmxnet3_xdp_set(struct net_device *netdev, struct netdev_bpf *bpf,
+		struct netlink_ext_ack *extack)
+{
+	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
+	struct bpf_prog *new_bpf_prog = bpf->prog;
+	struct bpf_prog *old_bpf_prog;
+	bool need_update;
+	bool running;
+	int err = 0;
+
+	if (new_bpf_prog && netdev->mtu > VMXNET3_XDP_MAX_MTU) {
+		NL_SET_ERR_MSG_MOD(extack, "MTU too large for XDP");
+		return -EOPNOTSUPP;
+	}
+
+	if ((adapter->netdev->features & NETIF_F_LRO)) {
+		netdev_err(adapter->netdev, "LRO is not supported with XDP");
+		adapter->netdev->features &= ~NETIF_F_LRO;
+	}
+
+	old_bpf_prog = rcu_dereference(adapter->xdp_bpf_prog);
+	if (!new_bpf_prog && !old_bpf_prog)
+		return 0;
+
+	running = netif_running(netdev);
+	need_update = !!old_bpf_prog != !!new_bpf_prog;
+
+	if (running && need_update)
+		vmxnet3_quiesce_dev(adapter);
+
+	vmxnet3_xdp_exchange_program(adapter, new_bpf_prog);
+	if (old_bpf_prog)
+		bpf_prog_put(old_bpf_prog);
+
+	if (!running || !need_update)
+		return 0;
+
+	vmxnet3_reset_dev(adapter);
+	vmxnet3_rq_destroy_all(adapter);
+	vmxnet3_adjust_rx_ring_size(adapter);
+	err = vmxnet3_rq_create_all(adapter);
+	if (err) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "failed to re-create rx queues for XDP.");
+		err = -EOPNOTSUPP;
+		return err;
+	}
+	err = vmxnet3_activate_dev(adapter);
+	if (err) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "failed to activate device for XDP.");
+		err = -EOPNOTSUPP;
+		return err;
+	}
+	clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
+	return err;
+}
+
+/* This is the main xdp call used by kernel to set/unset eBPF program. */
+int
+vmxnet3_xdp(struct net_device *netdev, struct netdev_bpf *bpf)
+{
+	switch (bpf->command) {
+	case XDP_SETUP_PROG:
+		return vmxnet3_xdp_set(netdev, bpf, bpf->extack);
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+bool
+vmxnet3_xdp_enabled(struct vmxnet3_adapter *adapter)
+{
+	return !!rcu_access_pointer(adapter->xdp_bpf_prog);
+}
+
+int
+vmxnet3_xdp_headroom(struct vmxnet3_adapter *adapter)
+{
+	return vmxnet3_xdp_enabled(adapter) ? VMXNET3_XDP_PAD : 0;
+}
+
+static int
+vmxnet3_xdp_xmit_frame(struct vmxnet3_adapter *adapter,
+		       struct xdp_frame *xdpf,
+		       struct vmxnet3_tx_queue *tq, bool dma_map)
+{
+	struct vmxnet3_tx_buf_info *tbi = NULL;
+	union Vmxnet3_GenericDesc *gdesc;
+	struct vmxnet3_tx_ctx ctx;
+	int tx_num_deferred;
+	struct page *page;
+	u32 buf_size;
+	int ret = 0;
+	u32 dw2;
+
+	dw2 = (tq->tx_ring.gen ^ 0x1) << VMXNET3_TXD_GEN_SHIFT;
+	dw2 |= xdpf->len;
+	ctx.sop_txd = tq->tx_ring.base + tq->tx_ring.next2fill;
+	gdesc = ctx.sop_txd;
+
+	buf_size = xdpf->len;
+	tbi = tq->buf_info + tq->tx_ring.next2fill;
+
+	if (vmxnet3_cmd_ring_desc_avail(&tq->tx_ring) == 0) {
+		tq->stats.tx_ring_full++;
+		return -ENOSPC;
+	}
+
+	tbi->map_type = VMXNET3_MAP_XDP;
+	if (dma_map) { /* ndo_xdp_xmit */
+		tbi->dma_addr = dma_map_single(&adapter->pdev->dev,
+					       xdpf->data, buf_size,
+					       DMA_TO_DEVICE);
+		if (dma_mapping_error(&adapter->pdev->dev, tbi->dma_addr))
+			return -EFAULT;
+		tbi->map_type |= VMXNET3_MAP_SINGLE;
+	} else { /* XDP buffer from page pool */
+		page = virt_to_head_page(xdpf->data);
+		tbi->dma_addr = page_pool_get_dma_addr(page) +
+				XDP_PACKET_HEADROOM;
+		dma_sync_single_for_device(&adapter->pdev->dev,
+					   tbi->dma_addr, buf_size,
+					   DMA_BIDIRECTIONAL);
+	}
+	tbi->xdpf = xdpf;
+	tbi->len = buf_size;
+
+	gdesc = tq->tx_ring.base + tq->tx_ring.next2fill;
+	WARN_ON_ONCE(gdesc->txd.gen == tq->tx_ring.gen);
+
+	gdesc->txd.addr = cpu_to_le64(tbi->dma_addr);
+	gdesc->dword[2] = cpu_to_le32(dw2);
+
+	/* Setup the EOP desc */
+	gdesc->dword[3] = cpu_to_le32(VMXNET3_TXD_CQ | VMXNET3_TXD_EOP);
+
+	gdesc->txd.om = 0;
+	gdesc->txd.msscof = 0;
+	gdesc->txd.hlen = 0;
+	gdesc->txd.ti = 0;
+
+	tx_num_deferred = le32_to_cpu(tq->shared->txNumDeferred);
+	le32_add_cpu(&tq->shared->txNumDeferred, 1);
+	tx_num_deferred++;
+
+	vmxnet3_cmd_ring_adv_next2fill(&tq->tx_ring);
+
+	/* set the last buf_info for the pkt */
+	tbi->sop_idx = ctx.sop_txd - tq->tx_ring.base;
+
+	dma_wmb();
+	gdesc->dword[2] = cpu_to_le32(le32_to_cpu(gdesc->dword[2]) ^
+						  VMXNET3_TXD_GEN);
+
+	if (tx_num_deferred >= le32_to_cpu(tq->shared->txThreshold)) {
+		tq->shared->txNumDeferred = 0;
+		VMXNET3_WRITE_BAR0_REG(adapter,
+				       VMXNET3_REG_TXPROD + tq->qid * 8,
+				       tq->tx_ring.next2fill);
+	}
+	return ret;
+}
+
+static int
+vmxnet3_xdp_xmit_back(struct vmxnet3_adapter *adapter,
+		      struct xdp_frame *xdpf)
+{
+	struct vmxnet3_tx_queue *tq;
+	struct netdev_queue *nq;
+	int err = 0, cpu;
+	int tq_number;
+
+	tq_number = adapter->num_tx_queues;
+	cpu = smp_processor_id();
+	if (likely(cpu < tq_number))
+		tq = &adapter->tx_queue[cpu];
+	else
+		tq = &adapter->tx_queue[reciprocal_scale(cpu, tq_number)];
+	if (tq->stopped)
+		return -ENETDOWN;
+
+	nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
+
+	__netif_tx_lock(nq, cpu);
+	err = vmxnet3_xdp_xmit_frame(adapter, xdpf, tq, false);
+	__netif_tx_unlock(nq);
+	return err;
+}
+
+/* ndo_xdp_xmit */
+int
+vmxnet3_xdp_xmit(struct net_device *dev,
+		 int n, struct xdp_frame **frames, u32 flags)
+{
+	struct vmxnet3_adapter *adapter;
+	struct vmxnet3_tx_queue *tq;
+	struct netdev_queue *nq;
+	int i, err, cpu;
+	int tq_number;
+
+	adapter = netdev_priv(dev);
+
+	if (unlikely(test_bit(VMXNET3_STATE_BIT_QUIESCED, &adapter->state)))
+		return -ENETDOWN;
+	if (unlikely(test_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state)))
+		return -EINVAL;
+
+	tq_number = adapter->num_tx_queues;
+	cpu = smp_processor_id();
+	if (likely(cpu < tq_number))
+		tq = &adapter->tx_queue[cpu];
+	else
+		tq = &adapter->tx_queue[reciprocal_scale(cpu, tq_number)];
+	if (tq->stopped)
+		return -ENETDOWN;
+
+	nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
+
+	for (i = 0; i < n; i++) {
+		err = vmxnet3_xdp_xmit_frame(adapter, frames[i], tq, true);
+		if (err) {
+			tq->stats.xdp_xmit_err++;
+			break;
+		}
+	}
+	tq->stats.xdp_xmit += i;
+
+	return i;
+}
+
+static int
+vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, struct xdp_buff *xdp)
+{
+	struct xdp_frame *xdpf;
+	struct bpf_prog *prog;
+	int err;
+	u32 act;
+
+	prog = rcu_dereference(rq->adapter->xdp_bpf_prog);
+	act = bpf_prog_run_xdp(prog, xdp);
+	rq->stats.xdp_packets++;
+
+	switch (act) {
+	case XDP_PASS:
+		return act;
+	case XDP_REDIRECT:
+		err = xdp_do_redirect(rq->adapter->netdev, xdp, prog);
+		if (!err)
+			rq->stats.xdp_redirects++;
+		else
+			rq->stats.xdp_drops++;
+		return act;
+	case XDP_TX:
+		xdpf = xdp_convert_buff_to_frame(xdp);
+		if (!xdpf || vmxnet3_xdp_xmit_back(rq->adapter, xdpf)) {
+			rq->stats.xdp_drops++;
+			page_pool_recycle_direct(rq->page_pool,
+				 virt_to_head_page(xdp->data_hard_start));
+		} else {
+			rq->stats.xdp_tx++;
+		}
+		return act;
+	default:
+		bpf_warn_invalid_xdp_action(rq->adapter->netdev, prog, act);
+		fallthrough;
+	case XDP_ABORTED:
+		trace_xdp_exception(rq->adapter->netdev, prog, act);
+		rq->stats.xdp_aborted++;
+		break;
+	case XDP_DROP:
+		rq->stats.xdp_drops++;
+		break;
+	}
+
+	page_pool_recycle_direct(rq->page_pool,
+				 virt_to_head_page(xdp->data_hard_start));
+	return act;
+}
+
+static struct sk_buff *
+vmxnet3_build_skb(struct vmxnet3_rx_queue *rq, struct page *page,
+		  const struct xdp_buff *xdp)
+{
+	struct sk_buff *skb;
+
+	skb = build_skb(page_address(page), PAGE_SIZE);
+	if (unlikely(!skb)) {
+		page_pool_recycle_direct(rq->page_pool, page);
+		rq->stats.rx_buf_alloc_failure++;
+		return NULL;
+	}
+
+	/* bpf prog might change len and data position. */
+	skb_reserve(skb, xdp->data - xdp->data_hard_start);
+	skb_put(skb, xdp->data_end - xdp->data);
+	skb_mark_for_recycle(skb);
+
+	return skb;
+}
+
+/* Handle packets from DataRing. */
+int
+vmxnet3_process_xdp_small(struct vmxnet3_adapter *adapter,
+			  struct vmxnet3_rx_queue *rq,
+			  void *data, int len,
+			  struct sk_buff **skb_xdp_pass)
+{
+	struct bpf_prog *xdp_prog;
+	struct xdp_buff xdp;
+	struct page *page;
+	int act;
+
+	page = page_pool_alloc_pages(rq->page_pool, GFP_ATOMIC);
+	if (!page) {
+		rq->stats.rx_buf_alloc_failure++;
+		return XDP_DROP;
+	}
+
+	xdp_init_buff(&xdp, PAGE_SIZE, &rq->xdp_rxq);
+	xdp_prepare_buff(&xdp, page_address(page), XDP_PACKET_HEADROOM,
+			 len, false);
+	xdp_buff_clear_frags_flag(&xdp);
+
+	/* Must copy the data because it's at dataring. */
+	memcpy(xdp.data, data, len);
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(rq->adapter->xdp_bpf_prog);
+	if (!xdp_prog) {
+		rcu_read_unlock();
+		page_pool_recycle_direct(rq->page_pool, page);
+		act = XDP_PASS;
+		goto out_skb;
+	}
+	act = vmxnet3_run_xdp(rq, &xdp);
+	rcu_read_unlock();
+
+out_skb:
+	if (act == XDP_PASS) {
+		*skb_xdp_pass = vmxnet3_build_skb(rq, page, &xdp);
+		if (!skb_xdp_pass)
+			return XDP_DROP;
+	}
+
+	/* No need to refill. */
+	return act;
+}
+
+int
+vmxnet3_process_xdp(struct vmxnet3_adapter *adapter,
+		    struct vmxnet3_rx_queue *rq,
+		    struct Vmxnet3_RxCompDesc *rcd,
+		    struct vmxnet3_rx_buf_info *rbi,
+		    struct Vmxnet3_RxDesc *rxd,
+		    struct sk_buff **skb_xdp_pass)
+{
+	struct bpf_prog *xdp_prog;
+	dma_addr_t new_dma_addr;
+	struct xdp_buff xdp;
+	struct page *page;
+	void *new_data;
+	int act;
+
+	page = rbi->pp_page;
+	dma_sync_single_for_cpu(&adapter->pdev->dev,
+				page_pool_get_dma_addr(page) +
+				XDP_PACKET_HEADROOM, rcd->len,
+				page_pool_get_dma_dir(rq->page_pool));
+
+	xdp_init_buff(&xdp, rbi->len, &rq->xdp_rxq);
+	xdp_prepare_buff(&xdp, page_address(page), XDP_PACKET_HEADROOM,
+			 rcd->len, false);
+	xdp_buff_clear_frags_flag(&xdp);
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(rq->adapter->xdp_bpf_prog);
+	if (!xdp_prog) {
+		rcu_read_unlock();
+		act = XDP_PASS;
+		goto refill;
+	}
+	act = vmxnet3_run_xdp(rq, &xdp);
+	rcu_read_unlock();
+
+	if (act == XDP_PASS) {
+		*skb_xdp_pass = vmxnet3_build_skb(rq, page, &xdp);
+		if (!skb_xdp_pass)
+			act = XDP_DROP;
+	}
+
+refill:
+	new_data = vmxnet3_pp_get_buff(rq->page_pool, &new_dma_addr,
+				       GFP_ATOMIC);
+	if (!new_data) {
+		rq->stats.rx_buf_alloc_failure++;
+		return XDP_DROP;
+	}
+	rbi->pp_page = virt_to_head_page(new_data);
+	rbi->dma_addr = new_dma_addr;
+	rxd->addr = cpu_to_le64(rbi->dma_addr);
+	rxd->len = rbi->len;
+
+	return act;
+}
diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.h b/drivers/net/vmxnet3/vmxnet3_xdp.h
new file mode 100644
index 000000000000..c182f811cf5f
--- /dev/null
+++ b/drivers/net/vmxnet3/vmxnet3_xdp.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * Linux driver for VMware's vmxnet3 ethernet NIC.
+ * Copyright (C) 2008-2023, VMware, Inc. All Rights Reserved.
+ * Maintained by: pv-drivers@vmware.com
+ *
+ */
+
+#ifndef _VMXNET3_XDP_H
+#define _VMXNET3_XDP_H
+
+#include <linux/filter.h>
+#include <linux/bpf_trace.h>
+#include <linux/netlink.h>
+#include <net/page_pool.h>
+#include <net/xdp.h>
+
+#include "vmxnet3_int.h"
+
+#define VMXNET3_XDP_PAD (SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) + \
+			 XDP_PACKET_HEADROOM)
+#define VMXNET3_XDP_MAX_MTU (PAGE_SIZE - VMXNET3_XDP_PAD)
+
+int vmxnet3_xdp(struct net_device *netdev, struct netdev_bpf *bpf);
+bool vmxnet3_xdp_enabled(struct vmxnet3_adapter *adapter);
+int vmxnet3_xdp_headroom(struct vmxnet3_adapter *adapter);
+int vmxnet3_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
+		     u32 flags);
+int vmxnet3_process_xdp(struct vmxnet3_adapter *adapter,
+			struct vmxnet3_rx_queue *rq,
+			struct Vmxnet3_RxCompDesc *rcd,
+			struct vmxnet3_rx_buf_info *rbi,
+			struct Vmxnet3_RxDesc *rxd,
+			struct sk_buff **skb_xdp_pass);
+int vmxnet3_process_xdp_small(struct vmxnet3_adapter *adapter,
+			      struct vmxnet3_rx_queue *rq,
+			      void *data, int len,
+			      struct sk_buff **skb_xdp_pass);
+void *vmxnet3_pp_get_buff(struct page_pool *pp, dma_addr_t *dma_addr,
+			  gfp_t gfp_mask);
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH net-next v13] vmxnet3: Add XDP support.
  2023-01-12 14:07 [RFC PATCH net-next v13] vmxnet3: Add XDP support William Tu
@ 2023-01-13 15:43 ` Alexander Duyck
  2023-01-18 14:17 ` Alexander Lobakin
  1 sibling, 0 replies; 6+ messages in thread
From: Alexander Duyck @ 2023-01-13 15:43 UTC (permalink / raw)
  To: William Tu
  Cc: netdev, jsankararama, gyang, doshir, gerhard, alexandr.lobakin,
	bang, tuc

On Thu, Jan 12, 2023 at 6:07 AM William Tu <u9012063@gmail.com> wrote:
>
> The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.
>
> Background:
> The vmxnet3 rx consists of three rings: ring0, ring1, and dataring.
> For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma
> mapped to the ring's descriptor. If LRO is enabled and packet size larger
> than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of
> the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is
> allocated using alloc_page. So for LRO packets, the payload will be in one
> buffer from r0 and multiple from r1, for non-LRO packets, only one
> descriptor in r0 is used for packet size less than 3k.
>
> When receiving a packet, the first descriptor will have the sop (start of
> packet) bit set, and the last descriptor will have the eop (end of packet)
> bit set. Non-LRO packets will have only one descriptor with both sop and
> eop set.
>
> Other than r0 and r1, vmxnet3 dataring is specifically designed for
> handling packets with small size, usually 128 bytes, defined in
> VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend
> driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in
> order to avoid memory mapping/unmapping overhead. In summary, packet size:
>     A. < 128B: use dataring
>     B. 128B - 3K: use ring0 (VMXNET3_RX_BUF_SKB)
>     C. > 3K: use ring0 and ring1 (VMXNET3_RX_BUF_SKB + VMXNET3_RX_BUF_PAGE)
> As a result, the patch adds XDP support for packets using dataring
> and r0 (case A and B), not the large packet size when LRO is enabled.
>
> XDP Implementation:
> When user loads and XDP prog, vmxnet3 driver checks configurations, such
> as mtu, lro, and re-allocate the rx buffer size for reserving the extra
> headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be
> associated with every rx queue of the device. Note that when using dataring
> for small packet size, vmxnet3 (front-end driver) doesn't control the
> buffer allocation, as a result we allocate a new page and copy packet
> from the dataring to XDP frame.
>
> The receive side of XDP is implemented for case A and B, by invoking the
> bpf program at vmxnet3_rq_rx_complete and handle its returned action.
> The vmxnet3_process_xdp(), vmxnet3_process_xdp_small() function handles
> the ring0 and dataring case separately, and decides the next journey of
> the packet afterward.
>
> For TX, vmxnet3 has split header design. Outgoing packets are parsed
> first and protocol headers (L2/L3/L4) are copied to the backend. The
> rest of the payload are dma mapped. Since XDP_TX does not parse the
> packet protocol, the entire XDP frame is dma mapped for transmission
> and transmitted in a batch. Later on, the frame is freed and recycled
> back to the memory pool.
>
> Performance:
> Tested using two VMs inside one ESXi vSphere 7.0 machine, using single
> core on each vmxnet3 device, sender using DPDK testpmd tx-mode attached
> to vmxnet3 device, sending 64B or 512B UDP packet.
>
> VM1 txgen:
> $ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \
> --forward-mode=txonly --eth-peer=0,<mac addr of vm2>
> option: add "--txonly-multi-flow"
> option: use --txpkts=512 or 64 byte
>
> VM2 running XDP:
> $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode
> $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options>
> options: XDP_DROP, XDP_PASS, XDP_TX
>
> To test REDIRECT to cpu 0, use
> $ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
>
> Single core performance comparison with skb-mode.
> 64B:      skb-mode -> native-mode
> XDP_DROP: 1.6Mpps -> 2.4Mpps
> XDP_PASS: 338Kpps -> 367Kpps
> XDP_TX:   1.1Mpps -> 2.3Mpps
> REDIRECT-drop: 1.3Mpps -> 2.3Mpps
>
> 512B:     skb-mode -> native-mode
> XDP_DROP: 863Kpps -> 1.3Mpps
> XDP_PASS: 275Kpps -> 376Kpps
> XDP_TX:   554Kpps -> 1.2Mpps
> REDIRECT-drop: 659Kpps -> 1.2Mpps
>
> Limitations:
> a. LRO will be disabled when users load XDP program
> b. MTU will be checked and limit to
>    VMXNET3_MAX_SKB_BUF_SIZE(3K) - XDP_PACKET_HEADROOM(256) -
>    SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
>
> Signed-off-by: William Tu <u9012063@gmail.com>

I don't see anything else that jumps out at me as needing to be addressed.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH net-next v13] vmxnet3: Add XDP support.
  2023-01-12 14:07 [RFC PATCH net-next v13] vmxnet3: Add XDP support William Tu
  2023-01-13 15:43 ` Alexander Duyck
@ 2023-01-18 14:17 ` Alexander Lobakin
  2023-01-21 18:29   ` William Tu
  1 sibling, 1 reply; 6+ messages in thread
From: Alexander Lobakin @ 2023-01-18 14:17 UTC (permalink / raw)
  To: William Tu
  Cc: netdev, jsankararama, gyang, doshir, alexander.duyck, gerhard, bang, tuc

From: William Tu <u9012063@gmail.com>
Date: Thu, 12 Jan 2023 06:07:43 -0800

First of all, sorry for the huge delay. I have a couple ten thousand
lines of code to review this week, not counting LKML submissions >_<

> The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.

[...]

> -	skb = tq->buf_info[eop_idx].skb;
> -	BUG_ON(skb == NULL);
> -	tq->buf_info[eop_idx].skb = NULL;
> -
> +	tbi = &tq->buf_info[eop_idx];
> +	BUG_ON(tbi->skb == NULL);
> +	map_type = tbi->map_type;
>  	VMXNET3_INC_RING_IDX_ONLY(eop_idx, tq->tx_ring.size);
>  
>  	while (tq->tx_ring.next2comp != eop_idx) {
>  		vmxnet3_unmap_tx_buf(tq->buf_info + tq->tx_ring.next2comp,
>  				     pdev);
> -

Nit: please try to avoid such unrelated changes. Moreover, I feel like
it's better for readability to have a newline here, so see no reason to
remove it.

>  		/* update next2comp w/o tx_lock. Since we are marking more,
>  		 * instead of less, tx ring entries avail, the worst case is
>  		 * that the tx routine incorrectly re-queues a pkt due to
> @@ -369,7 +371,14 @@ vmxnet3_unmap_pkt(u32 eop_idx, struct vmxnet3_tx_queue *tq,
>  		entries++;
>  	}
>  
> -	dev_kfree_skb_any(skb);
> +	if (map_type & VMXNET3_MAP_XDP)
> +		xdp_return_frame_bulk(tbi->xdpf, bq);
> +	else
> +		dev_kfree_skb_any(tbi->skb);

Not really related to XDP, but maybe for some future improvements: this
function is to be called inside the BH context only, so using
napi_consume_skb() would give you some nice perf improvement.

> +
> +	/* xdpf and skb are in an anonymous union. */
> +	tbi->skb = NULL;
> +
>  	return entries;
>  }
>  
> @@ -379,8 +388,10 @@ vmxnet3_tq_tx_complete(struct vmxnet3_tx_queue *tq,
>  			struct vmxnet3_adapter *adapter)
>  {
>  	int completed = 0;
> +	struct xdp_frame_bulk bq;
>  	union Vmxnet3_GenericDesc *gdesc;

RCT style of declarations?

>  
> +	xdp_frame_bulk_init(&bq);
>  	gdesc = tq->comp_ring.base + tq->comp_ring.next2proc;
>  	while (VMXNET3_TCD_GET_GEN(&gdesc->tcd) == tq->comp_ring.gen) {
>  		/* Prevent any &gdesc->tcd field from being (speculatively)

[...]

> @@ -1253,6 +1283,60 @@ vmxnet3_tq_xmit(struct sk_buff *skb, struct vmxnet3_tx_queue *tq,
>  	return NETDEV_TX_OK;
>  }
>  
> +static int
> +vmxnet3_create_pp(struct vmxnet3_adapter *adapter,
> +		  struct vmxnet3_rx_queue *rq, int size)
> +{
> +	const struct page_pool_params pp_params = {
> +		.order = 0,

Nit: it will be zeroed implicitly, so can be omitted. OTOH if you want
to explicitly say that you always use order-0 pages only, you can leave
it here.

> +		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> +		.pool_size = size,
> +		.nid = NUMA_NO_NODE,
> +		.dev = &adapter->pdev->dev,
> +		.offset = XDP_PACKET_HEADROOM,

Curious, on which architectures does this driver work in the real world?
Is it x86 only or maybe 64-bit systems only? Because not having
%NET_IP_ALIGN here will significantly slow down Rx on the systems where
it's defined as 2, not 0 (those systems can't stand unaligned access).

> +		.max_len = VMXNET3_XDP_MAX_MTU,
> +		.dma_dir = DMA_BIDIRECTIONAL,
> +	};
> +	struct page_pool *pp;
> +	int err;
> +
> +	pp = page_pool_create(&pp_params);
> +	if (IS_ERR(pp))
> +		return PTR_ERR(pp);
> +
> +	err = xdp_rxq_info_reg(&rq->xdp_rxq, adapter->netdev, rq->qid,
> +			       rq->napi.napi_id);
> +	if (err < 0)
> +		goto err_free_pp;
> +
> +	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, MEM_TYPE_PAGE_POOL, pp);
> +	if (err)
> +		goto err_unregister_rxq;
> +
> +	rq->page_pool = pp;

Nit: newline here?

> +	return 0;
> +
> +err_unregister_rxq:
> +	xdp_rxq_info_unreg(&rq->xdp_rxq);
> +err_free_pp:
> +	page_pool_destroy(pp);
> +
> +	return err;
> +}
> +
> +void *
> +vmxnet3_pp_get_buff(struct page_pool *pp, dma_addr_t *dma_addr,
> +		    gfp_t gfp_mask)
> +{
> +	struct page *page;
> +
> +	page = page_pool_alloc_pages(pp, gfp_mask | __GFP_NOWARN);
> +	if (!page)

unlikely()? It's error/exception path, you will never hit this branch
under normal conditions.

> +		return NULL;
> +
> +	*dma_addr = page_pool_get_dma_addr(page) + XDP_PACKET_HEADROOM;

Hmm, I'd rather say:

	*dma_addr = page_pool_get_dma_addr(page) + pp->p.offset;

Then you'd need to adujst offset only once in the function where you
create Page Pool if/when you need to change the Rx offset.
With the current code, it's easy to forget you need to change it in two
places.
Alternatively, you could define something like

#define VMXNET3_RX_OFFSET	XDP_PACKET_HEADROOM

and use it here and on Page Pool creation. Then, if you need to change
the Rx offset one day, you will adjust only that definition.

(also nit re newline before return?)

> +	return page_address(page);
> +}
>  
>  static netdev_tx_t
>  vmxnet3_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
> @@ -1404,6 +1488,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
>  	struct Vmxnet3_RxDesc rxCmdDesc;
>  	struct Vmxnet3_RxCompDesc rxComp;
>  #endif
> +	bool need_flush = 0;

= false, it's boolean, not int.

> +
>  	vmxnet3_getRxComp(rcd, &rq->comp_ring.base[rq->comp_ring.next2proc].rcd,
>  			  &rxComp);
>  	while (rcd->gen == rq->comp_ring.gen) {

[...]

> @@ -1622,6 +1754,7 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
>  		}
>  
>  
> +sop_done:
>  		skb = ctx->skb;
>  		if (rcd->eop) {
>  			u32 mtu = adapter->netdev->mtu;
> @@ -1730,6 +1863,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
>  		vmxnet3_getRxComp(rcd,
>  				  &rq->comp_ring.base[rq->comp_ring.next2proc].rcd, &rxComp);
>  	}
> +	if (need_flush)
> +		xdp_do_flush();

What about %XDP_TX? On each %XDP_TX we usually only place the frame to a
Tx ring and hit the doorbell to kick Tx only here, before xdp_do_flush().

>  
>  	return num_pkts;
>  }
> @@ -1755,13 +1890,20 @@ vmxnet3_rq_cleanup(struct vmxnet3_rx_queue *rq,
>  				&rq->rx_ring[ring_idx].base[i].rxd, &rxDesc);
>  
>  			if (rxd->btype == VMXNET3_RXD_BTYPE_HEAD &&
> -					rq->buf_info[ring_idx][i].skb) {
> +			    rq->buf_info[ring_idx][i].pp_page &&
> +			    rq->buf_info[ring_idx][i].buf_type ==
> +			    VMXNET3_RX_BUF_XDP) {
> +				page_pool_recycle_direct(rq->page_pool,
> +							 rq->buf_info[ring_idx][i].pp_page);
> +				rq->buf_info[ring_idx][i].pp_page = NULL;
> +			} else if (rxd->btype == VMXNET3_RXD_BTYPE_HEAD &&
> +				   rq->buf_info[ring_idx][i].skb) {
>  				dma_unmap_single(&adapter->pdev->dev, rxd->addr,
>  						 rxd->len, DMA_FROM_DEVICE);
>  				dev_kfree_skb(rq->buf_info[ring_idx][i].skb);
>  				rq->buf_info[ring_idx][i].skb = NULL;
>  			} else if (rxd->btype == VMXNET3_RXD_BTYPE_BODY &&
> -					rq->buf_info[ring_idx][i].page) {
> +				   rq->buf_info[ring_idx][i].page) {
>  				dma_unmap_page(&adapter->pdev->dev, rxd->addr,
>  					       rxd->len, DMA_FROM_DEVICE);
>  				put_page(rq->buf_info[ring_idx][i].page);
> @@ -1786,9 +1928,9 @@ vmxnet3_rq_cleanup_all(struct vmxnet3_adapter *adapter)
>  
>  	for (i = 0; i < adapter->num_rx_queues; i++)
>  		vmxnet3_rq_cleanup(&adapter->rx_queue[i], adapter);
> +	rcu_assign_pointer(adapter->xdp_bpf_prog, NULL);
>  }
>  
> -

(nit: also unrelated)

>  static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
>  			       struct vmxnet3_adapter *adapter)
>  {
> @@ -1815,6 +1957,13 @@ static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
>  		}
>  	}
>  
> +	if (rq->page_pool) {

Isn't it always true? You always create a Page Pool per each RQ IIUC?

> +		if (xdp_rxq_info_is_reg(&rq->xdp_rxq))
> +			xdp_rxq_info_unreg(&rq->xdp_rxq);
> +		page_pool_destroy(rq->page_pool);
> +		rq->page_pool = NULL;
> +	}
> +
>  	if (rq->data_ring.base) {
>  		dma_free_coherent(&adapter->pdev->dev,
>  				  rq->rx_ring[0].size * rq->data_ring.desc_size,

[...]

> -static int
> +int
>  vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter)
>  {
>  	int i, err = 0;
> @@ -2585,7 +2742,7 @@ vmxnet3_setup_driver_shared(struct vmxnet3_adapter *adapter)
>  	if (adapter->netdev->features & NETIF_F_RXCSUM)
>  		devRead->misc.uptFeatures |= UPT1_F_RXCSUM;
>  
> -	if (adapter->netdev->features & NETIF_F_LRO) {
> +	if ((adapter->netdev->features & NETIF_F_LRO)) {

Unneeded change (moreover, Clang sometimes triggers on such on W=1+)

>  		devRead->misc.uptFeatures |= UPT1_F_LRO;
>  		devRead->misc.maxNumRxSG = cpu_to_le16(1 + MAX_SKB_FRAGS);
>  	}
> @@ -3026,7 +3183,7 @@ vmxnet3_free_pci_resources(struct vmxnet3_adapter *adapter)
>  }
>  
>  
> -static void
> +void
>  vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
>  {
>  	size_t sz, i, ring0_size, ring1_size, comp_size;
> @@ -3035,7 +3192,8 @@ vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
>  		if (adapter->netdev->mtu <= VMXNET3_MAX_SKB_BUF_SIZE -
>  					    VMXNET3_MAX_ETH_HDR_SIZE) {
>  			adapter->skb_buf_size = adapter->netdev->mtu +
> -						VMXNET3_MAX_ETH_HDR_SIZE;
> +						VMXNET3_MAX_ETH_HDR_SIZE +
> +						vmxnet3_xdp_headroom(adapter);
>  			if (adapter->skb_buf_size < VMXNET3_MIN_T0_BUF_SIZE)
>  				adapter->skb_buf_size = VMXNET3_MIN_T0_BUF_SIZE;
>  
> @@ -3563,7 +3721,6 @@ vmxnet3_reset_work(struct work_struct *data)
>  	clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
>  }
>  
> -

(unrelated)

>  static int
>  vmxnet3_probe_device(struct pci_dev *pdev,
>  		     const struct pci_device_id *id)

[...]

>  enum vmxnet3_rx_buf_type {
>  	VMXNET3_RX_BUF_NONE = 0,
>  	VMXNET3_RX_BUF_SKB = 1,
> -	VMXNET3_RX_BUF_PAGE = 2
> +	VMXNET3_RX_BUF_PAGE = 2,
> +	VMXNET3_RX_BUF_XDP = 3

I'd always leave a ',' after the last entry. As you can see, if you
don't do that, you have to introduce 2 lines of changes instead of just
1 when you add a new entry.

>  };
>  
>  #define VMXNET3_RXD_COMP_PENDING        0
> @@ -271,6 +279,7 @@ struct vmxnet3_rx_buf_info {
>  	union {
>  		struct sk_buff *skb;
>  		struct page    *page;
> +		struct page    *pp_page; /* Page Pool for XDP frame */

Why not just use the already existing field if they're of the same type?

>  	};
>  	dma_addr_t dma_addr;
>  };

[...]

> +static int
> +vmxnet3_xdp_set(struct net_device *netdev, struct netdev_bpf *bpf,
> +		struct netlink_ext_ack *extack)
> +{
> +	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> +	struct bpf_prog *new_bpf_prog = bpf->prog;
> +	struct bpf_prog *old_bpf_prog;
> +	bool need_update;
> +	bool running;
> +	int err = 0;
> +
> +	if (new_bpf_prog && netdev->mtu > VMXNET3_XDP_MAX_MTU) {

Mismatch: as I said before, %VMXNET3_XDP_MAX_MTU is not MTU, rather max
frame len. At the same time, netdev->mtu is real MTU, which doesn't
include Eth, VLAN and FCS.

> +		NL_SET_ERR_MSG_MOD(extack, "MTU too large for XDP");

Any plans to add XDP multi-buffer support?

> +		return -EOPNOTSUPP;
> +	}
> +
> +	if ((adapter->netdev->features & NETIF_F_LRO)) {

(redundant braces)

> +		netdev_err(adapter->netdev, "LRO is not supported with XDP");

Why is this error printed via netdev_err(), not NL_SET()?

> +		adapter->netdev->features &= ~NETIF_F_LRO;
> +	}
> +
> +	old_bpf_prog = rcu_dereference(adapter->xdp_bpf_prog);
> +	if (!new_bpf_prog && !old_bpf_prog)
> +		return 0;
> +
> +	running = netif_running(netdev);
> +	need_update = !!old_bpf_prog != !!new_bpf_prog;
> +
> +	if (running && need_update)
> +		vmxnet3_quiesce_dev(adapter);
> +
> +	vmxnet3_xdp_exchange_program(adapter, new_bpf_prog);
> +	if (old_bpf_prog)
> +		bpf_prog_put(old_bpf_prog);
> +
> +	if (!running || !need_update)
> +		return 0;
> +
> +	vmxnet3_reset_dev(adapter);
> +	vmxnet3_rq_destroy_all(adapter);
> +	vmxnet3_adjust_rx_ring_size(adapter);
> +	err = vmxnet3_rq_create_all(adapter);
> +	if (err) {
> +		NL_SET_ERR_MSG_MOD(extack,
> +				   "failed to re-create rx queues for XDP.");
> +		err = -EOPNOTSUPP;
> +		return err;

return -OPNOTSUPP? Why doing it in two steps?

> +	}
> +	err = vmxnet3_activate_dev(adapter);
> +	if (err) {
> +		NL_SET_ERR_MSG_MOD(extack,
> +				   "failed to activate device for XDP.");
> +		err = -EOPNOTSUPP;
> +		return err;

(same)

> +	}
> +	clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);

(classic newline nit)

> +	return err;

@err will be 0 at this point, return it directly.

> +}
> +
> +/* This is the main xdp call used by kernel to set/unset eBPF program. */
> +int
> +vmxnet3_xdp(struct net_device *netdev, struct netdev_bpf *bpf)
> +{
> +	switch (bpf->command) {
> +	case XDP_SETUP_PROG:
> +		return vmxnet3_xdp_set(netdev, bpf, bpf->extack);
> +	default:
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +bool
> +vmxnet3_xdp_enabled(struct vmxnet3_adapter *adapter)
> +{
> +	return !!rcu_access_pointer(adapter->xdp_bpf_prog);
> +}
> +
> +int
> +vmxnet3_xdp_headroom(struct vmxnet3_adapter *adapter)
> +{
> +	return vmxnet3_xdp_enabled(adapter) ? VMXNET3_XDP_PAD : 0;

Uhm, the function is called '_headroom', but in fact it returns skb
overhead (headroom + tailroom).
Also, I don't feel like it's incorrect to return 0 as skb overhead, as
you unconditionally set PP offset to %XDP_PACKET_HEADROOM, plus skb
tailroom is always `SKB_DATA_ALIGN(sizeof(skb_shared_info))` regardless
of XDP prog presence. So I'd rather always return _XDP_PAD (or just
embed this definition into the single call site).

> +}

Making these 2 functions global are overkill and doesn't affect
performance positively. They can easily be static inlines.

> +
> +static int
> +vmxnet3_xdp_xmit_frame(struct vmxnet3_adapter *adapter,
> +		       struct xdp_frame *xdpf,
> +		       struct vmxnet3_tx_queue *tq, bool dma_map)
> +{
> +	struct vmxnet3_tx_buf_info *tbi = NULL;
> +	union Vmxnet3_GenericDesc *gdesc;
> +	struct vmxnet3_tx_ctx ctx;
> +	int tx_num_deferred;
> +	struct page *page;
> +	u32 buf_size;
> +	int ret = 0;
> +	u32 dw2;

[...]

> +	dma_wmb();
> +	gdesc->dword[2] = cpu_to_le32(le32_to_cpu(gdesc->dword[2]) ^
> +						  VMXNET3_TXD_GEN);
> +
> +	if (tx_num_deferred >= le32_to_cpu(tq->shared->txThreshold)) {
> +		tq->shared->txNumDeferred = 0;
> +		VMXNET3_WRITE_BAR0_REG(adapter,
> +				       VMXNET3_REG_TXPROD + tq->qid * 8,
> +				       tq->tx_ring.next2fill);
> +	}

(NL))

> +	return ret;
> +}
> +
> +static int
> +vmxnet3_xdp_xmit_back(struct vmxnet3_adapter *adapter,
> +		      struct xdp_frame *xdpf)
> +{
> +	struct vmxnet3_tx_queue *tq;
> +	struct netdev_queue *nq;
> +	int err = 0, cpu;
> +	int tq_number;
> +
> +	tq_number = adapter->num_tx_queues;
> +	cpu = smp_processor_id();
> +	if (likely(cpu < tq_number))
> +		tq = &adapter->tx_queue[cpu];
> +	else
> +		tq = &adapter->tx_queue[reciprocal_scale(cpu, tq_number)];

Interesting solution, the first time I see such. Usually we do just
`smp_processor_id() % num_tx_queues`. I don't say yours is worse, just a
sidenote :)

> +	if (tq->stopped)
> +		return -ENETDOWN;
> +
> +	nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
> +
> +	__netif_tx_lock(nq, cpu);
> +	err = vmxnet3_xdp_xmit_frame(adapter, xdpf, tq, false);
> +	__netif_tx_unlock(nq);
> +	return err;
> +}
> +
> +/* ndo_xdp_xmit */
> +int
> +vmxnet3_xdp_xmit(struct net_device *dev,
> +		 int n, struct xdp_frame **frames, u32 flags)
> +{
> +	struct vmxnet3_adapter *adapter;
> +	struct vmxnet3_tx_queue *tq;
> +	struct netdev_queue *nq;
> +	int i, err, cpu;
> +	int tq_number;
> +
> +	adapter = netdev_priv(dev);

Nit: embed into the declaration?

> +
> +	if (unlikely(test_bit(VMXNET3_STATE_BIT_QUIESCED, &adapter->state)))
> +		return -ENETDOWN;
> +	if (unlikely(test_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state)))
> +		return -EINVAL;
> +
> +	tq_number = adapter->num_tx_queues;
> +	cpu = smp_processor_id();
> +	if (likely(cpu < tq_number))
> +		tq = &adapter->tx_queue[cpu];
> +	else
> +		tq = &adapter->tx_queue[reciprocal_scale(cpu, tq_number)];
> +	if (tq->stopped)
> +		return -ENETDOWN;
> +
> +	nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
> +
> +	for (i = 0; i < n; i++) {
> +		err = vmxnet3_xdp_xmit_frame(adapter, frames[i], tq, true);
> +		if (err) {
> +			tq->stats.xdp_xmit_err++;
> +			break;
> +		}
> +	}
> +	tq->stats.xdp_xmit += i;
> +
> +	return i;
> +}
> +
> +static int
> +vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, struct xdp_buff *xdp)
> +{
> +	struct xdp_frame *xdpf;
> +	struct bpf_prog *prog;
> +	int err;
> +	u32 act;
> +
> +	prog = rcu_dereference(rq->adapter->xdp_bpf_prog);
> +	act = bpf_prog_run_xdp(prog, xdp);
> +	rq->stats.xdp_packets++;
> +
> +	switch (act) {
> +	case XDP_PASS:
> +		return act;
> +	case XDP_REDIRECT:
> +		err = xdp_do_redirect(rq->adapter->netdev, xdp, prog);
> +		if (!err)
> +			rq->stats.xdp_redirects++;
> +		else
> +			rq->stats.xdp_drops++;
> +		return act;
> +	case XDP_TX:
> +		xdpf = xdp_convert_buff_to_frame(xdp);
> +		if (!xdpf || vmxnet3_xdp_xmit_back(rq->adapter, xdpf)) {

I think this also could be unlikely()?

> +			rq->stats.xdp_drops++;
> +			page_pool_recycle_direct(rq->page_pool,
> +				 virt_to_head_page(xdp->data_hard_start));

Uff, I don't like this line break. Maybe grab the page into a local var
at first and then pass it to the function?

> +		} else {
> +			rq->stats.xdp_tx++;
> +		}
> +		return act;
> +	default:
> +		bpf_warn_invalid_xdp_action(rq->adapter->netdev, prog, act);
> +		fallthrough;
> +	case XDP_ABORTED:
> +		trace_xdp_exception(rq->adapter->netdev, prog, act);
> +		rq->stats.xdp_aborted++;
> +		break;
> +	case XDP_DROP:
> +		rq->stats.xdp_drops++;
> +		break;
> +	}
> +
> +	page_pool_recycle_direct(rq->page_pool,
> +				 virt_to_head_page(xdp->data_hard_start));
> +	return act;
> +}
> +
> +static struct sk_buff *
> +vmxnet3_build_skb(struct vmxnet3_rx_queue *rq, struct page *page,
> +		  const struct xdp_buff *xdp)
> +{
> +	struct sk_buff *skb;
> +
> +	skb = build_skb(page_address(page), PAGE_SIZE);
> +	if (unlikely(!skb)) {
> +		page_pool_recycle_direct(rq->page_pool, page);
> +		rq->stats.rx_buf_alloc_failure++;
> +		return NULL;
> +	}
> +
> +	/* bpf prog might change len and data position. */
> +	skb_reserve(skb, xdp->data - xdp->data_hard_start);
> +	skb_put(skb, xdp->data_end - xdp->data);
> +	skb_mark_for_recycle(skb);
> +
> +	return skb;
> +}
> +
> +/* Handle packets from DataRing. */
> +int
> +vmxnet3_process_xdp_small(struct vmxnet3_adapter *adapter,
> +			  struct vmxnet3_rx_queue *rq,
> +			  void *data, int len,
> +			  struct sk_buff **skb_xdp_pass)
> +{
> +	struct bpf_prog *xdp_prog;
> +	struct xdp_buff xdp;
> +	struct page *page;
> +	int act;
> +
> +	page = page_pool_alloc_pages(rq->page_pool, GFP_ATOMIC);
> +	if (!page) {

(unlikely nit)

> +		rq->stats.rx_buf_alloc_failure++;
> +		return XDP_DROP;
> +	}
> +
> +	xdp_init_buff(&xdp, PAGE_SIZE, &rq->xdp_rxq);
> +	xdp_prepare_buff(&xdp, page_address(page), XDP_PACKET_HEADROOM,
> +			 len, false);
> +	xdp_buff_clear_frags_flag(&xdp);
> +
> +	/* Must copy the data because it's at dataring. */
> +	memcpy(xdp.data, data, len);

Wanted to write "oh, too bad we have to copy the data" and only then
noticed your explanation that dataring is used for frames < 128 bytes
only :D

> +
> +	rcu_read_lock();
> +	xdp_prog = rcu_dereference(rq->adapter->xdp_bpf_prog);
> +	if (!xdp_prog) {
> +		rcu_read_unlock();
> +		page_pool_recycle_direct(rq->page_pool, page);
> +		act = XDP_PASS;
> +		goto out_skb;
> +	}
> +	act = vmxnet3_run_xdp(rq, &xdp);
> +	rcu_read_unlock();
> +
> +out_skb:

Nit: you can move this label one line below and skip always-true branch
when !xdp_prog.

> +	if (act == XDP_PASS) {
> +		*skb_xdp_pass = vmxnet3_build_skb(rq, page, &xdp);
> +		if (!skb_xdp_pass)
> +			return XDP_DROP;
> +	}

[...]

> +#include "vmxnet3_int.h"
> +
> +#define VMXNET3_XDP_PAD (SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) + \
> +			 XDP_PACKET_HEADROOM)
> +#define VMXNET3_XDP_MAX_MTU (PAGE_SIZE - VMXNET3_XDP_PAD)

Uhm, couldn't say it's valid to name it as "MTU", it's rather max frame
size. MTU doesn't include Ethernet, VLAN headers and FCS.

> +
> +int vmxnet3_xdp(struct net_device *netdev, struct netdev_bpf *bpf);
> +bool vmxnet3_xdp_enabled(struct vmxnet3_adapter *adapter);
> +int vmxnet3_xdp_headroom(struct vmxnet3_adapter *adapter);
> +int vmxnet3_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
> +		     u32 flags);
> +int vmxnet3_process_xdp(struct vmxnet3_adapter *adapter,
> +			struct vmxnet3_rx_queue *rq,
> +			struct Vmxnet3_RxCompDesc *rcd,
> +			struct vmxnet3_rx_buf_info *rbi,
> +			struct Vmxnet3_RxDesc *rxd,
> +			struct sk_buff **skb_xdp_pass);
> +int vmxnet3_process_xdp_small(struct vmxnet3_adapter *adapter,
> +			      struct vmxnet3_rx_queue *rq,
> +			      void *data, int len,
> +			      struct sk_buff **skb_xdp_pass);
> +void *vmxnet3_pp_get_buff(struct page_pool *pp, dma_addr_t *dma_addr,
> +			  gfp_t gfp_mask);
> +#endif
Thanks,
Olek

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH net-next v13] vmxnet3: Add XDP support.
  2023-01-18 14:17 ` Alexander Lobakin
@ 2023-01-21 18:29   ` William Tu
  2023-01-24 10:47     ` Alexander Lobakin
  0 siblings, 1 reply; 6+ messages in thread
From: William Tu @ 2023-01-21 18:29 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: netdev, jsankararama, gyang, doshir, alexander.duyck, gerhard, bang

On Wed, Jan 18, 2023 at 6:17 AM Alexander Lobakin
<alexandr.lobakin@intel.com> wrote:
>
> From: William Tu <u9012063@gmail.com>
> Date: Thu, 12 Jan 2023 06:07:43 -0800
>
> First of all, sorry for the huge delay. I have a couple ten thousand
> lines of code to review this week, not counting LKML submissions >_<
>
Hi Alexander,

I totally understand, and thanks again for taking another review of this patch.
Your feedback is very much appreciated.


> > The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.
>
> [...]
>
> > -     skb = tq->buf_info[eop_idx].skb;
> > -     BUG_ON(skb == NULL);
> > -     tq->buf_info[eop_idx].skb = NULL;
> > -
> > +     tbi = &tq->buf_info[eop_idx];
> > +     BUG_ON(tbi->skb == NULL);
> > +     map_type = tbi->map_type;
> >       VMXNET3_INC_RING_IDX_ONLY(eop_idx, tq->tx_ring.size);
> >
> >       while (tq->tx_ring.next2comp != eop_idx) {
> >               vmxnet3_unmap_tx_buf(tq->buf_info + tq->tx_ring.next2comp,
> >                                    pdev);
> > -
>
> Nit: please try to avoid such unrelated changes. Moreover, I feel like
> it's better for readability to have a newline here, so see no reason to
> remove it.

Thanks! There are many mistakes like this below (either missing or
need space, RCT style, adding unlikely, Unneeded change),
I will fix them in next version.

>
> >               /* update next2comp w/o tx_lock. Since we are marking more,
> >                * instead of less, tx ring entries avail, the worst case is
> >                * that the tx routine incorrectly re-queues a pkt due to
> > @@ -369,7 +371,14 @@ vmxnet3_unmap_pkt(u32 eop_idx, struct vmxnet3_tx_queue *tq,
> >               entries++;
> >       }
> >
> > -     dev_kfree_skb_any(skb);
> > +     if (map_type & VMXNET3_MAP_XDP)
> > +             xdp_return_frame_bulk(tbi->xdpf, bq);
> > +     else
> > +             dev_kfree_skb_any(tbi->skb);
>
> Not really related to XDP, but maybe for some future improvements: this
> function is to be called inside the BH context only, so using
> napi_consume_skb() would give you some nice perf improvement.

Sure, thanks for pointing this. I will keep this in future work list in
the commit message.

>
> > +
> > +     /* xdpf and skb are in an anonymous union. */
> > +     tbi->skb = NULL;
> > +
> >       return entries;
> >  }
> >
> > @@ -379,8 +388,10 @@ vmxnet3_tq_tx_complete(struct vmxnet3_tx_queue *tq,
> >                       struct vmxnet3_adapter *adapter)
> >  {
> >       int completed = 0;
> > +     struct xdp_frame_bulk bq;
> >       union Vmxnet3_GenericDesc *gdesc;
>
> RCT style of declarations?
>
> >
> > +     xdp_frame_bulk_init(&bq);
> >       gdesc = tq->comp_ring.base + tq->comp_ring.next2proc;
> >       while (VMXNET3_TCD_GET_GEN(&gdesc->tcd) == tq->comp_ring.gen) {
> >               /* Prevent any &gdesc->tcd field from being (speculatively)
>
> [...]
>
> > @@ -1253,6 +1283,60 @@ vmxnet3_tq_xmit(struct sk_buff *skb, struct vmxnet3_tx_queue *tq,
> >       return NETDEV_TX_OK;
> >  }
> >
> > +static int
> > +vmxnet3_create_pp(struct vmxnet3_adapter *adapter,
> > +               struct vmxnet3_rx_queue *rq, int size)
> > +{
> > +     const struct page_pool_params pp_params = {
> > +             .order = 0,
>
> Nit: it will be zeroed implicitly, so can be omitted. OTOH if you want
> to explicitly say that you always use order-0 pages only, you can leave
> it here.
I will leave it here as it's more clear.

>
> > +             .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> > +             .pool_size = size,
> > +             .nid = NUMA_NO_NODE,
> > +             .dev = &adapter->pdev->dev,
> > +             .offset = XDP_PACKET_HEADROOM,
>
> Curious, on which architectures does this driver work in the real world?
> Is it x86 only or maybe 64-bit systems only? Because not having
> %NET_IP_ALIGN here will significantly slow down Rx on the systems where
> it's defined as 2, not 0 (those systems can't stand unaligned access).

Interesting, and I don't know that after I research a little.
I tested only on x86 and 64bit. But in reality vmxnet3 works in other
architectures.
I will define s.t like below
   #define VMXNET3_XDP_HEADROOM       (XDP_PACKET_HEADROOM + NET_IP_ALIGN)

>
> > +             .max_len = VMXNET3_XDP_MAX_MTU,
> > +             .dma_dir = DMA_BIDIRECTIONAL,
> > +     };
> > +     struct page_pool *pp;
> > +     int err;
> > +
> > +     pp = page_pool_create(&pp_params);
> > +     if (IS_ERR(pp))
> > +             return PTR_ERR(pp);
> > +
> > +     err = xdp_rxq_info_reg(&rq->xdp_rxq, adapter->netdev, rq->qid,
> > +                            rq->napi.napi_id);
> > +     if (err < 0)
> > +             goto err_free_pp;
> > +
> > +     err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, MEM_TYPE_PAGE_POOL, pp);
> > +     if (err)
> > +             goto err_unregister_rxq;
> > +
> > +     rq->page_pool = pp;
>
> Nit: newline here?
>
> > +     return 0;
> > +
> > +err_unregister_rxq:
> > +     xdp_rxq_info_unreg(&rq->xdp_rxq);
> > +err_free_pp:
> > +     page_pool_destroy(pp);
> > +
> > +     return err;
> > +}
> > +
> > +void *
> > +vmxnet3_pp_get_buff(struct page_pool *pp, dma_addr_t *dma_addr,
> > +                 gfp_t gfp_mask)
> > +{
> > +     struct page *page;
> > +
> > +     page = page_pool_alloc_pages(pp, gfp_mask | __GFP_NOWARN);
> > +     if (!page)
>
> unlikely()? It's error/exception path, you will never hit this branch
> under normal conditions.
>
> > +             return NULL;
> > +
> > +     *dma_addr = page_pool_get_dma_addr(page) + XDP_PACKET_HEADROOM;
>
> Hmm, I'd rather say:
>
>         *dma_addr = page_pool_get_dma_addr(page) + pp->p.offset;
>
> Then you'd need to adujst offset only once in the function where you
> create Page Pool if/when you need to change the Rx offset.
> With the current code, it's easy to forget you need to change it in two
> places.
> Alternatively, you could define something like
>
> #define VMXNET3_RX_OFFSET       XDP_PACKET_HEADROOM
got it, thanks.
Also consider the NET_IP_ALIGN suggestion, I will do
#define VMXNET3_RX_OFFSET       (XDP_PACKET_HEADROOM + NET_IP_ALIGN)

>
> and use it here and on Page Pool creation. Then, if you need to change
> the Rx offset one day, you will adjust only that definition.
>
> (also nit re newline before return?)
>
> > +     return page_address(page);
> > +}
> >
> >  static netdev_tx_t
> >  vmxnet3_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
> > @@ -1404,6 +1488,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
> >       struct Vmxnet3_RxDesc rxCmdDesc;
> >       struct Vmxnet3_RxCompDesc rxComp;
> >  #endif
> > +     bool need_flush = 0;
>
> = false, it's boolean, not int.
>
> > +
> >       vmxnet3_getRxComp(rcd, &rq->comp_ring.base[rq->comp_ring.next2proc].rcd,
> >                         &rxComp);
> >       while (rcd->gen == rq->comp_ring.gen) {
>
> [...]
>
> > @@ -1622,6 +1754,7 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
> >               }
> >
> >
> > +sop_done:
> >               skb = ctx->skb;
> >               if (rcd->eop) {
> >                       u32 mtu = adapter->netdev->mtu;
> > @@ -1730,6 +1863,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
> >               vmxnet3_getRxComp(rcd,
> >                                 &rq->comp_ring.base[rq->comp_ring.next2proc].rcd, &rxComp);
> >       }
> > +     if (need_flush)
> > +             xdp_do_flush();
>
> What about %XDP_TX? On each %XDP_TX we usually only place the frame to a
> Tx ring and hit the doorbell to kick Tx only here, before xdp_do_flush().

I think it's ok here. For XDP_TX, we place the frame to tx ring and wait until
a threshold (%tq->shared->txThreshold), then hit the doorbell.

>
> >
> >       return num_pkts;
> >  }
> > @@ -1755,13 +1890,20 @@ vmxnet3_rq_cleanup(struct vmxnet3_rx_queue *rq,
> >                               &rq->rx_ring[ring_idx].base[i].rxd, &rxDesc);
> >
> >                       if (rxd->btype == VMXNET3_RXD_BTYPE_HEAD &&
> > -                                     rq->buf_info[ring_idx][i].skb) {
> > +                         rq->buf_info[ring_idx][i].pp_page &&
> > +                         rq->buf_info[ring_idx][i].buf_type ==
> > +                         VMXNET3_RX_BUF_XDP) {
> > +                             page_pool_recycle_direct(rq->page_pool,
> > +                                                      rq->buf_info[ring_idx][i].pp_page);
> > +                             rq->buf_info[ring_idx][i].pp_page = NULL;
> > +                     } else if (rxd->btype == VMXNET3_RXD_BTYPE_HEAD &&
> > +                                rq->buf_info[ring_idx][i].skb) {
> >                               dma_unmap_single(&adapter->pdev->dev, rxd->addr,
> >                                                rxd->len, DMA_FROM_DEVICE);
> >                               dev_kfree_skb(rq->buf_info[ring_idx][i].skb);
> >                               rq->buf_info[ring_idx][i].skb = NULL;
> >                       } else if (rxd->btype == VMXNET3_RXD_BTYPE_BODY &&
> > -                                     rq->buf_info[ring_idx][i].page) {
> > +                                rq->buf_info[ring_idx][i].page) {
> >                               dma_unmap_page(&adapter->pdev->dev, rxd->addr,
> >                                              rxd->len, DMA_FROM_DEVICE);
> >                               put_page(rq->buf_info[ring_idx][i].page);
> > @@ -1786,9 +1928,9 @@ vmxnet3_rq_cleanup_all(struct vmxnet3_adapter *adapter)
> >
> >       for (i = 0; i < adapter->num_rx_queues; i++)
> >               vmxnet3_rq_cleanup(&adapter->rx_queue[i], adapter);
> > +     rcu_assign_pointer(adapter->xdp_bpf_prog, NULL);
> >  }
> >
> > -
>
> (nit: also unrelated)
>
> >  static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
> >                              struct vmxnet3_adapter *adapter)
> >  {
> > @@ -1815,6 +1957,13 @@ static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
> >               }
> >       }
> >
> > +     if (rq->page_pool) {
>
> Isn't it always true? You always create a Page Pool per each RQ IIUC?
good catch, will remove the check.

>
> > +             if (xdp_rxq_info_is_reg(&rq->xdp_rxq))
> > +                     xdp_rxq_info_unreg(&rq->xdp_rxq);
> > +             page_pool_destroy(rq->page_pool);
> > +             rq->page_pool = NULL;
> > +     }
> > +
> >       if (rq->data_ring.base) {
> >               dma_free_coherent(&adapter->pdev->dev,
> >                                 rq->rx_ring[0].size * rq->data_ring.desc_size,
>
> [...]
>
> > -static int
> > +int
> >  vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter)
> >  {
> >       int i, err = 0;
> > @@ -2585,7 +2742,7 @@ vmxnet3_setup_driver_shared(struct vmxnet3_adapter *adapter)
> >       if (adapter->netdev->features & NETIF_F_RXCSUM)
> >               devRead->misc.uptFeatures |= UPT1_F_RXCSUM;
> >
> > -     if (adapter->netdev->features & NETIF_F_LRO) {
> > +     if ((adapter->netdev->features & NETIF_F_LRO)) {
>
> Unneeded change (moreover, Clang sometimes triggers on such on W=1+)
>
> >               devRead->misc.uptFeatures |= UPT1_F_LRO;
> >               devRead->misc.maxNumRxSG = cpu_to_le16(1 + MAX_SKB_FRAGS);
> >       }
> > @@ -3026,7 +3183,7 @@ vmxnet3_free_pci_resources(struct vmxnet3_adapter *adapter)
> >  }
> >
> >
> > -static void
> > +void
> >  vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
> >  {
> >       size_t sz, i, ring0_size, ring1_size, comp_size;
> > @@ -3035,7 +3192,8 @@ vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
> >               if (adapter->netdev->mtu <= VMXNET3_MAX_SKB_BUF_SIZE -
> >                                           VMXNET3_MAX_ETH_HDR_SIZE) {
> >                       adapter->skb_buf_size = adapter->netdev->mtu +
> > -                                             VMXNET3_MAX_ETH_HDR_SIZE;
> > +                                             VMXNET3_MAX_ETH_HDR_SIZE +
> > +                                             vmxnet3_xdp_headroom(adapter);
> >                       if (adapter->skb_buf_size < VMXNET3_MIN_T0_BUF_SIZE)
> >                               adapter->skb_buf_size = VMXNET3_MIN_T0_BUF_SIZE;
> >
> > @@ -3563,7 +3721,6 @@ vmxnet3_reset_work(struct work_struct *data)
> >       clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
> >  }
> >
> > -
>
> (unrelated)
>
> >  static int
> >  vmxnet3_probe_device(struct pci_dev *pdev,
> >                    const struct pci_device_id *id)
>
> [...]
>
> >  enum vmxnet3_rx_buf_type {
> >       VMXNET3_RX_BUF_NONE = 0,
> >       VMXNET3_RX_BUF_SKB = 1,
> > -     VMXNET3_RX_BUF_PAGE = 2
> > +     VMXNET3_RX_BUF_PAGE = 2,
> > +     VMXNET3_RX_BUF_XDP = 3
>
> I'd always leave a ',' after the last entry. As you can see, if you
> don't do that, you have to introduce 2 lines of changes instead of just
> 1 when you add a new entry.
thanks, that's good point. Will do it, thanks!

>
> >  };
> >
> >  #define VMXNET3_RXD_COMP_PENDING        0
> > @@ -271,6 +279,7 @@ struct vmxnet3_rx_buf_info {
> >       union {
> >               struct sk_buff *skb;
> >               struct page    *page;
> > +             struct page    *pp_page; /* Page Pool for XDP frame */
>
> Why not just use the already existing field if they're of the same type?

I guess in the beginning I want to avoid mixing the two cases/rings.
I will use just %page as you suggest.

>
> >       };
> >       dma_addr_t dma_addr;
> >  };
>
> [...]
>
> > +static int
> > +vmxnet3_xdp_set(struct net_device *netdev, struct netdev_bpf *bpf,
> > +             struct netlink_ext_ack *extack)
> > +{
> > +     struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> > +     struct bpf_prog *new_bpf_prog = bpf->prog;
> > +     struct bpf_prog *old_bpf_prog;
> > +     bool need_update;
> > +     bool running;
> > +     int err = 0;
> > +
> > +     if (new_bpf_prog && netdev->mtu > VMXNET3_XDP_MAX_MTU) {
>
> Mismatch: as I said before, %VMXNET3_XDP_MAX_MTU is not MTU, rather max
> frame len. At the same time, netdev->mtu is real MTU, which doesn't
> include Eth, VLAN and FCS.

Thanks!
So I should include the hardware header length, define s.t like
#define VMXNET3_RX_OFFSET       (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
#define VMXNET3_XDP_MAX_MTU    PAGE_SIZE - VMXNET3_RX_OFFSET -
dev->hard_header_len

>
> > +             NL_SET_ERR_MSG_MOD(extack, "MTU too large for XDP");
>
> Any plans to add XDP multi-buffer support?
>
> > +             return -EOPNOTSUPP;
> > +     }
> > +
> > +     if ((adapter->netdev->features & NETIF_F_LRO)) {
>
> (redundant braces)
>
> > +             netdev_err(adapter->netdev, "LRO is not supported with XDP");
>
> Why is this error printed via netdev_err(), not NL_SET()?

I want to show the error message in dmesg, which I didn't see it
printed when using NL_SET
is it better to use NL_SET?

>
> > +             adapter->netdev->features &= ~NETIF_F_LRO;
> > +     }
> > +
> > +     old_bpf_prog = rcu_dereference(adapter->xdp_bpf_prog);
> > +     if (!new_bpf_prog && !old_bpf_prog)
> > +             return 0;
> > +
> > +     running = netif_running(netdev);
> > +     need_update = !!old_bpf_prog != !!new_bpf_prog;
> > +
> > +     if (running && need_update)
> > +             vmxnet3_quiesce_dev(adapter);
> > +
> > +     vmxnet3_xdp_exchange_program(adapter, new_bpf_prog);
> > +     if (old_bpf_prog)
> > +             bpf_prog_put(old_bpf_prog);
> > +
> > +     if (!running || !need_update)
> > +             return 0;
> > +
> > +     vmxnet3_reset_dev(adapter);
> > +     vmxnet3_rq_destroy_all(adapter);
> > +     vmxnet3_adjust_rx_ring_size(adapter);
> > +     err = vmxnet3_rq_create_all(adapter);
> > +     if (err) {
> > +             NL_SET_ERR_MSG_MOD(extack,
> > +                                "failed to re-create rx queues for XDP.");
> > +             err = -EOPNOTSUPP;
> > +             return err;
>
> return -OPNOTSUPP? Why doing it in two steps?
>
> > +     }
> > +     err = vmxnet3_activate_dev(adapter);
> > +     if (err) {
> > +             NL_SET_ERR_MSG_MOD(extack,
> > +                                "failed to activate device for XDP.");
> > +             err = -EOPNOTSUPP;
> > +             return err;
>
> (same)
>
> > +     }
> > +     clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
>
> (classic newline nit)
>
> > +     return err;
>
> @err will be 0 at this point, return it directly.
>
> > +}
> > +
> > +/* This is the main xdp call used by kernel to set/unset eBPF program. */
> > +int
> > +vmxnet3_xdp(struct net_device *netdev, struct netdev_bpf *bpf)
> > +{
> > +     switch (bpf->command) {
> > +     case XDP_SETUP_PROG:
> > +             return vmxnet3_xdp_set(netdev, bpf, bpf->extack);
> > +     default:
> > +             return -EINVAL;
> > +     }
> > +     return 0;
> > +}
> > +
> > +bool
> > +vmxnet3_xdp_enabled(struct vmxnet3_adapter *adapter)
> > +{
> > +     return !!rcu_access_pointer(adapter->xdp_bpf_prog);
> > +}
> > +
> > +int
> > +vmxnet3_xdp_headroom(struct vmxnet3_adapter *adapter)
> > +{
> > +     return vmxnet3_xdp_enabled(adapter) ? VMXNET3_XDP_PAD : 0;
>
> Uhm, the function is called '_headroom', but in fact it returns skb
> overhead (headroom + tailroom).
> Also, I don't feel like it's incorrect to return 0 as skb overhead, as
> you unconditionally set PP offset to %XDP_PACKET_HEADROOM, plus skb
> tailroom is always `SKB_DATA_ALIGN(sizeof(skb_shared_info))` regardless
> of XDP prog presence. So I'd rather always return _XDP_PAD (or just
> embed this definition into the single call site).
>
> > +}
>
> Making these 2 functions global are overkill and doesn't affect
> performance positively. They can easily be static inlines.

right, I realize vmxnet3_xdp_headroom is used only at one place.
I will embed into the code.
and I will make the vmxnet3_xdp_enable static inline
thanks!

>
> > +
> > +static int
> > +vmxnet3_xdp_xmit_frame(struct vmxnet3_adapter *adapter,
> > +                    struct xdp_frame *xdpf,
> > +                    struct vmxnet3_tx_queue *tq, bool dma_map)
> > +{
> > +     struct vmxnet3_tx_buf_info *tbi = NULL;
> > +     union Vmxnet3_GenericDesc *gdesc;
> > +     struct vmxnet3_tx_ctx ctx;
> > +     int tx_num_deferred;
> > +     struct page *page;
> > +     u32 buf_size;
> > +     int ret = 0;
> > +     u32 dw2;
>
> [...]
>
> > +     dma_wmb();
> > +     gdesc->dword[2] = cpu_to_le32(le32_to_cpu(gdesc->dword[2]) ^
> > +                                               VMXNET3_TXD_GEN);
> > +
> > +     if (tx_num_deferred >= le32_to_cpu(tq->shared->txThreshold)) {
> > +             tq->shared->txNumDeferred = 0;
> > +             VMXNET3_WRITE_BAR0_REG(adapter,
> > +                                    VMXNET3_REG_TXPROD + tq->qid * 8,
> > +                                    tq->tx_ring.next2fill);
> > +     }
>
> (NL))
>
> > +     return ret;
> > +}
> > +
> > +static int
> > +vmxnet3_xdp_xmit_back(struct vmxnet3_adapter *adapter,
> > +                   struct xdp_frame *xdpf)
> > +{
> > +     struct vmxnet3_tx_queue *tq;
> > +     struct netdev_queue *nq;
> > +     int err = 0, cpu;
> > +     int tq_number;
> > +
> > +     tq_number = adapter->num_tx_queues;
> > +     cpu = smp_processor_id();
> > +     if (likely(cpu < tq_number))
> > +             tq = &adapter->tx_queue[cpu];
> > +     else
> > +             tq = &adapter->tx_queue[reciprocal_scale(cpu, tq_number)];
>
> Interesting solution, the first time I see such. Usually we do just
> `smp_processor_id() % num_tx_queues`. I don't say yours is worse, just a
> sidenote :)

ha, yes, I learned from Alexander Duyck's feedback

>
> > +     if (tq->stopped)
> > +             return -ENETDOWN;
> > +
> > +     nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
> > +
> > +     __netif_tx_lock(nq, cpu);
> > +     err = vmxnet3_xdp_xmit_frame(adapter, xdpf, tq, false);
> > +     __netif_tx_unlock(nq);
> > +     return err;
> > +}
> > +
> > +/* ndo_xdp_xmit */
> > +int
> > +vmxnet3_xdp_xmit(struct net_device *dev,
> > +              int n, struct xdp_frame **frames, u32 flags)
> > +{
> > +     struct vmxnet3_adapter *adapter;
> > +     struct vmxnet3_tx_queue *tq;
> > +     struct netdev_queue *nq;
> > +     int i, err, cpu;
> > +     int tq_number;
> > +
> > +     adapter = netdev_priv(dev);
>
> Nit: embed into the declaration?
>
> > +
> > +     if (unlikely(test_bit(VMXNET3_STATE_BIT_QUIESCED, &adapter->state)))
> > +             return -ENETDOWN;
> > +     if (unlikely(test_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state)))
> > +             return -EINVAL;
> > +
> > +     tq_number = adapter->num_tx_queues;
> > +     cpu = smp_processor_id();
> > +     if (likely(cpu < tq_number))
> > +             tq = &adapter->tx_queue[cpu];
> > +     else
> > +             tq = &adapter->tx_queue[reciprocal_scale(cpu, tq_number)];
> > +     if (tq->stopped)
> > +             return -ENETDOWN;
> > +
> > +     nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
> > +
> > +     for (i = 0; i < n; i++) {
> > +             err = vmxnet3_xdp_xmit_frame(adapter, frames[i], tq, true);
> > +             if (err) {
> > +                     tq->stats.xdp_xmit_err++;
> > +                     break;
> > +             }
> > +     }
> > +     tq->stats.xdp_xmit += i;
> > +
> > +     return i;
> > +}
> > +
> > +static int
> > +vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, struct xdp_buff *xdp)
> > +{
> > +     struct xdp_frame *xdpf;
> > +     struct bpf_prog *prog;
> > +     int err;
> > +     u32 act;
> > +
> > +     prog = rcu_dereference(rq->adapter->xdp_bpf_prog);
> > +     act = bpf_prog_run_xdp(prog, xdp);
> > +     rq->stats.xdp_packets++;
> > +
> > +     switch (act) {
> > +     case XDP_PASS:
> > +             return act;
> > +     case XDP_REDIRECT:
> > +             err = xdp_do_redirect(rq->adapter->netdev, xdp, prog);
> > +             if (!err)
> > +                     rq->stats.xdp_redirects++;
> > +             else
> > +                     rq->stats.xdp_drops++;
> > +             return act;
> > +     case XDP_TX:
> > +             xdpf = xdp_convert_buff_to_frame(xdp);
> > +             if (!xdpf || vmxnet3_xdp_xmit_back(rq->adapter, xdpf)) {
>
> I think this also could be unlikely()?
>
> > +                     rq->stats.xdp_drops++;
> > +                     page_pool_recycle_direct(rq->page_pool,
> > +                              virt_to_head_page(xdp->data_hard_start));
>
> Uff, I don't like this line break. Maybe grab the page into a local var
> at first and then pass it to the function?
OK!

>
> > +             } else {
> > +                     rq->stats.xdp_tx++;
> > +             }
> > +             return act;
> > +     default:
> > +             bpf_warn_invalid_xdp_action(rq->adapter->netdev, prog, act);
> > +             fallthrough;
> > +     case XDP_ABORTED:
> > +             trace_xdp_exception(rq->adapter->netdev, prog, act);
> > +             rq->stats.xdp_aborted++;
> > +             break;
> > +     case XDP_DROP:
> > +             rq->stats.xdp_drops++;
> > +             break;
> > +     }
> > +
> > +     page_pool_recycle_direct(rq->page_pool,
> > +                              virt_to_head_page(xdp->data_hard_start));
> > +     return act;
> > +}
> > +
> > +static struct sk_buff *
> > +vmxnet3_build_skb(struct vmxnet3_rx_queue *rq, struct page *page,
> > +               const struct xdp_buff *xdp)
> > +{
> > +     struct sk_buff *skb;
> > +
> > +     skb = build_skb(page_address(page), PAGE_SIZE);
> > +     if (unlikely(!skb)) {
> > +             page_pool_recycle_direct(rq->page_pool, page);
> > +             rq->stats.rx_buf_alloc_failure++;
> > +             return NULL;
> > +     }
> > +
> > +     /* bpf prog might change len and data position. */
> > +     skb_reserve(skb, xdp->data - xdp->data_hard_start);
> > +     skb_put(skb, xdp->data_end - xdp->data);
> > +     skb_mark_for_recycle(skb);
> > +
> > +     return skb;
> > +}
> > +
> > +/* Handle packets from DataRing. */
> > +int
> > +vmxnet3_process_xdp_small(struct vmxnet3_adapter *adapter,
> > +                       struct vmxnet3_rx_queue *rq,
> > +                       void *data, int len,
> > +                       struct sk_buff **skb_xdp_pass)
> > +{
> > +     struct bpf_prog *xdp_prog;
> > +     struct xdp_buff xdp;
> > +     struct page *page;
> > +     int act;
> > +
> > +     page = page_pool_alloc_pages(rq->page_pool, GFP_ATOMIC);
> > +     if (!page) {
>
> (unlikely nit)
>
> > +             rq->stats.rx_buf_alloc_failure++;
> > +             return XDP_DROP;
> > +     }
> > +
> > +     xdp_init_buff(&xdp, PAGE_SIZE, &rq->xdp_rxq);
> > +     xdp_prepare_buff(&xdp, page_address(page), XDP_PACKET_HEADROOM,
> > +                      len, false);
> > +     xdp_buff_clear_frags_flag(&xdp);
> > +
> > +     /* Must copy the data because it's at dataring. */
> > +     memcpy(xdp.data, data, len);
>
> Wanted to write "oh, too bad we have to copy the data" and only then
> noticed your explanation that dataring is used for frames < 128 bytes
> only :D

:D

>
> > +
> > +     rcu_read_lock();
> > +     xdp_prog = rcu_dereference(rq->adapter->xdp_bpf_prog);
> > +     if (!xdp_prog) {
> > +             rcu_read_unlock();
> > +             page_pool_recycle_direct(rq->page_pool, page);
> > +             act = XDP_PASS;
> > +             goto out_skb;
> > +     }
> > +     act = vmxnet3_run_xdp(rq, &xdp);
> > +     rcu_read_unlock();
> > +
> > +out_skb:
>
> Nit: you can move this label one line below and skip always-true branch
> when !xdp_prog.
>
> > +     if (act == XDP_PASS) {
> > +             *skb_xdp_pass = vmxnet3_build_skb(rq, page, &xdp);
> > +             if (!skb_xdp_pass)
> > +                     return XDP_DROP;
> > +     }
>
> [...]
>
> > +#include "vmxnet3_int.h"
> > +
> > +#define VMXNET3_XDP_PAD (SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) + \
> > +                      XDP_PACKET_HEADROOM)
> > +#define VMXNET3_XDP_MAX_MTU (PAGE_SIZE - VMXNET3_XDP_PAD)
>
> Uhm, couldn't say it's valid to name it as "MTU", it's rather max frame
> size. MTU doesn't include Ethernet, VLAN headers and FCS.
Got it, will rename it.

Working on next version now...
Thanks
William

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH net-next v13] vmxnet3: Add XDP support.
  2023-01-21 18:29   ` William Tu
@ 2023-01-24 10:47     ` Alexander Lobakin
  2023-01-27 13:14       ` William Tu
  0 siblings, 1 reply; 6+ messages in thread
From: Alexander Lobakin @ 2023-01-24 10:47 UTC (permalink / raw)
  To: William Tu
  Cc: netdev, jsankararama, gyang, doshir, alexander.duyck, gerhard, bang

From: William Tu <u9012063@gmail.com>
Date: Sat, 21 Jan 2023 10:29:28 -0800

> On Wed, Jan 18, 2023 at 6:17 AM Alexander Lobakin
> <alexandr.lobakin@intel.com> wrote:
>>
>> From: William Tu <u9012063@gmail.com>
>> Date: Thu, 12 Jan 2023 06:07:43 -0800

[...]

>>> +static int
>>> +vmxnet3_create_pp(struct vmxnet3_adapter *adapter,
>>> +               struct vmxnet3_rx_queue *rq, int size)
>>> +{
>>> +     const struct page_pool_params pp_params = {
>>> +             .order = 0,
>>
>> Nit: it will be zeroed implicitly, so can be omitted. OTOH if you want
>> to explicitly say that you always use order-0 pages only, you can leave
>> it here.
> I will leave it here as it's more clear.

+

> 
>>
>>> +             .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>> +             .pool_size = size,
>>> +             .nid = NUMA_NO_NODE,
>>> +             .dev = &adapter->pdev->dev,
>>> +             .offset = XDP_PACKET_HEADROOM,
>>
>> Curious, on which architectures does this driver work in the real world?
>> Is it x86 only or maybe 64-bit systems only? Because not having
>> %NET_IP_ALIGN here will significantly slow down Rx on the systems where
>> it's defined as 2, not 0 (those systems can't stand unaligned access).

[...]

>>> @@ -1730,6 +1863,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
>>>               vmxnet3_getRxComp(rcd,
>>>                                 &rq->comp_ring.base[rq->comp_ring.next2proc].rcd, &rxComp);
>>>       }
>>> +     if (need_flush)
>>> +             xdp_do_flush();
>>
>> What about %XDP_TX? On each %XDP_TX we usually only place the frame to a
>> Tx ring and hit the doorbell to kick Tx only here, before xdp_do_flush().
> 
> I think it's ok here. For XDP_TX, we place the frame to tx ring and wait until
> a threshold (%tq->shared->txThreshold), then hit the doorbell.

What if it didn't reach the threshold and the NAPI poll cycle is
finished? It will stay on the ring without hitting the doorbell?

> 
>>
>>>
>>>       return num_pkts;
>>>  }
>>> @@ -1755,13 +1890,20 @@ vmxnet3_rq_cleanup(struct vmxnet3_rx_queue *rq,
>>>                               &rq->rx_ring[ring_idx].base[i].rxd, &rxDesc);
>>>
>>>                       if (rxd->btype == VMXNET3_RXD_BTYPE_HEAD &&
>>> -                                     rq->buf_info[ring_idx][i].skb) {
>>> +                         rq->buf_info[ring_idx][i].pp_page &&
>>> +                         rq->buf_info[ring_idx][i].buf_type ==
>>> +                         VMXNET3_RX_BUF_XDP) {
>>> +                             page_pool_recycle_direct(rq->page_pool,
>>> +                                                      rq->buf_info[ring_idx][i].pp_page);
>>> +                             rq->buf_info[ring_idx][i].pp_page = NULL;
>>> +                     } else if (rxd->btype == VMXNET3_RXD_BTYPE_HEAD &&
>>> +                                rq->buf_info[ring_idx][i].skb) {
>>>                               dma_unmap_single(&adapter->pdev->dev, rxd->addr,
>>>                                                rxd->len, DMA_FROM_DEVICE);
>>>                               dev_kfree_skb(rq->buf_info[ring_idx][i].skb);
>>>                               rq->buf_info[ring_idx][i].skb = NULL;
>>>                       } else if (rxd->btype == VMXNET3_RXD_BTYPE_BODY &&
>>> -                                     rq->buf_info[ring_idx][i].page) {
>>> +                                rq->buf_info[ring_idx][i].page) {
>>>                               dma_unmap_page(&adapter->pdev->dev, rxd->addr,
>>>                                              rxd->len, DMA_FROM_DEVICE);
>>>                               put_page(rq->buf_info[ring_idx][i].page);
>>> @@ -1786,9 +1928,9 @@ vmxnet3_rq_cleanup_all(struct vmxnet3_adapter *adapter)
>>>
>>>       for (i = 0; i < adapter->num_rx_queues; i++)
>>>               vmxnet3_rq_cleanup(&adapter->rx_queue[i], adapter);
>>> +     rcu_assign_pointer(adapter->xdp_bpf_prog, NULL);
>>>  }
>>>
>>> -
>>
>> (nit: also unrelated)
>>
>>>  static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
>>>                              struct vmxnet3_adapter *adapter)
>>>  {
>>> @@ -1815,6 +1957,13 @@ static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
>>>               }
>>>       }
>>>
>>> +     if (rq->page_pool) {
>>
>> Isn't it always true? You always create a Page Pool per each RQ IIUC?
> good catch, will remove the check.
> 
>>
>>> +             if (xdp_rxq_info_is_reg(&rq->xdp_rxq))
>>> +                     xdp_rxq_info_unreg(&rq->xdp_rxq);
>>> +             page_pool_destroy(rq->page_pool);
>>> +             rq->page_pool = NULL;
>>> +     }
>>> +
>>>       if (rq->data_ring.base) {
>>>               dma_free_coherent(&adapter->pdev->dev,
>>>                                 rq->rx_ring[0].size * rq->data_ring.desc_size,
>>
>> [...]
>>
>>> -static int
>>> +int
>>>  vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter)
>>>  {
>>>       int i, err = 0;
>>> @@ -2585,7 +2742,7 @@ vmxnet3_setup_driver_shared(struct vmxnet3_adapter *adapter)
>>>       if (adapter->netdev->features & NETIF_F_RXCSUM)
>>>               devRead->misc.uptFeatures |= UPT1_F_RXCSUM;
>>>
>>> -     if (adapter->netdev->features & NETIF_F_LRO) {
>>> +     if ((adapter->netdev->features & NETIF_F_LRO)) {
>>
>> Unneeded change (moreover, Clang sometimes triggers on such on W=1+)
>>
>>>               devRead->misc.uptFeatures |= UPT1_F_LRO;
>>>               devRead->misc.maxNumRxSG = cpu_to_le16(1 + MAX_SKB_FRAGS);
>>>       }
>>> @@ -3026,7 +3183,7 @@ vmxnet3_free_pci_resources(struct vmxnet3_adapter *adapter)
>>>  }
>>>
>>>
>>> -static void
>>> +void
>>>  vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
>>>  {
>>>       size_t sz, i, ring0_size, ring1_size, comp_size;
>>> @@ -3035,7 +3192,8 @@ vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
>>>               if (adapter->netdev->mtu <= VMXNET3_MAX_SKB_BUF_SIZE -
>>>                                           VMXNET3_MAX_ETH_HDR_SIZE) {
>>>                       adapter->skb_buf_size = adapter->netdev->mtu +
>>> -                                             VMXNET3_MAX_ETH_HDR_SIZE;
>>> +                                             VMXNET3_MAX_ETH_HDR_SIZE +
>>> +                                             vmxnet3_xdp_headroom(adapter);
>>>                       if (adapter->skb_buf_size < VMXNET3_MIN_T0_BUF_SIZE)
>>>                               adapter->skb_buf_size = VMXNET3_MIN_T0_BUF_SIZE;
>>>
>>> @@ -3563,7 +3721,6 @@ vmxnet3_reset_work(struct work_struct *data)
>>>       clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
>>>  }
>>>
>>> -
>>
>> (unrelated)
>>
>>>  static int
>>>  vmxnet3_probe_device(struct pci_dev *pdev,
>>>                    const struct pci_device_id *id)
>>
>> [...]
>>
>>>  enum vmxnet3_rx_buf_type {
>>>       VMXNET3_RX_BUF_NONE = 0,
>>>       VMXNET3_RX_BUF_SKB = 1,
>>> -     VMXNET3_RX_BUF_PAGE = 2
>>> +     VMXNET3_RX_BUF_PAGE = 2,
>>> +     VMXNET3_RX_BUF_XDP = 3
>>
>> I'd always leave a ',' after the last entry. As you can see, if you
>> don't do that, you have to introduce 2 lines of changes instead of just
>> 1 when you add a new entry.
> thanks, that's good point. Will do it, thanks!
> 
>>
>>>  };
>>>
>>>  #define VMXNET3_RXD_COMP_PENDING        0
>>> @@ -271,6 +279,7 @@ struct vmxnet3_rx_buf_info {
>>>       union {
>>>               struct sk_buff *skb;
>>>               struct page    *page;
>>> +             struct page    *pp_page; /* Page Pool for XDP frame */
>>
>> Why not just use the already existing field if they're of the same type?
> 
> I guess in the beginning I want to avoid mixing the two cases/rings.
> I will use just %page as you suggest.
> 
>>
>>>       };
>>>       dma_addr_t dma_addr;
>>>  };
>>
>> [...]
>>
>>> +static int
>>> +vmxnet3_xdp_set(struct net_device *netdev, struct netdev_bpf *bpf,
>>> +             struct netlink_ext_ack *extack)
>>> +{
>>> +     struct vmxnet3_adapter *adapter = netdev_priv(netdev);
>>> +     struct bpf_prog *new_bpf_prog = bpf->prog;
>>> +     struct bpf_prog *old_bpf_prog;
>>> +     bool need_update;
>>> +     bool running;
>>> +     int err = 0;
>>> +
>>> +     if (new_bpf_prog && netdev->mtu > VMXNET3_XDP_MAX_MTU) {
>>
>> Mismatch: as I said before, %VMXNET3_XDP_MAX_MTU is not MTU, rather max
>> frame len. At the same time, netdev->mtu is real MTU, which doesn't
>> include Eth, VLAN and FCS.
> 
> Thanks!
> So I should include the hardware header length, define s.t like
> #define VMXNET3_RX_OFFSET       (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
> #define VMXNET3_XDP_MAX_MTU    PAGE_SIZE - VMXNET3_RX_OFFSET -
> dev->hard_header_len

Hmm, since your netdev is always Ethernet, you can hardcode %ETH_HLEN.
OTOH don't forget to include 2 VLANs, FCS and tailroom:

#define VMXNET3_RX_OFFSET	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
#define VMXNET3_RX_TAILROOOM	SKB_DATA_ALIGN(sizeof(skb_shared_info))
#define VMXNET3_RX_MAX_FRAME	(PAGE_SIZE - VMXNET3_RX_OFFSET - \
				 VMXNET3_RX_TAILROOM)
#define VMXNET3_RX_MAX_MTU	(VMXNET3_RX_MAX_FRAME - ETH_HLEN - \
				 2 * VLAN_HLEN - ETH_FCS_LEN)

Then it will be your max MTU :)
dev->hard_header_len is also ok, but it's always %ETH_HLEN for Ethernet
devices.

> 
>>
>>> +             NL_SET_ERR_MSG_MOD(extack, "MTU too large for XDP");
>>
>> Any plans to add XDP multi-buffer support?
>>
>>> +             return -EOPNOTSUPP;
>>> +     }
>>> +
>>> +     if ((adapter->netdev->features & NETIF_F_LRO)) {
>>
>> (redundant braces)
>>
>>> +             netdev_err(adapter->netdev, "LRO is not supported with XDP");
>>
>> Why is this error printed via netdev_err(), not NL_SET()?
> 
> I want to show the error message in dmesg, which I didn't see it
> printed when using NL_SET
> is it better to use NL_SET?

When &netlink_ext_ack is available, it's better to use it instead.
Alternatively, you can print to both Netlink *and* the kernel log.
Printing to NL allows you to pass a message directly to userspace and
then the program you use to configure devices will print it. Otherwise,
the user will need to open up the kernel log.

> 
>>
>>> +             adapter->netdev->features &= ~NETIF_F_LRO;
>>> +     }
>>> +
>>> +     old_bpf_prog = rcu_dereference(adapter->xdp_bpf_prog);
>>> +     if (!new_bpf_prog && !old_bpf_prog)
>>> +             return 0;

> Working on next version now...
> Thanks
> William

Thanks,
Olek

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH net-next v13] vmxnet3: Add XDP support.
  2023-01-24 10:47     ` Alexander Lobakin
@ 2023-01-27 13:14       ` William Tu
  0 siblings, 0 replies; 6+ messages in thread
From: William Tu @ 2023-01-27 13:14 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: netdev, jsankararama, gyang, doshir, alexander.duyck, gerhard, bang

Hi Alexander,
Thanks for your feedback.

On Tue, Jan 24, 2023 at 2:47 AM Alexander Lobakin
<alexandr.lobakin@intel.com> wrote:
>
> From: William Tu <u9012063@gmail.com>
> Date: Sat, 21 Jan 2023 10:29:28 -0800
>
> > On Wed, Jan 18, 2023 at 6:17 AM Alexander Lobakin
> > <alexandr.lobakin@intel.com> wrote:
> >>
> >> From: William Tu <u9012063@gmail.com>
> >> Date: Thu, 12 Jan 2023 06:07:43 -0800
>
> [...]
>
> >>> +static int
> >>> +vmxnet3_create_pp(struct vmxnet3_adapter *adapter,
> >>> +               struct vmxnet3_rx_queue *rq, int size)
> >>> +{
> >>> +     const struct page_pool_params pp_params = {
> >>> +             .order = 0,
> >>
> >> Nit: it will be zeroed implicitly, so can be omitted. OTOH if you want
> >> to explicitly say that you always use order-0 pages only, you can leave
> >> it here.
> > I will leave it here as it's more clear.
>
> +
>
> >
> >>
> >>> +             .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> >>> +             .pool_size = size,
> >>> +             .nid = NUMA_NO_NODE,
> >>> +             .dev = &adapter->pdev->dev,
> >>> +             .offset = XDP_PACKET_HEADROOM,
> >>
> >> Curious, on which architectures does this driver work in the real world?
> >> Is it x86 only or maybe 64-bit systems only? Because not having
> >> %NET_IP_ALIGN here will significantly slow down Rx on the systems where
> >> it's defined as 2, not 0 (those systems can't stand unaligned access).
>
> [...]
>
> >>> @@ -1730,6 +1863,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
> >>>               vmxnet3_getRxComp(rcd,
> >>>                                 &rq->comp_ring.base[rq->comp_ring.next2proc].rcd, &rxComp);
> >>>       }
> >>> +     if (need_flush)
> >>> +             xdp_do_flush();
> >>
> >> What about %XDP_TX? On each %XDP_TX we usually only place the frame to a
> >> Tx ring and hit the doorbell to kick Tx only here, before xdp_do_flush().
> >
> > I think it's ok here. For XDP_TX, we place the frame to tx ring and wait until
> > a threshold (%tq->shared->txThreshold), then hit the doorbell.
>
> What if it didn't reach the threshold and the NAPI poll cycle is
> finished? It will stay on the ring without hitting the doorbell?

Good catch... Actually we don't need to do anything here.
The backend driver at hypervisor side will poll packets from the tx queues,
transmit these packets, and reset the %tq->shared->txThreshold to zero.
I should add a comment in the source code.

>
> >
> >>
> >>>
> >>>       return num_pkts;
> >>>  }
> >>> @@ -1755,13 +1890,20 @@ vmxnet3_rq_cleanup(struct vmxnet3_rx_queue *rq,
> >>>                               &rq->rx_ring[ring_idx].base[i].rxd, &rxDesc);
> >>>
> >>>                       if (rxd->btype == VMXNET3_RXD_BTYPE_HEAD &&
> >>> -                                     rq->buf_info[ring_idx][i].skb) {
> >>> +                         rq->buf_info[ring_idx][i].pp_page &&
> >>> +                         rq->buf_info[ring_idx][i].buf_type ==
> >>> +                         VMXNET3_RX_BUF_XDP) {
> >>> +                             page_pool_recycle_direct(rq->page_pool,
> >>> +                                                      rq->buf_info[ring_idx][i].pp_page);
> >>> +                             rq->buf_info[ring_idx][i].pp_page = NULL;
> >>> +                     } else if (rxd->btype == VMXNET3_RXD_BTYPE_HEAD &&
> >>> +                                rq->buf_info[ring_idx][i].skb) {
> >>>                               dma_unmap_single(&adapter->pdev->dev, rxd->addr,
> >>>                                                rxd->len, DMA_FROM_DEVICE);
> >>>                               dev_kfree_skb(rq->buf_info[ring_idx][i].skb);
> >>>                               rq->buf_info[ring_idx][i].skb = NULL;
> >>>                       } else if (rxd->btype == VMXNET3_RXD_BTYPE_BODY &&
> >>> -                                     rq->buf_info[ring_idx][i].page) {
> >>> +                                rq->buf_info[ring_idx][i].page) {
> >>>                               dma_unmap_page(&adapter->pdev->dev, rxd->addr,
> >>>                                              rxd->len, DMA_FROM_DEVICE);
> >>>                               put_page(rq->buf_info[ring_idx][i].page);
> >>> @@ -1786,9 +1928,9 @@ vmxnet3_rq_cleanup_all(struct vmxnet3_adapter *adapter)
> >>>
> >>>       for (i = 0; i < adapter->num_rx_queues; i++)
> >>>               vmxnet3_rq_cleanup(&adapter->rx_queue[i], adapter);
> >>> +     rcu_assign_pointer(adapter->xdp_bpf_prog, NULL);
> >>>  }
> >>>
> >>> -
> >>
> >> (nit: also unrelated)
> >>
> >>>  static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
> >>>                              struct vmxnet3_adapter *adapter)
> >>>  {
> >>> @@ -1815,6 +1957,13 @@ static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
> >>>               }
> >>>       }
> >>>
> >>> +     if (rq->page_pool) {
> >>
> >> Isn't it always true? You always create a Page Pool per each RQ IIUC?
> > good catch, will remove the check.
> >
> >>
> >>> +             if (xdp_rxq_info_is_reg(&rq->xdp_rxq))
> >>> +                     xdp_rxq_info_unreg(&rq->xdp_rxq);
> >>> +             page_pool_destroy(rq->page_pool);
> >>> +             rq->page_pool = NULL;
> >>> +     }
> >>> +
> >>>       if (rq->data_ring.base) {
> >>>               dma_free_coherent(&adapter->pdev->dev,
> >>>                                 rq->rx_ring[0].size * rq->data_ring.desc_size,
> >>
> >> [...]
> >>
> >>> -static int
> >>> +int
> >>>  vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter)
> >>>  {
> >>>       int i, err = 0;
> >>> @@ -2585,7 +2742,7 @@ vmxnet3_setup_driver_shared(struct vmxnet3_adapter *adapter)
> >>>       if (adapter->netdev->features & NETIF_F_RXCSUM)
> >>>               devRead->misc.uptFeatures |= UPT1_F_RXCSUM;
> >>>
> >>> -     if (adapter->netdev->features & NETIF_F_LRO) {
> >>> +     if ((adapter->netdev->features & NETIF_F_LRO)) {
> >>
> >> Unneeded change (moreover, Clang sometimes triggers on such on W=1+)
> >>
> >>>               devRead->misc.uptFeatures |= UPT1_F_LRO;
> >>>               devRead->misc.maxNumRxSG = cpu_to_le16(1 + MAX_SKB_FRAGS);
> >>>       }
> >>> @@ -3026,7 +3183,7 @@ vmxnet3_free_pci_resources(struct vmxnet3_adapter *adapter)
> >>>  }
> >>>
> >>>
> >>> -static void
> >>> +void
> >>>  vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
> >>>  {
> >>>       size_t sz, i, ring0_size, ring1_size, comp_size;
> >>> @@ -3035,7 +3192,8 @@ vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
> >>>               if (adapter->netdev->mtu <= VMXNET3_MAX_SKB_BUF_SIZE -
> >>>                                           VMXNET3_MAX_ETH_HDR_SIZE) {
> >>>                       adapter->skb_buf_size = adapter->netdev->mtu +
> >>> -                                             VMXNET3_MAX_ETH_HDR_SIZE;
> >>> +                                             VMXNET3_MAX_ETH_HDR_SIZE +
> >>> +                                             vmxnet3_xdp_headroom(adapter);
> >>>                       if (adapter->skb_buf_size < VMXNET3_MIN_T0_BUF_SIZE)
> >>>                               adapter->skb_buf_size = VMXNET3_MIN_T0_BUF_SIZE;
> >>>
> >>> @@ -3563,7 +3721,6 @@ vmxnet3_reset_work(struct work_struct *data)
> >>>       clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
> >>>  }
> >>>
> >>> -
> >>
> >> (unrelated)
> >>
> >>>  static int
> >>>  vmxnet3_probe_device(struct pci_dev *pdev,
> >>>                    const struct pci_device_id *id)
> >>
> >> [...]
> >>
> >>>  enum vmxnet3_rx_buf_type {
> >>>       VMXNET3_RX_BUF_NONE = 0,
> >>>       VMXNET3_RX_BUF_SKB = 1,
> >>> -     VMXNET3_RX_BUF_PAGE = 2
> >>> +     VMXNET3_RX_BUF_PAGE = 2,
> >>> +     VMXNET3_RX_BUF_XDP = 3
> >>
> >> I'd always leave a ',' after the last entry. As you can see, if you
> >> don't do that, you have to introduce 2 lines of changes instead of just
> >> 1 when you add a new entry.
> > thanks, that's good point. Will do it, thanks!
> >
> >>
> >>>  };
> >>>
> >>>  #define VMXNET3_RXD_COMP_PENDING        0
> >>> @@ -271,6 +279,7 @@ struct vmxnet3_rx_buf_info {
> >>>       union {
> >>>               struct sk_buff *skb;
> >>>               struct page    *page;
> >>> +             struct page    *pp_page; /* Page Pool for XDP frame */
> >>
> >> Why not just use the already existing field if they're of the same type?
> >
> > I guess in the beginning I want to avoid mixing the two cases/rings.
> > I will use just %page as you suggest.
> >
> >>
> >>>       };
> >>>       dma_addr_t dma_addr;
> >>>  };
> >>
> >> [...]
> >>
> >>> +static int
> >>> +vmxnet3_xdp_set(struct net_device *netdev, struct netdev_bpf *bpf,
> >>> +             struct netlink_ext_ack *extack)
> >>> +{
> >>> +     struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> >>> +     struct bpf_prog *new_bpf_prog = bpf->prog;
> >>> +     struct bpf_prog *old_bpf_prog;
> >>> +     bool need_update;
> >>> +     bool running;
> >>> +     int err = 0;
> >>> +
> >>> +     if (new_bpf_prog && netdev->mtu > VMXNET3_XDP_MAX_MTU) {
> >>
> >> Mismatch: as I said before, %VMXNET3_XDP_MAX_MTU is not MTU, rather max
> >> frame len. At the same time, netdev->mtu is real MTU, which doesn't
> >> include Eth, VLAN and FCS.
> >
> > Thanks!
> > So I should include the hardware header length, define s.t like
> > #define VMXNET3_RX_OFFSET       (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
> > #define VMXNET3_XDP_MAX_MTU    PAGE_SIZE - VMXNET3_RX_OFFSET -
> > dev->hard_header_len
>
> Hmm, since your netdev is always Ethernet, you can hardcode %ETH_HLEN.
> OTOH don't forget to include 2 VLANs, FCS and tailroom:
>
> #define VMXNET3_RX_OFFSET       (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
> #define VMXNET3_RX_TAILROOOM    SKB_DATA_ALIGN(sizeof(skb_shared_info))
> #define VMXNET3_RX_MAX_FRAME    (PAGE_SIZE - VMXNET3_RX_OFFSET - \
>                                  VMXNET3_RX_TAILROOM)
> #define VMXNET3_RX_MAX_MTU      (VMXNET3_RX_MAX_FRAME - ETH_HLEN - \
>                                  2 * VLAN_HLEN - ETH_FCS_LEN)
>
> Then it will be your max MTU :)
> dev->hard_header_len is also ok, but it's always %ETH_HLEN for Ethernet
> devices.

Thanks a lot :)

>
> >
> >>
> >>> +             NL_SET_ERR_MSG_MOD(extack, "MTU too large for XDP");
> >>
> >> Any plans to add XDP multi-buffer support?
> >>
> >>> +             return -EOPNOTSUPP;
> >>> +     }
> >>> +
> >>> +     if ((adapter->netdev->features & NETIF_F_LRO)) {
> >>
> >> (redundant braces)
> >>
> >>> +             netdev_err(adapter->netdev, "LRO is not supported with XDP");
> >>
> >> Why is this error printed via netdev_err(), not NL_SET()?
> >
> > I want to show the error message in dmesg, which I didn't see it
> > printed when using NL_SET
> > is it better to use NL_SET?
>
> When &netlink_ext_ack is available, it's better to use it instead.
> Alternatively, you can print to both Netlink *and* the kernel log.
> Printing to NL allows you to pass a message directly to userspace and
> then the program you use to configure devices will print it. Otherwise,
> the user will need to open up the kernel log.

Got it, thanks for explaining it.
Thanks
William

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-01-27 13:15 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-12 14:07 [RFC PATCH net-next v13] vmxnet3: Add XDP support William Tu
2023-01-13 15:43 ` Alexander Duyck
2023-01-18 14:17 ` Alexander Lobakin
2023-01-21 18:29   ` William Tu
2023-01-24 10:47     ` Alexander Lobakin
2023-01-27 13:14       ` William Tu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).