All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-wired-lan] [PATCH v2 0/3] i40e: Support for XDP
@ 2016-12-09  8:22 =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2016-12-09  8:22 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

This series adds XDP support for i40e-based NICs.

The first patch adds XDP_RX support, the second XDP_TX support and the
last patch makes it possible to change an XDP program without
rebuilding the rings.

Thanks to Daniel for pointing out that checking against
xdp_adjust_head support has to be done in ndo_xdp.

v2:
  * Fixed kbuild error for PAGE_SIZE >= 8192.
  * Renamed i40e_try_flip_rx_page to i40e_can_reuse_rx_page, which is
    more in line to the other Intel Ethernet drivers (igb/fm10k).
  * Validate xdp_adjust_head support in ndo_xdp/XDP_SETUP_PROG.


Bj?rn


Bj?rn T?pel (3):
  i40e: Initial support for XDP
  i40e: Add XDP_TX support
  i40e: Don't reset/rebuild rings on XDP program swap

 drivers/net/ethernet/intel/i40e/i40e.h         |  18 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   3 +
 drivers/net/ethernet/intel/i40e/i40e_main.c    | 361 +++++++++++++++++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 444 +++++++++++++++++++++----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   7 +
 5 files changed, 717 insertions(+), 116 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support for XDP
  2016-12-09  8:22 [Intel-wired-lan] [PATCH v2 0/3] i40e: Support for XDP =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2016-12-09  8:22 ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-09 11:56   ` kbuild test robot
  2016-12-09 17:15   ` Alexander Duyck
  2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 2/3] i40e: Add XDP_TX support =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 3/3] i40e: Don't reset/rebuild rings on XDP program swap =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2 siblings, 2 replies; 13+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2016-12-09  8:22 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

This commit adds basic XDP support for i40e derived NICs. All XDP
actions will end up in XDP_DROP.

Only the default/main VSI has support for enabling XDP.

Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h         |  13 +++
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   3 +
 drivers/net/ethernet/intel/i40e/i40e_main.c    |  77 +++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 146 ++++++++++++++++++++-----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   2 +
 5 files changed, 216 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index ba8d30984bee..05d805f439e6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -545,6 +545,8 @@ struct i40e_vsi {
 	struct i40e_ring **rx_rings;
 	struct i40e_ring **tx_rings;
 
+	struct bpf_prog *xdp_prog;
+
 	u32  active_filters;
 	u32  promisc_threshold;
 
@@ -904,4 +906,15 @@ i40e_status i40e_get_npar_bw_setting(struct i40e_pf *pf);
 i40e_status i40e_set_npar_bw_setting(struct i40e_pf *pf);
 i40e_status i40e_commit_npar_bw_setting(struct i40e_pf *pf);
 void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
+
+/**
+ * i40e_enabled_xdp_vsi - Check if VSI has XDP enabled
+ * @vsi: pointer to a vsi
+ *
+ * Returns true if the VSI has XDP enabled.
+ **/
+static inline bool i40e_enabled_xdp_vsi(const struct i40e_vsi *vsi)
+{
+	return vsi->xdp_prog;
+}
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index cc1465aac2ef..831bbc208fc8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1254,6 +1254,9 @@ static int i40e_set_ringparam(struct net_device *netdev,
 	if ((ring->rx_mini_pending) || (ring->rx_jumbo_pending))
 		return -EINVAL;
 
+	if (i40e_enabled_xdp_vsi(vsi))
+		return -EINVAL;
+
 	if (ring->tx_pending > I40E_MAX_NUM_DESCRIPTORS ||
 	    ring->tx_pending < I40E_MIN_NUM_DESCRIPTORS ||
 	    ring->rx_pending > I40E_MAX_NUM_DESCRIPTORS ||
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index da4cbe32eb86..b247c3231170 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -24,6 +24,7 @@
  *
  ******************************************************************************/
 
+#include <linux/bpf.h>
 #include <linux/etherdevice.h>
 #include <linux/of_net.h>
 #include <linux/pci.h>
@@ -2431,6 +2432,13 @@ static int i40e_change_mtu(struct net_device *netdev, int new_mtu)
 	struct i40e_netdev_priv *np = netdev_priv(netdev);
 	struct i40e_vsi *vsi = np->vsi;
 
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
+
+		if (max_frame > I40E_RXBUFFER_2048)
+			return -EINVAL;
+	}
+
 	netdev_info(netdev, "changing MTU from %d to %d\n",
 		    netdev->mtu, new_mtu);
 	netdev->mtu = new_mtu;
@@ -3085,6 +3093,15 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
 	writel(0, ring->tail);
 
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		struct bpf_prog *prog;
+
+		prog = bpf_prog_add(vsi->xdp_prog, 1);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+		ring->xdp_prog = prog;
+	}
+
 	i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
 
 	return 0;
@@ -9234,6 +9251,65 @@ static netdev_features_t i40e_features_check(struct sk_buff *skb,
 	return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
 }
 
+/**
+ * i40e_xdp_setup - Add/remove an XDP program to a VSI
+ * @vsi: the VSI to add the program
+ * @prog: the XDP program
+ **/
+static int i40e_xdp_setup(struct i40e_vsi *vsi,
+			  struct bpf_prog *prog)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct net_device *netdev = vsi->netdev;
+	int frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
+
+	if (prog && prog->xdp_adjust_head)
+		return -EOPNOTSUPP;
+
+	if (frame_size > I40E_RXBUFFER_2048)
+		return -EINVAL;
+
+	if (!(pf->flags & I40E_FLAG_MSIX_ENABLED))
+		return -EINVAL;
+
+	if (!i40e_enabled_xdp_vsi(vsi) && !prog)
+		return 0;
+
+	i40e_prep_for_reset(pf);
+
+	if (vsi->xdp_prog)
+		bpf_prog_put(vsi->xdp_prog);
+	vsi->xdp_prog = prog;
+
+	i40e_reset_and_rebuild(pf, true);
+	return 0;
+}
+
+/**
+ * i40e_xdp - NDO for enabled/query
+ * @dev: the netdev
+ * @xdp: XDP program
+ **/
+static int i40e_xdp(struct net_device *dev,
+		    struct netdev_xdp *xdp)
+{
+	struct i40e_netdev_priv *np = netdev_priv(dev);
+	struct i40e_vsi *vsi = np->vsi;
+
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return i40e_xdp_setup(vsi, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_open		= i40e_open,
 	.ndo_stop		= i40e_close,
@@ -9270,6 +9346,7 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_features_check	= i40e_features_check,
 	.ndo_bridge_getlink	= i40e_ndo_bridge_getlink,
 	.ndo_bridge_setlink	= i40e_ndo_bridge_setlink,
+	.ndo_xdp                = i40e_xdp,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 352cf7cd2ef4..d835a51dafa6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -24,6 +24,7 @@
  *
  ******************************************************************************/
 
+#include <linux/bpf.h>
 #include <linux/prefetch.h>
 #include <net/busy_poll.h>
 #include "i40e.h"
@@ -1040,6 +1041,11 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 	rx_ring->next_to_alloc = 0;
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
+
+	if (rx_ring->xdp_prog) {
+		bpf_prog_put(rx_ring->xdp_prog);
+		rx_ring->xdp_prog = NULL;
+	}
 }
 
 /**
@@ -1600,30 +1606,104 @@ static bool i40e_add_rx_frag(struct i40e_ring *rx_ring,
 }
 
 /**
+ * i40e_run_xdp - Runs an XDP program for an Rx ring
+ * @rx_ring: Rx ring used for XDP
+ * @rx_buffer: current Rx buffer
+ * @rx_desc: current Rx descriptor
+ * @xdp_prog: the XDP program to run
+ *
+ * Returns true if the XDP program consumed the incoming frame. False
+ * means pass the frame to the good old stack.
+ **/
+static bool i40e_run_xdp(struct i40e_ring *rx_ring,
+			 struct i40e_rx_buffer *rx_buffer,
+			 union i40e_rx_desc *rx_desc,
+			 struct bpf_prog *xdp_prog)
+{
+	u64 qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+	unsigned int size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
+			    I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
+	struct xdp_buff xdp;
+	u32 xdp_action;
+
+	WARN_ON(!i40e_test_staterr(rx_desc,
+				   BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)));
+
+	xdp.data = page_address(rx_buffer->page) + rx_buffer->page_offset;
+	xdp.data_end = xdp.data + size;
+	xdp_action = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	switch (xdp_action) {
+	case XDP_PASS:
+		return false;
+	default:
+		bpf_warn_invalid_xdp_action(xdp_action);
+	case XDP_ABORTED:
+	case XDP_TX:
+	case XDP_DROP:
+		if (likely(!i40e_page_is_reserved(rx_buffer->page))) {
+			i40e_reuse_rx_page(rx_ring, rx_buffer);
+			rx_ring->rx_stats.page_reuse_count++;
+			break;
+		}
+
+		/* we are not reusing the buffer so unmap it */
+		dma_unmap_page(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
+			       DMA_FROM_DEVICE);
+		__free_pages(rx_buffer->page, 0);
+	}
+
+	/* clear contents of buffer_info */
+	rx_buffer->page = NULL;
+	return true; /* Swallowed by XDP */
+}
+
+/**
  * i40e_fetch_rx_buffer - Allocate skb and populate it
  * @rx_ring: rx descriptor ring to transact packets on
  * @rx_desc: descriptor containing info written by hardware
+ * @skb: The allocated skb, if any
  *
- * This function allocates an skb on the fly, and populates it with the page
- * data from the current receive descriptor, taking care to set up the skb
- * correctly, as well as handling calling the page recycle function if
- * necessary.
+ * Unless XDP is enabled, this function allocates an skb on the fly,
+ * and populates it with the page data from the current receive
+ * descriptor, taking care to set up the skb correctly, as well as
+ * handling calling the page recycle function if necessary.
+ *
+ * If the received frame was handled by XDP, true is
+ * returned. Otherwise, the skb is returned to the caller via the skb
+ * parameter.
  */
 static inline
-struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
-				     union i40e_rx_desc *rx_desc)
+bool i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
+			  union i40e_rx_desc *rx_desc,
+			  struct sk_buff **skb)
 {
 	struct i40e_rx_buffer *rx_buffer;
-	struct sk_buff *skb;
 	struct page *page;
 
 	rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
 	page = rx_buffer->page;
 	prefetchw(page);
 
-	skb = rx_buffer->skb;
+	/* we are reusing so sync this buffer for CPU use */
+	dma_sync_single_range_for_cpu(rx_ring->dev,
+				      rx_buffer->dma,
+				      rx_buffer->page_offset,
+				      I40E_RXBUFFER_2048,
+				      DMA_FROM_DEVICE);
+
+	if (rx_ring->xdp_prog) {
+		bool xdp_consumed;
+
+		xdp_consumed = i40e_run_xdp(rx_ring, rx_buffer,
+					    rx_desc, rx_ring->xdp_prog);
+		if (xdp_consumed)
+			return true;
+	}
 
-	if (likely(!skb)) {
+	*skb = rx_buffer->skb;
+
+	if (likely(!*skb)) {
 		void *page_addr = page_address(page) + rx_buffer->page_offset;
 
 		/* prefetch first cache line of first page */
@@ -1633,32 +1713,25 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
 #endif
 
 		/* allocate a skb to store the frags */
-		skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
-				       I40E_RX_HDR_SIZE,
-				       GFP_ATOMIC | __GFP_NOWARN);
-		if (unlikely(!skb)) {
+		*skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
+					I40E_RX_HDR_SIZE,
+					GFP_ATOMIC | __GFP_NOWARN);
+		if (unlikely(!*skb)) {
 			rx_ring->rx_stats.alloc_buff_failed++;
-			return NULL;
+			return false;
 		}
 
 		/* we will be copying header into skb->data in
 		 * pskb_may_pull so it is in our interest to prefetch
 		 * it now to avoid a possible cache miss
 		 */
-		prefetchw(skb->data);
+		prefetchw((*skb)->data);
 	} else {
 		rx_buffer->skb = NULL;
 	}
 
-	/* we are reusing so sync this buffer for CPU use */
-	dma_sync_single_range_for_cpu(rx_ring->dev,
-				      rx_buffer->dma,
-				      rx_buffer->page_offset,
-				      I40E_RXBUFFER_2048,
-				      DMA_FROM_DEVICE);
-
 	/* pull page into skb */
-	if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, skb)) {
+	if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, *skb)) {
 		/* hand second half of page back to the ring */
 		i40e_reuse_rx_page(rx_ring, rx_buffer);
 		rx_ring->rx_stats.page_reuse_count++;
@@ -1671,7 +1744,7 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
 	/* clear contents of buffer_info */
 	rx_buffer->page = NULL;
 
-	return skb;
+	return false;
 }
 
 /**
@@ -1716,6 +1789,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 }
 
 /**
+ * i40e_update_rx_next_to_clean - Bumps the next-to-clean for an Rx ing
+ * @rx_ring: Rx ring to bump
+ **/
+static void i40e_update_rx_next_to_clean(struct i40e_ring *rx_ring)
+{
+	u32 ntc = rx_ring->next_to_clean + 1;
+
+	ntc = (ntc < rx_ring->count) ? ntc : 0;
+	rx_ring->next_to_clean = ntc;
+
+	prefetch(I40E_RX_DESC(rx_ring, ntc));
+}
+
+/**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
  * @budget: Total limit on number of packets to process
@@ -1739,6 +1826,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 		u16 vlan_tag;
 		u8 rx_ptype;
 		u64 qword;
+		bool xdp_consumed;
 
 		/* return some buffers to hardware, one at a time is too slow */
 		if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
@@ -1764,7 +1852,15 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 		 */
 		dma_rmb();
 
-		skb = i40e_fetch_rx_buffer(rx_ring, rx_desc);
+		xdp_consumed = i40e_fetch_rx_buffer(rx_ring, rx_desc, &skb);
+		if (xdp_consumed) {
+			cleaned_count++;
+
+			i40e_update_rx_next_to_clean(rx_ring);
+			total_rx_packets++;
+			continue;
+		}
+
 		if (!skb)
 			break;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index e065321ce8ed..957d856a82c4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -341,6 +341,8 @@ struct i40e_ring {
 
 	struct rcu_head rcu;		/* to avoid race on free */
 	u16 next_to_alloc;
+
+	struct bpf_prog *xdp_prog;
 } ____cacheline_internodealigned_in_smp;
 
 enum i40e_latency_range {
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 2/3] i40e: Add XDP_TX support
  2016-12-09  8:22 [Intel-wired-lan] [PATCH v2 0/3] i40e: Support for XDP =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2016-12-09  8:22 ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-09 18:06   ` John Fastabend
  2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 3/3] i40e: Don't reset/rebuild rings on XDP program swap =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2 siblings, 1 reply; 13+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2016-12-09  8:22 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

This patch adds proper XDP_TX support.

Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h      |   5 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 273 ++++++++++++++++++++-----
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 303 +++++++++++++++++++++++-----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   5 +
 4 files changed, 490 insertions(+), 96 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 05d805f439e6..adc1f3f32729 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -545,6 +545,10 @@ struct i40e_vsi {
 	struct i40e_ring **rx_rings;
 	struct i40e_ring **tx_rings;
 
+	/* The XDP rings are Tx only, and follows the count of the
+	 * regular rings, i.e. alloc_queue_pairs/num_queue_pairs
+	 */
+	struct i40e_ring **xdp_rings;
 	struct bpf_prog *xdp_prog;
 
 	u32  active_filters;
@@ -622,6 +626,7 @@ struct i40e_q_vector {
 
 	struct i40e_ring_container rx;
 	struct i40e_ring_container tx;
+	struct i40e_ring_container xdp;
 
 	u8 num_ringpairs;	/* total number of ring pairs in vector */
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b247c3231170..37bb03914a64 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -107,6 +107,18 @@ MODULE_VERSION(DRV_VERSION);
 static struct workqueue_struct *i40e_wq;
 
 /**
+ * i40e_alloc_queue_pairs_xdp_vsi - required # of XDP queue pairs
+ * @vsi: pointer to a vsi
+ **/
+static u16 i40e_alloc_queue_pairs_xdp_vsi(const struct i40e_vsi *vsi)
+{
+	if (i40e_enabled_xdp_vsi(vsi))
+		return vsi->alloc_queue_pairs;
+
+	return 0;
+}
+
+/**
  * i40e_allocate_dma_mem_d - OS specific memory alloc for shared code
  * @hw:   pointer to the HW structure
  * @mem:  ptr to mem struct to fill out
@@ -2828,6 +2840,12 @@ static int i40e_vsi_setup_tx_resources(struct i40e_vsi *vsi)
 	for (i = 0; i < vsi->num_queue_pairs && !err; i++)
 		err = i40e_setup_tx_descriptors(vsi->tx_rings[i]);
 
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return err;
+
+	for (i = 0; i < vsi->num_queue_pairs && !err; i++)
+		err = i40e_setup_tx_descriptors(vsi->xdp_rings[i]);
+
 	return err;
 }
 
@@ -2841,12 +2859,17 @@ static void i40e_vsi_free_tx_resources(struct i40e_vsi *vsi)
 {
 	int i;
 
-	if (!vsi->tx_rings)
-		return;
+	if (vsi->tx_rings) {
+		for (i = 0; i < vsi->num_queue_pairs; i++)
+			if (vsi->tx_rings[i] && vsi->tx_rings[i]->desc)
+				i40e_free_tx_resources(vsi->tx_rings[i]);
+	}
 
-	for (i = 0; i < vsi->num_queue_pairs; i++)
-		if (vsi->tx_rings[i] && vsi->tx_rings[i]->desc)
-			i40e_free_tx_resources(vsi->tx_rings[i]);
+	if (vsi->xdp_rings) {
+		for (i = 0; i < vsi->num_queue_pairs; i++)
+			if (vsi->xdp_rings[i] && vsi->xdp_rings[i]->desc)
+				i40e_free_tx_resources(vsi->xdp_rings[i]);
+	}
 }
 
 /**
@@ -3121,6 +3144,12 @@ static int i40e_vsi_configure_tx(struct i40e_vsi *vsi)
 	for (i = 0; (i < vsi->num_queue_pairs) && !err; i++)
 		err = i40e_configure_tx_ring(vsi->tx_rings[i]);
 
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return err;
+
+	for (i = 0; (i < vsi->num_queue_pairs) && !err; i++)
+		err = i40e_configure_tx_ring(vsi->xdp_rings[i]);
+
 	return err;
 }
 
@@ -3269,7 +3298,7 @@ static void i40e_vsi_configure_msix(struct i40e_vsi *vsi)
 	struct i40e_hw *hw = &pf->hw;
 	u16 vector;
 	int i, q;
-	u32 qp;
+	u32 qp, qp_idx = 0;
 
 	/* The interrupt indexing is offset by 1 in the PFINT_ITRn
 	 * and PFINT_LNKLSTn registers, e.g.:
@@ -3296,16 +3325,33 @@ static void i40e_vsi_configure_msix(struct i40e_vsi *vsi)
 		wr32(hw, I40E_PFINT_LNKLSTN(vector - 1), qp);
 		for (q = 0; q < q_vector->num_ringpairs; q++) {
 			u32 val;
+			u32 nqp = qp;
+
+			if (i40e_enabled_xdp_vsi(vsi)) {
+				nqp = vsi->base_queue +
+				      vsi->xdp_rings[qp_idx]->queue_index;
+			}
 
 			val = I40E_QINT_RQCTL_CAUSE_ENA_MASK |
-			      (I40E_RX_ITR << I40E_QINT_RQCTL_ITR_INDX_SHIFT)  |
-			      (vector      << I40E_QINT_RQCTL_MSIX_INDX_SHIFT) |
-			      (qp          << I40E_QINT_RQCTL_NEXTQ_INDX_SHIFT)|
+			      (I40E_RX_ITR << I40E_QINT_RQCTL_ITR_INDX_SHIFT)   |
+			      (vector      << I40E_QINT_RQCTL_MSIX_INDX_SHIFT)  |
+			      (nqp         << I40E_QINT_RQCTL_NEXTQ_INDX_SHIFT) |
 			      (I40E_QUEUE_TYPE_TX
 				      << I40E_QINT_RQCTL_NEXTQ_TYPE_SHIFT);
 
 			wr32(hw, I40E_QINT_RQCTL(qp), val);
 
+			if (i40e_enabled_xdp_vsi(vsi)) {
+				val = I40E_QINT_TQCTL_CAUSE_ENA_MASK |
+				      (I40E_TX_ITR << I40E_QINT_TQCTL_ITR_INDX_SHIFT)   |
+				      (vector      << I40E_QINT_TQCTL_MSIX_INDX_SHIFT)  |
+				      (qp          << I40E_QINT_TQCTL_NEXTQ_INDX_SHIFT) |
+				      (I40E_QUEUE_TYPE_TX
+				       << I40E_QINT_TQCTL_NEXTQ_TYPE_SHIFT);
+
+				wr32(hw, I40E_QINT_TQCTL(nqp), val);
+			}
+
 			val = I40E_QINT_TQCTL_CAUSE_ENA_MASK |
 			      (I40E_TX_ITR << I40E_QINT_TQCTL_ITR_INDX_SHIFT)  |
 			      (vector      << I40E_QINT_TQCTL_MSIX_INDX_SHIFT) |
@@ -3320,6 +3366,7 @@ static void i40e_vsi_configure_msix(struct i40e_vsi *vsi)
 
 			wr32(hw, I40E_QINT_TQCTL(qp), val);
 			qp++;
+			qp_idx++;
 		}
 	}
 
@@ -3562,6 +3609,10 @@ static void i40e_vsi_disable_irq(struct i40e_vsi *vsi)
 	for (i = 0; i < vsi->num_queue_pairs; i++) {
 		wr32(hw, I40E_QINT_TQCTL(vsi->tx_rings[i]->reg_idx), 0);
 		wr32(hw, I40E_QINT_RQCTL(vsi->rx_rings[i]->reg_idx), 0);
+		if (i40e_enabled_xdp_vsi(vsi)) {
+			wr32(hw, I40E_QINT_TQCTL(vsi->xdp_rings[i]->reg_idx),
+			     0);
+		}
 	}
 
 	if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
@@ -3871,6 +3922,24 @@ static void i40e_map_vector_to_qp(struct i40e_vsi *vsi, int v_idx, int qp_idx)
 }
 
 /**
+ * i40e_map_vector_to_xdp_ring - Assigns the XDP Tx queue to the vector
+ * @vsi: the VSI being configured
+ * @v_idx: vector index
+ * @xdp_idx: XDP Tx queue index
+ **/
+static void i40e_map_vector_to_xdp_ring(struct i40e_vsi *vsi, int v_idx,
+					int xdp_idx)
+{
+	struct i40e_q_vector *q_vector = vsi->q_vectors[v_idx];
+	struct i40e_ring *xdp_ring = vsi->xdp_rings[xdp_idx];
+
+	xdp_ring->q_vector = q_vector;
+	xdp_ring->next = q_vector->xdp.ring;
+	q_vector->xdp.ring = xdp_ring;
+	q_vector->xdp.count++;
+}
+
+/**
  * i40e_vsi_map_rings_to_vectors - Maps descriptor rings to vectors
  * @vsi: the VSI being configured
  *
@@ -3903,11 +3972,17 @@ static void i40e_vsi_map_rings_to_vectors(struct i40e_vsi *vsi)
 
 		q_vector->rx.count = 0;
 		q_vector->tx.count = 0;
+		q_vector->xdp.count = 0;
 		q_vector->rx.ring = NULL;
 		q_vector->tx.ring = NULL;
+		q_vector->xdp.ring = NULL;
 
 		while (num_ringpairs--) {
 			i40e_map_vector_to_qp(vsi, v_start, qp_idx);
+			if (i40e_enabled_xdp_vsi(vsi)) {
+				i40e_map_vector_to_xdp_ring(vsi, v_start,
+							    qp_idx);
+			}
 			qp_idx++;
 			qp_remaining--;
 		}
@@ -4001,56 +4076,82 @@ static int i40e_pf_txq_wait(struct i40e_pf *pf, int pf_q, bool enable)
 }
 
 /**
- * i40e_vsi_control_tx - Start or stop a VSI's rings
+ * i40e_vsi_control_txq - Start or stop a VSI's queue
  * @vsi: the VSI being configured
  * @enable: start or stop the rings
+ * @pf_q: the PF queue
  **/
-static int i40e_vsi_control_tx(struct i40e_vsi *vsi, bool enable)
+static int i40e_vsi_control_txq(struct i40e_vsi *vsi, bool enable, int pf_q)
 {
 	struct i40e_pf *pf = vsi->back;
 	struct i40e_hw *hw = &pf->hw;
-	int i, j, pf_q, ret = 0;
+	int j, ret = 0;
 	u32 tx_reg;
 
-	pf_q = vsi->base_queue;
-	for (i = 0; i < vsi->num_queue_pairs; i++, pf_q++) {
+	/* warn the TX unit of coming changes */
+	i40e_pre_tx_queue_cfg(&pf->hw, pf_q, enable);
+	if (!enable)
+		usleep_range(10, 20);
 
-		/* warn the TX unit of coming changes */
-		i40e_pre_tx_queue_cfg(&pf->hw, pf_q, enable);
-		if (!enable)
-			usleep_range(10, 20);
+	for (j = 0; j < 50; j++) {
+		tx_reg = rd32(hw, I40E_QTX_ENA(pf_q));
+		if (((tx_reg >> I40E_QTX_ENA_QENA_REQ_SHIFT) & 1) ==
+		    ((tx_reg >> I40E_QTX_ENA_QENA_STAT_SHIFT) & 1))
+			break;
+		usleep_range(1000, 2000);
+	}
+	/* Skip if the queue is already in the requested state */
+	if (enable == !!(tx_reg & I40E_QTX_ENA_QENA_STAT_MASK))
+		return 0;
 
-		for (j = 0; j < 50; j++) {
-			tx_reg = rd32(hw, I40E_QTX_ENA(pf_q));
-			if (((tx_reg >> I40E_QTX_ENA_QENA_REQ_SHIFT) & 1) ==
-			    ((tx_reg >> I40E_QTX_ENA_QENA_STAT_SHIFT) & 1))
-				break;
-			usleep_range(1000, 2000);
-		}
-		/* Skip if the queue is already in the requested state */
-		if (enable == !!(tx_reg & I40E_QTX_ENA_QENA_STAT_MASK))
-			continue;
+	/* turn on/off the queue */
+	if (enable) {
+		wr32(hw, I40E_QTX_HEAD(pf_q), 0);
+		tx_reg |= I40E_QTX_ENA_QENA_REQ_MASK;
+	} else {
+		tx_reg &= ~I40E_QTX_ENA_QENA_REQ_MASK;
+	}
 
-		/* turn on/off the queue */
-		if (enable) {
-			wr32(hw, I40E_QTX_HEAD(pf_q), 0);
-			tx_reg |= I40E_QTX_ENA_QENA_REQ_MASK;
-		} else {
-			tx_reg &= ~I40E_QTX_ENA_QENA_REQ_MASK;
-		}
+	wr32(hw, I40E_QTX_ENA(pf_q), tx_reg);
+	/* No waiting for the Tx queue to disable */
+	if (!enable && test_bit(__I40E_PORT_TX_SUSPENDED, &pf->state))
+		return 0;
 
-		wr32(hw, I40E_QTX_ENA(pf_q), tx_reg);
-		/* No waiting for the Tx queue to disable */
-		if (!enable && test_bit(__I40E_PORT_TX_SUSPENDED, &pf->state))
-			continue;
+	/* wait for the change to finish */
+	ret = i40e_pf_txq_wait(pf, pf_q, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+	return 0;
+}
 
-		/* wait for the change to finish */
-		ret = i40e_pf_txq_wait(pf, pf_q, enable);
-		if (ret) {
-			dev_info(&pf->pdev->dev,
-				 "VSI seid %d Tx ring %d %sable timeout\n",
-				 vsi->seid, pf_q, (enable ? "en" : "dis"));
+/**
+ * i40e_vsi_control_tx - Start or stop a VSI's rings
+ * @vsi: the VSI being configured
+ * @enable: start or stop the rings
+ **/
+static int i40e_vsi_control_tx(struct i40e_vsi *vsi, bool enable)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+	int i, pf_q, ret = 0;
+
+	pf_q = vsi->base_queue;
+	for (i = 0; i < vsi->num_queue_pairs; i++, pf_q++) {
+		ret = i40e_vsi_control_txq(vsi, enable, pf_q);
+		if (ret)
 			break;
+	}
+
+	if (!ret && i40e_enabled_xdp_vsi(vsi)) {
+		for (i = 0; i < vsi->num_queue_pairs; i++) {
+			pf_q = vsi->base_queue + vsi->xdp_rings[i]->queue_index;
+			ret = i40e_vsi_control_txq(vsi, enable, pf_q);
+			if (ret)
+				break;
 		}
 	}
 
@@ -4311,6 +4412,9 @@ static void i40e_free_q_vector(struct i40e_vsi *vsi, int v_idx)
 	i40e_for_each_ring(ring, q_vector->rx)
 		ring->q_vector = NULL;
 
+	i40e_for_each_ring(ring, q_vector->xdp)
+		ring->q_vector = NULL;
+
 	/* only VSI w/ an associated netdev is set up w/ NAPI */
 	if (vsi->netdev)
 		netif_napi_del(&q_vector->napi);
@@ -4534,6 +4638,21 @@ static int i40e_vsi_wait_queues_disabled(struct i40e_vsi *vsi)
 		}
 	}
 
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return 0;
+
+	for (i = 0; i < vsi->num_queue_pairs; i++) {
+		pf_q = vsi->base_queue + vsi->xdp_rings[i]->queue_index;
+		/* Check and wait for the disable status of the queue */
+		ret = i40e_pf_txq_wait(pf, pf_q, false);
+		if (ret) {
+			dev_info(&pf->pdev->dev,
+				 "VSI seid %d XDP Tx ring %d disable timeout\n",
+				 vsi->seid, pf_q);
+			return ret;
+		}
+	}
+
 	return 0;
 }
 
@@ -5474,6 +5593,8 @@ void i40e_down(struct i40e_vsi *vsi)
 
 	for (i = 0; i < vsi->num_queue_pairs; i++) {
 		i40e_clean_tx_ring(vsi->tx_rings[i]);
+		if (i40e_enabled_xdp_vsi(vsi))
+			i40e_clean_tx_ring(vsi->xdp_rings[i]);
 		i40e_clean_rx_ring(vsi->rx_rings[i]);
 	}
 
@@ -7446,6 +7567,16 @@ static int i40e_vsi_alloc_arrays(struct i40e_vsi *vsi, bool alloc_qvectors)
 		return -ENOMEM;
 	vsi->rx_rings = &vsi->tx_rings[vsi->alloc_queue_pairs];
 
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		size = sizeof(struct i40e_ring *) *
+		       i40e_alloc_queue_pairs_xdp_vsi(vsi);
+		vsi->xdp_rings = kzalloc(size, GFP_KERNEL);
+		if (!vsi->xdp_rings) {
+			ret = -ENOMEM;
+			goto err_xdp_rings;
+		}
+	}
+
 	if (alloc_qvectors) {
 		/* allocate memory for q_vector pointers */
 		size = sizeof(struct i40e_q_vector *) * vsi->num_q_vectors;
@@ -7458,6 +7589,8 @@ static int i40e_vsi_alloc_arrays(struct i40e_vsi *vsi, bool alloc_qvectors)
 	return ret;
 
 err_vectors:
+	kfree(vsi->xdp_rings);
+err_xdp_rings:
 	kfree(vsi->tx_rings);
 	return ret;
 }
@@ -7564,6 +7697,8 @@ static void i40e_vsi_free_arrays(struct i40e_vsi *vsi, bool free_qvectors)
 	kfree(vsi->tx_rings);
 	vsi->tx_rings = NULL;
 	vsi->rx_rings = NULL;
+	kfree(vsi->xdp_rings);
+	vsi->xdp_rings = NULL;
 }
 
 /**
@@ -7649,6 +7784,13 @@ static void i40e_vsi_clear_rings(struct i40e_vsi *vsi)
 			vsi->rx_rings[i] = NULL;
 		}
 	}
+
+	if (vsi->xdp_rings && vsi->xdp_rings[0]) {
+		for (i = 0; i < vsi->alloc_queue_pairs; i++) {
+			kfree_rcu(vsi->xdp_rings[i], rcu);
+			vsi->xdp_rings[i] = NULL;
+		}
+	}
 }
 
 /**
@@ -7696,6 +7838,31 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
 		vsi->rx_rings[i] = rx_ring;
 	}
 
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return 0;
+
+	for (i = 0; i < vsi->alloc_queue_pairs; i++) {
+		tx_ring = kzalloc(sizeof(*tx_ring), GFP_KERNEL);
+		if (!tx_ring)
+			goto err_out;
+
+		tx_ring->queue_index = vsi->alloc_queue_pairs + i;
+		tx_ring->reg_idx = vsi->base_queue + vsi->alloc_queue_pairs + i;
+		tx_ring->ring_active = false;
+		tx_ring->vsi = vsi;
+		tx_ring->netdev = NULL;
+		tx_ring->dev = &pf->pdev->dev;
+		tx_ring->count = vsi->num_desc;
+		tx_ring->size = 0;
+		tx_ring->dcb_tc = 0;
+		if (vsi->back->flags & I40E_FLAG_WB_ON_ITR_CAPABLE)
+			tx_ring->flags = I40E_TXR_FLAGS_WB_ON_ITR;
+		tx_ring->tx_itr_setting = pf->tx_itr_default;
+		tx_ring->xdp_sibling = vsi->rx_rings[i];
+		vsi->xdp_rings[i] = tx_ring;
+		vsi->rx_rings[i]->xdp_sibling = tx_ring;
+	}
+
 	return 0;
 
 err_out:
@@ -9924,6 +10091,7 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 	struct i40e_pf *pf;
 	u8 enabled_tc;
 	int ret;
+	u16 alloc_queue_pairs;
 
 	if (!vsi)
 		return NULL;
@@ -9939,11 +10107,13 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 	if (ret)
 		goto err_vsi;
 
-	ret = i40e_get_lump(pf, pf->qp_pile, vsi->alloc_queue_pairs, vsi->idx);
+	alloc_queue_pairs = vsi->alloc_queue_pairs +
+			    i40e_alloc_queue_pairs_xdp_vsi(vsi);
+	ret = i40e_get_lump(pf, pf->qp_pile, alloc_queue_pairs, vsi->idx);
 	if (ret < 0) {
 		dev_info(&pf->pdev->dev,
 			 "failed to get tracking for %d queues for VSI %d err %d\n",
-			 vsi->alloc_queue_pairs, vsi->seid, ret);
+			 alloc_queue_pairs, vsi->seid, ret);
 		goto err_vsi;
 	}
 	vsi->base_queue = ret;
@@ -10001,6 +10171,7 @@ struct i40e_vsi *i40e_vsi_setup(struct i40e_pf *pf, u8 type,
 	struct i40e_veb *veb = NULL;
 	int ret, i;
 	int v_idx;
+	u16 alloc_queue_pairs;
 
 	/* The requested uplink_seid must be either
 	 *     - the PF's port seid
@@ -10085,13 +10256,15 @@ struct i40e_vsi *i40e_vsi_setup(struct i40e_pf *pf, u8 type,
 		pf->lan_vsi = v_idx;
 	else if (type == I40E_VSI_SRIOV)
 		vsi->vf_id = param1;
+
+	alloc_queue_pairs = vsi->alloc_queue_pairs +
+			    i40e_alloc_queue_pairs_xdp_vsi(vsi);
 	/* assign it some queues */
-	ret = i40e_get_lump(pf, pf->qp_pile, vsi->alloc_queue_pairs,
-				vsi->idx);
+	ret = i40e_get_lump(pf, pf->qp_pile, alloc_queue_pairs,	vsi->idx);
 	if (ret < 0) {
 		dev_info(&pf->pdev->dev,
 			 "failed to get tracking for %d queues for VSI %d err=%d\n",
-			 vsi->alloc_queue_pairs, vsi->seid, ret);
+			 alloc_queue_pairs, vsi->seid, ret);
 		goto err_vsi;
 	}
 	vsi->base_queue = ret;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index d835a51dafa6..072338cda2f4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -520,6 +520,8 @@ static void i40e_unmap_and_free_tx_resource(struct i40e_ring *ring,
 	if (tx_buffer->skb) {
 		if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB)
 			kfree(tx_buffer->raw_buf);
+		else if (tx_buffer->tx_flags & I40E_TX_FLAGS_XDP)
+			put_page(tx_buffer->page);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 		if (dma_unmap_len(tx_buffer, len))
@@ -620,6 +622,15 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw)
 #define WB_STRIDE 4
 
 /**
+ * i40e_page_is_reserved - check if reuse is possible
+ * @page: page struct to check
+ */
+static inline bool i40e_page_is_reserved(struct page *page)
+{
+	return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page);
+}
+
+/**
  * i40e_clean_tx_irq - Reclaim resources after transmit completes
  * @vsi: the VSI we care about
  * @tx_ring: Tx ring to clean
@@ -762,6 +773,98 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 	return !!budget;
 }
 
+static bool i40e_clean_xdp_irq(struct i40e_vsi *vsi,
+			       struct i40e_ring *tx_ring)
+{
+	u16 i = tx_ring->next_to_clean;
+	struct i40e_tx_buffer *tx_buf;
+	struct i40e_tx_desc *tx_head;
+	struct i40e_tx_desc *tx_desc;
+	unsigned int total_bytes = 0, total_packets = 0;
+	unsigned int budget = vsi->work_limit;
+
+	tx_buf = &tx_ring->tx_bi[i];
+	tx_desc = I40E_TX_DESC(tx_ring, i);
+	i -= tx_ring->count;
+
+	tx_head = I40E_TX_DESC(tx_ring, i40e_get_head(tx_ring));
+
+	do {
+		struct i40e_tx_desc *eop_desc = tx_buf->next_to_watch;
+
+		/* if next_to_watch is not set then there is no work pending */
+		if (!eop_desc)
+			break;
+
+		/* prevent any other reads prior to eop_desc */
+		read_barrier_depends();
+
+		/* we have caught up to head, no work left to do */
+		if (tx_head == tx_desc)
+			break;
+
+		/* clear next_to_watch to prevent false hangs */
+		tx_buf->next_to_watch = NULL;
+
+		/* update the statistics for this packet */
+		total_bytes += tx_buf->bytecount;
+		total_packets += tx_buf->gso_segs;
+
+		put_page(tx_buf->page);
+
+		/* unmap skb header data */
+		dma_unmap_single(tx_ring->dev,
+				 dma_unmap_addr(tx_buf, dma),
+				 dma_unmap_len(tx_buf, len),
+				 DMA_TO_DEVICE);
+
+		/* clear tx_buffer data */
+		tx_buf->skb = NULL;
+		dma_unmap_len_set(tx_buf, len, 0);
+
+		/* move us one more past the eop_desc for start of next pkt */
+		tx_buf++;
+		tx_desc++;
+		i++;
+		if (unlikely(!i)) {
+			i -= tx_ring->count;
+			tx_buf = tx_ring->tx_bi;
+			tx_desc = I40E_TX_DESC(tx_ring, 0);
+		}
+
+		prefetch(tx_desc);
+
+		/* update budget accounting */
+		budget--;
+	} while (likely(budget));
+
+	i += tx_ring->count;
+	tx_ring->next_to_clean = i;
+	u64_stats_update_begin(&tx_ring->syncp);
+	tx_ring->stats.bytes += total_bytes;
+	tx_ring->stats.packets += total_packets;
+	u64_stats_update_end(&tx_ring->syncp);
+	tx_ring->q_vector->tx.total_bytes += total_bytes;
+	tx_ring->q_vector->tx.total_packets += total_packets;
+
+	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+		/* check to see if there are < 4 descriptors
+		 * waiting to be written back, then kick the hardware to force
+		 * them to be written back in case we stay in NAPI.
+		 * In this mode on X722 we do not enable Interrupt.
+		 */
+		unsigned int j = i40e_get_tx_pending(tx_ring, false);
+
+		if (budget &&
+		    ((j / WB_STRIDE) == 0) && (j > 0) &&
+		    !test_bit(__I40E_DOWN, &vsi->state) &&
+		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
+			tx_ring->arm_wb = true;
+	}
+
+	return !!budget;
+}
+
 /**
  * i40e_enable_wb_on_itr - Arm hardware to do a wb, interrupts are not enabled
  * @vsi: the VSI we care about
@@ -1496,35 +1599,40 @@ static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb)
 }
 
 /**
- * i40e_reuse_rx_page - page flip buffer and store it back on the ring
- * @rx_ring: rx descriptor ring to store buffers on
- * @old_buff: donor buffer to have page reused
+ * i40e_can_reuse_rx_page - attempts to flip a page for reuse
+ * @rx_buffer: The buffer to alter
  *
- * Synchronizes page for reuse by the adapter
- **/
-static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
-			       struct i40e_rx_buffer *old_buff)
+ * Returns true if the page was successfully flipped and can be
+ * reused.
+ */
+static bool i40e_can_reuse_rx_page(struct i40e_rx_buffer *rx_buffer,
+				   unsigned int truesize)
 {
-	struct i40e_rx_buffer *new_buff;
-	u16 nta = rx_ring->next_to_alloc;
+	/* avoid re-using remote pages */
+	if (unlikely(i40e_page_is_reserved(rx_buffer->page)))
+		return false;
 
-	new_buff = &rx_ring->rx_bi[nta];
+#if (PAGE_SIZE < 8192)
+	/* if we are only owner of page we can reuse it */
+	if (unlikely(page_count(rx_buffer->page) != 1))
+		return false;
 
-	/* update, and store next to alloc */
-	nta++;
-	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+	/* flip page offset to other buffer */
+	rx_buffer->page_offset ^= truesize;
+#else
+	/* move offset up to the next cache line */
+	rx_buffer->page_offset += truesize;
 
-	/* transfer page from old buffer to new buffer */
-	*new_buff = *old_buff;
-}
+	if (rx_buffer->page_offset > (PAGE_SIZE - I40E_RXBUFFER_2048))
+		return false;
+#endif
 
-/**
- * i40e_page_is_reserved - check if reuse is possible
- * @page: page struct to check
- */
-static inline bool i40e_page_is_reserved(struct page *page)
-{
-	return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page);
+	/* Even if we own the page, we are not allowed to use atomic_set()
+	 * This would break get_page_unless_zero() users.
+	 */
+	get_page(rx_buffer->page);
+
+	return true;
 }
 
 /**
@@ -1555,7 +1663,6 @@ static bool i40e_add_rx_frag(struct i40e_ring *rx_ring,
 	unsigned int truesize = I40E_RXBUFFER_2048;
 #else
 	unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
-	unsigned int last_offset = PAGE_SIZE - I40E_RXBUFFER_2048;
 #endif
 
 	/* will the data fit in the skb we allocated? if so, just
@@ -1578,34 +1685,107 @@ static bool i40e_add_rx_frag(struct i40e_ring *rx_ring,
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
 			rx_buffer->page_offset, size, truesize);
 
-	/* avoid re-using remote pages */
-	if (unlikely(i40e_page_is_reserved(page)))
-		return false;
+	return i40e_can_reuse_rx_page(rx_buffer, truesize);
+}
 
-#if (PAGE_SIZE < 8192)
-	/* if we are only owner of page we can reuse it */
-	if (unlikely(page_count(page) != 1))
+/**
+ * i40e_xdp_xmit_tail_bump - updates the tail and sets the RS bit
+ * @xdp_ring: XDP Tx ring
+ **/
+static
+void i40e_xdp_xmit_tail_bump(struct i40e_ring *xdp_ring)
+{
+	struct i40e_tx_desc *tx_desc;
+
+	/* Set RS and bump tail */
+	tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->curr_in_use);
+	tx_desc->cmd_type_offset_bsz |=
+		cpu_to_le64(I40E_TX_DESC_CMD_RS << I40E_TXD_QW1_CMD_SHIFT);
+	/* Force memory writes to complete before letting h/w know
+	 * there are new descriptors to fetch.  (Only applicable for
+	 * weak-ordered memory model archs, such as IA-64).
+	 */
+	wmb();
+	writel(xdp_ring->curr_in_use, xdp_ring->tail);
+
+	xdp_ring->xdp_needs_tail_bump = false;
+}
+
+/**
+ * i40e_xdp_xmit - transmit a frame on the XDP Tx queue
+ * @xdp_ring: XDP Tx ring
+ * @page: current page containing the frame
+ * @page_offset: offset where the frame resides
+ * @dma: Bus address of the frame
+ * @size: size of the frame
+ *
+ * Returns true successfully sent.
+ **/
+static bool i40e_xdp_xmit(void *data, size_t size, struct page *page,
+			  struct i40e_ring *xdp_ring)
+{
+	struct i40e_tx_buffer *tx_bi;
+	struct i40e_tx_desc *tx_desc;
+	u16 i = xdp_ring->next_to_use;
+	dma_addr_t dma;
+
+	if (unlikely(I40E_DESC_UNUSED(xdp_ring) < 1)) {
+		if (xdp_ring->xdp_needs_tail_bump)
+			i40e_xdp_xmit_tail_bump(xdp_ring);
+		xdp_ring->tx_stats.tx_busy++;
 		return false;
+	}
 
-	/* flip page offset to other buffer */
-	rx_buffer->page_offset ^= truesize;
-#else
-	/* move offset up to the next cache line */
-	rx_buffer->page_offset += truesize;
+	tx_bi = &xdp_ring->tx_bi[i];
+	tx_bi->bytecount = size;
+	tx_bi->gso_segs = 1;
+	tx_bi->tx_flags = I40E_TX_FLAGS_XDP;
+	tx_bi->page = page;
 
-	if (rx_buffer->page_offset > last_offset)
+	dma = dma_map_single(xdp_ring->dev, data, size, DMA_TO_DEVICE);
+	if (dma_mapping_error(xdp_ring->dev, dma))
 		return false;
-#endif
 
-	/* Even if we own the page, we are not allowed to use atomic_set()
-	 * This would break get_page_unless_zero() users.
-	 */
-	get_page(rx_buffer->page);
+	/* record length, and DMA address */
+	dma_unmap_len_set(tx_bi, len, size);
+	dma_unmap_addr_set(tx_bi, dma, dma);
 
+	tx_desc = I40E_TX_DESC(xdp_ring, i);
+	tx_desc->buffer_addr = cpu_to_le64(dma);
+	tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
+						  | I40E_TX_DESC_CMD_EOP,
+						  0, size, 0);
+	tx_bi->next_to_watch = tx_desc;
+	xdp_ring->curr_in_use = i++;
+	xdp_ring->next_to_use = (i < xdp_ring->count) ? i : 0;
+	xdp_ring->xdp_needs_tail_bump = true;
 	return true;
 }
 
 /**
+ * i40e_reuse_rx_page - page flip buffer and store it back on the ring
+ * @rx_ring: rx descriptor ring to store buffers on
+ * @old_buff: donor buffer to have page reused
+ *
+ * Synchronizes page for reuse by the adapter
+ **/
+static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
+			       struct i40e_rx_buffer *old_buff)
+{
+	struct i40e_rx_buffer *new_buff;
+	u16 nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	/* transfer page from old buffer to new buffer */
+	*new_buff = *old_buff;
+}
+
+/**
  * i40e_run_xdp - Runs an XDP program for an Rx ring
  * @rx_ring: Rx ring used for XDP
  * @rx_buffer: current Rx buffer
@@ -1623,8 +1803,14 @@ static bool i40e_run_xdp(struct i40e_ring *rx_ring,
 	u64 qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
 	unsigned int size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
 			    I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
+#if (PAGE_SIZE < 8192)
+	unsigned int truesize = I40E_RXBUFFER_2048;
+#else
+	unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
+#endif
 	struct xdp_buff xdp;
 	u32 xdp_action;
+	bool tx_ok;
 
 	WARN_ON(!i40e_test_staterr(rx_desc,
 				   BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)));
@@ -1636,21 +1822,34 @@ static bool i40e_run_xdp(struct i40e_ring *rx_ring,
 	switch (xdp_action) {
 	case XDP_PASS:
 		return false;
-	default:
-		bpf_warn_invalid_xdp_action(xdp_action);
-	case XDP_ABORTED:
 	case XDP_TX:
+		tx_ok = i40e_xdp_xmit(xdp.data, size, rx_buffer->page,
+				      rx_ring->xdp_sibling);
+		if (likely(tx_ok)) {
+			if (i40e_can_reuse_rx_page(rx_buffer, truesize)) {
+				i40e_reuse_rx_page(rx_ring, rx_buffer);
+				rx_ring->rx_stats.page_reuse_count++;
+			} else {
+				dma_unmap_page(rx_ring->dev, rx_buffer->dma,
+					       PAGE_SIZE, DMA_FROM_DEVICE);
+			}
+			break;
+		}
+	case XDP_ABORTED:
 	case XDP_DROP:
+do_drop:
 		if (likely(!i40e_page_is_reserved(rx_buffer->page))) {
 			i40e_reuse_rx_page(rx_ring, rx_buffer);
 			rx_ring->rx_stats.page_reuse_count++;
 			break;
 		}
-
-		/* we are not reusing the buffer so unmap it */
 		dma_unmap_page(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
 			       DMA_FROM_DEVICE);
 		__free_pages(rx_buffer->page, 0);
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(xdp_action);
+		goto do_drop;
 	}
 
 	/* clear contents of buffer_info */
@@ -2065,6 +2264,15 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 		ring->arm_wb = false;
 	}
 
+	i40e_for_each_ring(ring, q_vector->xdp) {
+		if (!i40e_clean_xdp_irq(vsi, ring)) {
+			clean_complete = false;
+			continue;
+		}
+		arm_wb |= ring->arm_wb;
+		ring->arm_wb = false;
+	}
+
 	/* Handle case where we are called by netpoll with a budget of 0 */
 	if (budget <= 0)
 		goto tx_only;
@@ -2077,6 +2285,9 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	i40e_for_each_ring(ring, q_vector->rx) {
 		int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
 
+		if (ring->xdp_sibling && ring->xdp_sibling->xdp_needs_tail_bump)
+			i40e_xdp_xmit_tail_bump(ring->xdp_sibling);
+
 		work_done += cleaned;
 		/* if we clean as many as budgeted, we must not be done */
 		if (cleaned >= budget_per_ring)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 957d856a82c4..4d9459134e69 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -220,6 +220,7 @@ static inline unsigned int i40e_txd_use_count(unsigned int size)
 #define I40E_TX_FLAGS_TSYN		BIT(8)
 #define I40E_TX_FLAGS_FD_SB		BIT(9)
 #define I40E_TX_FLAGS_UDP_TUNNEL	BIT(10)
+#define I40E_TX_FLAGS_XDP		BIT(11)
 #define I40E_TX_FLAGS_VLAN_MASK		0xffff0000
 #define I40E_TX_FLAGS_VLAN_PRIO_MASK	0xe0000000
 #define I40E_TX_FLAGS_VLAN_PRIO_SHIFT	29
@@ -230,6 +231,7 @@ struct i40e_tx_buffer {
 	union {
 		struct sk_buff *skb;
 		void *raw_buf;
+		struct page *page;
 	};
 	unsigned int bytecount;
 	unsigned short gso_segs;
@@ -343,6 +345,9 @@ struct i40e_ring {
 	u16 next_to_alloc;
 
 	struct bpf_prog *xdp_prog;
+	struct i40e_ring *xdp_sibling;  /* rx to xdp, and xdp to rx */
+	bool xdp_needs_tail_bump;
+	u16 curr_in_use;
 } ____cacheline_internodealigned_in_smp;
 
 enum i40e_latency_range {
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 3/3] i40e: Don't reset/rebuild rings on XDP program swap
  2016-12-09  8:22 [Intel-wired-lan] [PATCH v2 0/3] i40e: Support for XDP =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 2/3] i40e: Add XDP_TX support =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2016-12-09  8:22 ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2 siblings, 0 replies; 13+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2016-12-09  8:22 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

Previously, when swapping from one XDP program to another, a
reset/rebuild rings was triggered. Now, the XDP program is simply
changed without that requirement.

Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h      |  4 +--
 drivers/net/ethernet/intel/i40e/i40e_main.c | 41 ++++++++++++++++++-----------
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 25 +++++++++++-------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  2 +-
 4 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index adc1f3f32729..9bc2a8cf5c2e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -549,7 +549,7 @@ struct i40e_vsi {
 	 * regular rings, i.e. alloc_queue_pairs/num_queue_pairs
 	 */
 	struct i40e_ring **xdp_rings;
-	struct bpf_prog *xdp_prog;
+	bool xdp_enabled;
 
 	u32  active_filters;
 	u32  promisc_threshold;
@@ -920,6 +920,6 @@ void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
  **/
 static inline bool i40e_enabled_xdp_vsi(const struct i40e_vsi *vsi)
 {
-	return vsi->xdp_prog;
+	return vsi->xdp_enabled;
 }
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 37bb03914a64..b8d26737de44 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3116,15 +3116,6 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
 	writel(0, ring->tail);
 
-	if (i40e_enabled_xdp_vsi(vsi)) {
-		struct bpf_prog *prog;
-
-		prog = bpf_prog_add(vsi->xdp_prog, 1);
-		if (IS_ERR(prog))
-			return PTR_ERR(prog);
-		ring->xdp_prog = prog;
-	}
-
 	i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
 
 	return 0;
@@ -9428,7 +9419,9 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 {
 	struct i40e_pf *pf = vsi->back;
 	struct net_device *netdev = vsi->netdev;
-	int frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
+	int i, frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
+	bool need_reset;
+	struct bpf_prog *old_prog;
 
 	if (prog && prog->xdp_adjust_head)
 		return -EOPNOTSUPP;
@@ -9442,13 +9435,29 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	if (!i40e_enabled_xdp_vsi(vsi) && !prog)
 		return 0;
 
-	i40e_prep_for_reset(pf);
+	if (prog) {
+		prog = bpf_prog_add(prog, vsi->num_queue_pairs - 1);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+	}
 
-	if (vsi->xdp_prog)
-		bpf_prog_put(vsi->xdp_prog);
-	vsi->xdp_prog = prog;
+	/* When turning XDP on->off/off->on we reset and rebuild the rings. */
+	need_reset = (i40e_enabled_xdp_vsi(vsi) != !!prog);
 
-	i40e_reset_and_rebuild(pf, true);
+	if (need_reset)
+		i40e_prep_for_reset(pf);
+
+	vsi->xdp_enabled = !!prog;
+
+	if (need_reset)
+		i40e_reset_and_rebuild(pf, true);
+
+	for (i = 0; i < vsi->num_queue_pairs; i++) {
+		old_prog = rtnl_dereference(vsi->rx_rings[i]->xdp_prog);
+		rcu_assign_pointer(vsi->rx_rings[i]->xdp_prog, prog);
+		if (old_prog)
+			bpf_prog_put(old_prog);
+	}
 	return 0;
 }
 
@@ -11743,7 +11752,9 @@ static void i40e_remove(struct pci_dev *pdev)
 		pf->flags &= ~I40E_FLAG_SRIOV_ENABLED;
 	}
 
+	rtnl_lock();
 	i40e_fdir_teardown(pf);
+	rtnl_unlock();
 
 	/* If there is a switch structure or any orphans, remove them.
 	 * This will leave only the PF's VSI remaining.
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 072338cda2f4..4ee84e877073 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1112,6 +1112,7 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 	struct device *dev = rx_ring->dev;
 	unsigned long bi_size;
 	u16 i;
+	struct bpf_prog *old_prog;
 
 	/* ring already cleared, nothing to do */
 	if (!rx_ring->rx_bi)
@@ -1145,10 +1146,10 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
 
-	if (rx_ring->xdp_prog) {
-		bpf_prog_put(rx_ring->xdp_prog);
-		rx_ring->xdp_prog = NULL;
-	}
+	old_prog = rtnl_dereference(rx_ring->xdp_prog);
+	RCU_INIT_POINTER(rx_ring->xdp_prog, NULL);
+	if (old_prog)
+		bpf_prog_put(old_prog);
 }
 
 /**
@@ -1879,6 +1880,7 @@ bool i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
 {
 	struct i40e_rx_buffer *rx_buffer;
 	struct page *page;
+	struct bpf_prog *xdp_prog;
 
 	rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
 	page = rx_buffer->page;
@@ -1891,14 +1893,17 @@ bool i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
 				      I40E_RXBUFFER_2048,
 				      DMA_FROM_DEVICE);
 
-	if (rx_ring->xdp_prog) {
-		bool xdp_consumed;
-
-		xdp_consumed = i40e_run_xdp(rx_ring, rx_buffer,
-					    rx_desc, rx_ring->xdp_prog);
-		if (xdp_consumed)
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(rx_ring->xdp_prog);
+	if (xdp_prog) {
+		bool xdp_consumed = i40e_run_xdp(rx_ring, rx_buffer,
+						 rx_desc, xdp_prog);
+		if (xdp_consumed) {
+			rcu_read_unlock();
 			return true;
+		}
 	}
+	rcu_read_unlock();
 
 	*skb = rx_buffer->skb;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 4d9459134e69..cfb2c1016242 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -344,7 +344,7 @@ struct i40e_ring {
 	struct rcu_head rcu;		/* to avoid race on free */
 	u16 next_to_alloc;
 
-	struct bpf_prog *xdp_prog;
+	struct bpf_prog __rcu *xdp_prog;
 	struct i40e_ring *xdp_sibling;  /* rx to xdp, and xdp to rx */
 	bool xdp_needs_tail_bump;
 	u16 curr_in_use;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support for XDP
  2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2016-12-09 11:56   ` kbuild test robot
  2016-12-09 12:35     ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-09 17:15   ` Alexander Duyck
  1 sibling, 1 reply; 13+ messages in thread
From: kbuild test robot @ 2016-12-09 11:56 UTC (permalink / raw)
  To: intel-wired-lan

Hi Bj?rn,

[auto build test ERROR on next-20161208]
[cannot apply to jkirsher-next-queue/dev-queue v4.9-rc8 v4.9-rc7 v4.9-rc6 v4.9-rc8]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/i40e-Initial-support-for-XDP/20161209-182749
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/intel/i40e/i40e_main.c: In function 'i40e_xdp_setup':
>> drivers/net/ethernet/intel/i40e/i40e_main.c:9218:18: error: 'struct bpf_prog' has no member named 'xdp_adjust_head'
     if (prog && prog->xdp_adjust_head)
                     ^~

vim +9218 drivers/net/ethernet/intel/i40e/i40e_main.c

  9212				  struct bpf_prog *prog)
  9213	{
  9214		struct i40e_pf *pf = vsi->back;
  9215		struct net_device *netdev = vsi->netdev;
  9216		int frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
  9217	
> 9218		if (prog && prog->xdp_adjust_head)
  9219			return -EOPNOTSUPP;
  9220	
  9221		if (frame_size > I40E_RXBUFFER_2048)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/gzip
Size: 57890 bytes
Desc: not available
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20161209/12f41a90/attachment-0001.bin>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support for XDP
  2016-12-09 11:56   ` kbuild test robot
@ 2016-12-09 12:35     ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-09 16:41       ` Alexander Duyck
  0 siblings, 1 reply; 13+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2016-12-09 12:35 UTC (permalink / raw)
  To: intel-wired-lan

Jeff,

I rebased the series on net-next v4.9-rc8-1700-gf006b2c5dff3.

2016-12-09 12:56 GMT+01:00 kbuild test robot <lkp@intel.com>:
> Hi Bj?rn,
>
> [auto build test ERROR on next-20161208]
> [cannot apply to jkirsher-next-queue/dev-queue v4.9-rc8 v4.9-rc7 v4.9-rc6 v4.9-rc8]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/i40e-Initial-support-for-XDP/20161209-182749
> config: i386-allmodconfig (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
>    drivers/net/ethernet/intel/i40e/i40e_main.c: In function 'i40e_xdp_setup':
>>> drivers/net/ethernet/intel/i40e/i40e_main.c:9218:18: error: 'struct bpf_prog' has no member named 'xdp_adjust_head'
>      if (prog && prog->xdp_adjust_head)
>                      ^~
>
> vim +9218 drivers/net/ethernet/intel/i40e/i40e_main.c
>
>   9212                            struct bpf_prog *prog)
>   9213  {
>   9214          struct i40e_pf *pf = vsi->back;
>   9215          struct net_device *netdev = vsi->netdev;
>   9216          int frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>   9217
>> 9218          if (prog && prog->xdp_adjust_head)

xdp_adjust_head was introduced in:
  17bedab27231 ("bpf: xdp: Allow head adjustment in XDP prog")

Can I tell the builder to use this tree? What's the typical path here?


Thanks,
Bj?rn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support for XDP
  2016-12-09 12:35     ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2016-12-09 16:41       ` Alexander Duyck
  2016-12-09 17:20         ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Duyck @ 2016-12-09 16:41 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Dec 9, 2016 at 4:35 AM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
> Jeff,
>
> I rebased the series on net-next v4.9-rc8-1700-gf006b2c5dff3.
>
> 2016-12-09 12:56 GMT+01:00 kbuild test robot <lkp@intel.com>:
>> Hi Bj?rn,
>>
>> [auto build test ERROR on next-20161208]
>> [cannot apply to jkirsher-next-queue/dev-queue v4.9-rc8 v4.9-rc7 v4.9-rc6 v4.9-rc8]
>> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>>
>> url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/i40e-Initial-support-for-XDP/20161209-182749
>> config: i386-allmodconfig (attached as .config)
>> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
>> reproduce:
>>         # save the attached .config to linux build tree
>>         make ARCH=i386
>>
>> All errors (new ones prefixed by >>):
>>
>>    drivers/net/ethernet/intel/i40e/i40e_main.c: In function 'i40e_xdp_setup':
>>>> drivers/net/ethernet/intel/i40e/i40e_main.c:9218:18: error: 'struct bpf_prog' has no member named 'xdp_adjust_head'
>>      if (prog && prog->xdp_adjust_head)
>>                      ^~
>>
>> vim +9218 drivers/net/ethernet/intel/i40e/i40e_main.c
>>
>>   9212                            struct bpf_prog *prog)
>>   9213  {
>>   9214          struct i40e_pf *pf = vsi->back;
>>   9215          struct net_device *netdev = vsi->netdev;
>>   9216          int frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>>   9217
>>> 9218          if (prog && prog->xdp_adjust_head)
>
> xdp_adjust_head was introduced in:
>   17bedab27231 ("bpf: xdp: Allow head adjustment in XDP prog")
>
> Can I tell the builder to use this tree? What's the typical path here?

Jeff needs to rebase the next-queue off of that tree in order for it
to get the necessary bits included.

- Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support for XDP
  2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-09 11:56   ` kbuild test robot
@ 2016-12-09 17:15   ` Alexander Duyck
  2016-12-09 17:54     ` John Fastabend
  2016-12-12  8:08     ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  1 sibling, 2 replies; 13+ messages in thread
From: Alexander Duyck @ 2016-12-09 17:15 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Dec 9, 2016 at 12:22 AM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
> From: Bj?rn T?pel <bjorn.topel@intel.com>
>
> This commit adds basic XDP support for i40e derived NICs. All XDP
> actions will end up in XDP_DROP.
>
> Only the default/main VSI has support for enabling XDP.
>
> Acked-by: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e.h         |  13 +++
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   3 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c    |  77 +++++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 146 ++++++++++++++++++++-----
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   2 +
>  5 files changed, 216 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
> index ba8d30984bee..05d805f439e6 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -545,6 +545,8 @@ struct i40e_vsi {
>         struct i40e_ring **rx_rings;
>         struct i40e_ring **tx_rings;
>
> +       struct bpf_prog *xdp_prog;
> +
>         u32  active_filters;
>         u32  promisc_threshold;
>
> @@ -904,4 +906,15 @@ i40e_status i40e_get_npar_bw_setting(struct i40e_pf *pf);
>  i40e_status i40e_set_npar_bw_setting(struct i40e_pf *pf);
>  i40e_status i40e_commit_npar_bw_setting(struct i40e_pf *pf);
>  void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
> +
> +/**
> + * i40e_enabled_xdp_vsi - Check if VSI has XDP enabled
> + * @vsi: pointer to a vsi
> + *
> + * Returns true if the VSI has XDP enabled.
> + **/
> +static inline bool i40e_enabled_xdp_vsi(const struct i40e_vsi *vsi)
> +{
> +       return vsi->xdp_prog;
> +}

It might be better to have this return !!vsi->xdp_prog.  That way it
is explicity casting the return value as a boolean value.

>  #endif /* _I40E_H_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> index cc1465aac2ef..831bbc208fc8 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> @@ -1254,6 +1254,9 @@ static int i40e_set_ringparam(struct net_device *netdev,
>         if ((ring->rx_mini_pending) || (ring->rx_jumbo_pending))
>                 return -EINVAL;
>
> +       if (i40e_enabled_xdp_vsi(vsi))
> +               return -EINVAL;
> +
>         if (ring->tx_pending > I40E_MAX_NUM_DESCRIPTORS ||
>             ring->tx_pending < I40E_MIN_NUM_DESCRIPTORS ||
>             ring->rx_pending > I40E_MAX_NUM_DESCRIPTORS ||
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index da4cbe32eb86..b247c3231170 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -24,6 +24,7 @@
>   *
>   ******************************************************************************/
>
> +#include <linux/bpf.h>
>  #include <linux/etherdevice.h>
>  #include <linux/of_net.h>
>  #include <linux/pci.h>
> @@ -2431,6 +2432,13 @@ static int i40e_change_mtu(struct net_device *netdev, int new_mtu)
>         struct i40e_netdev_priv *np = netdev_priv(netdev);
>         struct i40e_vsi *vsi = np->vsi;
>
> +       if (i40e_enabled_xdp_vsi(vsi)) {
> +               int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
> +
> +               if (max_frame > I40E_RXBUFFER_2048)
> +                       return -EINVAL;
> +       }
> +

I suppose this works for now, but we should probably be using a value
smaller than 2048.  Most likely we are going to need to restrict this
down to about 1.5K or something in that neighborhood.  I am already
looking at changes that clean most of this up to prep it for the DMA
and page count fixes.  To resolve the jumbo frames case we are
probably looking at using a 3K buffer size with an order 1 page on
x86.  That will give us enough room for adding headers and/or any
trailers needed to the frame.

>         netdev_info(netdev, "changing MTU from %d to %d\n",
>                     netdev->mtu, new_mtu);
>         netdev->mtu = new_mtu;
> @@ -3085,6 +3093,15 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>         ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
>         writel(0, ring->tail);
>
> +       if (i40e_enabled_xdp_vsi(vsi)) {
> +               struct bpf_prog *prog;
> +
> +               prog = bpf_prog_add(vsi->xdp_prog, 1);
> +               if (IS_ERR(prog))
> +                       return PTR_ERR(prog);
> +               ring->xdp_prog = prog;
> +       }
> +
>         i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
>
>         return 0;
> @@ -9234,6 +9251,65 @@ static netdev_features_t i40e_features_check(struct sk_buff *skb,
>         return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>  }
>
> +/**
> + * i40e_xdp_setup - Add/remove an XDP program to a VSI
> + * @vsi: the VSI to add the program
> + * @prog: the XDP program
> + **/
> +static int i40e_xdp_setup(struct i40e_vsi *vsi,
> +                         struct bpf_prog *prog)
> +{
> +       struct i40e_pf *pf = vsi->back;
> +       struct net_device *netdev = vsi->netdev;
> +       int frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
> +
> +       if (prog && prog->xdp_adjust_head)
> +               return -EOPNOTSUPP;
> +
> +       if (frame_size > I40E_RXBUFFER_2048)
> +               return -EINVAL;
> +
> +       if (!(pf->flags & I40E_FLAG_MSIX_ENABLED))
> +               return -EINVAL;
> +

Why would MSI-X matter?  Is this because you are trying to reserve
extra Tx rings for the XDP program?

If so we might want to update i40e to make it more ixgbe like since it
can in theory support multiple queues with a single vector.  It might
be useful to add comments on why each of these criteria must be met
when you perform the checks.

> +       if (!i40e_enabled_xdp_vsi(vsi) && !prog)
> +               return 0;
> +
> +       i40e_prep_for_reset(pf);
> +
> +       if (vsi->xdp_prog)
> +               bpf_prog_put(vsi->xdp_prog);
> +       vsi->xdp_prog = prog;
> +
> +       i40e_reset_and_rebuild(pf, true);
> +       return 0;
> +}
> +
> +/**
> + * i40e_xdp - NDO for enabled/query
> + * @dev: the netdev
> + * @xdp: XDP program
> + **/
> +static int i40e_xdp(struct net_device *dev,
> +                   struct netdev_xdp *xdp)
> +{
> +       struct i40e_netdev_priv *np = netdev_priv(dev);
> +       struct i40e_vsi *vsi = np->vsi;
> +
> +       if (vsi->type != I40E_VSI_MAIN)
> +               return -EINVAL;
> +
> +       switch (xdp->command) {
> +       case XDP_SETUP_PROG:
> +               return i40e_xdp_setup(vsi, xdp->prog);
> +       case XDP_QUERY_PROG:
> +               xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
> +               return 0;
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
>  static const struct net_device_ops i40e_netdev_ops = {
>         .ndo_open               = i40e_open,
>         .ndo_stop               = i40e_close,
> @@ -9270,6 +9346,7 @@ static const struct net_device_ops i40e_netdev_ops = {
>         .ndo_features_check     = i40e_features_check,
>         .ndo_bridge_getlink     = i40e_ndo_bridge_getlink,
>         .ndo_bridge_setlink     = i40e_ndo_bridge_setlink,
> +       .ndo_xdp                = i40e_xdp,
>  };
>
>  /**
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 352cf7cd2ef4..d835a51dafa6 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -24,6 +24,7 @@
>   *
>   ******************************************************************************/
>
> +#include <linux/bpf.h>
>  #include <linux/prefetch.h>
>  #include <net/busy_poll.h>
>  #include "i40e.h"
> @@ -1040,6 +1041,11 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>         rx_ring->next_to_alloc = 0;
>         rx_ring->next_to_clean = 0;
>         rx_ring->next_to_use = 0;
> +
> +       if (rx_ring->xdp_prog) {
> +               bpf_prog_put(rx_ring->xdp_prog);
> +               rx_ring->xdp_prog = NULL;
> +       }
>  }
>
>  /**
> @@ -1600,30 +1606,104 @@ static bool i40e_add_rx_frag(struct i40e_ring *rx_ring,
>  }
>
>  /**
> + * i40e_run_xdp - Runs an XDP program for an Rx ring
> + * @rx_ring: Rx ring used for XDP
> + * @rx_buffer: current Rx buffer
> + * @rx_desc: current Rx descriptor
> + * @xdp_prog: the XDP program to run
> + *
> + * Returns true if the XDP program consumed the incoming frame. False
> + * means pass the frame to the good old stack.
> + **/
> +static bool i40e_run_xdp(struct i40e_ring *rx_ring,
> +                        struct i40e_rx_buffer *rx_buffer,
> +                        union i40e_rx_desc *rx_desc,
> +                        struct bpf_prog *xdp_prog)
> +{
> +       u64 qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> +       unsigned int size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> +                           I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
> +       struct xdp_buff xdp;
> +       u32 xdp_action;
> +
> +       WARN_ON(!i40e_test_staterr(rx_desc,
> +                                  BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)));
> +

We can't just do a WARN_ON if we don't have the end of the frame.
Also it probably doesn't even matter unless we are looking at the
first frame.  I would argue that this might be better as a BUG_ON
since it should not happen.

> +       xdp.data = page_address(rx_buffer->page) + rx_buffer->page_offset;
> +       xdp.data_end = xdp.data + size;
> +       xdp_action = bpf_prog_run_xdp(xdp_prog, &xdp);
> +
> +       switch (xdp_action) {
> +       case XDP_PASS:
> +               return false;
> +       default:
> +               bpf_warn_invalid_xdp_action(xdp_action);
> +       case XDP_ABORTED:
> +       case XDP_TX:
> +       case XDP_DROP:
> +               if (likely(!i40e_page_is_reserved(rx_buffer->page))) {
> +                       i40e_reuse_rx_page(rx_ring, rx_buffer);
> +                       rx_ring->rx_stats.page_reuse_count++;
> +                       break;
> +               }
> +
> +               /* we are not reusing the buffer so unmap it */
> +               dma_unmap_page(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
> +                              DMA_FROM_DEVICE);
> +               __free_pages(rx_buffer->page, 0);
> +       }
> +
> +       /* clear contents of buffer_info */
> +       rx_buffer->page = NULL;
> +       return true; /* Swallowed by XDP */
> +}
> +
> +/**
>   * i40e_fetch_rx_buffer - Allocate skb and populate it
>   * @rx_ring: rx descriptor ring to transact packets on
>   * @rx_desc: descriptor containing info written by hardware
> + * @skb: The allocated skb, if any
>   *
> - * This function allocates an skb on the fly, and populates it with the page
> - * data from the current receive descriptor, taking care to set up the skb
> - * correctly, as well as handling calling the page recycle function if
> - * necessary.
> + * Unless XDP is enabled, this function allocates an skb on the fly,
> + * and populates it with the page data from the current receive
> + * descriptor, taking care to set up the skb correctly, as well as
> + * handling calling the page recycle function if necessary.
> + *
> + * If the received frame was handled by XDP, true is
> + * returned. Otherwise, the skb is returned to the caller via the skb
> + * parameter.
>   */
>  static inline
> -struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
> -                                    union i40e_rx_desc *rx_desc)
> +bool i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
> +                         union i40e_rx_desc *rx_desc,
> +                         struct sk_buff **skb)
>  {
>         struct i40e_rx_buffer *rx_buffer;
> -       struct sk_buff *skb;
>         struct page *page;
>
>         rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
>         page = rx_buffer->page;
>         prefetchw(page);
>
> -       skb = rx_buffer->skb;

I'm not sure where this line came from.  It was dropped from the
driver in next-queue a while ago so I don't think your patch applies
currently.

> +       /* we are reusing so sync this buffer for CPU use */
> +       dma_sync_single_range_for_cpu(rx_ring->dev,
> +                                     rx_buffer->dma,
> +                                     rx_buffer->page_offset,
> +                                     I40E_RXBUFFER_2048,
> +                                     DMA_FROM_DEVICE);

I would prefer if moving this piece up was handled in a separate
patch.  It just makes things more readable that way and I have patches
I will be submitting that will pull this and the few lines ahead of it
out of the fetch_rx_buffer entirely.

> +
> +       if (rx_ring->xdp_prog) {
> +               bool xdp_consumed;
> +
> +               xdp_consumed = i40e_run_xdp(rx_ring, rx_buffer,
> +                                           rx_desc, rx_ring->xdp_prog);
> +               if (xdp_consumed)
> +                       return true;
> +       }
>
> -       if (likely(!skb)) {
> +       *skb = rx_buffer->skb;
> +
> +       if (likely(!*skb)) {

Instead of making skb a double pointer we can just change a few things
to make this all simpler.  For example one thing we could look at
doing is taking an hlist_nulls type approach where we signal with a
value of 1 to indicate that XDP stole the buffer so there is no skb to
assign.  Then you just have to clean it up in i40e_is_non_eop by doing
a "&" and continue from there, or could add your own block after
i40e_is_non_eop.

For that matter we coudl probably consolidate a few things into
i40e_cleanup_headers and have it handled there.

>                 void *page_addr = page_address(page) + rx_buffer->page_offset;
>
>                 /* prefetch first cache line of first page */
> @@ -1633,32 +1713,25 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>  #endif
>
>                 /* allocate a skb to store the frags */
> -               skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
> -                                      I40E_RX_HDR_SIZE,
> -                                      GFP_ATOMIC | __GFP_NOWARN);
> -               if (unlikely(!skb)) {
> +               *skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
> +                                       I40E_RX_HDR_SIZE,
> +                                       GFP_ATOMIC | __GFP_NOWARN);
> +               if (unlikely(!*skb)) {
>                         rx_ring->rx_stats.alloc_buff_failed++;
> -                       return NULL;
> +                       return false;
>                 }
>
>                 /* we will be copying header into skb->data in
>                  * pskb_may_pull so it is in our interest to prefetch
>                  * it now to avoid a possible cache miss
>                  */
> -               prefetchw(skb->data);
> +               prefetchw((*skb)->data);
>         } else {
>                 rx_buffer->skb = NULL;
>         }
>
> -       /* we are reusing so sync this buffer for CPU use */
> -       dma_sync_single_range_for_cpu(rx_ring->dev,
> -                                     rx_buffer->dma,
> -                                     rx_buffer->page_offset,
> -                                     I40E_RXBUFFER_2048,
> -                                     DMA_FROM_DEVICE);
> -
>         /* pull page into skb */
> -       if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, skb)) {
> +       if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, *skb)) {
>                 /* hand second half of page back to the ring */
>                 i40e_reuse_rx_page(rx_ring, rx_buffer);
>                 rx_ring->rx_stats.page_reuse_count++;
> @@ -1671,7 +1744,7 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>         /* clear contents of buffer_info */
>         rx_buffer->page = NULL;
>
> -       return skb;
> +       return false;
>  }
>
>  /**
> @@ -1716,6 +1789,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
>  }
>
>  /**
> + * i40e_update_rx_next_to_clean - Bumps the next-to-clean for an Rx ing
> + * @rx_ring: Rx ring to bump
> + **/
> +static void i40e_update_rx_next_to_clean(struct i40e_ring *rx_ring)
> +{
> +       u32 ntc = rx_ring->next_to_clean + 1;
> +
> +       ntc = (ntc < rx_ring->count) ? ntc : 0;
> +       rx_ring->next_to_clean = ntc;
> +
> +       prefetch(I40E_RX_DESC(rx_ring, ntc));
> +}
> +
> +/**
>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>   * @rx_ring: rx descriptor ring to transact packets on
>   * @budget: Total limit on number of packets to process
> @@ -1739,6 +1826,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>                 u16 vlan_tag;
>                 u8 rx_ptype;
>                 u64 qword;
> +               bool xdp_consumed;
>
>                 /* return some buffers to hardware, one at a time is too slow */
>                 if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
> @@ -1764,7 +1852,15 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>                  */
>                 dma_rmb();
>
> -               skb = i40e_fetch_rx_buffer(rx_ring, rx_desc);
> +               xdp_consumed = i40e_fetch_rx_buffer(rx_ring, rx_desc, &skb);
> +               if (xdp_consumed) {
> +                       cleaned_count++;
> +
> +                       i40e_update_rx_next_to_clean(rx_ring);
> +                       total_rx_packets++;
> +                       continue;
> +               }
> +

There are a few bits here that aren't quite handled correctly.  It
looks like we are getting the total_rx_packets, but you didn't update
total_rx_bytes.

>                 if (!skb)
>                         break;
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index e065321ce8ed..957d856a82c4 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -341,6 +341,8 @@ struct i40e_ring {
>
>         struct rcu_head rcu;            /* to avoid race on free */
>         u16 next_to_alloc;
> +
> +       struct bpf_prog *xdp_prog;
>  } ____cacheline_internodealigned_in_smp;
>

You might be better off moving this further up in the structure.
Maybe into the slot after *tail.  Otherwise it is going to be in a
pretty noisy cache line, although it looks like nobody ever really
bothered to optimize the i40e_ring structure so that you had read
mostly and write mostly cache lines anyway so maybe you don't need to
bother.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support for XDP
  2016-12-09 16:41       ` Alexander Duyck
@ 2016-12-09 17:20         ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 13+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2016-12-09 17:20 UTC (permalink / raw)
  To: intel-wired-lan

2016-12-09 17:41 GMT+01:00 Alexander Duyck <alexander.duyck@gmail.com>:
[...]
>
> Jeff needs to rebase the next-queue off of that tree in order for it
> to get the necessary bits included.

Got it! Do you know if I can re-trigger the build after that, or do I
need to resubmit the patch set?

Bj?rn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support for XDP
  2016-12-09 17:15   ` Alexander Duyck
@ 2016-12-09 17:54     ` John Fastabend
  2016-12-12  8:25       ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2016-12-12  8:08     ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  1 sibling, 1 reply; 13+ messages in thread
From: John Fastabend @ 2016-12-09 17:54 UTC (permalink / raw)
  To: intel-wired-lan

On 16-12-09 09:15 AM, Alexander Duyck wrote:
> On Fri, Dec 9, 2016 at 12:22 AM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>> From: Bj?rn T?pel <bjorn.topel@intel.com>
>>
>> This commit adds basic XDP support for i40e derived NICs. All XDP
>> actions will end up in XDP_DROP.
>>
>> Only the default/main VSI has support for enabling XDP.
>>
>> Acked-by: John Fastabend <john.r.fastabend@intel.com>
>> Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e.h         |  13 +++
>>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   3 +
>>  drivers/net/ethernet/intel/i40e/i40e_main.c    |  77 +++++++++++++
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 146 ++++++++++++++++++++-----
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   2 +
>>  5 files changed, 216 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
>> index ba8d30984bee..05d805f439e6 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -545,6 +545,8 @@ struct i40e_vsi {
>>         struct i40e_ring **rx_rings;
>>         struct i40e_ring **tx_rings;
>>
>> +       struct bpf_prog *xdp_prog;
>> +

This is being protected by rcu right at least in patch 3 this seems to
be the case? We should add __rcu annotation here,

  struct bpf_prog __rcu *xdp_prog

Then also run

  make C=2 ./drivers/net/ethernet/intel/i40e/

This should help catch any rcu errors. Oh you also need the PROVE_RCU
config option set in .config

I wonder if patch 3 could be squashed with patch 1 here? Do you think
that this patch is easier to read without it.

>>         u32  active_filters;
>>         u32  promisc_threshold;
>>
>> @@ -904,4 +906,15 @@ i40e_status i40e_get_npar_bw_setting(struct i40e_pf *pf);
>>  i40e_status i40e_set_npar_bw_setting(struct i40e_pf *pf);
>>  i40e_status i40e_commit_npar_bw_setting(struct i40e_pf *pf);
>>  void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
>> +
>> +/**
>> + * i40e_enabled_xdp_vsi - Check if VSI has XDP enabled
>> + * @vsi: pointer to a vsi
>> + *
>> + * Returns true if the VSI has XDP enabled.
>> + **/
>> +static inline bool i40e_enabled_xdp_vsi(const struct i40e_vsi *vsi)
>> +{
>> +       return vsi->xdp_prog;
>> +}
> 
> It might be better to have this return !!vsi->xdp_prog.  That way it
> is explicity casting the return value as a boolean value.
> 
>>  #endif /* _I40E_H_ */
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> index cc1465aac2ef..831bbc208fc8 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> @@ -1254,6 +1254,9 @@ static int i40e_set_ringparam(struct net_device *netdev,
>>         if ((ring->rx_mini_pending) || (ring->rx_jumbo_pending))
>>                 return -EINVAL;
>>
>> +       if (i40e_enabled_xdp_vsi(vsi))
>> +               return -EINVAL;
>> +
>>         if (ring->tx_pending > I40E_MAX_NUM_DESCRIPTORS ||
>>             ring->tx_pending < I40E_MIN_NUM_DESCRIPTORS ||
>>             ring->rx_pending > I40E_MAX_NUM_DESCRIPTORS ||
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index da4cbe32eb86..b247c3231170 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -24,6 +24,7 @@
>>   *
>>   ******************************************************************************/
>>
>> +#include <linux/bpf.h>
>>  #include <linux/etherdevice.h>
>>  #include <linux/of_net.h>
>>  #include <linux/pci.h>
>> @@ -2431,6 +2432,13 @@ static int i40e_change_mtu(struct net_device *netdev, int new_mtu)
>>         struct i40e_netdev_priv *np = netdev_priv(netdev);
>>         struct i40e_vsi *vsi = np->vsi;
>>
>> +       if (i40e_enabled_xdp_vsi(vsi)) {
>> +               int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>> +
>> +               if (max_frame > I40E_RXBUFFER_2048)
>> +                       return -EINVAL;
>> +       }
>> +
> 
> I suppose this works for now, but we should probably be using a value
> smaller than 2048.  Most likely we are going to need to restrict this
> down to about 1.5K or something in that neighborhood.  I am already
> looking at changes that clean most of this up to prep it for the DMA
> and page count fixes.  To resolve the jumbo frames case we are
> probably looking at using a 3K buffer size with an order 1 page on
> x86.  That will give us enough room for adding headers and/or any
> trailers needed to the frame.
> 

Also looks like we are trying to standardize on 256B of headroom for
headers. So when we add the adjust_head support this will need another
256B addition here. And we will have to do an offset on the DMA region.

I thought Alex was maybe trying to clean some of this up so it would
be easier to do this. I vaguely remember something about SKB_PAD.

>>         netdev_info(netdev, "changing MTU from %d to %d\n",
>>                     netdev->mtu, new_mtu);
>>         netdev->mtu = new_mtu;
>> @@ -3085,6 +3093,15 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>>         ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
>>         writel(0, ring->tail);
>>
>> +       if (i40e_enabled_xdp_vsi(vsi)) {
>> +               struct bpf_prog *prog;
>> +
>> +               prog = bpf_prog_add(vsi->xdp_prog, 1);
>> +               if (IS_ERR(prog))
>> +                       return PTR_ERR(prog);
>> +               ring->xdp_prog = prog;
>> +       }
>> +
>>         i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
>>
>>         return 0;
>> @@ -9234,6 +9251,65 @@ static netdev_features_t i40e_features_check(struct sk_buff *skb,
>>         return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>>  }
>>
>> +/**
>> + * i40e_xdp_setup - Add/remove an XDP program to a VSI
>> + * @vsi: the VSI to add the program
>> + * @prog: the XDP program
>> + **/
>> +static int i40e_xdp_setup(struct i40e_vsi *vsi,
>> +                         struct bpf_prog *prog)
>> +{
>> +       struct i40e_pf *pf = vsi->back;
>> +       struct net_device *netdev = vsi->netdev;
>> +       int frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>> +
>> +       if (prog && prog->xdp_adjust_head)

I think we want a follow on patch to enable adjust head next. It would
be best to get this in the same kernel version but not necessarily in
the same series.

>> +               return -EOPNOTSUPP;
>> +
>> +       if (frame_size > I40E_RXBUFFER_2048)
>> +               return -EINVAL;
>> +
>> +       if (!(pf->flags & I40E_FLAG_MSIX_ENABLED))
>> +               return -EINVAL;
>> +
> 
> Why would MSI-X matter?  Is this because you are trying to reserve
> extra Tx rings for the XDP program?
> 
> If so we might want to update i40e to make it more ixgbe like since it
> can in theory support multiple queues with a single vector.  It might
> be useful to add comments on why each of these criteria must be met
> when you perform the checks.
> 
>> +       if (!i40e_enabled_xdp_vsi(vsi) && !prog)
>> +               return 0;
>> +
>> +       i40e_prep_for_reset(pf);
>> +
>> +       if (vsi->xdp_prog)
>> +               bpf_prog_put(vsi->xdp_prog);
>> +       vsi->xdp_prog = prog;
>> +

So I worry a bit about correctness here. Patch 3 likely fixes this but
between patch 1 and 3 it seems like there could be some race here.

This makes me think patch3 should be merged here.

>> +       i40e_reset_and_rebuild(pf, true);
>> +       return 0;
>> +}
>> +
>> +/**
>> + * i40e_xdp - NDO for enabled/query
>> + * @dev: the netdev
>> + * @xdp: XDP program
>> + **/
>> +static int i40e_xdp(struct net_device *dev,
>> +                   struct netdev_xdp *xdp)
>> +{
>> +       struct i40e_netdev_priv *np = netdev_priv(dev);
>> +       struct i40e_vsi *vsi = np->vsi;
>> +
>> +       if (vsi->type != I40E_VSI_MAIN)
>> +               return -EINVAL;
>> +
>> +       switch (xdp->command) {
>> +       case XDP_SETUP_PROG:
>> +               return i40e_xdp_setup(vsi, xdp->prog);
>> +       case XDP_QUERY_PROG:
>> +               xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
>> +               return 0;
>> +       default:
>> +               return -EINVAL;
>> +       }
>> +}
>> +
>>  static const struct net_device_ops i40e_netdev_ops = {
>>         .ndo_open               = i40e_open,
>>         .ndo_stop               = i40e_close,
>> @@ -9270,6 +9346,7 @@ static const struct net_device_ops i40e_netdev_ops = {
>>         .ndo_features_check     = i40e_features_check,
>>         .ndo_bridge_getlink     = i40e_ndo_bridge_getlink,
>>         .ndo_bridge_setlink     = i40e_ndo_bridge_setlink,
>> +       .ndo_xdp                = i40e_xdp,
>>  };
>>
>>  /**
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> index 352cf7cd2ef4..d835a51dafa6 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> @@ -24,6 +24,7 @@
>>   *
>>   ******************************************************************************/
>>
>> +#include <linux/bpf.h>
>>  #include <linux/prefetch.h>
>>  #include <net/busy_poll.h>
>>  #include "i40e.h"
>> @@ -1040,6 +1041,11 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>>         rx_ring->next_to_alloc = 0;
>>         rx_ring->next_to_clean = 0;
>>         rx_ring->next_to_use = 0;
>> +
>> +       if (rx_ring->xdp_prog) {
>> +               bpf_prog_put(rx_ring->xdp_prog);
>> +               rx_ring->xdp_prog = NULL;
>> +       }
>>  }
>>
>>  /**
>> @@ -1600,30 +1606,104 @@ static bool i40e_add_rx_frag(struct i40e_ring *rx_ring,
>>  }
>>
>>  /**
>> + * i40e_run_xdp - Runs an XDP program for an Rx ring
>> + * @rx_ring: Rx ring used for XDP
>> + * @rx_buffer: current Rx buffer
>> + * @rx_desc: current Rx descriptor
>> + * @xdp_prog: the XDP program to run
>> + *
>> + * Returns true if the XDP program consumed the incoming frame. False
>> + * means pass the frame to the good old stack.
>> + **/
>> +static bool i40e_run_xdp(struct i40e_ring *rx_ring,
>> +                        struct i40e_rx_buffer *rx_buffer,
>> +                        union i40e_rx_desc *rx_desc,
>> +                        struct bpf_prog *xdp_prog)
>> +{
>> +       u64 qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
>> +       unsigned int size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
>> +                           I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
>> +       struct xdp_buff xdp;
>> +       u32 xdp_action;
>> +
>> +       WARN_ON(!i40e_test_staterr(rx_desc,
>> +                                  BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)));
>> +
> 
> We can't just do a WARN_ON if we don't have the end of the frame.
> Also it probably doesn't even matter unless we are looking at the
> first frame.  I would argue that this might be better as a BUG_ON
> since it should not happen.
> 

Or if its really just a catch all for something that shouldn't happen
go ahead and drop it.

>> +       xdp.data = page_address(rx_buffer->page) + rx_buffer->page_offset;
>> +       xdp.data_end = xdp.data + size;
>> +       xdp_action = bpf_prog_run_xdp(xdp_prog, &xdp);
>> +
>> +       switch (xdp_action) {
>> +       case XDP_PASS:
>> +               return false;
>> +       default:
>> +               bpf_warn_invalid_xdp_action(xdp_action);
>> +       case XDP_ABORTED:
>> +       case XDP_TX:
>> +       case XDP_DROP:
>> +               if (likely(!i40e_page_is_reserved(rx_buffer->page))) {
>> +                       i40e_reuse_rx_page(rx_ring, rx_buffer);
>> +                       rx_ring->rx_stats.page_reuse_count++;
>> +                       break;
>> +               }
>> +
>> +               /* we are not reusing the buffer so unmap it */
>> +               dma_unmap_page(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
>> +                              DMA_FROM_DEVICE);
>> +               __free_pages(rx_buffer->page, 0);
>> +       }
>> +
>> +       /* clear contents of buffer_info */
>> +       rx_buffer->page = NULL;
>> +       return true; /* Swallowed by XDP */
>> +}
>> +
>> +/**
>>   * i40e_fetch_rx_buffer - Allocate skb and populate it
>>   * @rx_ring: rx descriptor ring to transact packets on
>>   * @rx_desc: descriptor containing info written by hardware
>> + * @skb: The allocated skb, if any
>>   *
>> - * This function allocates an skb on the fly, and populates it with the page
>> - * data from the current receive descriptor, taking care to set up the skb
>> - * correctly, as well as handling calling the page recycle function if
>> - * necessary.
>> + * Unless XDP is enabled, this function allocates an skb on the fly,
>> + * and populates it with the page data from the current receive
>> + * descriptor, taking care to set up the skb correctly, as well as
>> + * handling calling the page recycle function if necessary.
>> + *
>> + * If the received frame was handled by XDP, true is
>> + * returned. Otherwise, the skb is returned to the caller via the skb
>> + * parameter.
>>   */
>>  static inline
>> -struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>> -                                    union i40e_rx_desc *rx_desc)
>> +bool i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>> +                         union i40e_rx_desc *rx_desc,
>> +                         struct sk_buff **skb)
>>  {
>>         struct i40e_rx_buffer *rx_buffer;
>> -       struct sk_buff *skb;
>>         struct page *page;
>>
>>         rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
>>         page = rx_buffer->page;
>>         prefetchw(page);
>>
>> -       skb = rx_buffer->skb;
> 
> I'm not sure where this line came from.  It was dropped from the
> driver in next-queue a while ago so I don't think your patch applies
> currently.
> 
>> +       /* we are reusing so sync this buffer for CPU use */
>> +       dma_sync_single_range_for_cpu(rx_ring->dev,
>> +                                     rx_buffer->dma,
>> +                                     rx_buffer->page_offset,
>> +                                     I40E_RXBUFFER_2048,
>> +                                     DMA_FROM_DEVICE);
> 
> I would prefer if moving this piece up was handled in a separate
> patch.  It just makes things more readable that way and I have patches
> I will be submitting that will pull this and the few lines ahead of it
> out of the fetch_rx_buffer entirely.
> 
>> +
>> +       if (rx_ring->xdp_prog) {
>> +               bool xdp_consumed;
>> +
>> +               xdp_consumed = i40e_run_xdp(rx_ring, rx_buffer,
>> +                                           rx_desc, rx_ring->xdp_prog);

So here is the race I commented on above if we NULL xdp_prog between
'if' block and i40e_run_xdp.


>> +               if (xdp_consumed)
>> +                       return true;
>> +       }
>>
>> -       if (likely(!skb)) {
>> +       *skb = rx_buffer->skb;
>> +
>> +       if (likely(!*skb)) {
> 
> Instead of making skb a double pointer we can just change a few things
> to make this all simpler.  For example one thing we could look at
> doing is taking an hlist_nulls type approach where we signal with a
> value of 1 to indicate that XDP stole the buffer so there is no skb to
> assign.  Then you just have to clean it up in i40e_is_non_eop by doing
> a "&" and continue from there, or could add your own block after
> i40e_is_non_eop.
> 
> For that matter we coudl probably consolidate a few things into
> i40e_cleanup_headers and have it handled there.
> 
>>                 void *page_addr = page_address(page) + rx_buffer->page_offset;
>>
>>                 /* prefetch first cache line of first page */
>> @@ -1633,32 +1713,25 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>>  #endif
>>
>>                 /* allocate a skb to store the frags */
>> -               skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
>> -                                      I40E_RX_HDR_SIZE,
>> -                                      GFP_ATOMIC | __GFP_NOWARN);
>> -               if (unlikely(!skb)) {
>> +               *skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
>> +                                       I40E_RX_HDR_SIZE,
>> +                                       GFP_ATOMIC | __GFP_NOWARN);
>> +               if (unlikely(!*skb)) {
>>                         rx_ring->rx_stats.alloc_buff_failed++;
>> -                       return NULL;
>> +                       return false;
>>                 }
>>
>>                 /* we will be copying header into skb->data in
>>                  * pskb_may_pull so it is in our interest to prefetch
>>                  * it now to avoid a possible cache miss
>>                  */
>> -               prefetchw(skb->data);
>> +               prefetchw((*skb)->data);
>>         } else {
>>                 rx_buffer->skb = NULL;
>>         }
>>
>> -       /* we are reusing so sync this buffer for CPU use */
>> -       dma_sync_single_range_for_cpu(rx_ring->dev,
>> -                                     rx_buffer->dma,
>> -                                     rx_buffer->page_offset,
>> -                                     I40E_RXBUFFER_2048,
>> -                                     DMA_FROM_DEVICE);
>> -
>>         /* pull page into skb */
>> -       if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, skb)) {
>> +       if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, *skb)) {
>>                 /* hand second half of page back to the ring */
>>                 i40e_reuse_rx_page(rx_ring, rx_buffer);
>>                 rx_ring->rx_stats.page_reuse_count++;
>> @@ -1671,7 +1744,7 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>>         /* clear contents of buffer_info */
>>         rx_buffer->page = NULL;
>>
>> -       return skb;
>> +       return false;
>>  }
>>
>>  /**
>> @@ -1716,6 +1789,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
>>  }
>>
>>  /**
>> + * i40e_update_rx_next_to_clean - Bumps the next-to-clean for an Rx ing
>> + * @rx_ring: Rx ring to bump
>> + **/
>> +static void i40e_update_rx_next_to_clean(struct i40e_ring *rx_ring)
>> +{
>> +       u32 ntc = rx_ring->next_to_clean + 1;
>> +
>> +       ntc = (ntc < rx_ring->count) ? ntc : 0;
>> +       rx_ring->next_to_clean = ntc;
>> +
>> +       prefetch(I40E_RX_DESC(rx_ring, ntc));
>> +}
>> +
>> +/**
>>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>>   * @rx_ring: rx descriptor ring to transact packets on
>>   * @budget: Total limit on number of packets to process
>> @@ -1739,6 +1826,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>                 u16 vlan_tag;
>>                 u8 rx_ptype;
>>                 u64 qword;
>> +               bool xdp_consumed;
>>
>>                 /* return some buffers to hardware, one at a time is too slow */
>>                 if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
>> @@ -1764,7 +1852,15 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>                  */
>>                 dma_rmb();
>>
>> -               skb = i40e_fetch_rx_buffer(rx_ring, rx_desc);
>> +               xdp_consumed = i40e_fetch_rx_buffer(rx_ring, rx_desc, &skb);
>> +               if (xdp_consumed) {
>> +                       cleaned_count++;
>> +
>> +                       i40e_update_rx_next_to_clean(rx_ring);
>> +                       total_rx_packets++;
>> +                       continue;
>> +               }
>> +
> 
> There are a few bits here that aren't quite handled correctly.  It
> looks like we are getting the total_rx_packets, but you didn't update
> total_rx_bytes.
> 
>>                 if (!skb)
>>                         break;
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> index e065321ce8ed..957d856a82c4 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> @@ -341,6 +341,8 @@ struct i40e_ring {
>>
>>         struct rcu_head rcu;            /* to avoid race on free */
>>         u16 next_to_alloc;
>> +
>> +       struct bpf_prog *xdp_prog;
>>  } ____cacheline_internodealigned_in_smp;
>>
> 
> You might be better off moving this further up in the structure.
> Maybe into the slot after *tail.  Otherwise it is going to be in a
> pretty noisy cache line, although it looks like nobody ever really
> bothered to optimize the i40e_ring structure so that you had read
> mostly and write mostly cache lines anyway so maybe you don't need to
> bother.

Maybe another patch to clean up the structures.

> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan at lists.osuosl.org
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 2/3] i40e: Add XDP_TX support
  2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 2/3] i40e: Add XDP_TX support =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2016-12-09 18:06   ` John Fastabend
  0 siblings, 0 replies; 13+ messages in thread
From: John Fastabend @ 2016-12-09 18:06 UTC (permalink / raw)
  To: intel-wired-lan

On 16-12-09 12:22 AM, Bj?rn T?pel wrote:
> From: Bj?rn T?pel <bjorn.topel@intel.com>
> 
> This patch adds proper XDP_TX support.
> 
> Acked-by: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e.h      |   5 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 273 ++++++++++++++++++++-----
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 303 +++++++++++++++++++++++-----
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |   5 +
>  4 files changed, 490 insertions(+), 96 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
> index 05d805f439e6..adc1f3f32729 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -545,6 +545,10 @@ struct i40e_vsi {
>  	struct i40e_ring **rx_rings;
>  	struct i40e_ring **tx_rings;
>  
> +	/* The XDP rings are Tx only, and follows the count of the
> +	 * regular rings, i.e. alloc_queue_pairs/num_queue_pairs
> +	 */
> +	struct i40e_ring **xdp_rings;

hmm not really a complaint about your patch but these _rings are using
something like rcu because of the kfree_rcu being called but I don't
see any rcu_dereference_bh() usage or rcu_assign_pointer() :/


I know you just copied the existing pattern. Maybe an i40e dev can
comment.

Thanks,
John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support for XDP
  2016-12-09 17:15   ` Alexander Duyck
  2016-12-09 17:54     ` John Fastabend
@ 2016-12-12  8:08     ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  1 sibling, 0 replies; 13+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2016-12-12  8:08 UTC (permalink / raw)
  To: intel-wired-lan

Firstly, thanks for the detailed review! I'll only address Alex'
comments here. John's will be answered separately.

2016-12-09 18:15 GMT+01:00 Alexander Duyck <alexander.duyck@gmail.com>:
> On Fri, Dec 9, 2016 at 12:22 AM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>> From: Bj?rn T?pel <bjorn.topel@intel.com>
>>
>> This commit adds basic XDP support for i40e derived NICs. All XDP
>> actions will end up in XDP_DROP.
>>
>> Only the default/main VSI has support for enabling XDP.
>>
>> Acked-by: John Fastabend <john.r.fastabend@intel.com>
>> Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e.h         |  13 +++
>>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   3 +
>>  drivers/net/ethernet/intel/i40e/i40e_main.c    |  77 +++++++++++++
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 146 ++++++++++++++++++++-----
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   2 +
>>  5 files changed, 216 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
>> index ba8d30984bee..05d805f439e6 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -545,6 +545,8 @@ struct i40e_vsi {
>>         struct i40e_ring **rx_rings;
>>         struct i40e_ring **tx_rings;
>>
>> +       struct bpf_prog *xdp_prog;
>> +
>>         u32  active_filters;
>>         u32  promisc_threshold;
>>
>> @@ -904,4 +906,15 @@ i40e_status i40e_get_npar_bw_setting(struct i40e_pf *pf);
>>  i40e_status i40e_set_npar_bw_setting(struct i40e_pf *pf);
>>  i40e_status i40e_commit_npar_bw_setting(struct i40e_pf *pf);
>>  void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
>> +
>> +/**
>> + * i40e_enabled_xdp_vsi - Check if VSI has XDP enabled
>> + * @vsi: pointer to a vsi
>> + *
>> + * Returns true if the VSI has XDP enabled.
>> + **/
>> +static inline bool i40e_enabled_xdp_vsi(const struct i40e_vsi *vsi)
>> +{
>> +       return vsi->xdp_prog;
>> +}
>
> It might be better to have this return !!vsi->xdp_prog.  That way it
> is explicity casting the return value as a boolean value.

Yup, will fix in v3. It's also makes the return statement more
readable.

>
>>  #endif /* _I40E_H_ */
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> index cc1465aac2ef..831bbc208fc8 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>> @@ -1254,6 +1254,9 @@ static int i40e_set_ringparam(struct net_device *netdev,
>>         if ((ring->rx_mini_pending) || (ring->rx_jumbo_pending))
>>                 return -EINVAL;
>>
>> +       if (i40e_enabled_xdp_vsi(vsi))
>> +               return -EINVAL;
>> +
>>         if (ring->tx_pending > I40E_MAX_NUM_DESCRIPTORS ||
>>             ring->tx_pending < I40E_MIN_NUM_DESCRIPTORS ||
>>             ring->rx_pending > I40E_MAX_NUM_DESCRIPTORS ||
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index da4cbe32eb86..b247c3231170 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -24,6 +24,7 @@
>>   *
>>   ******************************************************************************/
>>
>> +#include <linux/bpf.h>
>>  #include <linux/etherdevice.h>
>>  #include <linux/of_net.h>
>>  #include <linux/pci.h>
>> @@ -2431,6 +2432,13 @@ static int i40e_change_mtu(struct net_device *netdev, int new_mtu)
>>         struct i40e_netdev_priv *np = netdev_priv(netdev);
>>         struct i40e_vsi *vsi = np->vsi;
>>
>> +       if (i40e_enabled_xdp_vsi(vsi)) {
>> +               int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>> +
>> +               if (max_frame > I40E_RXBUFFER_2048)
>> +                       return -EINVAL;
>> +       }
>> +
>
> I suppose this works for now, but we should probably be using a value
> smaller than 2048.  Most likely we are going to need to restrict this
> down to about 1.5K or something in that neighborhood.  I am already
> looking at changes that clean most of this up to prep it for the DMA
> and page count fixes.  To resolve the jumbo frames case we are
> probably looking at using a 3K buffer size with an order 1 page on
> x86.  That will give us enough room for adding headers and/or any
> trailers needed to the frame.

Ok! I'll keep this for now, and adapt it to your upcoming changes
later.

>>         netdev_info(netdev, "changing MTU from %d to %d\n",
>>                     netdev->mtu, new_mtu);
>>         netdev->mtu = new_mtu;
>> @@ -3085,6 +3093,15 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>>         ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
>>         writel(0, ring->tail);
>>
>> +       if (i40e_enabled_xdp_vsi(vsi)) {
>> +               struct bpf_prog *prog;
>> +
>> +               prog = bpf_prog_add(vsi->xdp_prog, 1);
>> +               if (IS_ERR(prog))
>> +                       return PTR_ERR(prog);
>> +               ring->xdp_prog = prog;
>> +       }
>> +
>>         i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
>>
>>         return 0;
>> @@ -9234,6 +9251,65 @@ static netdev_features_t i40e_features_check(struct sk_buff *skb,
>>         return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>>  }
>>
>> +/**
>> + * i40e_xdp_setup - Add/remove an XDP program to a VSI
>> + * @vsi: the VSI to add the program
>> + * @prog: the XDP program
>> + **/
>> +static int i40e_xdp_setup(struct i40e_vsi *vsi,
>> +                         struct bpf_prog *prog)
>> +{
>> +       struct i40e_pf *pf = vsi->back;
>> +       struct net_device *netdev = vsi->netdev;
>> +       int frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>> +
>> +       if (prog && prog->xdp_adjust_head)
>> +               return -EOPNOTSUPP;
>> +
>> +       if (frame_size > I40E_RXBUFFER_2048)
>> +               return -EINVAL;
>> +
>> +       if (!(pf->flags & I40E_FLAG_MSIX_ENABLED))
>> +               return -EINVAL;
>> +
>
> Why would MSI-X matter?  Is this because you are trying to reserve
> extra Tx rings for the XDP program?
>
> If so we might want to update i40e to make it more ixgbe like since it
> can in theory support multiple queues with a single vector.  It might
> be useful to add comments on why each of these criteria must be met
> when you perform the checks.

You're right, MSI-X *doesn't* matter! Sloppiness from my side. i40e
already support multiple queues per vector, so that isn't an
issue. I'll remove this check, and add proper vector setup for the
non-MSI-X case.

I'll add comments, and maybe logging, as well.

>
>> +       if (!i40e_enabled_xdp_vsi(vsi) && !prog)
>> +               return 0;
>> +
>> +       i40e_prep_for_reset(pf);
>> +
>> +       if (vsi->xdp_prog)
>> +               bpf_prog_put(vsi->xdp_prog);
>> +       vsi->xdp_prog = prog;
>> +
>> +       i40e_reset_and_rebuild(pf, true);
>> +       return 0;
>> +}
>> +
>> +/**
>> + * i40e_xdp - NDO for enabled/query
>> + * @dev: the netdev
>> + * @xdp: XDP program
>> + **/
>> +static int i40e_xdp(struct net_device *dev,
>> +                   struct netdev_xdp *xdp)
>> +{
>> +       struct i40e_netdev_priv *np = netdev_priv(dev);
>> +       struct i40e_vsi *vsi = np->vsi;
>> +
>> +       if (vsi->type != I40E_VSI_MAIN)
>> +               return -EINVAL;
>> +
>> +       switch (xdp->command) {
>> +       case XDP_SETUP_PROG:
>> +               return i40e_xdp_setup(vsi, xdp->prog);
>> +       case XDP_QUERY_PROG:
>> +               xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
>> +               return 0;
>> +       default:
>> +               return -EINVAL;
>> +       }
>> +}
>> +
>>  static const struct net_device_ops i40e_netdev_ops = {
>>         .ndo_open               = i40e_open,
>>         .ndo_stop               = i40e_close,
>> @@ -9270,6 +9346,7 @@ static const struct net_device_ops i40e_netdev_ops = {
>>         .ndo_features_check     = i40e_features_check,
>>         .ndo_bridge_getlink     = i40e_ndo_bridge_getlink,
>>         .ndo_bridge_setlink     = i40e_ndo_bridge_setlink,
>> +       .ndo_xdp                = i40e_xdp,
>>  };
>>
>>  /**
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> index 352cf7cd2ef4..d835a51dafa6 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> @@ -24,6 +24,7 @@
>>   *
>>   ******************************************************************************/
>>
>> +#include <linux/bpf.h>
>>  #include <linux/prefetch.h>
>>  #include <net/busy_poll.h>
>>  #include "i40e.h"
>> @@ -1040,6 +1041,11 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>>         rx_ring->next_to_alloc = 0;
>>         rx_ring->next_to_clean = 0;
>>         rx_ring->next_to_use = 0;
>> +
>> +       if (rx_ring->xdp_prog) {
>> +               bpf_prog_put(rx_ring->xdp_prog);
>> +               rx_ring->xdp_prog = NULL;
>> +       }
>>  }
>>
>>  /**
>> @@ -1600,30 +1606,104 @@ static bool i40e_add_rx_frag(struct i40e_ring *rx_ring,
>>  }
>>
>>  /**
>> + * i40e_run_xdp - Runs an XDP program for an Rx ring
>> + * @rx_ring: Rx ring used for XDP
>> + * @rx_buffer: current Rx buffer
>> + * @rx_desc: current Rx descriptor
>> + * @xdp_prog: the XDP program to run
>> + *
>> + * Returns true if the XDP program consumed the incoming frame. False
>> + * means pass the frame to the good old stack.
>> + **/
>> +static bool i40e_run_xdp(struct i40e_ring *rx_ring,
>> +                        struct i40e_rx_buffer *rx_buffer,
>> +                        union i40e_rx_desc *rx_desc,
>> +                        struct bpf_prog *xdp_prog)
>> +{
>> +       u64 qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
>> +       unsigned int size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
>> +                           I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
>> +       struct xdp_buff xdp;
>> +       u32 xdp_action;
>> +
>> +       WARN_ON(!i40e_test_staterr(rx_desc,
>> +                                  BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)));
>> +
>
> We can't just do a WARN_ON if we don't have the end of the frame.
> Also it probably doesn't even matter unless we are looking at the
> first frame.  I would argue that this might be better as a BUG_ON
> since it should not happen.

Hmm, BUG_ON is a bit scary... Still, I agree with you. I'll change it
to BUG_ON in v3.

>
>> +       xdp.data = page_address(rx_buffer->page) + rx_buffer->page_offset;
>> +       xdp.data_end = xdp.data + size;
>> +       xdp_action = bpf_prog_run_xdp(xdp_prog, &xdp);
>> +
>> +       switch (xdp_action) {
>> +       case XDP_PASS:
>> +               return false;
>> +       default:
>> +               bpf_warn_invalid_xdp_action(xdp_action);
>> +       case XDP_ABORTED:
>> +       case XDP_TX:
>> +       case XDP_DROP:
>> +               if (likely(!i40e_page_is_reserved(rx_buffer->page))) {
>> +                       i40e_reuse_rx_page(rx_ring, rx_buffer);
>> +                       rx_ring->rx_stats.page_reuse_count++;
>> +                       break;
>> +               }
>> +
>> +               /* we are not reusing the buffer so unmap it */
>> +               dma_unmap_page(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
>> +                              DMA_FROM_DEVICE);
>> +               __free_pages(rx_buffer->page, 0);
>> +       }
>> +
>> +       /* clear contents of buffer_info */
>> +       rx_buffer->page = NULL;
>> +       return true; /* Swallowed by XDP */
>> +}
>> +
>> +/**
>>   * i40e_fetch_rx_buffer - Allocate skb and populate it
>>   * @rx_ring: rx descriptor ring to transact packets on
>>   * @rx_desc: descriptor containing info written by hardware
>> + * @skb: The allocated skb, if any
>>   *
>> - * This function allocates an skb on the fly, and populates it with the page
>> - * data from the current receive descriptor, taking care to set up the skb
>> - * correctly, as well as handling calling the page recycle function if
>> - * necessary.
>> + * Unless XDP is enabled, this function allocates an skb on the fly,
>> + * and populates it with the page data from the current receive
>> + * descriptor, taking care to set up the skb correctly, as well as
>> + * handling calling the page recycle function if necessary.
>> + *
>> + * If the received frame was handled by XDP, true is
>> + * returned. Otherwise, the skb is returned to the caller via the skb
>> + * parameter.
>>   */
>>  static inline
>> -struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>> -                                    union i40e_rx_desc *rx_desc)
>> +bool i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>> +                         union i40e_rx_desc *rx_desc,
>> +                         struct sk_buff **skb)
>>  {
>>         struct i40e_rx_buffer *rx_buffer;
>> -       struct sk_buff *skb;
>>         struct page *page;
>>
>>         rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
>>         page = rx_buffer->page;
>>         prefetchw(page);
>>
>> -       skb = rx_buffer->skb;
>
> I'm not sure where this line came from.  It was dropped from the
> driver in next-queue a while ago so I don't think your patch applies
> currently.

Weird. I'll look into that and make sure next version is correct!

>
>> +       /* we are reusing so sync this buffer for CPU use */
>> +       dma_sync_single_range_for_cpu(rx_ring->dev,
>> +                                     rx_buffer->dma,
>> +                                     rx_buffer->page_offset,
>> +                                     I40E_RXBUFFER_2048,
>> +                                     DMA_FROM_DEVICE);
>
> I would prefer if moving this piece up was handled in a separate
> patch.  It just makes things more readable that way and I have patches
> I will be submitting that will pull this and the few lines ahead of it
> out of the fetch_rx_buffer entirely.

I'll put this hunk in a separate patch.

>
>> +
>> +       if (rx_ring->xdp_prog) {
>> +               bool xdp_consumed;
>> +
>> +               xdp_consumed = i40e_run_xdp(rx_ring, rx_buffer,
>> +                                           rx_desc, rx_ring->xdp_prog);
>> +               if (xdp_consumed)
>> +                       return true;
>> +       }
>>
>> -       if (likely(!skb)) {
>> +       *skb = rx_buffer->skb;
>> +
>> +       if (likely(!*skb)) {
>
> Instead of making skb a double pointer we can just change a few things
> to make this all simpler.  For example one thing we could look at
> doing is taking an hlist_nulls type approach where we signal with a
> value of 1 to indicate that XDP stole the buffer so there is no skb to
> assign.  Then you just have to clean it up in i40e_is_non_eop by doing
> a "&" and continue from there, or could add your own block after
> i40e_is_non_eop.
>
> For that matter we coudl probably consolidate a few things into
> i40e_cleanup_headers and have it handled there.

I'll clean this up!

>
>>                 void *page_addr = page_address(page) + rx_buffer->page_offset;
>>
>>                 /* prefetch first cache line of first page */
>> @@ -1633,32 +1713,25 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>>  #endif
>>
>>                 /* allocate a skb to store the frags */
>> -               skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
>> -                                      I40E_RX_HDR_SIZE,
>> -                                      GFP_ATOMIC | __GFP_NOWARN);
>> -               if (unlikely(!skb)) {
>> +               *skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
>> +                                       I40E_RX_HDR_SIZE,
>> +                                       GFP_ATOMIC | __GFP_NOWARN);
>> +               if (unlikely(!*skb)) {
>>                         rx_ring->rx_stats.alloc_buff_failed++;
>> -                       return NULL;
>> +                       return false;
>>                 }
>>
>>                 /* we will be copying header into skb->data in
>>                  * pskb_may_pull so it is in our interest to prefetch
>>                  * it now to avoid a possible cache miss
>>                  */
>> -               prefetchw(skb->data);
>> +               prefetchw((*skb)->data);
>>         } else {
>>                 rx_buffer->skb = NULL;
>>         }
>>
>> -       /* we are reusing so sync this buffer for CPU use */
>> -       dma_sync_single_range_for_cpu(rx_ring->dev,
>> -                                     rx_buffer->dma,
>> -                                     rx_buffer->page_offset,
>> -                                     I40E_RXBUFFER_2048,
>> -                                     DMA_FROM_DEVICE);
>> -
>>         /* pull page into skb */
>> -       if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, skb)) {
>> +       if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, *skb)) {
>>                 /* hand second half of page back to the ring */
>>                 i40e_reuse_rx_page(rx_ring, rx_buffer);
>>                 rx_ring->rx_stats.page_reuse_count++;
>> @@ -1671,7 +1744,7 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>>         /* clear contents of buffer_info */
>>         rx_buffer->page = NULL;
>>
>> -       return skb;
>> +       return false;
>>  }
>>
>>  /**
>> @@ -1716,6 +1789,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
>>  }
>>
>>  /**
>> + * i40e_update_rx_next_to_clean - Bumps the next-to-clean for an Rx ing
>> + * @rx_ring: Rx ring to bump
>> + **/
>> +static void i40e_update_rx_next_to_clean(struct i40e_ring *rx_ring)
>> +{
>> +       u32 ntc = rx_ring->next_to_clean + 1;
>> +
>> +       ntc = (ntc < rx_ring->count) ? ntc : 0;
>> +       rx_ring->next_to_clean = ntc;
>> +
>> +       prefetch(I40E_RX_DESC(rx_ring, ntc));
>> +}
>> +
>> +/**
>>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>>   * @rx_ring: rx descriptor ring to transact packets on
>>   * @budget: Total limit on number of packets to process
>> @@ -1739,6 +1826,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>                 u16 vlan_tag;
>>                 u8 rx_ptype;
>>                 u64 qword;
>> +               bool xdp_consumed;
>>
>>                 /* return some buffers to hardware, one at a time is too slow */
>>                 if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
>> @@ -1764,7 +1852,15 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>                  */
>>                 dma_rmb();
>>
>> -               skb = i40e_fetch_rx_buffer(rx_ring, rx_desc);
>> +               xdp_consumed = i40e_fetch_rx_buffer(rx_ring, rx_desc, &skb);
>> +               if (xdp_consumed) {
>> +                       cleaned_count++;
>> +
>> +                       i40e_update_rx_next_to_clean(rx_ring);
>> +                       total_rx_packets++;
>> +                       continue;
>> +               }
>> +
>
> There are a few bits here that aren't quite handled correctly.  It
> looks like we are getting the total_rx_packets, but you didn't update
> total_rx_bytes.

Ah, good catch! I'll make sure to bump the Rx bytes.

>
>>                 if (!skb)
>>                         break;
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> index e065321ce8ed..957d856a82c4 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> @@ -341,6 +341,8 @@ struct i40e_ring {
>>
>>         struct rcu_head rcu;            /* to avoid race on free */
>>         u16 next_to_alloc;
>> +
>> +       struct bpf_prog *xdp_prog;
>>  } ____cacheline_internodealigned_in_smp;
>>
>
> You might be better off moving this further up in the structure.
> Maybe into the slot after *tail.  Otherwise it is going to be in a
> pretty noisy cache line, although it looks like nobody ever really
> bothered to optimize the i40e_ring structure so that you had read
> mostly and write mostly cache lines anyway so maybe you don't need to
> bother.

I'll try moving it up, and see if I get more pps. If not, I'll leave
the reordering out for a future i40e_ring cleanup.

Finally, I'll make sure the next patch set is based on Jeff's dev-queue
branch.


Bj?rn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support for XDP
  2016-12-09 17:54     ` John Fastabend
@ 2016-12-12  8:25       ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 13+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2016-12-12  8:25 UTC (permalink / raw)
  To: intel-wired-lan

2016-12-09 18:54 GMT+01:00 John Fastabend <john.fastabend@gmail.com>:
> On 16-12-09 09:15 AM, Alexander Duyck wrote:
>> On Fri, Dec 9, 2016 at 12:22 AM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>>> From: Bj?rn T?pel <bjorn.topel@intel.com>
>>>
>>> This commit adds basic XDP support for i40e derived NICs. All XDP
>>> actions will end up in XDP_DROP.
>>>
>>> Only the default/main VSI has support for enabling XDP.
>>>
>>> Acked-by: John Fastabend <john.r.fastabend@intel.com>
>>> Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
>>> ---
>>>  drivers/net/ethernet/intel/i40e/i40e.h         |  13 +++
>>>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   3 +
>>>  drivers/net/ethernet/intel/i40e/i40e_main.c    |  77 +++++++++++++
>>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 146 ++++++++++++++++++++-----
>>>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   2 +
>>>  5 files changed, 216 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
>>> index ba8d30984bee..05d805f439e6 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>>> @@ -545,6 +545,8 @@ struct i40e_vsi {
>>>         struct i40e_ring **rx_rings;
>>>         struct i40e_ring **tx_rings;
>>>
>>> +       struct bpf_prog *xdp_prog;
>>> +
>
> This is being protected by rcu right at least in patch 3 this seems to
> be the case? We should add __rcu annotation here,
>
>   struct bpf_prog __rcu *xdp_prog
>
> Then also run
>
>   make C=2 ./drivers/net/ethernet/intel/i40e/
>
> This should help catch any rcu errors. Oh you also need the PROVE_RCU
> config option set in .config
>
> I wonder if patch 3 could be squashed with patch 1 here? Do you think
> that this patch is easier to read without it.

Good point. The non-RCU stuff in patch 1 doesn't really add anything
to the story, so I'll squash 3 with 1. And I'll run it with lockdep
and make sure it's sparse clean.

>>>         u32  active_filters;
>>>         u32  promisc_threshold;
>>>
>>> @@ -904,4 +906,15 @@ i40e_status i40e_get_npar_bw_setting(struct i40e_pf *pf);
>>>  i40e_status i40e_set_npar_bw_setting(struct i40e_pf *pf);
>>>  i40e_status i40e_commit_npar_bw_setting(struct i40e_pf *pf);
>>>  void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
>>> +
>>> +/**
>>> + * i40e_enabled_xdp_vsi - Check if VSI has XDP enabled
>>> + * @vsi: pointer to a vsi
>>> + *
>>> + * Returns true if the VSI has XDP enabled.
>>> + **/
>>> +static inline bool i40e_enabled_xdp_vsi(const struct i40e_vsi *vsi)
>>> +{
>>> +       return vsi->xdp_prog;
>>> +}
>>
>> It might be better to have this return !!vsi->xdp_prog.  That way it
>> is explicity casting the return value as a boolean value.
>>
>>>  #endif /* _I40E_H_ */
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>>> index cc1465aac2ef..831bbc208fc8 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
>>> @@ -1254,6 +1254,9 @@ static int i40e_set_ringparam(struct net_device *netdev,
>>>         if ((ring->rx_mini_pending) || (ring->rx_jumbo_pending))
>>>                 return -EINVAL;
>>>
>>> +       if (i40e_enabled_xdp_vsi(vsi))
>>> +               return -EINVAL;
>>> +
>>>         if (ring->tx_pending > I40E_MAX_NUM_DESCRIPTORS ||
>>>             ring->tx_pending < I40E_MIN_NUM_DESCRIPTORS ||
>>>             ring->rx_pending > I40E_MAX_NUM_DESCRIPTORS ||
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> index da4cbe32eb86..b247c3231170 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> @@ -24,6 +24,7 @@
>>>   *
>>>   ******************************************************************************/
>>>
>>> +#include <linux/bpf.h>
>>>  #include <linux/etherdevice.h>
>>>  #include <linux/of_net.h>
>>>  #include <linux/pci.h>
>>> @@ -2431,6 +2432,13 @@ static int i40e_change_mtu(struct net_device *netdev, int new_mtu)
>>>         struct i40e_netdev_priv *np = netdev_priv(netdev);
>>>         struct i40e_vsi *vsi = np->vsi;
>>>
>>> +       if (i40e_enabled_xdp_vsi(vsi)) {
>>> +               int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>>> +
>>> +               if (max_frame > I40E_RXBUFFER_2048)
>>> +                       return -EINVAL;
>>> +       }
>>> +
>>
>> I suppose this works for now, but we should probably be using a value
>> smaller than 2048.  Most likely we are going to need to restrict this
>> down to about 1.5K or something in that neighborhood.  I am already
>> looking at changes that clean most of this up to prep it for the DMA
>> and page count fixes.  To resolve the jumbo frames case we are
>> probably looking at using a 3K buffer size with an order 1 page on
>> x86.  That will give us enough room for adding headers and/or any
>> trailers needed to the frame.
>>
>
> Also looks like we are trying to standardize on 256B of headroom for
> headers. So when we add the adjust_head support this will need another
> 256B addition here. And we will have to do an offset on the DMA region.
>
> I thought Alex was maybe trying to clean some of this up so it would
> be easier to do this. I vaguely remember something about SKB_PAD.

Good points. As I wrote in my reply to Alex' comments, I'll let the
current code linger until Alex' upcoming changes. Does that sound
right?

>>>         netdev_info(netdev, "changing MTU from %d to %d\n",
>>>                     netdev->mtu, new_mtu);
>>>         netdev->mtu = new_mtu;
>>> @@ -3085,6 +3093,15 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>>>         ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
>>>         writel(0, ring->tail);
>>>
>>> +       if (i40e_enabled_xdp_vsi(vsi)) {
>>> +               struct bpf_prog *prog;
>>> +
>>> +               prog = bpf_prog_add(vsi->xdp_prog, 1);
>>> +               if (IS_ERR(prog))
>>> +                       return PTR_ERR(prog);
>>> +               ring->xdp_prog = prog;
>>> +       }
>>> +
>>>         i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
>>>
>>>         return 0;
>>> @@ -9234,6 +9251,65 @@ static netdev_features_t i40e_features_check(struct sk_buff *skb,
>>>         return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>>>  }
>>>
>>> +/**
>>> + * i40e_xdp_setup - Add/remove an XDP program to a VSI
>>> + * @vsi: the VSI to add the program
>>> + * @prog: the XDP program
>>> + **/
>>> +static int i40e_xdp_setup(struct i40e_vsi *vsi,
>>> +                         struct bpf_prog *prog)
>>> +{
>>> +       struct i40e_pf *pf = vsi->back;
>>> +       struct net_device *netdev = vsi->netdev;
>>> +       int frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>>> +
>>> +       if (prog && prog->xdp_adjust_head)
>
> I think we want a follow on patch to enable adjust head next. It would
> be best to get this in the same kernel version but not necessarily in
> the same series.

Ok! I'll base the v3 on Jeff's dev-queue branch. If dev-queue has
pulled in the xdp_adjust_head, I'll add this check as a separate patch
in the series. If not, I'll leave it out and make it a patch outside
the series.

>>> +               return -EOPNOTSUPP;
>>> +
>>> +       if (frame_size > I40E_RXBUFFER_2048)
>>> +               return -EINVAL;
>>> +
>>> +       if (!(pf->flags & I40E_FLAG_MSIX_ENABLED))
>>> +               return -EINVAL;
>>> +
>>
>> Why would MSI-X matter?  Is this because you are trying to reserve
>> extra Tx rings for the XDP program?
>>
>> If so we might want to update i40e to make it more ixgbe like since it
>> can in theory support multiple queues with a single vector.  It might
>> be useful to add comments on why each of these criteria must be met
>> when you perform the checks.
>>
>>> +       if (!i40e_enabled_xdp_vsi(vsi) && !prog)
>>> +               return 0;
>>> +
>>> +       i40e_prep_for_reset(pf);
>>> +
>>> +       if (vsi->xdp_prog)
>>> +               bpf_prog_put(vsi->xdp_prog);
>>> +       vsi->xdp_prog = prog;
>>> +
>
> So I worry a bit about correctness here. Patch 3 likely fixes this but
> between patch 1 and 3 it seems like there could be some race here.
>
> This makes me think patch3 should be merged here.

Yeah, let's do that.

>>> +       i40e_reset_and_rebuild(pf, true);
>>> +       return 0;
>>> +}
>>> +
>>> +/**
>>> + * i40e_xdp - NDO for enabled/query
>>> + * @dev: the netdev
>>> + * @xdp: XDP program
>>> + **/
>>> +static int i40e_xdp(struct net_device *dev,
>>> +                   struct netdev_xdp *xdp)
>>> +{
>>> +       struct i40e_netdev_priv *np = netdev_priv(dev);
>>> +       struct i40e_vsi *vsi = np->vsi;
>>> +
>>> +       if (vsi->type != I40E_VSI_MAIN)
>>> +               return -EINVAL;
>>> +
>>> +       switch (xdp->command) {
>>> +       case XDP_SETUP_PROG:
>>> +               return i40e_xdp_setup(vsi, xdp->prog);
>>> +       case XDP_QUERY_PROG:
>>> +               xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
>>> +               return 0;
>>> +       default:
>>> +               return -EINVAL;
>>> +       }
>>> +}
>>> +
>>>  static const struct net_device_ops i40e_netdev_ops = {
>>>         .ndo_open               = i40e_open,
>>>         .ndo_stop               = i40e_close,
>>> @@ -9270,6 +9346,7 @@ static const struct net_device_ops i40e_netdev_ops = {
>>>         .ndo_features_check     = i40e_features_check,
>>>         .ndo_bridge_getlink     = i40e_ndo_bridge_getlink,
>>>         .ndo_bridge_setlink     = i40e_ndo_bridge_setlink,
>>> +       .ndo_xdp                = i40e_xdp,
>>>  };
>>>
>>>  /**
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>>> index 352cf7cd2ef4..d835a51dafa6 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>>> @@ -24,6 +24,7 @@
>>>   *
>>>   ******************************************************************************/
>>>
>>> +#include <linux/bpf.h>
>>>  #include <linux/prefetch.h>
>>>  #include <net/busy_poll.h>
>>>  #include "i40e.h"
>>> @@ -1040,6 +1041,11 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>>>         rx_ring->next_to_alloc = 0;
>>>         rx_ring->next_to_clean = 0;
>>>         rx_ring->next_to_use = 0;
>>> +
>>> +       if (rx_ring->xdp_prog) {
>>> +               bpf_prog_put(rx_ring->xdp_prog);
>>> +               rx_ring->xdp_prog = NULL;
>>> +       }
>>>  }
>>>
>>>  /**
>>> @@ -1600,30 +1606,104 @@ static bool i40e_add_rx_frag(struct i40e_ring *rx_ring,
>>>  }
>>>
>>>  /**
>>> + * i40e_run_xdp - Runs an XDP program for an Rx ring
>>> + * @rx_ring: Rx ring used for XDP
>>> + * @rx_buffer: current Rx buffer
>>> + * @rx_desc: current Rx descriptor
>>> + * @xdp_prog: the XDP program to run
>>> + *
>>> + * Returns true if the XDP program consumed the incoming frame. False
>>> + * means pass the frame to the good old stack.
>>> + **/
>>> +static bool i40e_run_xdp(struct i40e_ring *rx_ring,
>>> +                        struct i40e_rx_buffer *rx_buffer,
>>> +                        union i40e_rx_desc *rx_desc,
>>> +                        struct bpf_prog *xdp_prog)
>>> +{
>>> +       u64 qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
>>> +       unsigned int size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
>>> +                           I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
>>> +       struct xdp_buff xdp;
>>> +       u32 xdp_action;
>>> +
>>> +       WARN_ON(!i40e_test_staterr(rx_desc,
>>> +                                  BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)));
>>> +
>>
>> We can't just do a WARN_ON if we don't have the end of the frame.
>> Also it probably doesn't even matter unless we are looking at the
>> first frame.  I would argue that this might be better as a BUG_ON
>> since it should not happen.
>>
>
> Or if its really just a catch all for something that shouldn't happen
> go ahead and drop it.

Hmm, drop and WARN_ONCE, instead of BUG_ON?

>>> +       xdp.data = page_address(rx_buffer->page) + rx_buffer->page_offset;
>>> +       xdp.data_end = xdp.data + size;
>>> +       xdp_action = bpf_prog_run_xdp(xdp_prog, &xdp);
>>> +
>>> +       switch (xdp_action) {
>>> +       case XDP_PASS:
>>> +               return false;
>>> +       default:
>>> +               bpf_warn_invalid_xdp_action(xdp_action);
>>> +       case XDP_ABORTED:
>>> +       case XDP_TX:
>>> +       case XDP_DROP:
>>> +               if (likely(!i40e_page_is_reserved(rx_buffer->page))) {
>>> +                       i40e_reuse_rx_page(rx_ring, rx_buffer);
>>> +                       rx_ring->rx_stats.page_reuse_count++;
>>> +                       break;
>>> +               }
>>> +
>>> +               /* we are not reusing the buffer so unmap it */
>>> +               dma_unmap_page(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
>>> +                              DMA_FROM_DEVICE);
>>> +               __free_pages(rx_buffer->page, 0);
>>> +       }
>>> +
>>> +       /* clear contents of buffer_info */
>>> +       rx_buffer->page = NULL;
>>> +       return true; /* Swallowed by XDP */
>>> +}
>>> +
>>> +/**
>>>   * i40e_fetch_rx_buffer - Allocate skb and populate it
>>>   * @rx_ring: rx descriptor ring to transact packets on
>>>   * @rx_desc: descriptor containing info written by hardware
>>> + * @skb: The allocated skb, if any
>>>   *
>>> - * This function allocates an skb on the fly, and populates it with the page
>>> - * data from the current receive descriptor, taking care to set up the skb
>>> - * correctly, as well as handling calling the page recycle function if
>>> - * necessary.
>>> + * Unless XDP is enabled, this function allocates an skb on the fly,
>>> + * and populates it with the page data from the current receive
>>> + * descriptor, taking care to set up the skb correctly, as well as
>>> + * handling calling the page recycle function if necessary.
>>> + *
>>> + * If the received frame was handled by XDP, true is
>>> + * returned. Otherwise, the skb is returned to the caller via the skb
>>> + * parameter.
>>>   */
>>>  static inline
>>> -struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>>> -                                    union i40e_rx_desc *rx_desc)
>>> +bool i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>>> +                         union i40e_rx_desc *rx_desc,
>>> +                         struct sk_buff **skb)
>>>  {
>>>         struct i40e_rx_buffer *rx_buffer;
>>> -       struct sk_buff *skb;
>>>         struct page *page;
>>>
>>>         rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
>>>         page = rx_buffer->page;
>>>         prefetchw(page);
>>>
>>> -       skb = rx_buffer->skb;
>>
>> I'm not sure where this line came from.  It was dropped from the
>> driver in next-queue a while ago so I don't think your patch applies
>> currently.
>>
>>> +       /* we are reusing so sync this buffer for CPU use */
>>> +       dma_sync_single_range_for_cpu(rx_ring->dev,
>>> +                                     rx_buffer->dma,
>>> +                                     rx_buffer->page_offset,
>>> +                                     I40E_RXBUFFER_2048,
>>> +                                     DMA_FROM_DEVICE);
>>
>> I would prefer if moving this piece up was handled in a separate
>> patch.  It just makes things more readable that way and I have patches
>> I will be submitting that will pull this and the few lines ahead of it
>> out of the fetch_rx_buffer entirely.
>>
>>> +
>>> +       if (rx_ring->xdp_prog) {
>>> +               bool xdp_consumed;
>>> +
>>> +               xdp_consumed = i40e_run_xdp(rx_ring, rx_buffer,
>>> +                                           rx_desc, rx_ring->xdp_prog);
>
> So here is the race I commented on above if we NULL xdp_prog between
> 'if' block and i40e_run_xdp.

Ok! This will be addressed by the patch 1/3 squash.

>>> +               if (xdp_consumed)
>>> +                       return true;
>>> +       }
>>>
>>> -       if (likely(!skb)) {
>>> +       *skb = rx_buffer->skb;
>>> +
>>> +       if (likely(!*skb)) {
>>
>> Instead of making skb a double pointer we can just change a few things
>> to make this all simpler.  For example one thing we could look at
>> doing is taking an hlist_nulls type approach where we signal with a
>> value of 1 to indicate that XDP stole the buffer so there is no skb to
>> assign.  Then you just have to clean it up in i40e_is_non_eop by doing
>> a "&" and continue from there, or could add your own block after
>> i40e_is_non_eop.
>>
>> For that matter we coudl probably consolidate a few things into
>> i40e_cleanup_headers and have it handled there.
>>
>>>                 void *page_addr = page_address(page) + rx_buffer->page_offset;
>>>
>>>                 /* prefetch first cache line of first page */
>>> @@ -1633,32 +1713,25 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>>>  #endif
>>>
>>>                 /* allocate a skb to store the frags */
>>> -               skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
>>> -                                      I40E_RX_HDR_SIZE,
>>> -                                      GFP_ATOMIC | __GFP_NOWARN);
>>> -               if (unlikely(!skb)) {
>>> +               *skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
>>> +                                       I40E_RX_HDR_SIZE,
>>> +                                       GFP_ATOMIC | __GFP_NOWARN);
>>> +               if (unlikely(!*skb)) {
>>>                         rx_ring->rx_stats.alloc_buff_failed++;
>>> -                       return NULL;
>>> +                       return false;
>>>                 }
>>>
>>>                 /* we will be copying header into skb->data in
>>>                  * pskb_may_pull so it is in our interest to prefetch
>>>                  * it now to avoid a possible cache miss
>>>                  */
>>> -               prefetchw(skb->data);
>>> +               prefetchw((*skb)->data);
>>>         } else {
>>>                 rx_buffer->skb = NULL;
>>>         }
>>>
>>> -       /* we are reusing so sync this buffer for CPU use */
>>> -       dma_sync_single_range_for_cpu(rx_ring->dev,
>>> -                                     rx_buffer->dma,
>>> -                                     rx_buffer->page_offset,
>>> -                                     I40E_RXBUFFER_2048,
>>> -                                     DMA_FROM_DEVICE);
>>> -
>>>         /* pull page into skb */
>>> -       if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, skb)) {
>>> +       if (i40e_add_rx_frag(rx_ring, rx_buffer, rx_desc, *skb)) {
>>>                 /* hand second half of page back to the ring */
>>>                 i40e_reuse_rx_page(rx_ring, rx_buffer);
>>>                 rx_ring->rx_stats.page_reuse_count++;
>>> @@ -1671,7 +1744,7 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
>>>         /* clear contents of buffer_info */
>>>         rx_buffer->page = NULL;
>>>
>>> -       return skb;
>>> +       return false;
>>>  }
>>>
>>>  /**
>>> @@ -1716,6 +1789,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
>>>  }
>>>
>>>  /**
>>> + * i40e_update_rx_next_to_clean - Bumps the next-to-clean for an Rx ing
>>> + * @rx_ring: Rx ring to bump
>>> + **/
>>> +static void i40e_update_rx_next_to_clean(struct i40e_ring *rx_ring)
>>> +{
>>> +       u32 ntc = rx_ring->next_to_clean + 1;
>>> +
>>> +       ntc = (ntc < rx_ring->count) ? ntc : 0;
>>> +       rx_ring->next_to_clean = ntc;
>>> +
>>> +       prefetch(I40E_RX_DESC(rx_ring, ntc));
>>> +}
>>> +
>>> +/**
>>>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>>>   * @rx_ring: rx descriptor ring to transact packets on
>>>   * @budget: Total limit on number of packets to process
>>> @@ -1739,6 +1826,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>>                 u16 vlan_tag;
>>>                 u8 rx_ptype;
>>>                 u64 qword;
>>> +               bool xdp_consumed;
>>>
>>>                 /* return some buffers to hardware, one at a time is too slow */
>>>                 if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
>>> @@ -1764,7 +1852,15 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>>                  */
>>>                 dma_rmb();
>>>
>>> -               skb = i40e_fetch_rx_buffer(rx_ring, rx_desc);
>>> +               xdp_consumed = i40e_fetch_rx_buffer(rx_ring, rx_desc, &skb);
>>> +               if (xdp_consumed) {
>>> +                       cleaned_count++;
>>> +
>>> +                       i40e_update_rx_next_to_clean(rx_ring);
>>> +                       total_rx_packets++;
>>> +                       continue;
>>> +               }
>>> +
>>
>> There are a few bits here that aren't quite handled correctly.  It
>> looks like we are getting the total_rx_packets, but you didn't update
>> total_rx_bytes.
>>
>>>                 if (!skb)
>>>                         break;
>>>
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>>> index e065321ce8ed..957d856a82c4 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>>> @@ -341,6 +341,8 @@ struct i40e_ring {
>>>
>>>         struct rcu_head rcu;            /* to avoid race on free */
>>>         u16 next_to_alloc;
>>> +
>>> +       struct bpf_prog *xdp_prog;
>>>  } ____cacheline_internodealigned_in_smp;
>>>
>>
>> You might be better off moving this further up in the structure.
>> Maybe into the slot after *tail.  Otherwise it is going to be in a
>> pretty noisy cache line, although it looks like nobody ever really
>> bothered to optimize the i40e_ring structure so that you had read
>> mostly and write mostly cache lines anyway so maybe you don't need to
>> bother.
>
> Maybe another patch to clean up the structures.

Please refer to my reply to Alex' comments.


Again, thanks for digging through the code.


Bj?rn

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-12-12  8:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-09  8:22 [Intel-wired-lan] [PATCH v2 0/3] i40e: Support for XDP =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 1/3] i40e: Initial support " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2016-12-09 11:56   ` kbuild test robot
2016-12-09 12:35     ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2016-12-09 16:41       ` Alexander Duyck
2016-12-09 17:20         ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2016-12-09 17:15   ` Alexander Duyck
2016-12-09 17:54     ` John Fastabend
2016-12-12  8:25       ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2016-12-12  8:08     ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 2/3] i40e: Add XDP_TX support =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2016-12-09 18:06   ` John Fastabend
2016-12-09  8:22 ` [Intel-wired-lan] [PATCH v2 3/3] i40e: Don't reset/rebuild rings on XDP program swap =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.