netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support
@ 2018-06-04 12:05 Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 01/11] xsk: moved struct xdp_umem definition Björn Töpel
                   ` (12 more replies)
  0 siblings, 13 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

This patch serie introduces zerocopy (ZC) support for
AF_XDP. Programs using AF_XDP sockets will now receive RX packets
without any copies and can also transmit packets without incurring any
copies. No modifications to the application are needed, but the NIC
driver needs to be modified to support ZC. If ZC is not supported by
the driver, the modes introduced in the AF_XDP patch will be
used. Using ZC in our micro benchmarks results in significantly
improved performance as can be seen in the performance section later
in this cover letter.

Note that for an untrusted application, HW packet steering to a
specific queue pair (the one associated with the application) is a
requirement when using ZC, as the application would otherwise be able
to see other user space processes' packets. If the HW cannot support
the required packet steering you need to use the XDP_SKB mode or the
XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP
patch set can be used to do load balancing in that case.

For benchmarking, you can use the xdpsock application from the AF_XDP
patch set without any modifications. Say that you would like your UDP
traffic from port 4242 to end up in queue 16, that we will enable
AF_XDP on. Here, we use ethtool for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be
done using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are without
retpoline so that we can compare against previous numbers. 

AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
set are also reported for ease of reference. The numbers within
parantheses are from the RFC V1 ZC patch set.
Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
rxdrop       2.9*       9.6*       21.1(21.5)
txpush       2.6*       -          22.0(21.6)
l2fwd        1.9*       2.5*       15.3(15.0)

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
rxdrop       2.1*       3.3*       3.3(3.3)
l2fwd        1.4*       1.8*       3.1(3.1)

* From AF_XDP V3 patch set and cover letter.

So why do we not get higher values for RX similar to the 34 Mpps we
had in AF_PACKET V4? We made an experiment running the rxdrop
benchmark without using the xdp_do_redirect/flush infrastructure nor
using an XDP program (all traffic on a queue goes to one
socket). Instead the driver acts directly on the AF_XDP socket. With
this we got 36.9 Mpps, a significant improvement without any change to
the uapi. So not forcing users to have an XDP program if they do not
need it, might be a good idea. This measurement is actually higher
than what we got with AF_PACKET V4.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      32.3M  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3.3M    0

The structure of the patch set is as follows:

Patches 1-3: Plumbing for AF_XDP ZC support
Patches 4-5: AF_XDP ZC for RX
Patches 6-7: AF_XDP ZC for TX
Patch 8-10: ZC support for i40e.
Patch 11: Use the bind flags in sample application to force TX skb
          path when -S is providedd on the command line.

This patch set is based on the new uapi introduced in "AF_XDP: bug
fixes and descriptor changes". You need to apply that patch set
first, before applying this one.

We based this patch set on bpf-next commit bd3a08aaa9a3 ("bpf:
flowlabel in bpf_fib_lookup should be flowinfo")

Comments:

* Implementing dynamic creation and deletion of queues in the i40e
  driver would facilitate the coexistence of xdp_redirect and af_xdp.

Thanks: Björn and Magnus

Björn Töpel (8):
  xsk: moved struct xdp_umem definition
  xsk: introduce xdp_umem_page
  net: xdp: added bpf_netdev_command XDP_{QUERY,SETUP}_XSK_UMEM
  xdp: add MEM_TYPE_ZERO_COPY
  xsk: add zero-copy support for Rx
  i40e: added queue pair disable/enable functions
  i40e: implement AF_XDP zero-copy support for Rx
  samples/bpf: xdpsock: use skb Tx path for XDP_SKB

Magnus Karlsson (3):
  net: added netdevice operation for Tx
  xsk: wire upp Tx zero-copy functions
  i40e: implement AF_XDP zero-copy support for Tx

 drivers/net/ethernet/intel/i40e/Makefile    |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h      |  23 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 287 +++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 256 ++++-------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h | 151 ++++++-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 677 ++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  19 +
 include/linux/netdevice.h                   |  10 +
 include/net/xdp.h                           |  10 +
 include/net/xdp_sock.h                      |  77 +++-
 include/uapi/linux/if_xdp.h                 |   4 +-
 net/core/xdp.c                              |  19 +-
 net/xdp/xdp_umem.c                          | 118 ++++-
 net/xdp/xdp_umem.h                          |  32 +-
 net/xdp/xsk.c                               | 166 +++++--
 net/xdp/xsk_queue.h                         |  35 +-
 samples/bpf/xdpsock_user.c                  |   5 +
 17 files changed, 1657 insertions(+), 235 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h

-- 
2.14.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 01/11] xsk: moved struct xdp_umem definition
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
@ 2018-06-04 12:05 ` Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 02/11] xsk: introduce xdp_umem_page Björn Töpel
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Moved struct xdp_umem to xdp_sock.h, in order to prepare for zero-copy
support.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h | 24 +++++++++++++++++++++++-
 net/xdp/xdp_umem.c     |  1 +
 net/xdp/xdp_umem.h     | 22 +---------------------
 net/xdp/xsk_queue.h    |  3 +--
 4 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 7a647c56ec15..3a6cd88f179d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -6,12 +6,34 @@
 #ifndef _LINUX_XDP_SOCK_H
 #define _LINUX_XDP_SOCK_H
 
+#include <linux/workqueue.h>
+#include <linux/if_xdp.h>
 #include <linux/mutex.h>
+#include <linux/mm.h>
 #include <net/sock.h>
 
 struct net_device;
 struct xsk_queue;
-struct xdp_umem;
+
+struct xdp_umem_props {
+	u64 chunk_mask;
+	u64 size;
+};
+
+struct xdp_umem {
+	struct xsk_queue *fq;
+	struct xsk_queue *cq;
+	struct page **pgs;
+	struct xdp_umem_props props;
+	u32 headroom;
+	u32 chunk_size_nohr;
+	struct user_struct *user;
+	struct pid *pid;
+	unsigned long address;
+	refcount_t users;
+	struct work_struct work;
+	u32 npgs;
+};
 
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 9ad791ff4739..2793a503223e 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -13,6 +13,7 @@
 #include <linux/mm.h>
 
 #include "xdp_umem.h"
+#include "xsk_queue.h"
 
 #define XDP_UMEM_MIN_CHUNK_SIZE 2048
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index aeadd1bcb72d..9433e8af650a 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -6,27 +6,7 @@
 #ifndef XDP_UMEM_H_
 #define XDP_UMEM_H_
 
-#include <linux/mm.h>
-#include <linux/if_xdp.h>
-#include <linux/workqueue.h>
-
-#include "xsk_queue.h"
-#include "xdp_umem_props.h"
-
-struct xdp_umem {
-	struct xsk_queue *fq;
-	struct xsk_queue *cq;
-	struct page **pgs;
-	struct xdp_umem_props props;
-	u32 headroom;
-	u32 chunk_size_nohr;
-	struct user_struct *user;
-	struct pid *pid;
-	unsigned long address;
-	refcount_t users;
-	struct work_struct work;
-	u32 npgs;
-};
+#include <net/xdp_sock.h>
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 337e5ad3b10e..5246ed420a16 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -8,8 +8,7 @@
 
 #include <linux/types.h>
 #include <linux/if_xdp.h>
-
-#include "xdp_umem_props.h"
+#include <net/xdp_sock.h>
 
 #define RX_BATCH_SIZE 16
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 02/11] xsk: introduce xdp_umem_page
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 01/11] xsk: moved struct xdp_umem definition Björn Töpel
@ 2018-06-04 12:05 ` Björn Töpel
  2019-03-13  9:39   ` [bpf-next,02/11] " Jiri Slaby
  2018-06-04 12:05 ` [PATCH bpf-next 03/11] net: xdp: added bpf_netdev_command XDP_{QUERY,SETUP}_XSK_UMEM Björn Töpel
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

The xdp_umem_page holds the address for a page. Trade memory for
faster lookup. Later, we'll add DMA address here as well.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  7 ++++++-
 net/xdp/xdp_umem.c     | 15 ++++++++++++++-
 net/xdp/xdp_umem.h     |  3 +--
 3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 3a6cd88f179d..caf343a7e224 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -20,10 +20,14 @@ struct xdp_umem_props {
 	u64 size;
 };
 
+struct xdp_umem_page {
+	void *addr;
+};
+
 struct xdp_umem {
 	struct xsk_queue *fq;
 	struct xsk_queue *cq;
-	struct page **pgs;
+	struct xdp_umem_page *pages;
 	struct xdp_umem_props props;
 	u32 headroom;
 	u32 chunk_size_nohr;
@@ -32,6 +36,7 @@ struct xdp_umem {
 	unsigned long address;
 	refcount_t users;
 	struct work_struct work;
+	struct page **pgs;
 	u32 npgs;
 };
 
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 2793a503223e..aca826011f6c 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -65,6 +65,9 @@ static void xdp_umem_release(struct xdp_umem *umem)
 		goto out;
 
 	mmput(mm);
+	kfree(umem->pages);
+	umem->pages = NULL;
+
 	xdp_umem_unaccount_pages(umem);
 out:
 	kfree(umem);
@@ -155,7 +158,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
 	unsigned int chunks, chunks_per_page;
 	u64 addr = mr->addr, size = mr->len;
-	int size_chk, err;
+	int size_chk, err, i;
 
 	if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) {
 		/* Strictly speaking we could support this, if:
@@ -213,6 +216,16 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	err = xdp_umem_pin_pages(umem);
 	if (err)
 		goto out_account;
+
+	umem->pages = kcalloc(umem->npgs, sizeof(*umem->pages), GFP_KERNEL);
+	if (!umem->pages) {
+		err = -ENOMEM;
+		goto out_account;
+	}
+
+	for (i = 0; i < umem->npgs; i++)
+		umem->pages[i].addr = page_address(umem->pgs[i]);
+
 	return 0;
 
 out_account:
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 9433e8af650a..40e8fa4a92af 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -10,8 +10,7 @@
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
-	return page_address(umem->pgs[addr >> PAGE_SHIFT]) +
-		(addr & (PAGE_SIZE - 1));
+	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
 }
 
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 03/11] net: xdp: added bpf_netdev_command XDP_{QUERY,SETUP}_XSK_UMEM
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 01/11] xsk: moved struct xdp_umem definition Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 02/11] xsk: introduce xdp_umem_page Björn Töpel
@ 2018-06-04 12:05 ` Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 04/11] xdp: add MEM_TYPE_ZERO_COPY Björn Töpel
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Extend ndo_bpf with two new commands used for query zero-copy support
and register an UMEM to a queue_id of a netdev.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/netdevice.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7f17785a59d7..85d91cc41cdf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -817,10 +817,13 @@ enum bpf_netdev_command {
 	BPF_OFFLOAD_DESTROY,
 	BPF_OFFLOAD_MAP_ALLOC,
 	BPF_OFFLOAD_MAP_FREE,
+	XDP_QUERY_XSK_UMEM,
+	XDP_SETUP_XSK_UMEM,
 };
 
 struct bpf_prog_offload_ops;
 struct netlink_ext_ack;
+struct xdp_umem;
 
 struct netdev_bpf {
 	enum bpf_netdev_command command;
@@ -851,6 +854,11 @@ struct netdev_bpf {
 		struct {
 			struct bpf_offloaded_map *offmap;
 		};
+		/* XDP_SETUP_XSK_UMEM */
+		struct {
+			struct xdp_umem *umem;
+			u16 queue_id;
+		} xsk;
 	};
 };
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 04/11] xdp: add MEM_TYPE_ZERO_COPY
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
                   ` (2 preceding siblings ...)
  2018-06-04 12:05 ` [PATCH bpf-next 03/11] net: xdp: added bpf_netdev_command XDP_{QUERY,SETUP}_XSK_UMEM Björn Töpel
@ 2018-06-04 12:05 ` Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 05/11] xsk: add zero-copy support for Rx Björn Töpel
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Here, a new type of allocator support is added to the XDP return
API. A zero-copy allocated xdp_buff cannot be converted to an
xdp_frame. Instead is the buff has to be copied. This is not supported
at all in this commit.

Also, an opaque "handle" is added to xdp_buff. This can be used as a
context for the zero-copy allocator implementation.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp.h | 10 ++++++++++
 net/core/xdp.c    | 19 ++++++++++++++-----
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index a3b71a4dd71d..2deea7166a34 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -37,6 +37,7 @@ enum xdp_mem_type {
 	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
 	MEM_TYPE_PAGE_POOL,
+	MEM_TYPE_ZERO_COPY,
 	MEM_TYPE_MAX,
 };
 
@@ -51,6 +52,10 @@ struct xdp_mem_info {
 
 struct page_pool;
 
+struct zero_copy_allocator {
+	void (*free)(struct zero_copy_allocator *zca, unsigned long handle);
+};
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
@@ -63,6 +68,7 @@ struct xdp_buff {
 	void *data_end;
 	void *data_meta;
 	void *data_hard_start;
+	unsigned long handle;
 	struct xdp_rxq_info *rxq;
 };
 
@@ -86,6 +92,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	int metasize;
 	int headroom;
 
+	/* TODO: implement clone, copy, use "native" MEM_TYPE */
+	if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
+		return NULL;
+
 	/* Assure headroom is available for storing info */
 	headroom = xdp->data - xdp->data_hard_start;
 	metasize = xdp->data - xdp->data_meta;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index cb8c4e061a5a..9d1f22072d5d 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -31,6 +31,7 @@ struct xdp_mem_allocator {
 	union {
 		void *allocator;
 		struct page_pool *page_pool;
+		struct zero_copy_allocator *zc_alloc;
 	};
 	struct rhash_head node;
 	struct rcu_head rcu;
@@ -261,7 +262,7 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 	xdp_rxq->mem.type = type;
 
 	if (!allocator) {
-		if (type == MEM_TYPE_PAGE_POOL)
+		if (type == MEM_TYPE_PAGE_POOL || type == MEM_TYPE_ZERO_COPY)
 			return -EINVAL; /* Setup time check page_pool req */
 		return 0;
 	}
@@ -314,7 +315,8 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
  * is used for those calls sites.  Thus, allowing for faster recycling
  * of xdp_frames/pages in those cases.
  */
-static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct)
+static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
+			 unsigned long handle)
 {
 	struct xdp_mem_allocator *xa;
 	struct page *page;
@@ -338,6 +340,13 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct)
 		page = virt_to_page(data); /* Assumes order0 page*/
 		put_page(page);
 		break;
+	case MEM_TYPE_ZERO_COPY:
+		/* NB! Only valid from an xdp_buff! */
+		rcu_read_lock();
+		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+		xa->zc_alloc->free(xa->zc_alloc, handle);
+		rcu_read_unlock();
 	default:
 		/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
 		break;
@@ -346,18 +355,18 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct)
 
 void xdp_return_frame(struct xdp_frame *xdpf)
 {
-	__xdp_return(xdpf->data, &xdpf->mem, false);
+	__xdp_return(xdpf->data, &xdpf->mem, false, 0);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);
 
 void xdp_return_frame_rx_napi(struct xdp_frame *xdpf)
 {
-	__xdp_return(xdpf->data, &xdpf->mem, true);
+	__xdp_return(xdpf->data, &xdpf->mem, true, 0);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
 
 void xdp_return_buff(struct xdp_buff *xdp)
 {
-	__xdp_return(xdp->data, &xdp->rxq->mem, true);
+	__xdp_return(xdp->data, &xdp->rxq->mem, true, xdp->handle);
 }
 EXPORT_SYMBOL_GPL(xdp_return_buff);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 05/11] xsk: add zero-copy support for Rx
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
                   ` (3 preceding siblings ...)
  2018-06-04 12:05 ` [PATCH bpf-next 04/11] xdp: add MEM_TYPE_ZERO_COPY Björn Töpel
@ 2018-06-04 12:05 ` Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 06/11] net: added netdevice operation for Tx Björn Töpel
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Extend the xsk_rcv to support the new MEM_TYPE_ZERO_COPY memory, and
wireup ndo_bpf call in bind.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h      |  6 +++
 include/uapi/linux/if_xdp.h |  4 +-
 net/xdp/xdp_umem.c          | 77 ++++++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.h          |  3 ++
 net/xdp/xsk.c               | 96 +++++++++++++++++++++++++++++++++++----------
 5 files changed, 165 insertions(+), 21 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index caf343a7e224..d93d3aac3fc9 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -22,6 +22,7 @@ struct xdp_umem_props {
 
 struct xdp_umem_page {
 	void *addr;
+	dma_addr_t dma;
 };
 
 struct xdp_umem {
@@ -38,6 +39,9 @@ struct xdp_umem {
 	struct work_struct work;
 	struct page **pgs;
 	u32 npgs;
+	struct net_device *dev;
+	u16 queue_id;
+	bool zc;
 };
 
 struct xdp_sock {
@@ -60,6 +64,8 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
+u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
+void xsk_umem_discard_addr(struct xdp_umem *umem);
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index e411d6f9ac65..1fa0e977ea8d 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -13,7 +13,9 @@
 #include <linux/types.h>
 
 /* Options for the sxdp_flags field */
-#define XDP_SHARED_UMEM 1
+#define XDP_SHARED_UMEM	(1 << 0)
+#define XDP_COPY	(1 << 1) /* Force copy-mode */
+#define XDP_ZEROCOPY	(1 << 2) /* Force zero-copy mode */
 
 struct sockaddr_xdp {
 	__u16 sxdp_family;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index aca826011f6c..f729d79b8d91 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -17,6 +17,81 @@
 
 #define XDP_UMEM_MIN_CHUNK_SIZE 2048
 
+int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
+			u32 queue_id, u16 flags)
+{
+	bool force_zc, force_copy;
+	struct netdev_bpf bpf;
+	int err;
+
+	force_zc = flags & XDP_ZEROCOPY;
+	force_copy = flags & XDP_COPY;
+
+	if (force_zc && force_copy)
+		return -EINVAL;
+
+	if (force_copy)
+		return 0;
+
+	dev_hold(dev);
+
+	if (dev->netdev_ops->ndo_bpf) {
+		bpf.command = XDP_QUERY_XSK_UMEM;
+
+		rtnl_lock();
+		err = dev->netdev_ops->ndo_bpf(dev, &bpf);
+		rtnl_unlock();
+
+		if (err) {
+			dev_put(dev);
+			return force_zc ? -ENOTSUPP : 0;
+		}
+
+		bpf.command = XDP_SETUP_XSK_UMEM;
+		bpf.xsk.umem = umem;
+		bpf.xsk.queue_id = queue_id;
+
+		rtnl_lock();
+		err = dev->netdev_ops->ndo_bpf(dev, &bpf);
+		rtnl_unlock();
+
+		if (err) {
+			dev_put(dev);
+			return force_zc ? err : 0; /* fail or fallback */
+		}
+
+		umem->dev = dev;
+		umem->queue_id = queue_id;
+		umem->zc = true;
+		return 0;
+	}
+
+	dev_put(dev);
+	return force_zc ? -ENOTSUPP : 0; /* fail or fallback */
+}
+
+void xdp_umem_clear_dev(struct xdp_umem *umem)
+{
+	struct netdev_bpf bpf;
+	int err;
+
+	if (umem->dev) {
+		bpf.command = XDP_SETUP_XSK_UMEM;
+		bpf.xsk.umem = NULL;
+		bpf.xsk.queue_id = umem->queue_id;
+
+		rtnl_lock();
+		err = umem->dev->netdev_ops->ndo_bpf(umem->dev, &bpf);
+		rtnl_unlock();
+
+		if (err)
+			WARN(1, "failed to disable umem!\n");
+
+		dev_put(umem->dev);
+		umem->dev = NULL;
+	}
+}
+
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
 {
 	unsigned int i;
@@ -43,6 +118,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
 	struct task_struct *task;
 	struct mm_struct *mm;
 
+	xdp_umem_clear_dev(umem);
+
 	if (umem->fq) {
 		xskq_destroy(umem->fq);
 		umem->fq = NULL;
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 40e8fa4a92af..674508a32a4d 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -13,6 +13,9 @@ static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
 }
 
+int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
+			u32 queue_id, u16 flags);
+void xdp_umem_clear_dev(struct xdp_umem *umem);
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4688c750df1d..ab64bd8260ea 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -36,19 +36,28 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 {
-	return !!xs->rx;
+	return READ_ONCE(xs->rx) &&  READ_ONCE(xs->umem) &&
+		READ_ONCE(xs->umem->fq);
 }
 
-static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
+{
+	return xskq_peek_addr(umem->fq, addr);
+}
+EXPORT_SYMBOL(xsk_umem_peek_addr);
+
+void xsk_umem_discard_addr(struct xdp_umem *umem)
+{
+	xskq_discard_addr(umem->fq);
+}
+EXPORT_SYMBOL(xsk_umem_discard_addr);
+
+static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 {
-	u32 len = xdp->data_end - xdp->data;
 	void *buffer;
 	u64 addr;
 	int err;
 
-	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
-		return -EINVAL;
-
 	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
 	    len > xs->umem->chunk_size_nohr) {
 		xs->rx_dropped++;
@@ -60,25 +69,41 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	buffer = xdp_umem_get_data(xs->umem, addr);
 	memcpy(buffer, xdp->data, len);
 	err = xskq_produce_batch_desc(xs->rx, addr, len);
-	if (!err)
+	if (!err) {
 		xskq_discard_addr(xs->umem->fq);
-	else
-		xs->rx_dropped++;
+		xdp_return_buff(xdp);
+		return 0;
+	}
 
+	xs->rx_dropped++;
 	return err;
 }
 
-int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 {
-	int err;
+	int err = xskq_produce_batch_desc(xs->rx, (u64)xdp->handle, len);
 
-	err = __xsk_rcv(xs, xdp);
-	if (likely(!err))
+	if (err) {
 		xdp_return_buff(xdp);
+		xs->rx_dropped++;
+	}
 
 	return err;
 }
 
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	u32 len;
+
+	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+		return -EINVAL;
+
+	len = xdp->data_end - xdp->data;
+
+	return (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) ?
+		__xsk_rcv_zc(xs, xdp, len) : __xsk_rcv(xs, xdp, len);
+}
+
 void xsk_flush(struct xdp_sock *xs)
 {
 	xskq_produce_flush_desc(xs->rx);
@@ -87,12 +112,29 @@ void xsk_flush(struct xdp_sock *xs)
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
+	u32 len = xdp->data_end - xdp->data;
+	void *buffer;
+	u64 addr;
 	int err;
 
-	err = __xsk_rcv(xs, xdp);
-	if (!err)
+	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
+	    len > xs->umem->chunk_size_nohr) {
+		xs->rx_dropped++;
+		return -ENOSPC;
+	}
+
+	addr += xs->umem->headroom;
+
+	buffer = xdp_umem_get_data(xs->umem, addr);
+	memcpy(buffer, xdp->data, len);
+	err = xskq_produce_batch_desc(xs->rx, addr, len);
+	if (!err) {
+		xskq_discard_addr(xs->umem->fq);
 		xsk_flush(xs);
+		return 0;
+	}
 
+	xs->rx_dropped++;
 	return err;
 }
 
@@ -291,6 +333,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
 	struct net_device *dev;
+	u32 flags, qid;
 	int err = 0;
 
 	if (addr_len < sizeof(struct sockaddr_xdp))
@@ -315,16 +358,26 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		goto out_unlock;
 	}
 
-	if ((xs->rx && sxdp->sxdp_queue_id >= dev->real_num_rx_queues) ||
-	    (xs->tx && sxdp->sxdp_queue_id >= dev->real_num_tx_queues)) {
+	qid = sxdp->sxdp_queue_id;
+
+	if ((xs->rx && qid >= dev->real_num_rx_queues) ||
+	    (xs->tx && qid >= dev->real_num_tx_queues)) {
 		err = -EINVAL;
 		goto out_unlock;
 	}
 
-	if (sxdp->sxdp_flags & XDP_SHARED_UMEM) {
+	flags = sxdp->sxdp_flags;
+
+	if (flags & XDP_SHARED_UMEM) {
 		struct xdp_sock *umem_xs;
 		struct socket *sock;
 
+		if ((flags & XDP_COPY) || (flags & XDP_ZEROCOPY)) {
+			/* Cannot specify flags for shared sockets. */
+			err = -EINVAL;
+			goto out_unlock;
+		}
+
 		if (xs->umem) {
 			/* We have already our own. */
 			err = -EINVAL;
@@ -343,8 +396,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 			err = -EBADF;
 			sockfd_put(sock);
 			goto out_unlock;
-		} else if (umem_xs->dev != dev ||
-			   umem_xs->queue_id != sxdp->sxdp_queue_id) {
+		} else if (umem_xs->dev != dev || umem_xs->queue_id != qid) {
 			err = -EINVAL;
 			sockfd_put(sock);
 			goto out_unlock;
@@ -360,6 +412,10 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		/* This xsk has its own umem. */
 		xskq_set_umem(xs->umem->fq, &xs->umem->props);
 		xskq_set_umem(xs->umem->cq, &xs->umem->props);
+
+		err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
+		if (err)
+			goto out_unlock;
 	}
 
 	xs->dev = dev;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 06/11] net: added netdevice operation for Tx
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
                   ` (4 preceding siblings ...)
  2018-06-04 12:05 ` [PATCH bpf-next 05/11] xsk: add zero-copy support for Rx Björn Töpel
@ 2018-06-04 12:05 ` Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 07/11] xsk: wire upp Tx zero-copy functions Björn Töpel
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: john.fastabend, willemdebruijn.kernel, mst, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang, francois.ozog,
	ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Added ndo_xsk_async_xmit. This ndo "kicks" the netdev to start to pull
userland AF_XDP Tx frames from a NAPI context.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 85d91cc41cdf..7ddf9c7ad6d7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1391,6 +1391,8 @@ struct net_device_ops {
 						struct xdp_frame **xdp,
 						u32 flags);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
+	int			(*ndo_xsk_async_xmit)(struct net_device *dev,
+						      u32 queue_id);
 };
 
 /**
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 07/11] xsk: wire upp Tx zero-copy functions
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
                   ` (5 preceding siblings ...)
  2018-06-04 12:05 ` [PATCH bpf-next 06/11] net: added netdevice operation for Tx Björn Töpel
@ 2018-06-04 12:05 ` Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 08/11] i40e: added queue pair disable/enable functions Björn Töpel
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: john.fastabend, willemdebruijn.kernel, mst, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang, francois.ozog,
	ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here we add the functionality required to support zero-copy Tx, and
also exposes various zero-copy related functions for the netdevs.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h |  9 +++++++
 net/xdp/xdp_umem.c     | 29 +++++++++++++++++++--
 net/xdp/xdp_umem.h     |  8 +++++-
 net/xdp/xsk.c          | 70 +++++++++++++++++++++++++++++++++++++++++++++-----
 net/xdp/xsk_queue.h    | 32 ++++++++++++++++++++++-
 5 files changed, 137 insertions(+), 11 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index d93d3aac3fc9..9fe472f2ac95 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -9,6 +9,7 @@
 #include <linux/workqueue.h>
 #include <linux/if_xdp.h>
 #include <linux/mutex.h>
+#include <linux/spinlock.h>
 #include <linux/mm.h>
 #include <net/sock.h>
 
@@ -42,6 +43,8 @@ struct xdp_umem {
 	struct net_device *dev;
 	u16 queue_id;
 	bool zc;
+	spinlock_t xsk_list_lock;
+	struct list_head xsk_list;
 };
 
 struct xdp_sock {
@@ -53,6 +56,8 @@ struct xdp_sock {
 	struct list_head flush_node;
 	u16 queue_id;
 	struct xsk_queue *tx ____cacheline_aligned_in_smp;
+	struct list_head list;
+	bool zc;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 	u64 rx_dropped;
@@ -64,8 +69,12 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
+/* Used from netdev driver */
 u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
 void xsk_umem_discard_addr(struct xdp_umem *umem);
+void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len);
+void xsk_umem_consume_tx_done(struct xdp_umem *umem);
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index f729d79b8d91..7eb4948a38d2 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -17,6 +17,29 @@
 
 #define XDP_UMEM_MIN_CHUNK_SIZE 2048
 
+void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&umem->xsk_list_lock, flags);
+	list_add_rcu(&xs->list, &umem->xsk_list);
+	spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
+}
+
+void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs)
+{
+	unsigned long flags;
+
+	if (xs->dev) {
+		spin_lock_irqsave(&umem->xsk_list_lock, flags);
+		list_del_rcu(&xs->list);
+		spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
+
+		if (umem->zc)
+			synchronize_net();
+	}
+}
+
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 			u32 queue_id, u16 flags)
 {
@@ -35,7 +58,7 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 
 	dev_hold(dev);
 
-	if (dev->netdev_ops->ndo_bpf) {
+	if (dev->netdev_ops->ndo_bpf && dev->netdev_ops->ndo_xsk_async_xmit) {
 		bpf.command = XDP_QUERY_XSK_UMEM;
 
 		rtnl_lock();
@@ -70,7 +93,7 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 	return force_zc ? -ENOTSUPP : 0; /* fail or fallback */
 }
 
-void xdp_umem_clear_dev(struct xdp_umem *umem)
+static void xdp_umem_clear_dev(struct xdp_umem *umem)
 {
 	struct netdev_bpf bpf;
 	int err;
@@ -283,6 +306,8 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	umem->npgs = size / PAGE_SIZE;
 	umem->pgs = NULL;
 	umem->user = NULL;
+	INIT_LIST_HEAD(&umem->xsk_list);
+	spin_lock_init(&umem->xsk_list_lock);
 
 	refcount_set(&umem->users, 1);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 674508a32a4d..f11560334f88 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -13,12 +13,18 @@ static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
 }
 
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
+{
+	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
+}
+
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 			u32 queue_id, u16 flags);
-void xdp_umem_clear_dev(struct xdp_umem *umem);
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
+void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs);
+void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs);
 struct xdp_umem *xdp_umem_create(struct xdp_umem_reg *mr);
 
 #endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ab64bd8260ea..ddca4bf1cfc8 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -21,6 +21,7 @@
 #include <linux/uaccess.h>
 #include <linux/net.h>
 #include <linux/netdevice.h>
+#include <linux/rculist.h>
 #include <net/xdp_sock.h>
 #include <net/xdp.h>
 
@@ -138,6 +139,59 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
+{
+	xskq_produce_flush_addr_n(umem->cq, nb_entries);
+}
+EXPORT_SYMBOL(xsk_umem_complete_tx);
+
+void xsk_umem_consume_tx_done(struct xdp_umem *umem)
+{
+	struct xdp_sock *xs;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
+		xs->sk.sk_write_space(&xs->sk);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(xsk_umem_consume_tx_done);
+
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len)
+{
+	struct xdp_desc desc;
+	struct xdp_sock *xs;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
+		if (!xskq_peek_desc(xs->tx, &desc))
+			continue;
+
+		if (xskq_produce_addr_lazy(umem->cq, desc.addr))
+			goto out;
+
+		*dma = xdp_umem_get_dma(umem, desc.addr);
+		*len = desc.len;
+
+		xskq_discard_desc(xs->tx);
+		rcu_read_unlock();
+		return true;
+	}
+
+out:
+	rcu_read_unlock();
+	return false;
+}
+EXPORT_SYMBOL(xsk_umem_consume_tx);
+
+static int xsk_zc_xmit(struct sock *sk)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct net_device *dev = xs->dev;
+
+	return dev->netdev_ops->ndo_xsk_async_xmit(dev, xs->queue_id);
+}
+
 static void xsk_destruct_skb(struct sk_buff *skb)
 {
 	u64 addr = (u64)(long)skb_shinfo(skb)->destructor_arg;
@@ -151,7 +205,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
 static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 			    size_t total_len)
 {
-	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
 	u32 max_batch = TX_BATCH_SIZE;
 	struct xdp_sock *xs = xdp_sk(sk);
 	bool sent_frame = false;
@@ -161,8 +214,6 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 	if (unlikely(!xs->tx))
 		return -ENOBUFS;
-	if (need_wait)
-		return -EOPNOTSUPP;
 
 	mutex_lock(&xs->mutex);
 
@@ -192,7 +243,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 			goto out;
 		}
 
-		skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
+		skb = sock_alloc_send_skb(sk, len, 1, &err);
 		if (unlikely(!skb)) {
 			err = -EAGAIN;
 			goto out;
@@ -235,6 +286,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 {
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
 
@@ -242,8 +294,10 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 		return -ENXIO;
 	if (unlikely(!(xs->dev->flags & IFF_UP)))
 		return -ENETDOWN;
+	if (need_wait)
+		return -EOPNOTSUPP;
 
-	return xsk_generic_xmit(sk, m, total_len);
+	return (xs->zc) ? xsk_zc_xmit(sk) : xsk_generic_xmit(sk, m, total_len);
 }
 
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
@@ -419,10 +473,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	}
 
 	xs->dev = dev;
-	xs->queue_id = sxdp->sxdp_queue_id;
-
+	xs->zc = xs->umem->zc;
+	xs->queue_id = qid;
 	xskq_set_umem(xs->rx, &xs->umem->props);
 	xskq_set_umem(xs->tx, &xs->umem->props);
+	xdp_add_sk_umem(xs->umem, xs);
 
 out_unlock:
 	if (err)
@@ -660,6 +715,7 @@ static void xsk_destruct(struct sock *sk)
 
 	xskq_destroy(xs->rx);
 	xskq_destroy(xs->tx);
+	xdp_del_sk_umem(xs->umem, xs);
 	xdp_put_umem(xs->umem);
 
 	sk_refcnt_debug_dec(sk);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 5246ed420a16..ef6a6f0ec949 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -11,6 +11,7 @@
 #include <net/xdp_sock.h>
 
 #define RX_BATCH_SIZE 16
+#define LAZY_UPDATE_THRESHOLD 128
 
 struct xdp_ring {
 	u32 producer ____cacheline_aligned_in_smp;
@@ -61,9 +62,14 @@ static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
 	return (entries > dcnt) ? dcnt : entries;
 }
 
+static inline u32 xskq_nb_free_lazy(struct xsk_queue *q, u32 producer)
+{
+	return q->nentries - (producer - q->cons_tail);
+}
+
 static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
 {
-	u32 free_entries = q->nentries - (producer - q->cons_tail);
+	u32 free_entries = xskq_nb_free_lazy(q, producer);
 
 	if (free_entries >= dcnt)
 		return free_entries;
@@ -123,6 +129,9 @@ static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr)
 {
 	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
 
+	if (xskq_nb_free(q, q->prod_tail, LAZY_UPDATE_THRESHOLD) == 0)
+		return -ENOSPC;
+
 	ring->desc[q->prod_tail++ & q->ring_mask] = addr;
 
 	/* Order producer and data */
@@ -132,6 +141,27 @@ static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr)
 	return 0;
 }
 
+static inline int xskq_produce_addr_lazy(struct xsk_queue *q, u64 addr)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+
+	if (xskq_nb_free(q, q->prod_head, LAZY_UPDATE_THRESHOLD) == 0)
+		return -ENOSPC;
+
+	ring->desc[q->prod_head++ & q->ring_mask] = addr;
+	return 0;
+}
+
+static inline void xskq_produce_flush_addr_n(struct xsk_queue *q,
+					     u32 nb_entries)
+{
+	/* Order producer and data */
+	smp_wmb();
+
+	q->prod_tail += nb_entries;
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+}
+
 static inline int xskq_reserve_addr(struct xsk_queue *q)
 {
 	if (xskq_nb_free(q, q->prod_head, 1) == 0)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 08/11] i40e: added queue pair disable/enable functions
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
                   ` (6 preceding siblings ...)
  2018-06-04 12:05 ` [PATCH bpf-next 07/11] xsk: wire upp Tx zero-copy functions Björn Töpel
@ 2018-06-04 12:05 ` Björn Töpel
  2018-06-04 12:05 ` [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx Björn Töpel
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Queue pair enable/disable plumbing.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 251 ++++++++++++++++++++++++++++
 1 file changed, 251 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b5daa5c9c7de..369a116edaa1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11823,6 +11823,257 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	return 0;
 }
 
+/**
+ * i40e_enter_busy_conf - Enters busy config state
+ * @vsi: vsi
+ *
+ * Returns 0 on success, <0 for failure.
+ **/
+static int i40e_enter_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+	int timeout = 50;
+
+	while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
+		timeout--;
+		if (!timeout)
+			return -EBUSY;
+		usleep_range(1000, 2000);
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_exit_busy_conf - Exits busy config state
+ * @vsi: vsi
+ **/
+static void i40e_exit_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+
+	clear_bit(__I40E_CONFIG_BUSY, pf->state);
+}
+
+/**
+ * i40e_queue_pair_reset_stats - Resets all statistics for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_reset_stats(struct i40e_vsi *vsi, int queue_pair)
+{
+	memset(&vsi->rx_rings[queue_pair]->rx_stats, 0,
+	       sizeof(vsi->rx_rings[queue_pair]->rx_stats));
+	memset(&vsi->tx_rings[queue_pair]->stats, 0,
+	       sizeof(vsi->tx_rings[queue_pair]->stats));
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		memset(&vsi->xdp_rings[queue_pair]->stats, 0,
+		       sizeof(vsi->xdp_rings[queue_pair]->stats));
+	}
+}
+
+/**
+ * i40e_queue_pair_clean_rings - Cleans all the rings of a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_clean_rings(struct i40e_vsi *vsi, int queue_pair)
+{
+	i40e_clean_tx_ring(vsi->tx_rings[queue_pair]);
+	if (i40e_enabled_xdp_vsi(vsi))
+		i40e_clean_tx_ring(vsi->xdp_rings[queue_pair]);
+	i40e_clean_rx_ring(vsi->rx_rings[queue_pair]);
+}
+
+/**
+ * i40e_queue_pair_control_napi - Enables/disables NAPI for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ **/
+static void i40e_queue_pair_control_napi(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_q_vector *q_vector = rxr->q_vector;
+
+	if (!vsi->netdev)
+		return;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (q_vector->rx.ring || q_vector->tx.ring) {
+		if (enable)
+			napi_enable(&q_vector->napi);
+		else
+			napi_disable(&q_vector->napi);
+	}
+}
+
+/**
+ * i40e_queue_pair_control_rings - Enables/disables all rings for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_control_rings(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_pf *pf = vsi->back;
+	int pf_q, ret = 0;
+
+	pf_q = vsi->base_queue + queue_pair;
+	ret = i40e_control_wait_tx_q(vsi->seid, pf, pf_q,
+				     false /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	i40e_control_rx_q(pf, pf_q, enable);
+	ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Rx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	/* Due to HW errata, on Rx disable only, the register can
+	 * indicate done before it really is. Needs 50ms to be sure
+	 */
+	if (!enable)
+		mdelay(50);
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return ret;
+
+	ret = i40e_control_wait_tx_q(vsi->seid, pf,
+				     pf_q + vsi->alloc_queue_pairs,
+				     true /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d XDP Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+	}
+
+	return ret;
+}
+
+/**
+ * i40e_queue_pair_enable_irq - Enables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_enable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED)
+		i40e_irq_dynamic_enable(vsi, rxr->q_vector->v_idx);
+	else
+		i40e_irq_dynamic_enable_icr0(pf);
+
+	i40e_flush(hw);
+}
+
+/**
+ * i40e_queue_pair_disable_irq - Disables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* For simplicity, instead of removing the qp interrupt causes
+	 * from the interrupt linked list, we simply disable the interrupt, and
+	 * leave the list intact.
+	 *
+	 * All rings in a qp belong to the same qvector.
+	 */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
+		u32 intpf = vsi->base_vector + rxr->q_vector->v_idx;
+
+		wr32(hw, I40E_PFINT_DYN_CTLN(intpf - 1), 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->msix_entries[intpf].vector);
+	} else {
+		/* Legacy and MSI mode - this stops all interrupt handling */
+		wr32(hw, I40E_PFINT_ICR0_ENA, 0);
+		wr32(hw, I40E_PFINT_DYN_CTL0, 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->pdev->irq);
+	}
+}
+
+/**
+ * i40e_queue_pair_disable - Disables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_enter_busy_conf(vsi);
+	if (err)
+		return err;
+
+	i40e_queue_pair_disable_irq(vsi, queue_pair);
+	err = i40e_queue_pair_control_rings(vsi, queue_pair,
+					    false /* disable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, false /* disable */);
+	i40e_queue_pair_clean_rings(vsi, queue_pair);
+	i40e_queue_pair_reset_stats(vsi, queue_pair);
+
+	return err;
+}
+
+/**
+ * i40e_queue_pair_enable - Enables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_configure_tx_ring(vsi->tx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		err = i40e_configure_tx_ring(vsi->xdp_rings[queue_pair]);
+		if (err)
+			return err;
+	}
+
+	err = i40e_configure_rx_ring(vsi->rx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	err = i40e_queue_pair_control_rings(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_enable_irq(vsi, queue_pair);
+
+	i40e_exit_busy_conf(vsi);
+
+	return err;
+}
+
 /**
  * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
                   ` (7 preceding siblings ...)
  2018-06-04 12:05 ` [PATCH bpf-next 08/11] i40e: added queue pair disable/enable functions Björn Töpel
@ 2018-06-04 12:05 ` Björn Töpel
  2018-06-04 20:35   ` Alexander Duyck
  2018-06-04 12:06 ` [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx Björn Töpel
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:05 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

This commit adds initial AF_XDP zero-copy support for i40e-based
NICs. First we add support for the new XDP_QUERY_XSK_UMEM and
XDP_SETUP_XSK_UMEM commands in ndo_bpf. This allows the AF_XDP socket
to pass a UMEM to the driver. The driver will then DMA map all the
frames in the UMEM for the driver. Next, the Rx code will allocate
frames from the UMEM fill queue, instead of the regular page
allocator.

Externally, for the rest of the XDP code, the driver internal UMEM
allocator will appear as a MEM_TYPE_ZERO_COPY.

The commit also introduces a completely new clean_rx_irq/allocator
functions for zero-copy, and means (functions pointers) to set
allocators and clean_rx functions.

This first version does not support:
* passing frames to the stack via XDP_PASS (clone/copy to skb).
* doing XDP redirect to other than AF_XDP sockets
  (convert_to_xdp_frame does not clone the frame yet).

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/Makefile    |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h      |  23 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c |  35 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 163 ++-------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h | 128 ++++++-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 537 ++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  17 +
 include/net/xdp_sock.h                      |  19 +
 net/xdp/xdp_umem.h                          |  10 -
 9 files changed, 789 insertions(+), 146 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h

diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
index 14397e7e9925..50590e8d1fd1 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -22,6 +22,7 @@ i40e-objs := i40e_main.o \
 	i40e_txrx.o	\
 	i40e_ptp.o	\
 	i40e_client.o   \
-	i40e_virtchnl_pf.o
+	i40e_virtchnl_pf.o \
+	i40e_xsk.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 7a80652e2500..20955e5dce02 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -786,6 +786,12 @@ struct i40e_vsi {
 
 	/* VSI specific handlers */
 	irqreturn_t (*irq_handler)(int irq, void *data);
+
+	/* AF_XDP zero-copy */
+	struct xdp_umem **xsk_umems;
+	u16 num_xsk_umems_used;
+	u16 num_xsk_umems;
+
 } ____cacheline_internodealigned_in_smp;
 
 struct i40e_netdev_priv {
@@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
 	return !!vsi->xdp_prog;
 }
 
+static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
+{
+	bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
+	int qid = ring->queue_index;
+
+	if (ring_is_xdp(ring))
+		qid -= ring->vsi->alloc_queue_pairs;
+
+	if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
+		return NULL;
+
+	return ring->vsi->xsk_umems[qid];
+}
+
 int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
 int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
 int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
@@ -1098,4 +1118,7 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
 int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
 				      struct i40e_cloud_filter *filter,
 				      bool add);
+int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair);
+int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair);
+
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 369a116edaa1..8c602424d339 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include <linux/of_net.h>
 #include <linux/pci.h>
 #include <linux/bpf.h>
+#include <net/xdp_sock.h>
 
 /* Local includes */
 #include "i40e.h"
@@ -16,6 +17,7 @@
  */
 #define CREATE_TRACE_POINTS
 #include "i40e_trace.h"
+#include "i40e_xsk.h"
 
 const char i40e_driver_name[] = "i40e";
 static const char i40e_driver_string[] =
@@ -3071,6 +3073,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
+	if (ring_is_xdp(ring))
+		ring->xsk_umem = i40e_xsk_umem(ring);
+
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
 		ring->atr_sample_rate = vsi->back->atr_sample_rate;
@@ -3180,13 +3185,30 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	struct i40e_hw *hw = &vsi->back->hw;
 	struct i40e_hmc_obj_rxq rx_ctx;
 	i40e_status err = 0;
+	int ret;
 
 	bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
 
 	/* clear the context structure first */
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
-	ring->rx_buf_len = vsi->rx_buf_len;
+	ring->xsk_umem = i40e_xsk_umem(ring);
+	if (ring->xsk_umem) {
+		ring->clean_rx_irq = i40e_clean_rx_irq_zc;
+		ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
+		ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -
+				   XDP_PACKET_HEADROOM;
+		ring->zca.free = i40e_zca_free;
+		ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+						 MEM_TYPE_ZERO_COPY,
+						 &ring->zca);
+		if (ret)
+			return ret;
+	} else {
+		ring->clean_rx_irq = i40e_clean_rx_irq;
+		ring->alloc_rx_buffers = i40e_alloc_rx_buffers;
+		ring->rx_buf_len = vsi->rx_buf_len;
+	}
 
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
 				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
@@ -3242,7 +3264,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
 	writel(0, ring->tail);
 
-	i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
+	ring->alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
 
 	return 0;
 }
@@ -12022,7 +12044,7 @@ static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
  *
  * Returns 0 on success, <0 on failure.
  **/
-static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
+int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
 {
 	int err;
 
@@ -12047,7 +12069,7 @@ static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
  *
  * Returns 0 on success, <0 on failure.
  **/
-static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
+int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
 {
 	int err;
 
@@ -12095,6 +12117,11 @@ static int i40e_xdp(struct net_device *dev,
 		xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
 		xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
 		return 0;
+	case XDP_QUERY_XSK_UMEM:
+		return 0;
+	case XDP_SETUP_XSK_UMEM:
+		return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
+					   xdp->xsk.queue_id);
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 5f01e4ce9c92..6b1142fbc697 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -5,6 +5,7 @@
 #include <net/busy_poll.h>
 #include <linux/bpf_trace.h>
 #include <net/xdp.h>
+#include <net/xdp_sock.h>
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
@@ -536,8 +537,8 @@ int i40e_add_del_fdir(struct i40e_vsi *vsi,
  * This is used to verify if the FD programming or invalidation
  * requested by SW to the HW is successful or not and take actions accordingly.
  **/
-static void i40e_fd_handle_status(struct i40e_ring *rx_ring,
-				  union i40e_rx_desc *rx_desc, u8 prog_id)
+void i40e_fd_handle_status(struct i40e_ring *rx_ring,
+			   union i40e_rx_desc *rx_desc, u8 prog_id)
 {
 	struct i40e_pf *pf = rx_ring->vsi->back;
 	struct pci_dev *pdev = pf->pdev;
@@ -1246,25 +1247,6 @@ static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
 	new_buff->pagecnt_bias	= old_buff->pagecnt_bias;
 }
 
-/**
- * i40e_rx_is_programming_status - check for programming status descriptor
- * @qw: qword representing status_error_len in CPU ordering
- *
- * The value of in the descriptor length field indicate if this
- * is a programming status descriptor for flow director or FCoE
- * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
- * it is a packet descriptor.
- **/
-static inline bool i40e_rx_is_programming_status(u64 qw)
-{
-	/* The Rx filter programming status and SPH bit occupy the same
-	 * spot in the descriptor. Since we don't support packet split we
-	 * can just reuse the bit as an indication that this is a
-	 * programming status descriptor.
-	 */
-	return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
-}
-
 /**
  * i40e_clean_programming_status - clean the programming status descriptor
  * @rx_ring: the rx ring that has this descriptor
@@ -1373,31 +1355,35 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 	}
 
 	/* Free all the Rx ring sk_buffs */
-	for (i = 0; i < rx_ring->count; i++) {
-		struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
-
-		if (!rx_bi->page)
-			continue;
+	if (!rx_ring->xsk_umem) {
+		for (i = 0; i < rx_ring->count; i++) {
+			struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
 
-		/* Invalidate cache lines that may have been written to by
-		 * device so that we avoid corrupting memory.
-		 */
-		dma_sync_single_range_for_cpu(rx_ring->dev,
-					      rx_bi->dma,
-					      rx_bi->page_offset,
-					      rx_ring->rx_buf_len,
-					      DMA_FROM_DEVICE);
-
-		/* free resources associated with mapping */
-		dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
-				     i40e_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE,
-				     I40E_RX_DMA_ATTR);
-
-		__page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
+			if (!rx_bi->page)
+				continue;
 
-		rx_bi->page = NULL;
-		rx_bi->page_offset = 0;
+			/* Invalidate cache lines that may have been
+			 * written to by device so that we avoid
+			 * corrupting memory.
+			 */
+			dma_sync_single_range_for_cpu(rx_ring->dev,
+						      rx_bi->dma,
+						      rx_bi->page_offset,
+						      rx_ring->rx_buf_len,
+						      DMA_FROM_DEVICE);
+
+			/* free resources associated with mapping */
+			dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
+					     i40e_rx_pg_size(rx_ring),
+					     DMA_FROM_DEVICE,
+					     I40E_RX_DMA_ATTR);
+
+			__page_frag_cache_drain(rx_bi->page,
+						rx_bi->pagecnt_bias);
+
+			rx_bi->page = NULL;
+			rx_bi->page_offset = 0;
+		}
 	}
 
 	bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
@@ -1487,27 +1473,6 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
 	return err;
 }
 
-/**
- * i40e_release_rx_desc - Store the new tail and head values
- * @rx_ring: ring to bump
- * @val: new head index
- **/
-static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
-{
-	rx_ring->next_to_use = val;
-
-	/* update next to alloc since we have filled the ring */
-	rx_ring->next_to_alloc = val;
-
-	/* Force memory writes to complete before letting h/w
-	 * know there are new descriptors to fetch.  (Only
-	 * applicable for weak-ordered memory model archs,
-	 * such as IA-64).
-	 */
-	wmb();
-	writel(val, rx_ring->tail);
-}
-
 /**
  * i40e_rx_offset - Return expected offset into page to access data
  * @rx_ring: Ring we are requesting offset of
@@ -1576,8 +1541,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
  * @skb: packet to send up
  * @vlan_tag: vlan tag for packet
  **/
-static void i40e_receive_skb(struct i40e_ring *rx_ring,
-			     struct sk_buff *skb, u16 vlan_tag)
+void i40e_receive_skb(struct i40e_ring *rx_ring,
+		      struct sk_buff *skb, u16 vlan_tag)
 {
 	struct i40e_q_vector *q_vector = rx_ring->q_vector;
 
@@ -1804,7 +1769,6 @@ static inline void i40e_rx_hash(struct i40e_ring *ring,
  * order to populate the hash, checksum, VLAN, protocol, and
  * other fields within the skb.
  **/
-static inline
 void i40e_process_skb_fields(struct i40e_ring *rx_ring,
 			     union i40e_rx_desc *rx_desc, struct sk_buff *skb,
 			     u8 rx_ptype)
@@ -1829,46 +1793,6 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
 	skb->protocol = eth_type_trans(skb, rx_ring->netdev);
 }
 
-/**
- * i40e_cleanup_headers - Correct empty headers
- * @rx_ring: rx descriptor ring packet is being transacted on
- * @skb: pointer to current skb being fixed
- * @rx_desc: pointer to the EOP Rx descriptor
- *
- * Also address the case where we are pulling data in on pages only
- * and as such no data is present in the skb header.
- *
- * In addition if skb is not at least 60 bytes we need to pad it so that
- * it is large enough to qualify as a valid Ethernet frame.
- *
- * Returns true if an error was encountered and skb was freed.
- **/
-static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb,
-				 union i40e_rx_desc *rx_desc)
-
-{
-	/* XDP packets use error pointer so abort at this point */
-	if (IS_ERR(skb))
-		return true;
-
-	/* ERR_MASK will only have valid bits if EOP set, and
-	 * what we are doing here is actually checking
-	 * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
-	 * the error field
-	 */
-	if (unlikely(i40e_test_staterr(rx_desc,
-				       BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
-		dev_kfree_skb_any(skb);
-		return true;
-	}
-
-	/* if eth_skb_pad returns an error the skb was freed */
-	if (eth_skb_pad(skb))
-		return true;
-
-	return false;
-}
-
 /**
  * i40e_page_is_reusable - check if any reuse is possible
  * @page: page struct to check
@@ -2177,15 +2101,11 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 	return true;
 }
 
-#define I40E_XDP_PASS 0
-#define I40E_XDP_CONSUMED 1
-#define I40E_XDP_TX 2
-
 static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 			      struct i40e_ring *xdp_ring);
 
-static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
-				 struct i40e_ring *xdp_ring)
+int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
+			  struct i40e_ring *xdp_ring)
 {
 	struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
 
@@ -2214,8 +2134,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	if (!xdp_prog)
 		goto xdp_out;
 
-	prefetchw(xdp->data_hard_start); /* xdp_frame write */
-
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
@@ -2263,15 +2181,6 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
 #endif
 }
 
-static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
-{
-	/* Force memory writes to complete before letting h/w
-	 * know there are new descriptors to fetch.
-	 */
-	wmb();
-	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
-}
-
 /**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
@@ -2284,7 +2193,7 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
  *
  * Returns amount of work completed
  **/
-static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 {
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	struct sk_buff *skb = rx_ring->skb;
@@ -2576,7 +2485,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
 
 	i40e_for_each_ring(ring, q_vector->rx) {
-		int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
+		int cleaned = ring->clean_rx_irq(ring, budget_per_ring);
 
 		work_done += cleaned;
 		/* if we clean as many as budgeted, we must not be done */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 820f76db251b..cddb185cd2f8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -296,13 +296,22 @@ struct i40e_tx_buffer {
 
 struct i40e_rx_buffer {
 	dma_addr_t dma;
-	struct page *page;
+	union {
+		struct {
+			struct page *page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
-	__u32 page_offset;
+			__u32 page_offset;
 #else
-	__u16 page_offset;
+			__u16 page_offset;
 #endif
-	__u16 pagecnt_bias;
+			__u16 pagecnt_bias;
+		};
+		struct {
+			/* for umem */
+			void *addr;
+			u64 handle;
+		};
+	};
 };
 
 struct i40e_queue_stats {
@@ -414,6 +423,12 @@ struct i40e_ring {
 
 	struct i40e_channel *ch;
 	struct xdp_rxq_info xdp_rxq;
+
+	int (*clean_rx_irq)(struct i40e_ring *ring, int budget);
+	bool (*alloc_rx_buffers)(struct i40e_ring *ring, u16 n);
+	struct xdp_umem *xsk_umem;
+
+	struct zero_copy_allocator zca; /* ZC allocator anchor */
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
@@ -490,6 +505,7 @@ bool __i40e_chk_linearize(struct sk_buff *skb);
 int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		  u32 flags);
 void i40e_xdp_flush(struct net_device *dev);
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -576,4 +592,108 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
 {
 	return netdev_get_tx_queue(ring->netdev, ring->queue_index);
 }
+
+#define I40E_XDP_PASS 0
+#define I40E_XDP_CONSUMED 1
+#define I40E_XDP_TX 2
+
+/**
+ * i40e_release_rx_desc - Store the new tail and head values
+ * @rx_ring: ring to bump
+ * @val: new head index
+ **/
+static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
+{
+	rx_ring->next_to_use = val;
+
+	/* update next to alloc since we have filled the ring */
+	rx_ring->next_to_alloc = val;
+
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.  (Only
+	 * applicable for weak-ordered memory model archs,
+	 * such as IA-64).
+	 */
+	wmb();
+	writel(val, rx_ring->tail);
+}
+
+/**
+ * i40e_rx_is_programming_status - check for programming status descriptor
+ * @qw: qword representing status_error_len in CPU ordering
+ *
+ * The value of in the descriptor length field indicate if this
+ * is a programming status descriptor for flow director or FCoE
+ * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
+ * it is a packet descriptor.
+ **/
+static inline bool i40e_rx_is_programming_status(u64 qw)
+{
+	/* The Rx filter programming status and SPH bit occupy the same
+	 * spot in the descriptor. Since we don't support packet split we
+	 * can just reuse the bit as an indication that this is a
+	 * programming status descriptor.
+	 */
+	return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
+}
+
+/**
+ * i40e_cleanup_headers - Correct empty headers
+ * @rx_ring: rx descriptor ring packet is being transacted on
+ * @skb: pointer to current skb being fixed
+ * @rx_desc: pointer to the EOP Rx descriptor
+ *
+ * Also address the case where we are pulling data in on pages only
+ * and as such no data is present in the skb header.
+ *
+ * In addition if skb is not at least 60 bytes we need to pad it so that
+ * it is large enough to qualify as a valid Ethernet frame.
+ *
+ * Returns true if an error was encountered and skb was freed.
+ **/
+static inline bool i40e_cleanup_headers(struct i40e_ring *rx_ring,
+					struct sk_buff *skb,
+					union i40e_rx_desc *rx_desc)
+
+{
+	/* XDP packets use error pointer so abort at this point */
+	if (IS_ERR(skb))
+		return true;
+
+	/* ERR_MASK will only have valid bits if EOP set, and
+	 * what we are doing here is actually checking
+	 * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
+	 * the error field
+	 */
+	if (unlikely(i40e_test_staterr(rx_desc,
+				       BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
+		dev_kfree_skb_any(skb);
+		return true;
+	}
+
+	/* if eth_skb_pad returns an error the skb was freed */
+	if (eth_skb_pad(skb))
+		return true;
+
+	return false;
+}
+
+static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+{
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.
+	 */
+	wmb();
+	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+}
+
+void i40e_fd_handle_status(struct i40e_ring *rx_ring,
+			   union i40e_rx_desc *rx_desc, u8 prog_id);
+int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
+			  struct i40e_ring *xdp_ring);
+void i40e_process_skb_fields(struct i40e_ring *rx_ring,
+			     union i40e_rx_desc *rx_desc, struct sk_buff *skb,
+			     u8 rx_ptype);
+void i40e_receive_skb(struct i40e_ring *rx_ring,
+		      struct sk_buff *skb, u16 vlan_tag);
 #endif /* _I40E_TXRX_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
new file mode 100644
index 000000000000..9d16924415b9
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -0,0 +1,537 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2018 Intel Corporation. */
+
+#include <linux/bpf_trace.h>
+#include <net/xdp_sock.h>
+#include <net/xdp.h>
+
+#include "i40e.h"
+#include "i40e_txrx.h"
+
+static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
+{
+	if (vsi->xsk_umems)
+		return 0;
+
+	vsi->num_xsk_umems_used = 0;
+	vsi->num_xsk_umems = vsi->alloc_queue_pairs;
+	vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
+				 GFP_KERNEL);
+	if (!vsi->xsk_umems) {
+		vsi->num_xsk_umems = 0;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			     u16 qid)
+{
+	int err;
+
+	err = i40e_alloc_xsk_umems(vsi);
+	if (err)
+		return err;
+
+	vsi->xsk_umems[qid] = umem;
+	vsi->num_xsk_umems_used++;
+
+	return 0;
+}
+
+static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
+{
+	vsi->xsk_umems[qid] = NULL;
+	vsi->num_xsk_umems_used--;
+
+	if (vsi->num_xsk_umems == 0) {
+		kfree(vsi->xsk_umems);
+		vsi->xsk_umems = NULL;
+		vsi->num_xsk_umems = 0;
+	}
+}
+
+static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i, j;
+	dma_addr_t dma;
+
+	dev = &pf->pdev->dev;
+	for (i = 0; i < umem->npgs; i++) {
+		dma = dma_map_page_attrs(dev, umem->pgs[i], 0, PAGE_SIZE,
+					 DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+		if (dma_mapping_error(dev, dma))
+			goto out_unmap;
+
+		umem->pages[i].dma = dma;
+	}
+
+	return 0;
+
+out_unmap:
+	for (j = 0; j < i; j++) {
+		dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
+				     DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+		umem->pages[i].dma = 0;
+	}
+
+	return -1;
+}
+
+static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i;
+
+	dev = &pf->pdev->dev;
+
+	for (i = 0; i < umem->npgs; i++) {
+		dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
+				     DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+
+		umem->pages[i].dma = 0;
+	}
+}
+
+static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
+				u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	if (qid >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (vsi->xsk_umems && vsi->xsk_umems[qid])
+		return -EBUSY;
+
+	err = i40e_xsk_umem_dma_map(vsi, umem);
+	if (err)
+		return err;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	err = i40e_add_xsk_umem(vsi, umem, qid);
+	if (err)
+		return err;
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
+	    !vsi->xsk_umems[qid])
+		return -EINVAL;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
+	i40e_remove_xsk_umem(vsi, qid);
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			u16 qid)
+{
+	if (umem)
+		return i40e_xsk_umem_enable(vsi, umem, qid);
+
+	return i40e_xsk_umem_disable(vsi, qid);
+}
+
+static struct sk_buff *i40e_run_xdp_zc(struct i40e_ring *rx_ring,
+				       struct xdp_buff *xdp)
+{
+	int err, result = I40E_XDP_PASS;
+	struct i40e_ring *xdp_ring;
+	struct bpf_prog *xdp_prog;
+	u32 act;
+	u16 off;
+
+	rcu_read_lock();
+	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	off = xdp->data - xdp->data_hard_start;
+	xdp->handle += off;
+	switch (act) {
+	case XDP_PASS:
+		break;
+	case XDP_TX:
+		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+		result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
+		break;
+	case XDP_REDIRECT:
+		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+		result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
+		/* fallthrough -- handle aborts by dropping packet */
+	case XDP_DROP:
+		result = I40E_XDP_CONSUMED;
+		break;
+	}
+
+	rcu_read_unlock();
+	return ERR_PTR(-result);
+}
+
+static bool i40e_alloc_frame_zc(struct i40e_ring *rx_ring,
+				struct i40e_rx_buffer *bi)
+{
+	struct xdp_umem *umem = rx_ring->xsk_umem;
+	void *addr = bi->addr;
+	u64 handle;
+
+	if (addr) {
+		rx_ring->rx_stats.page_reuse_count++;
+		return true;
+	}
+
+	if (!xsk_umem_peek_addr(umem, &handle)) {
+		rx_ring->rx_stats.alloc_page_failed++;
+		return false;
+	}
+
+	bi->dma = xdp_umem_get_dma(umem, handle);
+	bi->addr = xdp_umem_get_data(umem, handle);
+
+	bi->dma += umem->headroom + XDP_PACKET_HEADROOM;
+	bi->addr += umem->headroom + XDP_PACKET_HEADROOM;
+	bi->handle = handle + umem->headroom;
+
+	xsk_umem_discard_addr(umem);
+	return true;
+}
+
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count)
+{
+	u16 ntu = rx_ring->next_to_use;
+	union i40e_rx_desc *rx_desc;
+	struct i40e_rx_buffer *bi;
+
+	rx_desc = I40E_RX_DESC(rx_ring, ntu);
+	bi = &rx_ring->rx_bi[ntu];
+
+	do {
+		if (!i40e_alloc_frame_zc(rx_ring, bi))
+			goto no_buffers;
+
+		/* sync the buffer for use by the device */
+		dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
+						 rx_ring->rx_buf_len,
+						 DMA_BIDIRECTIONAL);
+
+		/* Refresh the desc even if buffer_addrs didn't change
+		 * because each write-back erases this info.
+		 */
+		rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
+
+		rx_desc++;
+		bi++;
+		ntu++;
+		if (unlikely(ntu == rx_ring->count)) {
+			rx_desc = I40E_RX_DESC(rx_ring, 0);
+			bi = rx_ring->rx_bi;
+			ntu = 0;
+		}
+
+		/* clear the status bits for the next_to_use descriptor */
+		rx_desc->wb.qword1.status_error_len = 0;
+
+		cleaned_count--;
+	} while (cleaned_count);
+
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	return false;
+
+no_buffers:
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	/* make sure to come back via polling to try again after
+	 * allocation failure
+	 */
+	return true;
+}
+
+static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
+						    const unsigned int size)
+{
+	struct i40e_rx_buffer *rx_buffer;
+
+	rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
+
+	/* we are reusing so sync this buffer for CPU use */
+	dma_sync_single_range_for_cpu(rx_ring->dev,
+				      rx_buffer->dma, 0,
+				      size,
+				      DMA_BIDIRECTIONAL);
+
+	return rx_buffer;
+}
+
+static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
+				    struct i40e_rx_buffer *old_buff)
+{
+	u64 mask = rx_ring->xsk_umem->props.chunk_mask;
+	u64 hr = rx_ring->xsk_umem->headroom;
+	u16 nta = rx_ring->next_to_alloc;
+	struct i40e_rx_buffer *new_buff;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	/* transfer page from old buffer to new buffer */
+	new_buff->dma		= old_buff->dma & mask;
+	new_buff->addr		= (void *)((u64)old_buff->addr & mask);
+	new_buff->handle	= old_buff->handle & mask;
+
+	new_buff->dma += hr + XDP_PACKET_HEADROOM;
+	new_buff->addr += hr + XDP_PACKET_HEADROOM;
+	new_buff->handle += hr;
+}
+
+/* Called from the XDP return API in NAPI context. */
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
+{
+	struct i40e_rx_buffer *new_buff;
+	struct i40e_ring *rx_ring;
+	u64 mask;
+	u16 nta;
+
+	rx_ring = container_of(alloc, struct i40e_ring, zca);
+	mask = rx_ring->xsk_umem->props.chunk_mask;
+
+	nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	handle &= mask;
+
+	new_buff->dma		= xdp_umem_get_dma(rx_ring->xsk_umem, handle);
+	new_buff->addr		= xdp_umem_get_data(rx_ring->xsk_umem, handle);
+	new_buff->handle	= (u64)handle;
+
+	new_buff->dma += rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
+	new_buff->addr += rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
+	new_buff->handle += rx_ring->xsk_umem->headroom;
+}
+
+static struct sk_buff *i40e_zc_frame_to_skb(struct i40e_ring *rx_ring,
+					    struct i40e_rx_buffer *rx_buffer,
+					    struct xdp_buff *xdp)
+{
+	/* XXX implement alloc skb and copy */
+	i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+	return NULL;
+}
+
+static void i40e_clean_programming_status_zc(struct i40e_ring *rx_ring,
+					     union i40e_rx_desc *rx_desc,
+					     u64 qw)
+{
+	struct i40e_rx_buffer *rx_buffer;
+	u32 ntc = rx_ring->next_to_clean;
+	u8 id;
+
+	/* fetch, update, and store next to clean */
+	rx_buffer = &rx_ring->rx_bi[ntc++];
+	ntc = (ntc < rx_ring->count) ? ntc : 0;
+	rx_ring->next_to_clean = ntc;
+
+	prefetch(I40E_RX_DESC(rx_ring, ntc));
+
+	/* place unused page back on the ring */
+	i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+	rx_ring->rx_stats.page_reuse_count++;
+
+	/* clear contents of buffer_info */
+	rx_buffer->addr = NULL;
+
+	id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
+		  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
+
+	if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
+		i40e_fd_handle_status(rx_ring, rx_desc, id);
+}
+
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
+{
+	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
+	bool failure = false, xdp_xmit = false;
+	struct sk_buff *skb;
+	struct xdp_buff xdp;
+
+	xdp.rxq = &rx_ring->xdp_rxq;
+
+	while (likely(total_rx_packets < (unsigned int)budget)) {
+		struct i40e_rx_buffer *rx_buffer;
+		union i40e_rx_desc *rx_desc;
+		unsigned int size;
+		u16 vlan_tag;
+		u8 rx_ptype;
+		u64 qword;
+		u32 ntc;
+
+		/* return some buffers to hardware, one at a time is too slow */
+		if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
+			failure = failure ||
+				  i40e_alloc_rx_buffers_zc(rx_ring,
+							   cleaned_count);
+			cleaned_count = 0;
+		}
+
+		rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
+
+		/* status_error_len will always be zero for unused descriptors
+		 * because it's cleared in cleanup, and overlaps with hdr_addr
+		 * which is always zero because packet split isn't used, if the
+		 * hardware wrote DD then the length will be non-zero
+		 */
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+
+		/* This memory barrier is needed to keep us from reading
+		 * any other fields out of the rx_desc until we have
+		 * verified the descriptor has been written back.
+		 */
+		dma_rmb();
+
+		if (unlikely(i40e_rx_is_programming_status(qword))) {
+			i40e_clean_programming_status_zc(rx_ring, rx_desc,
+							 qword);
+			cleaned_count++;
+			continue;
+		}
+		size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
+		       I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
+		if (!size)
+			break;
+
+		rx_buffer = i40e_get_rx_buffer_zc(rx_ring, size);
+
+		/* retrieve a buffer from the ring */
+		xdp.data = rx_buffer->addr;
+		xdp_set_data_meta_invalid(&xdp);
+		xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
+		xdp.data_end = xdp.data + size;
+		xdp.handle = rx_buffer->handle;
+
+		skb = i40e_run_xdp_zc(rx_ring, &xdp);
+
+		if (IS_ERR(skb)) {
+			if (PTR_ERR(skb) == -I40E_XDP_TX)
+				xdp_xmit = true;
+			else
+				i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+			total_rx_bytes += size;
+			total_rx_packets++;
+		} else {
+			skb = i40e_zc_frame_to_skb(rx_ring, rx_buffer, &xdp);
+			if (!skb) {
+				rx_ring->rx_stats.alloc_buff_failed++;
+				break;
+			}
+		}
+
+		rx_buffer->addr = NULL;
+		cleaned_count++;
+
+		/* don't care about non-EOP frames in XDP mode */
+		ntc = rx_ring->next_to_clean + 1;
+		ntc = (ntc < rx_ring->count) ? ntc : 0;
+		rx_ring->next_to_clean = ntc;
+		prefetch(I40E_RX_DESC(rx_ring, ntc));
+
+		if (i40e_cleanup_headers(rx_ring, skb, rx_desc)) {
+			skb = NULL;
+			continue;
+		}
+
+		/* probably a little skewed due to removing CRC */
+		total_rx_bytes += skb->len;
+
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+		rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
+			   I40E_RXD_QW1_PTYPE_SHIFT;
+
+		/* populate checksum, VLAN, and protocol */
+		i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+
+		vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
+			   le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
+
+		i40e_receive_skb(rx_ring, skb, vlan_tag);
+		skb = NULL;
+
+		/* update budget accounting */
+		total_rx_packets++;
+	}
+
+	if (xdp_xmit) {
+		struct i40e_ring *xdp_ring =
+			rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+
+		i40e_xdp_ring_update_tail(xdp_ring);
+		xdp_do_flush_map();
+	}
+
+	u64_stats_update_begin(&rx_ring->syncp);
+	rx_ring->stats.packets += total_rx_packets;
+	rx_ring->stats.bytes += total_rx_bytes;
+	u64_stats_update_end(&rx_ring->syncp);
+	rx_ring->q_vector->rx.total_packets += total_rx_packets;
+	rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
+
+	/* guarantee a trip back through this routine if there was a failure */
+	return failure ? budget : (int)total_rx_packets;
+}
+
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
new file mode 100644
index 000000000000..757ac5ca8511
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2018 Intel Corporation. */
+
+#ifndef _I40E_XSK_H_
+#define _I40E_XSK_H_
+
+struct i40e_vsi;
+struct xdp_umem;
+struct zero_copy_allocator;
+
+int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			u16 qid);
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
+
+#endif /* _I40E_XSK_H_ */
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 9fe472f2ac95..ec8fd3314097 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -94,6 +94,25 @@ static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 {
 	return false;
 }
+
+static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
+{
+	return NULL;
+}
+
+static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
+{
+}
 #endif /* CONFIG_XDP_SOCKETS */
 
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
+{
+	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
+}
+
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
+{
+	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
+}
+
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index f11560334f88..c8be1ad3eb88 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -8,16 +8,6 @@
 
 #include <net/xdp_sock.h>
 
-static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
-{
-	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
-}
-
-static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
-{
-	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
-}
-
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 			u32 queue_id, u16 flags);
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
                   ` (8 preceding siblings ...)
  2018-06-04 12:05 ` [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx Björn Töpel
@ 2018-06-04 12:06 ` Björn Töpel
  2018-06-04 20:53   ` Alexander Duyck
  2018-06-05 12:43   ` Jesper Dangaard Brouer
  2018-06-04 12:06 ` [PATCH bpf-next 11/11] samples/bpf: xdpsock: use skb Tx path for XDP_SKB Björn Töpel
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: john.fastabend, willemdebruijn.kernel, mst, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang, francois.ozog,
	ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, ndo_xsk_async_xmit is implemented. As a shortcut, the existing
XDP Tx rings are used for zero-copy. This will result in other devices
doing XDP_REDIRECT to an AF_XDP enabled queue will have its packets
dropped.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |   7 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  93 +++++++++++-------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  23 +++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 140 ++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |   2 +
 include/net/xdp_sock.h                      |  14 +++
 6 files changed, 242 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 8c602424d339..98c18c41809d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3073,8 +3073,12 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
-	if (ring_is_xdp(ring))
+	ring->clean_tx_irq = i40e_clean_tx_irq;
+	if (ring_is_xdp(ring)) {
 		ring->xsk_umem = i40e_xsk_umem(ring);
+		if (ring->xsk_umem)
+			ring->clean_tx_irq = i40e_clean_tx_irq_zc;
+	}
 
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
@@ -12162,6 +12166,7 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_bpf		= i40e_xdp,
 	.ndo_xdp_xmit		= i40e_xdp_xmit,
 	.ndo_xdp_flush		= i40e_xdp_flush,
+	.ndo_xsk_async_xmit	= i40e_xsk_async_xmit,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 6b1142fbc697..923bb84a93ab 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -10,16 +10,6 @@
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
 
-static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
-				u32 td_tag)
-{
-	return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
-			   ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
-			   ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
-			   ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
-			   ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
-}
-
 #define I40E_TXD_CMD (I40E_TX_DESC_CMD_EOP | I40E_TX_DESC_CMD_RS)
 /**
  * i40e_fdir - Generate a Flow Director descriptor based on fdata
@@ -649,9 +639,13 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
 	if (!tx_ring->tx_bi)
 		return;
 
-	/* Free all the Tx ring sk_buffs */
-	for (i = 0; i < tx_ring->count; i++)
-		i40e_unmap_and_free_tx_resource(tx_ring, &tx_ring->tx_bi[i]);
+	/* Cleanup only needed for non XSK TX ZC rings */
+	if (!tx_ring->xsk_umem) {
+		/* Free all the Tx ring sk_buffs */
+		for (i = 0; i < tx_ring->count; i++)
+			i40e_unmap_and_free_tx_resource(tx_ring,
+							&tx_ring->tx_bi[i]);
+	}
 
 	bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
 	memset(tx_ring->tx_bi, 0, bi_size);
@@ -768,8 +762,40 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
 	}
 }
 
+void i40e_update_tx_stats(struct i40e_ring *tx_ring,
+			  unsigned int total_packets,
+			  unsigned int total_bytes)
+{
+	u64_stats_update_begin(&tx_ring->syncp);
+	tx_ring->stats.bytes += total_bytes;
+	tx_ring->stats.packets += total_packets;
+	u64_stats_update_end(&tx_ring->syncp);
+	tx_ring->q_vector->tx.total_bytes += total_bytes;
+	tx_ring->q_vector->tx.total_packets += total_packets;
+}
+
 #define WB_STRIDE 4
 
+void i40e_arm_wb(struct i40e_ring *tx_ring,
+		 struct i40e_vsi *vsi,
+		 int budget)
+{
+	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+		/* check to see if there are < 4 descriptors
+		 * waiting to be written back, then kick the hardware to force
+		 * them to be written back in case we stay in NAPI.
+		 * In this mode on X722 we do not enable Interrupt.
+		 */
+		unsigned int j = i40e_get_tx_pending(tx_ring, false);
+
+		if (budget &&
+		    ((j / WB_STRIDE) == 0) && (j > 0) &&
+		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
+		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
+			tx_ring->arm_wb = true;
+	}
+}
+
 /**
  * i40e_clean_tx_irq - Reclaim resources after transmit completes
  * @vsi: the VSI we care about
@@ -778,8 +804,8 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
  *
  * Returns true if there's any budget left (e.g. the clean is finished)
  **/
-static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
-			      struct i40e_ring *tx_ring, int napi_budget)
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget)
 {
 	u16 i = tx_ring->next_to_clean;
 	struct i40e_tx_buffer *tx_buf;
@@ -874,27 +900,9 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
-	u64_stats_update_begin(&tx_ring->syncp);
-	tx_ring->stats.bytes += total_bytes;
-	tx_ring->stats.packets += total_packets;
-	u64_stats_update_end(&tx_ring->syncp);
-	tx_ring->q_vector->tx.total_bytes += total_bytes;
-	tx_ring->q_vector->tx.total_packets += total_packets;
-
-	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
-		/* check to see if there are < 4 descriptors
-		 * waiting to be written back, then kick the hardware to force
-		 * them to be written back in case we stay in NAPI.
-		 * In this mode on X722 we do not enable Interrupt.
-		 */
-		unsigned int j = i40e_get_tx_pending(tx_ring, false);
 
-		if (budget &&
-		    ((j / WB_STRIDE) == 0) && (j > 0) &&
-		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
-		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
-			tx_ring->arm_wb = true;
-	}
+	i40e_update_tx_stats(tx_ring, total_packets, total_bytes);
+	i40e_arm_wb(tx_ring, vsi, budget);
 
 	if (ring_is_xdp(tx_ring))
 		return !!budget;
@@ -2467,10 +2475,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	 * budget and be more aggressive about cleaning up the Tx descriptors.
 	 */
 	i40e_for_each_ring(ring, q_vector->tx) {
-		if (!i40e_clean_tx_irq(vsi, ring, budget)) {
+		if (!ring->clean_tx_irq(vsi, ring, budget)) {
 			clean_complete = false;
 			continue;
 		}
+
 		arm_wb |= ring->arm_wb;
 		ring->arm_wb = false;
 	}
@@ -3595,6 +3604,12 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return -ENXIO;
 
+	/* NB! For now, AF_XDP zero-copy hijacks the XDP ring, and
+	 * will drop incoming packets redirected by other devices!
+	 */
+	if (vsi->xdp_rings[queue_index]->xsk_umem)
+		return -ENXIO;
+
 	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
 		return -EINVAL;
 
@@ -3633,5 +3648,11 @@ void i40e_xdp_flush(struct net_device *dev)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return;
 
+	/* NB! For now, AF_XDP zero-copy hijacks the XDP ring, and
+	 * will drop incoming packets redirected by other devices!
+	 */
+	if (vsi->xdp_rings[queue_index]->xsk_umem)
+		return;
+
 	i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
 }
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index cddb185cd2f8..b9c42c352a8d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -426,6 +426,8 @@ struct i40e_ring {
 
 	int (*clean_rx_irq)(struct i40e_ring *ring, int budget);
 	bool (*alloc_rx_buffers)(struct i40e_ring *ring, u16 n);
+	bool (*clean_tx_irq)(struct i40e_vsi *vsi, struct i40e_ring *ring,
+			     int budget);
 	struct xdp_umem *xsk_umem;
 
 	struct zero_copy_allocator zca; /* ZC allocator anchor */
@@ -506,6 +508,9 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		  u32 flags);
 void i40e_xdp_flush(struct net_device *dev);
 int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget);
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -687,6 +692,16 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
 	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
 }
 
+static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
+				u32 td_tag)
+{
+	return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
+			   ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
+			   ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
+			   ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
+			   ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
+}
+
 void i40e_fd_handle_status(struct i40e_ring *rx_ring,
 			   union i40e_rx_desc *rx_desc, u8 prog_id);
 int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
@@ -696,4 +711,12 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
 			     u8 rx_ptype);
 void i40e_receive_skb(struct i40e_ring *rx_ring,
 		      struct sk_buff *skb, u16 vlan_tag);
+
+void i40e_update_tx_stats(struct i40e_ring *tx_ring,
+			  unsigned int total_packets,
+			  unsigned int total_bytes);
+void i40e_arm_wb(struct i40e_ring *tx_ring,
+		 struct i40e_vsi *vsi,
+		 int budget);
+
 #endif /* _I40E_TXRX_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 9d16924415b9..021fec5b5799 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -535,3 +535,143 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 	return failure ? budget : (int)total_rx_packets;
 }
 
+/* Returns true if the work is finished */
+static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
+{
+	unsigned int total_packets = 0, total_bytes = 0;
+	struct i40e_tx_buffer *tx_bi;
+	struct i40e_tx_desc *tx_desc;
+	bool work_done = true;
+	dma_addr_t dma;
+	u32 len;
+
+	while (budget-- > 0) {
+		if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
+			xdp_ring->tx_stats.tx_busy++;
+			work_done = false;
+			break;
+		}
+
+		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
+			break;
+
+		dma_sync_single_for_device(xdp_ring->dev, dma, len,
+					   DMA_BIDIRECTIONAL);
+
+		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
+		tx_bi->bytecount = len;
+		tx_bi->gso_segs = 1;
+
+		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
+		tx_desc->buffer_addr = cpu_to_le64(dma);
+		tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
+							| I40E_TX_DESC_CMD_EOP,
+							  0, len, 0);
+
+		total_packets++;
+		total_bytes += len;
+
+		xdp_ring->next_to_use++;
+		if (xdp_ring->next_to_use == xdp_ring->count)
+			xdp_ring->next_to_use = 0;
+	}
+
+	if (total_packets > 0) {
+		/* Request an interrupt for the last frame and bump tail ptr. */
+		tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
+						 I40E_TXD_QW1_CMD_SHIFT);
+		i40e_xdp_ring_update_tail(xdp_ring);
+
+		xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
+		i40e_update_tx_stats(xdp_ring, total_packets, total_bytes);
+	}
+
+	return !!budget && work_done;
+}
+
+bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
+			  struct i40e_ring *tx_ring, int napi_budget)
+{
+	struct xdp_umem *umem = tx_ring->xsk_umem;
+	u32 head_idx = i40e_get_head(tx_ring);
+	unsigned int budget = vsi->work_limit;
+	bool work_done = true, xmit_done;
+	u32 completed_frames;
+	u32 frames_ready;
+
+	if (head_idx < tx_ring->next_to_clean)
+		head_idx += tx_ring->count;
+	frames_ready = head_idx - tx_ring->next_to_clean;
+
+	if (frames_ready == 0) {
+		goto out_xmit;
+	} else if (frames_ready > budget) {
+		completed_frames = budget;
+		work_done = false;
+	} else {
+		completed_frames = frames_ready;
+	}
+
+	tx_ring->next_to_clean += completed_frames;
+	if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
+		tx_ring->next_to_clean -= tx_ring->count;
+
+	xsk_umem_complete_tx(umem, completed_frames);
+
+	i40e_arm_wb(tx_ring, vsi, budget);
+
+out_xmit:
+	xmit_done = i40e_xmit_zc(tx_ring, budget);
+
+	return work_done && xmit_done;
+}
+
+/**
+ * i40e_napi_is_scheduled - If napi is running, set the NAPIF_STATE_MISSED
+ * @n: napi context
+ *
+ * Returns true if NAPI is scheduled.
+ **/
+static bool i40e_napi_is_scheduled(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (val & NAPIF_STATE_DISABLE)
+			return true;
+
+		if (!(val & NAPIF_STATE_SCHED))
+			return false;
+
+		new = val | NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return true;
+}
+
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
+{
+	struct i40e_netdev_priv *np = netdev_priv(dev);
+	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_ring *ring;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return -ENETDOWN;
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return -ENXIO;
+
+	if (queue_id >= vsi->num_queue_pairs)
+		return -ENXIO;
+
+	if (!vsi->xdp_rings[queue_id]->xsk_umem)
+		return -ENXIO;
+
+	ring = vsi->xdp_rings[queue_id];
+
+	if (!i40e_napi_is_scheduled(&ring->q_vector->napi))
+		i40e_force_wb(vsi, ring->q_vector);
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
index 757ac5ca8511..bd006f1a4397 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
@@ -13,5 +13,7 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
 void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
 bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
 int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
+bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
+			  struct i40e_ring *tx_ring, int napi_budget);
 
 #endif /* _I40E_XSK_H_ */
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ec8fd3314097..63aa05abf11d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -103,6 +103,20 @@ static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
 static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
 {
 }
+
+static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
+{
+}
+
+static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
+				       u32 *len)
+{
+	return false;
+}
+
+static inline void xsk_umem_consume_tx_done(struct xdp_umem *umem)
+{
+}
 #endif /* CONFIG_XDP_SOCKETS */
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 11/11] samples/bpf: xdpsock: use skb Tx path for XDP_SKB
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
                   ` (9 preceding siblings ...)
  2018-06-04 12:06 ` [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx Björn Töpel
@ 2018-06-04 12:06 ` Björn Töpel
  2018-06-04 16:38 ` [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Alexei Starovoitov
  2018-11-14  8:10 ` af_xdp zero copy ideas Michael S. Tsirkin
  12 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-04 12:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Make sure that XDP_SKB also uses the skb Tx path.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 samples/bpf/xdpsock_user.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 7494f60fbff8..d69c8d78d3fd 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -75,6 +75,7 @@ static int opt_queue;
 static int opt_poll;
 static int opt_shared_packet_buffer;
 static int opt_interval = 1;
+static u32 opt_xdp_bind_flags;
 
 struct xdp_umem_uqueue {
 	u32 cached_prod;
@@ -541,9 +542,12 @@ static struct xdpsock *xsk_configure(struct xdp_umem *umem)
 	sxdp.sxdp_family = PF_XDP;
 	sxdp.sxdp_ifindex = opt_ifindex;
 	sxdp.sxdp_queue_id = opt_queue;
+
 	if (shared) {
 		sxdp.sxdp_flags = XDP_SHARED_UMEM;
 		sxdp.sxdp_shared_umem_fd = umem->fd;
+	} else {
+		sxdp.sxdp_flags = opt_xdp_bind_flags;
 	}
 
 	lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
@@ -699,6 +703,7 @@ static void parse_command_line(int argc, char **argv)
 			break;
 		case 'S':
 			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
+			opt_xdp_bind_flags |= XDP_COPY;
 			break;
 		case 'N':
 			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
                   ` (10 preceding siblings ...)
  2018-06-04 12:06 ` [PATCH bpf-next 11/11] samples/bpf: xdpsock: use skb Tx path for XDP_SKB Björn Töpel
@ 2018-06-04 16:38 ` Alexei Starovoitov
  2018-06-04 20:29   ` [Intel-wired-lan] " Jeff Kirsher
  2018-11-14  8:10 ` af_xdp zero copy ideas Michael S. Tsirkin
  12 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2018-06-04 16:38 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev,
	Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

On Mon, Jun 04, 2018 at 02:05:50PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This patch serie introduces zerocopy (ZC) support for
> AF_XDP. Programs using AF_XDP sockets will now receive RX packets
> without any copies and can also transmit packets without incurring any
> copies. No modifications to the application are needed, but the NIC
> driver needs to be modified to support ZC. If ZC is not supported by
> the driver, the modes introduced in the AF_XDP patch will be
> used. Using ZC in our micro benchmarks results in significantly
> improved performance as can be seen in the performance section later
> in this cover letter.
> 
> Note that for an untrusted application, HW packet steering to a
> specific queue pair (the one associated with the application) is a
> requirement when using ZC, as the application would otherwise be able
> to see other user space processes' packets. If the HW cannot support
> the required packet steering you need to use the XDP_SKB mode or the
> XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP
> patch set can be used to do load balancing in that case.
> 
> For benchmarking, you can use the xdpsock application from the AF_XDP
> patch set without any modifications. Say that you would like your UDP
> traffic from port 4242 to end up in queue 16, that we will enable
> AF_XDP on. Here, we use ethtool for this:
> 
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
> 
> Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be
> done using:
> 
>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are without
> retpoline so that we can compare against previous numbers. 
> 
> AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
> set are also reported for ease of reference. The numbers within
> parantheses are from the RFC V1 ZC patch set.
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.9*       9.6*       21.1(21.5)
> txpush       2.6*       -          22.0(21.6)
> l2fwd        1.9*       2.5*       15.3(15.0)
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.1*       3.3*       3.3(3.3)
> l2fwd        1.4*       1.8*       3.1(3.1)
> 
> * From AF_XDP V3 patch set and cover letter.
> 
> So why do we not get higher values for RX similar to the 34 Mpps we
> had in AF_PACKET V4? We made an experiment running the rxdrop
> benchmark without using the xdp_do_redirect/flush infrastructure nor
> using an XDP program (all traffic on a queue goes to one
> socket). Instead the driver acts directly on the AF_XDP socket. With
> this we got 36.9 Mpps, a significant improvement without any change to
> the uapi. So not forcing users to have an XDP program if they do not
> need it, might be a good idea. This measurement is actually higher
> than what we got with AF_PACKET V4.
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32.3M  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0
> 
> The structure of the patch set is as follows:
> 
> Patches 1-3: Plumbing for AF_XDP ZC support
> Patches 4-5: AF_XDP ZC for RX
> Patches 6-7: AF_XDP ZC for TX

Acked-by: Alexei Starovoitov <ast@kernel.org>
for above patches

> Patch 8-10: ZC support for i40e.

these also look good to me.
would be great if i40e experts take a look at them asap.

If there are no major objections we'd like to merge all of it
for this merge window.

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Intel-wired-lan] [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support
  2018-06-04 16:38 ` [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Alexei Starovoitov
@ 2018-06-04 20:29   ` Jeff Kirsher
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff Kirsher @ 2018-06-04 20:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Björn Töpel
  Cc: mykyta.iziumtsev, mst, brian.brooks, magnus.karlsson, andy,
	francois.ozog, willemdebruijn.kernel, daniel, ast,
	intel-wired-lan, brouer, Björn Töpel,
	michael.lundkvist, qi.z.zhang, michael.chan, magnus.karlsson,
	netdev, ilias.apalodimas

[-- Attachment #1: Type: text/plain, Size: 5343 bytes --]

On Mon, 2018-06-04 at 09:38 -0700, Alexei Starovoitov wrote:
> On Mon, Jun 04, 2018 at 02:05:50PM +0200, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> > 
> > This patch serie introduces zerocopy (ZC) support for
> > AF_XDP. Programs using AF_XDP sockets will now receive RX packets
> > without any copies and can also transmit packets without incurring
> > any
> > copies. No modifications to the application are needed, but the NIC
> > driver needs to be modified to support ZC. If ZC is not supported
> > by
> > the driver, the modes introduced in the AF_XDP patch will be
> > used. Using ZC in our micro benchmarks results in significantly
> > improved performance as can be seen in the performance section
> > later
> > in this cover letter.
> > 
> > Note that for an untrusted application, HW packet steering to a
> > specific queue pair (the one associated with the application) is a
> > requirement when using ZC, as the application would otherwise be
> > able
> > to see other user space processes' packets. If the HW cannot
> > support
> > the required packet steering you need to use the XDP_SKB mode or
> > the
> > XDP_DRV mode without ZC turned on. The XSKMAP introduced in the
> > AF_XDP
> > patch set can be used to do load balancing in that case.
> > 
> > For benchmarking, you can use the xdpsock application from the
> > AF_XDP
> > patch set without any modifications. Say that you would like your
> > UDP
> > traffic from port 4242 to end up in queue 16, that we will enable
> > AF_XDP on. Here, we use ethtool for this:
> > 
> >       ethtool -N p3p2 rx-flow-hash udp4 fn
> >       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
> >           action 16
> > 
> > Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then
> > be
> > done using:
> > 
> >       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> > 
> > We have run some benchmarks on a dual socket system with two
> > Broadwell
> > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has
> > 14
> > cores which gives a total of 28, but only two cores are used in
> > these
> > experiments. One for TR/RX and one for the user space application.
> > The
> > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> > 8192MB and with 8 of those DIMMs in the system we have 64 GB of
> > total
> > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0.
> > The
> > NIC is Intel I40E 40Gbit/s using the i40e driver.
> > 
> > Below are the results in Mpps of the I40E NIC benchmark runs for 64
> > and 1500 byte packets, generated by a commercial packet generator
> > HW
> > outputing packets at full 40 Gbit/s line rate. The results are
> > without
> > retpoline so that we can compare against previous numbers. 
> > 
> > AF_XDP performance 64 byte packets. Results from the AF_XDP V3
> > patch
> > set are also reported for ease of reference. The numbers within
> > parantheses are from the RFC V1 ZC patch set.
> > Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> > rxdrop       2.9*       9.6*       21.1(21.5)
> > txpush       2.6*       -          22.0(21.6)
> > l2fwd        1.9*       2.5*       15.3(15.0)
> > 
> > AF_XDP performance 1500 byte packets:
> > Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> > rxdrop       2.1*       3.3*       3.3(3.3)
> > l2fwd        1.4*       1.8*       3.1(3.1)
> > 
> > * From AF_XDP V3 patch set and cover letter.
> > 
> > So why do we not get higher values for RX similar to the 34 Mpps we
> > had in AF_PACKET V4? We made an experiment running the rxdrop
> > benchmark without using the xdp_do_redirect/flush infrastructure
> > nor
> > using an XDP program (all traffic on a queue goes to one
> > socket). Instead the driver acts directly on the AF_XDP socket.
> > With
> > this we got 36.9 Mpps, a significant improvement without any change
> > to
> > the uapi. So not forcing users to have an XDP program if they do
> > not
> > need it, might be a good idea. This measurement is actually higher
> > than what we got with AF_PACKET V4.
> > 
> > XDP performance on our system as a base line:
> > 
> > 64 byte packets:
> > XDP stats       CPU     pps         issue-pps
> > XDP-RX CPU      16      32.3M  0
> > 
> > 1500 byte packets:
> > XDP stats       CPU     pps         issue-pps
> > XDP-RX CPU      16      3.3M    0
> > 
> > The structure of the patch set is as follows:
> > 
> > Patches 1-3: Plumbing for AF_XDP ZC support
> > Patches 4-5: AF_XDP ZC for RX
> > Patches 6-7: AF_XDP ZC for TX
> 
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> for above patches
> 
> > Patch 8-10: ZC support for i40e.
> 
> these also look good to me.
> would be great if i40e experts take a look at them asap.
> 
> If there are no major objections we'd like to merge all of it
> for this merge window.

We would like a bit more time to review and test the changes, I
understand your eagerness for wanting this to get into 4.18 but this
change is large enough that a 24-48 hour review time is not prudent,
IMHO.

Alex also has requested for more time so that he can review the changes
as well.  I will go ahead and put the entire series in my tree so that
our validation team can start to "kick the tires".

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx
  2018-06-04 12:05 ` [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx Björn Töpel
@ 2018-06-04 20:35   ` Alexander Duyck
  2018-06-07  7:40     ` Björn Töpel
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Duyck @ 2018-06-04 20:35 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
	Netdev, mykyta.iziumtsev, Björn Töpel, John Fastabend,
	Willem de Bruijn, Michael S. Tsirkin, michael.lundkvist,
	Brandeburg, Jesse, Anjali Singhai Jain, qi.z.zhang,
	francois.ozog, ilias.apalodimas, brian.brooks

On Mon, Jun 4, 2018 at 5:05 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This commit adds initial AF_XDP zero-copy support for i40e-based
> NICs. First we add support for the new XDP_QUERY_XSK_UMEM and
> XDP_SETUP_XSK_UMEM commands in ndo_bpf. This allows the AF_XDP socket
> to pass a UMEM to the driver. The driver will then DMA map all the
> frames in the UMEM for the driver. Next, the Rx code will allocate
> frames from the UMEM fill queue, instead of the regular page
> allocator.
>
> Externally, for the rest of the XDP code, the driver internal UMEM
> allocator will appear as a MEM_TYPE_ZERO_COPY.
>
> The commit also introduces a completely new clean_rx_irq/allocator
> functions for zero-copy, and means (functions pointers) to set
> allocators and clean_rx functions.
>
> This first version does not support:
> * passing frames to the stack via XDP_PASS (clone/copy to skb).
> * doing XDP redirect to other than AF_XDP sockets
>   (convert_to_xdp_frame does not clone the frame yet).
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/Makefile    |   3 +-
>  drivers/net/ethernet/intel/i40e/i40e.h      |  23 ++
>  drivers/net/ethernet/intel/i40e/i40e_main.c |  35 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 163 ++-------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h | 128 ++++++-
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 537 ++++++++++++++++++++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  17 +
>  include/net/xdp_sock.h                      |  19 +
>  net/xdp/xdp_umem.h                          |  10 -
>  9 files changed, 789 insertions(+), 146 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h
>
> diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
> index 14397e7e9925..50590e8d1fd1 100644
> --- a/drivers/net/ethernet/intel/i40e/Makefile
> +++ b/drivers/net/ethernet/intel/i40e/Makefile
> @@ -22,6 +22,7 @@ i40e-objs := i40e_main.o \
>         i40e_txrx.o     \
>         i40e_ptp.o      \
>         i40e_client.o   \
> -       i40e_virtchnl_pf.o
> +       i40e_virtchnl_pf.o \
> +       i40e_xsk.o
>
>  i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
> index 7a80652e2500..20955e5dce02 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -786,6 +786,12 @@ struct i40e_vsi {
>
>         /* VSI specific handlers */
>         irqreturn_t (*irq_handler)(int irq, void *data);
> +
> +       /* AF_XDP zero-copy */
> +       struct xdp_umem **xsk_umems;
> +       u16 num_xsk_umems_used;
> +       u16 num_xsk_umems;
> +
>  } ____cacheline_internodealigned_in_smp;
>
>  struct i40e_netdev_priv {
> @@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
>         return !!vsi->xdp_prog;
>  }
>
> +static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
> +{
> +       bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
> +       int qid = ring->queue_index;
> +
> +       if (ring_is_xdp(ring))
> +               qid -= ring->vsi->alloc_queue_pairs;
> +
> +       if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
> +               return NULL;
> +
> +       return ring->vsi->xsk_umems[qid];
> +}
> +
>  int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
>  int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
>  int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
> @@ -1098,4 +1118,7 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
>  int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
>                                       struct i40e_cloud_filter *filter,
>                                       bool add);
> +int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair);
> +int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair);
> +
>  #endif /* _I40E_H_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 369a116edaa1..8c602424d339 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -5,6 +5,7 @@
>  #include <linux/of_net.h>
>  #include <linux/pci.h>
>  #include <linux/bpf.h>
> +#include <net/xdp_sock.h>
>
>  /* Local includes */
>  #include "i40e.h"
> @@ -16,6 +17,7 @@
>   */
>  #define CREATE_TRACE_POINTS
>  #include "i40e_trace.h"
> +#include "i40e_xsk.h"
>
>  const char i40e_driver_name[] = "i40e";
>  static const char i40e_driver_string[] =
> @@ -3071,6 +3073,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
>         i40e_status err = 0;
>         u32 qtx_ctl = 0;
>
> +       if (ring_is_xdp(ring))
> +               ring->xsk_umem = i40e_xsk_umem(ring);
> +
>         /* some ATR related tx ring init */
>         if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
>                 ring->atr_sample_rate = vsi->back->atr_sample_rate;
> @@ -3180,13 +3185,30 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>         struct i40e_hw *hw = &vsi->back->hw;
>         struct i40e_hmc_obj_rxq rx_ctx;
>         i40e_status err = 0;
> +       int ret;
>
>         bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
>
>         /* clear the context structure first */
>         memset(&rx_ctx, 0, sizeof(rx_ctx));
>
> -       ring->rx_buf_len = vsi->rx_buf_len;
> +       ring->xsk_umem = i40e_xsk_umem(ring);
> +       if (ring->xsk_umem) {
> +               ring->clean_rx_irq = i40e_clean_rx_irq_zc;
> +               ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
> +               ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -
> +                                  XDP_PACKET_HEADROOM;
> +               ring->zca.free = i40e_zca_free;
> +               ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> +                                                MEM_TYPE_ZERO_COPY,
> +                                                &ring->zca);
> +               if (ret)
> +                       return ret;
> +       } else {
> +               ring->clean_rx_irq = i40e_clean_rx_irq;
> +               ring->alloc_rx_buffers = i40e_alloc_rx_buffers;
> +               ring->rx_buf_len = vsi->rx_buf_len;
> +       }

With everything that is going on with retpoline overhead I am really
wary of this. We may want to look at finding another way to do this
instead of just function pointers so that we can avoid the extra
function pointer overhead. We may want to look at a flag for this
instead of using function pointers.

>         rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
>                                     BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
> @@ -3242,7 +3264,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>         ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
>         writel(0, ring->tail);
>
> -       i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
> +       ring->alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
>
>         return 0;
>  }
> @@ -12022,7 +12044,7 @@ static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
>   *
>   * Returns 0 on success, <0 on failure.
>   **/
> -static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
> +int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
>  {
>         int err;
>
> @@ -12047,7 +12069,7 @@ static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
>   *
>   * Returns 0 on success, <0 on failure.
>   **/
> -static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
> +int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
>  {
>         int err;
>
> @@ -12095,6 +12117,11 @@ static int i40e_xdp(struct net_device *dev,
>                 xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
>                 xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
>                 return 0;
> +       case XDP_QUERY_XSK_UMEM:
> +               return 0;
> +       case XDP_SETUP_XSK_UMEM:
> +               return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
> +                                          xdp->xsk.queue_id);
>         default:
>                 return -EINVAL;
>         }
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 5f01e4ce9c92..6b1142fbc697 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -5,6 +5,7 @@
>  #include <net/busy_poll.h>
>  #include <linux/bpf_trace.h>
>  #include <net/xdp.h>
> +#include <net/xdp_sock.h>
>  #include "i40e.h"
>  #include "i40e_trace.h"
>  #include "i40e_prototype.h"
> @@ -536,8 +537,8 @@ int i40e_add_del_fdir(struct i40e_vsi *vsi,
>   * This is used to verify if the FD programming or invalidation
>   * requested by SW to the HW is successful or not and take actions accordingly.
>   **/
> -static void i40e_fd_handle_status(struct i40e_ring *rx_ring,
> -                                 union i40e_rx_desc *rx_desc, u8 prog_id)
> +void i40e_fd_handle_status(struct i40e_ring *rx_ring,
> +                          union i40e_rx_desc *rx_desc, u8 prog_id)
>  {
>         struct i40e_pf *pf = rx_ring->vsi->back;
>         struct pci_dev *pdev = pf->pdev;
> @@ -1246,25 +1247,6 @@ static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
>         new_buff->pagecnt_bias  = old_buff->pagecnt_bias;
>  }
>
> -/**
> - * i40e_rx_is_programming_status - check for programming status descriptor
> - * @qw: qword representing status_error_len in CPU ordering
> - *
> - * The value of in the descriptor length field indicate if this
> - * is a programming status descriptor for flow director or FCoE
> - * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
> - * it is a packet descriptor.
> - **/
> -static inline bool i40e_rx_is_programming_status(u64 qw)
> -{
> -       /* The Rx filter programming status and SPH bit occupy the same
> -        * spot in the descriptor. Since we don't support packet split we
> -        * can just reuse the bit as an indication that this is a
> -        * programming status descriptor.
> -        */
> -       return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
> -}
> -
>  /**
>   * i40e_clean_programming_status - clean the programming status descriptor
>   * @rx_ring: the rx ring that has this descriptor
> @@ -1373,31 +1355,35 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>         }
>
>         /* Free all the Rx ring sk_buffs */
> -       for (i = 0; i < rx_ring->count; i++) {
> -               struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
> -
> -               if (!rx_bi->page)
> -                       continue;
> +       if (!rx_ring->xsk_umem) {

Instead of changing the indent on all this code it would probably be
easier to just add a goto and a label to skip all this.

> +               for (i = 0; i < rx_ring->count; i++) {
> +                       struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
>
> -               /* Invalidate cache lines that may have been written to by
> -                * device so that we avoid corrupting memory.
> -                */
> -               dma_sync_single_range_for_cpu(rx_ring->dev,
> -                                             rx_bi->dma,
> -                                             rx_bi->page_offset,
> -                                             rx_ring->rx_buf_len,
> -                                             DMA_FROM_DEVICE);
> -
> -               /* free resources associated with mapping */
> -               dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> -                                    i40e_rx_pg_size(rx_ring),
> -                                    DMA_FROM_DEVICE,
> -                                    I40E_RX_DMA_ATTR);
> -
> -               __page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
> +                       if (!rx_bi->page)
> +                               continue;
>
> -               rx_bi->page = NULL;
> -               rx_bi->page_offset = 0;
> +                       /* Invalidate cache lines that may have been
> +                        * written to by device so that we avoid
> +                        * corrupting memory.
> +                        */
> +                       dma_sync_single_range_for_cpu(rx_ring->dev,
> +                                                     rx_bi->dma,
> +                                                     rx_bi->page_offset,
> +                                                     rx_ring->rx_buf_len,
> +                                                     DMA_FROM_DEVICE);
> +
> +                       /* free resources associated with mapping */
> +                       dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> +                                            i40e_rx_pg_size(rx_ring),
> +                                            DMA_FROM_DEVICE,
> +                                            I40E_RX_DMA_ATTR);
> +
> +                       __page_frag_cache_drain(rx_bi->page,
> +                                               rx_bi->pagecnt_bias);
> +
> +                       rx_bi->page = NULL;
> +                       rx_bi->page_offset = 0;
> +               }
>         }
>
>         bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
> @@ -1487,27 +1473,6 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
>         return err;
>  }
>
> -/**
> - * i40e_release_rx_desc - Store the new tail and head values
> - * @rx_ring: ring to bump
> - * @val: new head index
> - **/
> -static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
> -{
> -       rx_ring->next_to_use = val;
> -
> -       /* update next to alloc since we have filled the ring */
> -       rx_ring->next_to_alloc = val;
> -
> -       /* Force memory writes to complete before letting h/w
> -        * know there are new descriptors to fetch.  (Only
> -        * applicable for weak-ordered memory model archs,
> -        * such as IA-64).
> -        */
> -       wmb();
> -       writel(val, rx_ring->tail);
> -}
> -
>  /**
>   * i40e_rx_offset - Return expected offset into page to access data
>   * @rx_ring: Ring we are requesting offset of
> @@ -1576,8 +1541,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
>   * @skb: packet to send up
>   * @vlan_tag: vlan tag for packet
>   **/
> -static void i40e_receive_skb(struct i40e_ring *rx_ring,
> -                            struct sk_buff *skb, u16 vlan_tag)
> +void i40e_receive_skb(struct i40e_ring *rx_ring,
> +                     struct sk_buff *skb, u16 vlan_tag)
>  {
>         struct i40e_q_vector *q_vector = rx_ring->q_vector;
>
> @@ -1804,7 +1769,6 @@ static inline void i40e_rx_hash(struct i40e_ring *ring,
>   * order to populate the hash, checksum, VLAN, protocol, and
>   * other fields within the skb.
>   **/
> -static inline
>  void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>                              union i40e_rx_desc *rx_desc, struct sk_buff *skb,
>                              u8 rx_ptype)
> @@ -1829,46 +1793,6 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>         skb->protocol = eth_type_trans(skb, rx_ring->netdev);
>  }
>
> -/**
> - * i40e_cleanup_headers - Correct empty headers
> - * @rx_ring: rx descriptor ring packet is being transacted on
> - * @skb: pointer to current skb being fixed
> - * @rx_desc: pointer to the EOP Rx descriptor
> - *
> - * Also address the case where we are pulling data in on pages only
> - * and as such no data is present in the skb header.
> - *
> - * In addition if skb is not at least 60 bytes we need to pad it so that
> - * it is large enough to qualify as a valid Ethernet frame.
> - *
> - * Returns true if an error was encountered and skb was freed.
> - **/
> -static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb,
> -                                union i40e_rx_desc *rx_desc)
> -
> -{
> -       /* XDP packets use error pointer so abort at this point */
> -       if (IS_ERR(skb))
> -               return true;
> -
> -       /* ERR_MASK will only have valid bits if EOP set, and
> -        * what we are doing here is actually checking
> -        * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
> -        * the error field
> -        */
> -       if (unlikely(i40e_test_staterr(rx_desc,
> -                                      BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
> -               dev_kfree_skb_any(skb);
> -               return true;
> -       }
> -
> -       /* if eth_skb_pad returns an error the skb was freed */
> -       if (eth_skb_pad(skb))
> -               return true;
> -
> -       return false;
> -}
> -
>  /**
>   * i40e_page_is_reusable - check if any reuse is possible
>   * @page: page struct to check
> @@ -2177,15 +2101,11 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
>         return true;
>  }
>
> -#define I40E_XDP_PASS 0
> -#define I40E_XDP_CONSUMED 1
> -#define I40E_XDP_TX 2
> -
>  static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
>                               struct i40e_ring *xdp_ring);
>
> -static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> -                                struct i40e_ring *xdp_ring)
> +int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> +                         struct i40e_ring *xdp_ring)
>  {
>         struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
>
> @@ -2214,8 +2134,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
>         if (!xdp_prog)
>                 goto xdp_out;
>
> -       prefetchw(xdp->data_hard_start); /* xdp_frame write */
> -
>         act = bpf_prog_run_xdp(xdp_prog, xdp);
>         switch (act) {
>         case XDP_PASS:
> @@ -2263,15 +2181,6 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
>  #endif
>  }
>
> -static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
> -{
> -       /* Force memory writes to complete before letting h/w
> -        * know there are new descriptors to fetch.
> -        */
> -       wmb();
> -       writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
> -}
> -
>  /**
>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>   * @rx_ring: rx descriptor ring to transact packets on
> @@ -2284,7 +2193,7 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
>   *
>   * Returns amount of work completed
>   **/
> -static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
> +int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>  {
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>         struct sk_buff *skb = rx_ring->skb;
> @@ -2576,7 +2485,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
>         budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
>
>         i40e_for_each_ring(ring, q_vector->rx) {
> -               int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
> +               int cleaned = ring->clean_rx_irq(ring, budget_per_ring);
>
>                 work_done += cleaned;
>                 /* if we clean as many as budgeted, we must not be done */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index 820f76db251b..cddb185cd2f8 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -296,13 +296,22 @@ struct i40e_tx_buffer {
>
>  struct i40e_rx_buffer {
>         dma_addr_t dma;
> -       struct page *page;
> +       union {
> +               struct {
> +                       struct page *page;
>  #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
> -       __u32 page_offset;
> +                       __u32 page_offset;
>  #else
> -       __u16 page_offset;
> +                       __u16 page_offset;
>  #endif
> -       __u16 pagecnt_bias;
> +                       __u16 pagecnt_bias;
> +               };
> +               struct {
> +                       /* for umem */
> +                       void *addr;
> +                       u64 handle;
> +               };

It might work better to just do this as a pair of unions. One for
page/addr and another for handle, page_offset, and pagecnt_bias.

> +       };
>  };
>
>  struct i40e_queue_stats {
> @@ -414,6 +423,12 @@ struct i40e_ring {
>
>         struct i40e_channel *ch;
>         struct xdp_rxq_info xdp_rxq;
> +
> +       int (*clean_rx_irq)(struct i40e_ring *ring, int budget);
> +       bool (*alloc_rx_buffers)(struct i40e_ring *ring, u16 n);
> +       struct xdp_umem *xsk_umem;
> +
> +       struct zero_copy_allocator zca; /* ZC allocator anchor */
>  } ____cacheline_internodealigned_in_smp;
>
>  static inline bool ring_uses_build_skb(struct i40e_ring *ring)
> @@ -490,6 +505,7 @@ bool __i40e_chk_linearize(struct sk_buff *skb);
>  int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>                   u32 flags);
>  void i40e_xdp_flush(struct net_device *dev);
> +int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
>
>  /**
>   * i40e_get_head - Retrieve head from head writeback
> @@ -576,4 +592,108 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
>  {
>         return netdev_get_tx_queue(ring->netdev, ring->queue_index);
>  }
> +
> +#define I40E_XDP_PASS 0
> +#define I40E_XDP_CONSUMED 1
> +#define I40E_XDP_TX 2
> +
> +/**
> + * i40e_release_rx_desc - Store the new tail and head values
> + * @rx_ring: ring to bump
> + * @val: new head index
> + **/
> +static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
> +{
> +       rx_ring->next_to_use = val;
> +
> +       /* update next to alloc since we have filled the ring */
> +       rx_ring->next_to_alloc = val;
> +
> +       /* Force memory writes to complete before letting h/w
> +        * know there are new descriptors to fetch.  (Only
> +        * applicable for weak-ordered memory model archs,
> +        * such as IA-64).
> +        */
> +       wmb();
> +       writel(val, rx_ring->tail);
> +}
> +
> +/**
> + * i40e_rx_is_programming_status - check for programming status descriptor
> + * @qw: qword representing status_error_len in CPU ordering
> + *
> + * The value of in the descriptor length field indicate if this
> + * is a programming status descriptor for flow director or FCoE
> + * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
> + * it is a packet descriptor.
> + **/
> +static inline bool i40e_rx_is_programming_status(u64 qw)
> +{
> +       /* The Rx filter programming status and SPH bit occupy the same
> +        * spot in the descriptor. Since we don't support packet split we
> +        * can just reuse the bit as an indication that this is a
> +        * programming status descriptor.
> +        */
> +       return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
> +}
> +
> +/**
> + * i40e_cleanup_headers - Correct empty headers
> + * @rx_ring: rx descriptor ring packet is being transacted on
> + * @skb: pointer to current skb being fixed
> + * @rx_desc: pointer to the EOP Rx descriptor
> + *
> + * Also address the case where we are pulling data in on pages only
> + * and as such no data is present in the skb header.
> + *
> + * In addition if skb is not at least 60 bytes we need to pad it so that
> + * it is large enough to qualify as a valid Ethernet frame.
> + *
> + * Returns true if an error was encountered and skb was freed.
> + **/
> +static inline bool i40e_cleanup_headers(struct i40e_ring *rx_ring,
> +                                       struct sk_buff *skb,
> +                                       union i40e_rx_desc *rx_desc)
> +
> +{
> +       /* XDP packets use error pointer so abort at this point */
> +       if (IS_ERR(skb))
> +               return true;
> +
> +       /* ERR_MASK will only have valid bits if EOP set, and
> +        * what we are doing here is actually checking
> +        * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
> +        * the error field
> +        */
> +       if (unlikely(i40e_test_staterr(rx_desc,
> +                                      BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
> +               dev_kfree_skb_any(skb);
> +               return true;
> +       }
> +
> +       /* if eth_skb_pad returns an error the skb was freed */
> +       if (eth_skb_pad(skb))
> +               return true;
> +
> +       return false;
> +}
> +
> +static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
> +{
> +       /* Force memory writes to complete before letting h/w
> +        * know there are new descriptors to fetch.
> +        */
> +       wmb();
> +       writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
> +}
> +
> +void i40e_fd_handle_status(struct i40e_ring *rx_ring,
> +                          union i40e_rx_desc *rx_desc, u8 prog_id);
> +int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> +                         struct i40e_ring *xdp_ring);
> +void i40e_process_skb_fields(struct i40e_ring *rx_ring,
> +                            union i40e_rx_desc *rx_desc, struct sk_buff *skb,
> +                            u8 rx_ptype);
> +void i40e_receive_skb(struct i40e_ring *rx_ring,
> +                     struct sk_buff *skb, u16 vlan_tag);
>  #endif /* _I40E_TXRX_H_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> new file mode 100644
> index 000000000000..9d16924415b9
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -0,0 +1,537 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2018 Intel Corporation. */
> +
> +#include <linux/bpf_trace.h>
> +#include <net/xdp_sock.h>
> +#include <net/xdp.h>
> +
> +#include "i40e.h"
> +#include "i40e_txrx.h"
> +
> +static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
> +{
> +       if (vsi->xsk_umems)
> +               return 0;
> +
> +       vsi->num_xsk_umems_used = 0;
> +       vsi->num_xsk_umems = vsi->alloc_queue_pairs;
> +       vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
> +                                GFP_KERNEL);
> +       if (!vsi->xsk_umems) {
> +               vsi->num_xsk_umems = 0;
> +               return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                            u16 qid)
> +{
> +       int err;
> +
> +       err = i40e_alloc_xsk_umems(vsi);
> +       if (err)
> +               return err;
> +
> +       vsi->xsk_umems[qid] = umem;
> +       vsi->num_xsk_umems_used++;
> +
> +       return 0;
> +}
> +
> +static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
> +{
> +       vsi->xsk_umems[qid] = NULL;
> +       vsi->num_xsk_umems_used--;
> +
> +       if (vsi->num_xsk_umems == 0) {
> +               kfree(vsi->xsk_umems);
> +               vsi->xsk_umems = NULL;
> +               vsi->num_xsk_umems = 0;
> +       }
> +}
> +
> +static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
> +{
> +       struct i40e_pf *pf = vsi->back;
> +       struct device *dev;
> +       unsigned int i, j;
> +       dma_addr_t dma;
> +
> +       dev = &pf->pdev->dev;
> +       for (i = 0; i < umem->npgs; i++) {
> +               dma = dma_map_page_attrs(dev, umem->pgs[i], 0, PAGE_SIZE,
> +                                        DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> +               if (dma_mapping_error(dev, dma))
> +                       goto out_unmap;
> +
> +               umem->pages[i].dma = dma;
> +       }
> +
> +       return 0;
> +
> +out_unmap:
> +       for (j = 0; j < i; j++) {
> +               dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
> +                                    DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> +               umem->pages[i].dma = 0;
> +       }
> +
> +       return -1;
> +}
> +
> +static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
> +{
> +       struct i40e_pf *pf = vsi->back;
> +       struct device *dev;
> +       unsigned int i;
> +
> +       dev = &pf->pdev->dev;
> +
> +       for (i = 0; i < umem->npgs; i++) {
> +               dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
> +                                    DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> +
> +               umem->pages[i].dma = 0;
> +       }
> +}
> +
> +static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                               u16 qid)
> +{
> +       bool if_running;
> +       int err;
> +
> +       if (vsi->type != I40E_VSI_MAIN)
> +               return -EINVAL;
> +
> +       if (qid >= vsi->num_queue_pairs)
> +               return -EINVAL;
> +
> +       if (vsi->xsk_umems && vsi->xsk_umems[qid])
> +               return -EBUSY;
> +
> +       err = i40e_xsk_umem_dma_map(vsi, umem);
> +       if (err)
> +               return err;
> +
> +       if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_disable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       err = i40e_add_xsk_umem(vsi, umem, qid);
> +       if (err)
> +               return err;
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_enable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
> +{
> +       bool if_running;
> +       int err;
> +
> +       if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
> +           !vsi->xsk_umems[qid])
> +               return -EINVAL;
> +
> +       if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_disable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
> +       i40e_remove_xsk_umem(vsi, qid);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_enable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                       u16 qid)
> +{
> +       if (umem)
> +               return i40e_xsk_umem_enable(vsi, umem, qid);
> +
> +       return i40e_xsk_umem_disable(vsi, qid);
> +}
> +
> +static struct sk_buff *i40e_run_xdp_zc(struct i40e_ring *rx_ring,
> +                                      struct xdp_buff *xdp)
> +{
> +       int err, result = I40E_XDP_PASS;
> +       struct i40e_ring *xdp_ring;
> +       struct bpf_prog *xdp_prog;
> +       u32 act;
> +       u16 off;
> +
> +       rcu_read_lock();
> +       xdp_prog = READ_ONCE(rx_ring->xdp_prog);
> +       act = bpf_prog_run_xdp(xdp_prog, xdp);
> +       off = xdp->data - xdp->data_hard_start;
> +       xdp->handle += off;
> +       switch (act) {
> +       case XDP_PASS:
> +               break;
> +       case XDP_TX:
> +               xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +               result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
> +               break;
> +       case XDP_REDIRECT:
> +               err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
> +               result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
> +               break;
> +       default:
> +               bpf_warn_invalid_xdp_action(act);
> +       case XDP_ABORTED:
> +               trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
> +               /* fallthrough -- handle aborts by dropping packet */
> +       case XDP_DROP:
> +               result = I40E_XDP_CONSUMED;
> +               break;
> +       }
> +
> +       rcu_read_unlock();
> +       return ERR_PTR(-result);
> +}
> +
> +static bool i40e_alloc_frame_zc(struct i40e_ring *rx_ring,
> +                               struct i40e_rx_buffer *bi)
> +{
> +       struct xdp_umem *umem = rx_ring->xsk_umem;
> +       void *addr = bi->addr;
> +       u64 handle;
> +
> +       if (addr) {
> +               rx_ring->rx_stats.page_reuse_count++;
> +               return true;
> +       }
> +
> +       if (!xsk_umem_peek_addr(umem, &handle)) {
> +               rx_ring->rx_stats.alloc_page_failed++;
> +               return false;
> +       }
> +
> +       bi->dma = xdp_umem_get_dma(umem, handle);
> +       bi->addr = xdp_umem_get_data(umem, handle);
> +
> +       bi->dma += umem->headroom + XDP_PACKET_HEADROOM;
> +       bi->addr += umem->headroom + XDP_PACKET_HEADROOM;
> +       bi->handle = handle + umem->headroom;
> +
> +       xsk_umem_discard_addr(umem);
> +       return true;
> +}
> +
> +bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count)
> +{
> +       u16 ntu = rx_ring->next_to_use;
> +       union i40e_rx_desc *rx_desc;
> +       struct i40e_rx_buffer *bi;
> +
> +       rx_desc = I40E_RX_DESC(rx_ring, ntu);
> +       bi = &rx_ring->rx_bi[ntu];
> +
> +       do {
> +               if (!i40e_alloc_frame_zc(rx_ring, bi))
> +                       goto no_buffers;
> +
> +               /* sync the buffer for use by the device */
> +               dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
> +                                                rx_ring->rx_buf_len,
> +                                                DMA_BIDIRECTIONAL);
> +
> +               /* Refresh the desc even if buffer_addrs didn't change
> +                * because each write-back erases this info.
> +                */
> +               rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
> +
> +               rx_desc++;
> +               bi++;
> +               ntu++;
> +               if (unlikely(ntu == rx_ring->count)) {
> +                       rx_desc = I40E_RX_DESC(rx_ring, 0);
> +                       bi = rx_ring->rx_bi;
> +                       ntu = 0;
> +               }
> +
> +               /* clear the status bits for the next_to_use descriptor */
> +               rx_desc->wb.qword1.status_error_len = 0;
> +
> +               cleaned_count--;
> +       } while (cleaned_count);
> +
> +       if (rx_ring->next_to_use != ntu)
> +               i40e_release_rx_desc(rx_ring, ntu);
> +
> +       return false;
> +
> +no_buffers:
> +       if (rx_ring->next_to_use != ntu)
> +               i40e_release_rx_desc(rx_ring, ntu);
> +
> +       /* make sure to come back via polling to try again after
> +        * allocation failure
> +        */
> +       return true;
> +}
> +
> +static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
> +                                                   const unsigned int size)
> +{
> +       struct i40e_rx_buffer *rx_buffer;
> +
> +       rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
> +
> +       /* we are reusing so sync this buffer for CPU use */
> +       dma_sync_single_range_for_cpu(rx_ring->dev,
> +                                     rx_buffer->dma, 0,
> +                                     size,
> +                                     DMA_BIDIRECTIONAL);
> +
> +       return rx_buffer;
> +}
> +
> +static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
> +                                   struct i40e_rx_buffer *old_buff)
> +{
> +       u64 mask = rx_ring->xsk_umem->props.chunk_mask;
> +       u64 hr = rx_ring->xsk_umem->headroom;
> +       u16 nta = rx_ring->next_to_alloc;
> +       struct i40e_rx_buffer *new_buff;
> +
> +       new_buff = &rx_ring->rx_bi[nta];
> +
> +       /* update, and store next to alloc */
> +       nta++;
> +       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> +
> +       /* transfer page from old buffer to new buffer */
> +       new_buff->dma           = old_buff->dma & mask;
> +       new_buff->addr          = (void *)((u64)old_buff->addr & mask);
> +       new_buff->handle        = old_buff->handle & mask;
> +
> +       new_buff->dma += hr + XDP_PACKET_HEADROOM;
> +       new_buff->addr += hr + XDP_PACKET_HEADROOM;
> +       new_buff->handle += hr;
> +}
> +
> +/* Called from the XDP return API in NAPI context. */
> +void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
> +{
> +       struct i40e_rx_buffer *new_buff;
> +       struct i40e_ring *rx_ring;
> +       u64 mask;
> +       u16 nta;
> +
> +       rx_ring = container_of(alloc, struct i40e_ring, zca);
> +       mask = rx_ring->xsk_umem->props.chunk_mask;
> +
> +       nta = rx_ring->next_to_alloc;
> +
> +       new_buff = &rx_ring->rx_bi[nta];
> +
> +       /* update, and store next to alloc */
> +       nta++;
> +       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> +
> +       handle &= mask;
> +
> +       new_buff->dma           = xdp_umem_get_dma(rx_ring->xsk_umem, handle);
> +       new_buff->addr          = xdp_umem_get_data(rx_ring->xsk_umem, handle);
> +       new_buff->handle        = (u64)handle;
> +
> +       new_buff->dma += rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
> +       new_buff->addr += rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
> +       new_buff->handle += rx_ring->xsk_umem->headroom;
> +}
> +
> +static struct sk_buff *i40e_zc_frame_to_skb(struct i40e_ring *rx_ring,
> +                                           struct i40e_rx_buffer *rx_buffer,
> +                                           struct xdp_buff *xdp)
> +{
> +       /* XXX implement alloc skb and copy */
> +       i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +       return NULL;
> +}
> +
> +static void i40e_clean_programming_status_zc(struct i40e_ring *rx_ring,
> +                                            union i40e_rx_desc *rx_desc,
> +                                            u64 qw)
> +{
> +       struct i40e_rx_buffer *rx_buffer;
> +       u32 ntc = rx_ring->next_to_clean;
> +       u8 id;
> +
> +       /* fetch, update, and store next to clean */
> +       rx_buffer = &rx_ring->rx_bi[ntc++];
> +       ntc = (ntc < rx_ring->count) ? ntc : 0;
> +       rx_ring->next_to_clean = ntc;
> +
> +       prefetch(I40E_RX_DESC(rx_ring, ntc));
> +
> +       /* place unused page back on the ring */
> +       i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +       rx_ring->rx_stats.page_reuse_count++;
> +
> +       /* clear contents of buffer_info */
> +       rx_buffer->addr = NULL;
> +
> +       id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
> +                 I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
> +
> +       if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
> +               i40e_fd_handle_status(rx_ring, rx_desc, id);
> +}
> +
> +int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> +{
> +       unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> +       u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
> +       bool failure = false, xdp_xmit = false;
> +       struct sk_buff *skb;
> +       struct xdp_buff xdp;
> +
> +       xdp.rxq = &rx_ring->xdp_rxq;
> +
> +       while (likely(total_rx_packets < (unsigned int)budget)) {
> +               struct i40e_rx_buffer *rx_buffer;
> +               union i40e_rx_desc *rx_desc;
> +               unsigned int size;
> +               u16 vlan_tag;
> +               u8 rx_ptype;
> +               u64 qword;
> +               u32 ntc;
> +
> +               /* return some buffers to hardware, one at a time is too slow */
> +               if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
> +                       failure = failure ||
> +                                 i40e_alloc_rx_buffers_zc(rx_ring,
> +                                                          cleaned_count);
> +                       cleaned_count = 0;
> +               }
> +
> +               rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
> +
> +               /* status_error_len will always be zero for unused descriptors
> +                * because it's cleared in cleanup, and overlaps with hdr_addr
> +                * which is always zero because packet split isn't used, if the
> +                * hardware wrote DD then the length will be non-zero
> +                */
> +               qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> +
> +               /* This memory barrier is needed to keep us from reading
> +                * any other fields out of the rx_desc until we have
> +                * verified the descriptor has been written back.
> +                */
> +               dma_rmb();
> +
> +               if (unlikely(i40e_rx_is_programming_status(qword))) {
> +                       i40e_clean_programming_status_zc(rx_ring, rx_desc,
> +                                                        qword);
> +                       cleaned_count++;
> +                       continue;
> +               }
> +               size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> +                      I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
> +               if (!size)
> +                       break;
> +
> +               rx_buffer = i40e_get_rx_buffer_zc(rx_ring, size);
> +
> +               /* retrieve a buffer from the ring */
> +               xdp.data = rx_buffer->addr;
> +               xdp_set_data_meta_invalid(&xdp);
> +               xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
> +               xdp.data_end = xdp.data + size;
> +               xdp.handle = rx_buffer->handle;
> +
> +               skb = i40e_run_xdp_zc(rx_ring, &xdp);
> +
> +               if (IS_ERR(skb)) {
> +                       if (PTR_ERR(skb) == -I40E_XDP_TX)
> +                               xdp_xmit = true;
> +                       else
> +                               i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +                       total_rx_bytes += size;
> +                       total_rx_packets++;
> +               } else {
> +                       skb = i40e_zc_frame_to_skb(rx_ring, rx_buffer, &xdp);
> +                       if (!skb) {
> +                               rx_ring->rx_stats.alloc_buff_failed++;
> +                               break;
> +                       }
> +               }
> +
> +               rx_buffer->addr = NULL;
> +               cleaned_count++;
> +
> +               /* don't care about non-EOP frames in XDP mode */
> +               ntc = rx_ring->next_to_clean + 1;
> +               ntc = (ntc < rx_ring->count) ? ntc : 0;
> +               rx_ring->next_to_clean = ntc;
> +               prefetch(I40E_RX_DESC(rx_ring, ntc));
> +
> +               if (i40e_cleanup_headers(rx_ring, skb, rx_desc)) {
> +                       skb = NULL;
> +                       continue;
> +               }
> +
> +               /* probably a little skewed due to removing CRC */
> +               total_rx_bytes += skb->len;
> +
> +               qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> +               rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
> +                          I40E_RXD_QW1_PTYPE_SHIFT;
> +
> +               /* populate checksum, VLAN, and protocol */
> +               i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
> +
> +               vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
> +                          le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
> +
> +               i40e_receive_skb(rx_ring, skb, vlan_tag);
> +               skb = NULL;
> +
> +               /* update budget accounting */
> +               total_rx_packets++;
> +       }
> +
> +       if (xdp_xmit) {
> +               struct i40e_ring *xdp_ring =
> +                       rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +
> +               i40e_xdp_ring_update_tail(xdp_ring);
> +               xdp_do_flush_map();
> +       }
> +
> +       u64_stats_update_begin(&rx_ring->syncp);
> +       rx_ring->stats.packets += total_rx_packets;
> +       rx_ring->stats.bytes += total_rx_bytes;
> +       u64_stats_update_end(&rx_ring->syncp);
> +       rx_ring->q_vector->rx.total_packets += total_rx_packets;
> +       rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
> +
> +       /* guarantee a trip back through this routine if there was a failure */
> +       return failure ? budget : (int)total_rx_packets;
> +}
> +

You should really look at adding comments to the code you are adding.
>From what I can tell almost all of the code comments were just copied
exactly from the original functions in the i40e_txrx.c file.

> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> new file mode 100644
> index 000000000000..757ac5ca8511
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2018 Intel Corporation. */
> +
> +#ifndef _I40E_XSK_H_
> +#define _I40E_XSK_H_
> +
> +struct i40e_vsi;
> +struct xdp_umem;
> +struct zero_copy_allocator;
> +
> +int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                       u16 qid);
> +void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
> +bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
> +int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
> +
> +#endif /* _I40E_XSK_H_ */
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 9fe472f2ac95..ec8fd3314097 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -94,6 +94,25 @@ static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
>  {
>         return false;
>  }
> +
> +static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
> +{
> +       return NULL;
> +}
> +
> +static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
> +{
> +}
>  #endif /* CONFIG_XDP_SOCKETS */
>
> +static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
> +{
> +       return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
> +}
> +
> +static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
> +{
> +       return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
> +}
> +
>  #endif /* _LINUX_XDP_SOCK_H */
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> index f11560334f88..c8be1ad3eb88 100644
> --- a/net/xdp/xdp_umem.h
> +++ b/net/xdp/xdp_umem.h
> @@ -8,16 +8,6 @@
>
>  #include <net/xdp_sock.h>
>
> -static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
> -{
> -       return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
> -}
> -
> -static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
> -{
> -       return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
> -}
> -
>  int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
>                         u32 queue_id, u16 flags);
>  bool xdp_umem_validate_queues(struct xdp_umem *umem);
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx
  2018-06-04 12:06 ` [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx Björn Töpel
@ 2018-06-04 20:53   ` Alexander Duyck
  2018-06-05 12:43   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 22+ messages in thread
From: Alexander Duyck @ 2018-06-04 20:53 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
	Netdev, mykyta.iziumtsev, John Fastabend, Willem de Bruijn,
	Michael S. Tsirkin, michael.lundkvist, Brandeburg, Jesse,
	Anjali Singhai Jain, qi.z.zhang, francois.ozog, ilias.apalodimas,
	brian.brooks, Andy Gospodarek, Michael Chan

On Mon, Jun 4, 2018 at 5:06 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
>
> Here, ndo_xsk_async_xmit is implemented. As a shortcut, the existing
> XDP Tx rings are used for zero-copy. This will result in other devices
> doing XDP_REDIRECT to an AF_XDP enabled queue will have its packets
> dropped.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c |   7 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |  93 +++++++++++-------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |  23 +++++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 140 ++++++++++++++++++++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h  |   2 +
>  include/net/xdp_sock.h                      |  14 +++
>  6 files changed, 242 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 8c602424d339..98c18c41809d 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -3073,8 +3073,12 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
>         i40e_status err = 0;
>         u32 qtx_ctl = 0;
>
> -       if (ring_is_xdp(ring))
> +       ring->clean_tx_irq = i40e_clean_tx_irq;
> +       if (ring_is_xdp(ring)) {
>                 ring->xsk_umem = i40e_xsk_umem(ring);
> +               if (ring->xsk_umem)
> +                       ring->clean_tx_irq = i40e_clean_tx_irq_zc;

Again, I am worried what the performance penalty on this will be given
the retpoline penalty for function pointers.

> +       }
>
>         /* some ATR related tx ring init */
>         if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
> @@ -12162,6 +12166,7 @@ static const struct net_device_ops i40e_netdev_ops = {
>         .ndo_bpf                = i40e_xdp,
>         .ndo_xdp_xmit           = i40e_xdp_xmit,
>         .ndo_xdp_flush          = i40e_xdp_flush,
> +       .ndo_xsk_async_xmit     = i40e_xsk_async_xmit,
>  };
>
>  /**
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 6b1142fbc697..923bb84a93ab 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -10,16 +10,6 @@
>  #include "i40e_trace.h"
>  #include "i40e_prototype.h"
>
> -static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
> -                               u32 td_tag)
> -{
> -       return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
> -                          ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
> -                          ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
> -                          ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
> -                          ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
> -}
> -
>  #define I40E_TXD_CMD (I40E_TX_DESC_CMD_EOP | I40E_TX_DESC_CMD_RS)
>  /**
>   * i40e_fdir - Generate a Flow Director descriptor based on fdata
> @@ -649,9 +639,13 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
>         if (!tx_ring->tx_bi)
>                 return;
>
> -       /* Free all the Tx ring sk_buffs */
> -       for (i = 0; i < tx_ring->count; i++)
> -               i40e_unmap_and_free_tx_resource(tx_ring, &tx_ring->tx_bi[i]);
> +       /* Cleanup only needed for non XSK TX ZC rings */
> +       if (!tx_ring->xsk_umem) {
> +               /* Free all the Tx ring sk_buffs */
> +               for (i = 0; i < tx_ring->count; i++)
> +                       i40e_unmap_and_free_tx_resource(tx_ring,
> +                                                       &tx_ring->tx_bi[i]);
> +       }
>
>         bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
>         memset(tx_ring->tx_bi, 0, bi_size);
> @@ -768,8 +762,40 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
>         }
>  }
>
> +void i40e_update_tx_stats(struct i40e_ring *tx_ring,
> +                         unsigned int total_packets,
> +                         unsigned int total_bytes)
> +{
> +       u64_stats_update_begin(&tx_ring->syncp);
> +       tx_ring->stats.bytes += total_bytes;
> +       tx_ring->stats.packets += total_packets;
> +       u64_stats_update_end(&tx_ring->syncp);
> +       tx_ring->q_vector->tx.total_bytes += total_bytes;
> +       tx_ring->q_vector->tx.total_packets += total_packets;
> +}
> +
>  #define WB_STRIDE 4
>
> +void i40e_arm_wb(struct i40e_ring *tx_ring,
> +                struct i40e_vsi *vsi,
> +                int budget)
> +{
> +       if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
> +               /* check to see if there are < 4 descriptors
> +                * waiting to be written back, then kick the hardware to force
> +                * them to be written back in case we stay in NAPI.
> +                * In this mode on X722 we do not enable Interrupt.
> +                */
> +               unsigned int j = i40e_get_tx_pending(tx_ring, false);
> +
> +               if (budget &&
> +                   ((j / WB_STRIDE) == 0) && (j > 0) &&
> +                   !test_bit(__I40E_VSI_DOWN, vsi->state) &&
> +                   (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
> +                       tx_ring->arm_wb = true;
> +       }
> +}
> +
>  /**
>   * i40e_clean_tx_irq - Reclaim resources after transmit completes
>   * @vsi: the VSI we care about
> @@ -778,8 +804,8 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
>   *
>   * Returns true if there's any budget left (e.g. the clean is finished)
>   **/
> -static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> -                             struct i40e_ring *tx_ring, int napi_budget)
> +bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> +                      struct i40e_ring *tx_ring, int napi_budget)
>  {
>         u16 i = tx_ring->next_to_clean;
>         struct i40e_tx_buffer *tx_buf;
> @@ -874,27 +900,9 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
>
>         i += tx_ring->count;
>         tx_ring->next_to_clean = i;
> -       u64_stats_update_begin(&tx_ring->syncp);
> -       tx_ring->stats.bytes += total_bytes;
> -       tx_ring->stats.packets += total_packets;
> -       u64_stats_update_end(&tx_ring->syncp);
> -       tx_ring->q_vector->tx.total_bytes += total_bytes;
> -       tx_ring->q_vector->tx.total_packets += total_packets;
> -
> -       if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
> -               /* check to see if there are < 4 descriptors
> -                * waiting to be written back, then kick the hardware to force
> -                * them to be written back in case we stay in NAPI.
> -                * In this mode on X722 we do not enable Interrupt.
> -                */
> -               unsigned int j = i40e_get_tx_pending(tx_ring, false);
>
> -               if (budget &&
> -                   ((j / WB_STRIDE) == 0) && (j > 0) &&
> -                   !test_bit(__I40E_VSI_DOWN, vsi->state) &&
> -                   (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
> -                       tx_ring->arm_wb = true;
> -       }
> +       i40e_update_tx_stats(tx_ring, total_packets, total_bytes);
> +       i40e_arm_wb(tx_ring, vsi, budget);
>
>         if (ring_is_xdp(tx_ring))
>                 return !!budget;
> @@ -2467,10 +2475,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
>          * budget and be more aggressive about cleaning up the Tx descriptors.
>          */
>         i40e_for_each_ring(ring, q_vector->tx) {
> -               if (!i40e_clean_tx_irq(vsi, ring, budget)) {
> +               if (!ring->clean_tx_irq(vsi, ring, budget)) {
>                         clean_complete = false;
>                         continue;
>                 }
> +
>                 arm_wb |= ring->arm_wb;
>                 ring->arm_wb = false;
>         }
> @@ -3595,6 +3604,12 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>         if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>                 return -ENXIO;
>
> +       /* NB! For now, AF_XDP zero-copy hijacks the XDP ring, and
> +        * will drop incoming packets redirected by other devices!
> +        */
> +       if (vsi->xdp_rings[queue_index]->xsk_umem)
> +               return -ENXIO;
> +
>         if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>                 return -EINVAL;
>
> @@ -3633,5 +3648,11 @@ void i40e_xdp_flush(struct net_device *dev)
>         if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>                 return;
>
> +       /* NB! For now, AF_XDP zero-copy hijacks the XDP ring, and
> +        * will drop incoming packets redirected by other devices!
> +        */
> +       if (vsi->xdp_rings[queue_index]->xsk_umem)
> +               return;
> +
>         i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
>  }
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index cddb185cd2f8..b9c42c352a8d 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -426,6 +426,8 @@ struct i40e_ring {
>
>         int (*clean_rx_irq)(struct i40e_ring *ring, int budget);
>         bool (*alloc_rx_buffers)(struct i40e_ring *ring, u16 n);
> +       bool (*clean_tx_irq)(struct i40e_vsi *vsi, struct i40e_ring *ring,
> +                            int budget);
>         struct xdp_umem *xsk_umem;
>
>         struct zero_copy_allocator zca; /* ZC allocator anchor */
> @@ -506,6 +508,9 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>                   u32 flags);
>  void i40e_xdp_flush(struct net_device *dev);
>  int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
> +bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> +                      struct i40e_ring *tx_ring, int napi_budget);
> +int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id);
>
>  /**
>   * i40e_get_head - Retrieve head from head writeback
> @@ -687,6 +692,16 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
>         writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
>  }
>
> +static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
> +                               u32 td_tag)
> +{
> +       return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
> +                          ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
> +                          ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
> +                          ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
> +                          ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
> +}
> +
>  void i40e_fd_handle_status(struct i40e_ring *rx_ring,
>                            union i40e_rx_desc *rx_desc, u8 prog_id);
>  int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> @@ -696,4 +711,12 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
>                              u8 rx_ptype);
>  void i40e_receive_skb(struct i40e_ring *rx_ring,
>                       struct sk_buff *skb, u16 vlan_tag);
> +
> +void i40e_update_tx_stats(struct i40e_ring *tx_ring,
> +                         unsigned int total_packets,
> +                         unsigned int total_bytes);
> +void i40e_arm_wb(struct i40e_ring *tx_ring,
> +                struct i40e_vsi *vsi,
> +                int budget);
> +
>  #endif /* _I40E_TXRX_H_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 9d16924415b9..021fec5b5799 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -535,3 +535,143 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>         return failure ? budget : (int)total_rx_packets;
>  }
>
> +/* Returns true if the work is finished */
> +static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
> +{
> +       unsigned int total_packets = 0, total_bytes = 0;
> +       struct i40e_tx_buffer *tx_bi;
> +       struct i40e_tx_desc *tx_desc;
> +       bool work_done = true;
> +       dma_addr_t dma;
> +       u32 len;
> +
> +       while (budget-- > 0) {
> +               if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
> +                       xdp_ring->tx_stats.tx_busy++;
> +                       work_done = false;
> +                       break;
> +               }
> +
> +               if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
> +                       break;
> +
> +               dma_sync_single_for_device(xdp_ring->dev, dma, len,
> +                                          DMA_BIDIRECTIONAL);
> +
> +               tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
> +               tx_bi->bytecount = len;
> +               tx_bi->gso_segs = 1;
> +
> +               tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> +               tx_desc->buffer_addr = cpu_to_le64(dma);
> +               tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
> +                                                       | I40E_TX_DESC_CMD_EOP,
> +                                                         0, len, 0);
> +
> +               total_packets++;
> +               total_bytes += len;
> +
> +               xdp_ring->next_to_use++;
> +               if (xdp_ring->next_to_use == xdp_ring->count)
> +                       xdp_ring->next_to_use = 0;
> +       }
> +
> +       if (total_packets > 0) {
> +               /* Request an interrupt for the last frame and bump tail ptr. */
> +               tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
> +                                                I40E_TXD_QW1_CMD_SHIFT);
> +               i40e_xdp_ring_update_tail(xdp_ring);
> +
> +               xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
> +               i40e_update_tx_stats(xdp_ring, total_packets, total_bytes);
> +       }
> +

So this code is likely vulnerable to an issue we were seeing where the
Tx was stalling and surging when xmit_more was in use. We found the
issue was that we were only setting the RS once per ring fill. As a
result the ring was either full or empty from the drivers perspective.
This ends up with poor Tx performance when it occurs. As such you may
want to set the RS bit at least twice per fill so you may want to look
at going through the lesser of either half the ring size or the
budget, then set the RS, and repeat with whatever budget you have
remaining. That way the ring should on average be 50% utilized instead
of either 100% or 0%.

> +       return !!budget && work_done;
> +}
> +
> +bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
> +                         struct i40e_ring *tx_ring, int napi_budget)
> +{
> +       struct xdp_umem *umem = tx_ring->xsk_umem;
> +       u32 head_idx = i40e_get_head(tx_ring);
> +       unsigned int budget = vsi->work_limit;
> +       bool work_done = true, xmit_done;
> +       u32 completed_frames;
> +       u32 frames_ready;
> +
> +       if (head_idx < tx_ring->next_to_clean)
> +               head_idx += tx_ring->count;
> +       frames_ready = head_idx - tx_ring->next_to_clean;
> +
> +       if (frames_ready == 0) {
> +               goto out_xmit;
> +       } else if (frames_ready > budget) {
> +               completed_frames = budget;
> +               work_done = false;
> +       } else {
> +               completed_frames = frames_ready;
> +       }
> +
> +       tx_ring->next_to_clean += completed_frames;
> +       if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
> +               tx_ring->next_to_clean -= tx_ring->count;
> +
> +       xsk_umem_complete_tx(umem, completed_frames);
> +
> +       i40e_arm_wb(tx_ring, vsi, budget);
> +
> +out_xmit:
> +       xmit_done = i40e_xmit_zc(tx_ring, budget);
> +
> +       return work_done && xmit_done;
> +}

I am not a fan of using head write-back. This code just seems shaky at
best to me. Am I understanding correctly that you are using the Tx
cleanup to transmit frames?

> +
> +/**
> + * i40e_napi_is_scheduled - If napi is running, set the NAPIF_STATE_MISSED
> + * @n: napi context
> + *
> + * Returns true if NAPI is scheduled.
> + **/
> +static bool i40e_napi_is_scheduled(struct napi_struct *n)
> +{
> +       unsigned long val, new;
> +
> +       do {
> +               val = READ_ONCE(n->state);
> +               if (val & NAPIF_STATE_DISABLE)
> +                       return true;
> +
> +               if (!(val & NAPIF_STATE_SCHED))
> +                       return false;
> +
> +               new = val | NAPIF_STATE_MISSED;
> +       } while (cmpxchg(&n->state, val, new) != val);
> +

This code does not belong here. This is core kernel code, not anything
driver specific. Probably not needed if you drop the call below.

> +       return true;
> +}
> +
> +int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
> +{
> +       struct i40e_netdev_priv *np = netdev_priv(dev);
> +       struct i40e_vsi *vsi = np->vsi;
> +       struct i40e_ring *ring;
> +
> +       if (test_bit(__I40E_VSI_DOWN, vsi->state))
> +               return -ENETDOWN;
> +
> +       if (!i40e_enabled_xdp_vsi(vsi))
> +               return -ENXIO;
> +
> +       if (queue_id >= vsi->num_queue_pairs)
> +               return -ENXIO;
> +
> +       if (!vsi->xdp_rings[queue_id]->xsk_umem)
> +               return -ENXIO;
> +
> +       ring = vsi->xdp_rings[queue_id];
> +
> +       if (!i40e_napi_is_scheduled(&ring->q_vector->napi))
> +               i40e_force_wb(vsi, ring->q_vector);

We really shouldn't have napi being scheduled by the Tx path.

> +
> +       return 0;
> +}

Again, more comments might be helpful here.

> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> index 757ac5ca8511..bd006f1a4397 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> @@ -13,5 +13,7 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
>  void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
>  bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
>  int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
> +bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
> +                         struct i40e_ring *tx_ring, int napi_budget);
>
>  #endif /* _I40E_XSK_H_ */
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index ec8fd3314097..63aa05abf11d 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -103,6 +103,20 @@ static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
>  static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
>  {
>  }
> +
> +static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
> +{
> +}
> +
> +static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
> +                                      u32 *len)
> +{
> +       return false;
> +}
> +
> +static inline void xsk_umem_consume_tx_done(struct xdp_umem *umem)
> +{
> +}
>  #endif /* CONFIG_XDP_SOCKETS */
>
>  static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx
  2018-06-04 12:06 ` [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx Björn Töpel
  2018-06-04 20:53   ` Alexander Duyck
@ 2018-06-05 12:43   ` Jesper Dangaard Brouer
  2018-06-05 13:07     ` Björn Töpel
  1 sibling, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2018-06-05 12:43 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, daniel, netdev, mykyta.iziumtsev,
	john.fastabend, willemdebruijn.kernel, mst, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang, francois.ozog,
	ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan, brouer

On Mon,  4 Jun 2018 14:06:00 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> Here, ndo_xsk_async_xmit is implemented. As a shortcut, the existing
> XDP Tx rings are used for zero-copy. This will result in other devices
> doing XDP_REDIRECT to an AF_XDP enabled queue will have its packets
> dropped.

This behavior is problematic, because XDP Tx rings are smp_processor_id
based, and several RX queues can (via proc smp_affinity settings) be
assigned to the same CPU. Thus, other RX-queues (than the AF_XDP
enabled queue) can experience packet drops.  And other devices doing
redirect through i40e, which happen to run on a CPU which XDP Tx queue
is "hijacked" will see dropped packets.

Any plans to allocate/create a dedicated TX ring per AF_XDP socket?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx
  2018-06-05 12:43   ` Jesper Dangaard Brouer
@ 2018-06-05 13:07     ` Björn Töpel
  0 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-05 13:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexander Duyck, Alexei Starovoitov, Daniel Borkmann, Netdev,
	MykytaI Iziumtsev, John Fastabend, Willem de Bruijn,
	Michael S. Tsirkin, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z, Francois Ozog, Ilias Apalodimas,
	Brian Brooks

Den tis 5 juni 2018 kl 14:44 skrev Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Mon,  4 Jun 2018 14:06:00 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> > Here, ndo_xsk_async_xmit is implemented. As a shortcut, the existing
> > XDP Tx rings are used for zero-copy. This will result in other devices
> > doing XDP_REDIRECT to an AF_XDP enabled queue will have its packets
> > dropped.
>
> This behavior is problematic, because XDP Tx rings are smp_processor_id
> based, and several RX queues can (via proc smp_affinity settings) be
> assigned to the same CPU. Thus, other RX-queues (than the AF_XDP
> enabled queue) can experience packet drops.  And other devices doing
> redirect through i40e, which happen to run on a CPU which XDP Tx queue
> is "hijacked" will see dropped packets.
>
> Any plans to allocate/create a dedicated TX ring per AF_XDP socket?
>

Yes -- again this was a shortcut, and must be addressed (for all the
reasons above).

Björn

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx
  2018-06-04 20:35   ` Alexander Duyck
@ 2018-06-07  7:40     ` Björn Töpel
  0 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2018-06-07  7:40 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
	Netdev, MykytaI Iziumtsev, Björn Töpel, John Fastabend,
	Willem de Bruijn, Michael S. Tsirkin, michael.lundkvist,
	Brandeburg, Jesse, Singhai, Anjali, Zhang, Qi Z, Francois Ozog

Den mån 4 juni 2018 kl 22:35 skrev Alexander Duyck <alexander.duyck@gmail.com>:
>
> On Mon, Jun 4, 2018 at 5:05 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > This commit adds initial AF_XDP zero-copy support for i40e-based
> > NICs. First we add support for the new XDP_QUERY_XSK_UMEM and
> > XDP_SETUP_XSK_UMEM commands in ndo_bpf. This allows the AF_XDP socket
> > to pass a UMEM to the driver. The driver will then DMA map all the
> > frames in the UMEM for the driver. Next, the Rx code will allocate
> > frames from the UMEM fill queue, instead of the regular page
> > allocator.
> >
> > Externally, for the rest of the XDP code, the driver internal UMEM
> > allocator will appear as a MEM_TYPE_ZERO_COPY.
> >
> > The commit also introduces a completely new clean_rx_irq/allocator
> > functions for zero-copy, and means (functions pointers) to set
> > allocators and clean_rx functions.
> >
> > This first version does not support:
> > * passing frames to the stack via XDP_PASS (clone/copy to skb).
> > * doing XDP redirect to other than AF_XDP sockets
> >   (convert_to_xdp_frame does not clone the frame yet).
> >
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/Makefile    |   3 +-
> >  drivers/net/ethernet/intel/i40e/i40e.h      |  23 ++
> >  drivers/net/ethernet/intel/i40e/i40e_main.c |  35 +-
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 163 ++-------
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.h | 128 ++++++-
> >  drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 537 ++++++++++++++++++++++++++++
> >  drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  17 +
> >  include/net/xdp_sock.h                      |  19 +
> >  net/xdp/xdp_umem.h                          |  10 -
> >  9 files changed, 789 insertions(+), 146 deletions(-)
> >  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
> >  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
> > index 14397e7e9925..50590e8d1fd1 100644
> > --- a/drivers/net/ethernet/intel/i40e/Makefile
> > +++ b/drivers/net/ethernet/intel/i40e/Makefile
> > @@ -22,6 +22,7 @@ i40e-objs := i40e_main.o \
> >         i40e_txrx.o     \
> >         i40e_ptp.o      \
> >         i40e_client.o   \
> > -       i40e_virtchnl_pf.o
> > +       i40e_virtchnl_pf.o \
> > +       i40e_xsk.o
> >
> >  i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
> > index 7a80652e2500..20955e5dce02 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> > @@ -786,6 +786,12 @@ struct i40e_vsi {
> >
> >         /* VSI specific handlers */
> >         irqreturn_t (*irq_handler)(int irq, void *data);
> > +
> > +       /* AF_XDP zero-copy */
> > +       struct xdp_umem **xsk_umems;
> > +       u16 num_xsk_umems_used;
> > +       u16 num_xsk_umems;
> > +
> >  } ____cacheline_internodealigned_in_smp;
> >
> >  struct i40e_netdev_priv {
> > @@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
> >         return !!vsi->xdp_prog;
> >  }
> >
> > +static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
> > +{
> > +       bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
> > +       int qid = ring->queue_index;
> > +
> > +       if (ring_is_xdp(ring))
> > +               qid -= ring->vsi->alloc_queue_pairs;
> > +
> > +       if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
> > +               return NULL;
> > +
> > +       return ring->vsi->xsk_umems[qid];
> > +}
> > +
> >  int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
> >  int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
> >  int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
> > @@ -1098,4 +1118,7 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
> >  int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
> >                                       struct i40e_cloud_filter *filter,
> >                                       bool add);
> > +int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair);
> > +int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair);
> > +
> >  #endif /* _I40E_H_ */
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > index 369a116edaa1..8c602424d339 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > @@ -5,6 +5,7 @@
> >  #include <linux/of_net.h>
> >  #include <linux/pci.h>
> >  #include <linux/bpf.h>
> > +#include <net/xdp_sock.h>
> >
> >  /* Local includes */
> >  #include "i40e.h"
> > @@ -16,6 +17,7 @@
> >   */
> >  #define CREATE_TRACE_POINTS
> >  #include "i40e_trace.h"
> > +#include "i40e_xsk.h"
> >
> >  const char i40e_driver_name[] = "i40e";
> >  static const char i40e_driver_string[] =
> > @@ -3071,6 +3073,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
> >         i40e_status err = 0;
> >         u32 qtx_ctl = 0;
> >
> > +       if (ring_is_xdp(ring))
> > +               ring->xsk_umem = i40e_xsk_umem(ring);
> > +
> >         /* some ATR related tx ring init */
> >         if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
> >                 ring->atr_sample_rate = vsi->back->atr_sample_rate;
> > @@ -3180,13 +3185,30 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> >         struct i40e_hw *hw = &vsi->back->hw;
> >         struct i40e_hmc_obj_rxq rx_ctx;
> >         i40e_status err = 0;
> > +       int ret;
> >
> >         bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
> >
> >         /* clear the context structure first */
> >         memset(&rx_ctx, 0, sizeof(rx_ctx));
> >
> > -       ring->rx_buf_len = vsi->rx_buf_len;
> > +       ring->xsk_umem = i40e_xsk_umem(ring);
> > +       if (ring->xsk_umem) {
> > +               ring->clean_rx_irq = i40e_clean_rx_irq_zc;
> > +               ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
> > +               ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -
> > +                                  XDP_PACKET_HEADROOM;
> > +               ring->zca.free = i40e_zca_free;
> > +               ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> > +                                                MEM_TYPE_ZERO_COPY,
> > +                                                &ring->zca);
> > +               if (ret)
> > +                       return ret;
> > +       } else {
> > +               ring->clean_rx_irq = i40e_clean_rx_irq;
> > +               ring->alloc_rx_buffers = i40e_alloc_rx_buffers;
> > +               ring->rx_buf_len = vsi->rx_buf_len;
> > +       }
>
> With everything that is going on with retpoline overhead I am really
> wary of this. We may want to look at finding another way to do this
> instead of just function pointers so that we can avoid the extra
> function pointer overhead. We may want to look at a flag for this
> instead of using function pointers.
>
> >         rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
> >                                     BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
> > @@ -3242,7 +3264,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> >         ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
> >         writel(0, ring->tail);
> >
> > -       i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
> > +       ring->alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
> >
> >         return 0;
> >  }
> > @@ -12022,7 +12044,7 @@ static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
> >   *
> >   * Returns 0 on success, <0 on failure.
> >   **/
> > -static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
> > +int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
> >  {
> >         int err;
> >
> > @@ -12047,7 +12069,7 @@ static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
> >   *
> >   * Returns 0 on success, <0 on failure.
> >   **/
> > -static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
> > +int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
> >  {
> >         int err;
> >
> > @@ -12095,6 +12117,11 @@ static int i40e_xdp(struct net_device *dev,
> >                 xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
> >                 xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
> >                 return 0;
> > +       case XDP_QUERY_XSK_UMEM:
> > +               return 0;
> > +       case XDP_SETUP_XSK_UMEM:
> > +               return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
> > +                                          xdp->xsk.queue_id);
> >         default:
> >                 return -EINVAL;
> >         }
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > index 5f01e4ce9c92..6b1142fbc697 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > @@ -5,6 +5,7 @@
> >  #include <net/busy_poll.h>
> >  #include <linux/bpf_trace.h>
> >  #include <net/xdp.h>
> > +#include <net/xdp_sock.h>
> >  #include "i40e.h"
> >  #include "i40e_trace.h"
> >  #include "i40e_prototype.h"
> > @@ -536,8 +537,8 @@ int i40e_add_del_fdir(struct i40e_vsi *vsi,
> >   * This is used to verify if the FD programming or invalidation
> >   * requested by SW to the HW is successful or not and take actions accordingly.
> >   **/
> > -static void i40e_fd_handle_status(struct i40e_ring *rx_ring,
> > -                                 union i40e_rx_desc *rx_desc, u8 prog_id)
> > +void i40e_fd_handle_status(struct i40e_ring *rx_ring,
> > +                          union i40e_rx_desc *rx_desc, u8 prog_id)
> >  {
> >         struct i40e_pf *pf = rx_ring->vsi->back;
> >         struct pci_dev *pdev = pf->pdev;
> > @@ -1246,25 +1247,6 @@ static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
> >         new_buff->pagecnt_bias  = old_buff->pagecnt_bias;
> >  }
> >
> > -/**
> > - * i40e_rx_is_programming_status - check for programming status descriptor
> > - * @qw: qword representing status_error_len in CPU ordering
> > - *
> > - * The value of in the descriptor length field indicate if this
> > - * is a programming status descriptor for flow director or FCoE
> > - * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
> > - * it is a packet descriptor.
> > - **/
> > -static inline bool i40e_rx_is_programming_status(u64 qw)
> > -{
> > -       /* The Rx filter programming status and SPH bit occupy the same
> > -        * spot in the descriptor. Since we don't support packet split we
> > -        * can just reuse the bit as an indication that this is a
> > -        * programming status descriptor.
> > -        */
> > -       return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
> > -}
> > -
> >  /**
> >   * i40e_clean_programming_status - clean the programming status descriptor
> >   * @rx_ring: the rx ring that has this descriptor
> > @@ -1373,31 +1355,35 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
> >         }
> >
> >         /* Free all the Rx ring sk_buffs */
> > -       for (i = 0; i < rx_ring->count; i++) {
> > -               struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
> > -
> > -               if (!rx_bi->page)
> > -                       continue;
> > +       if (!rx_ring->xsk_umem) {
>
> Instead of changing the indent on all this code it would probably be
> easier to just add a goto and a label to skip all this.
>
> > +               for (i = 0; i < rx_ring->count; i++) {
> > +                       struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
> >
> > -               /* Invalidate cache lines that may have been written to by
> > -                * device so that we avoid corrupting memory.
> > -                */
> > -               dma_sync_single_range_for_cpu(rx_ring->dev,
> > -                                             rx_bi->dma,
> > -                                             rx_bi->page_offset,
> > -                                             rx_ring->rx_buf_len,
> > -                                             DMA_FROM_DEVICE);
> > -
> > -               /* free resources associated with mapping */
> > -               dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> > -                                    i40e_rx_pg_size(rx_ring),
> > -                                    DMA_FROM_DEVICE,
> > -                                    I40E_RX_DMA_ATTR);
> > -
> > -               __page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
> > +                       if (!rx_bi->page)
> > +                               continue;
> >
> > -               rx_bi->page = NULL;
> > -               rx_bi->page_offset = 0;
> > +                       /* Invalidate cache lines that may have been
> > +                        * written to by device so that we avoid
> > +                        * corrupting memory.
> > +                        */
> > +                       dma_sync_single_range_for_cpu(rx_ring->dev,
> > +                                                     rx_bi->dma,
> > +                                                     rx_bi->page_offset,
> > +                                                     rx_ring->rx_buf_len,
> > +                                                     DMA_FROM_DEVICE);
> > +
> > +                       /* free resources associated with mapping */
> > +                       dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> > +                                            i40e_rx_pg_size(rx_ring),
> > +                                            DMA_FROM_DEVICE,
> > +                                            I40E_RX_DMA_ATTR);
> > +
> > +                       __page_frag_cache_drain(rx_bi->page,
> > +                                               rx_bi->pagecnt_bias);
> > +
> > +                       rx_bi->page = NULL;
> > +                       rx_bi->page_offset = 0;
> > +               }
> >         }
> >
> >         bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
> > @@ -1487,27 +1473,6 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
> >         return err;
> >  }
> >
> > -/**
> > - * i40e_release_rx_desc - Store the new tail and head values
> > - * @rx_ring: ring to bump
> > - * @val: new head index
> > - **/
> > -static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
> > -{
> > -       rx_ring->next_to_use = val;
> > -
> > -       /* update next to alloc since we have filled the ring */
> > -       rx_ring->next_to_alloc = val;
> > -
> > -       /* Force memory writes to complete before letting h/w
> > -        * know there are new descriptors to fetch.  (Only
> > -        * applicable for weak-ordered memory model archs,
> > -        * such as IA-64).
> > -        */
> > -       wmb();
> > -       writel(val, rx_ring->tail);
> > -}
> > -
> >  /**
> >   * i40e_rx_offset - Return expected offset into page to access data
> >   * @rx_ring: Ring we are requesting offset of
> > @@ -1576,8 +1541,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
> >   * @skb: packet to send up
> >   * @vlan_tag: vlan tag for packet
> >   **/
> > -static void i40e_receive_skb(struct i40e_ring *rx_ring,
> > -                            struct sk_buff *skb, u16 vlan_tag)
> > +void i40e_receive_skb(struct i40e_ring *rx_ring,
> > +                     struct sk_buff *skb, u16 vlan_tag)
> >  {
> >         struct i40e_q_vector *q_vector = rx_ring->q_vector;
> >
> > @@ -1804,7 +1769,6 @@ static inline void i40e_rx_hash(struct i40e_ring *ring,
> >   * order to populate the hash, checksum, VLAN, protocol, and
> >   * other fields within the skb.
> >   **/
> > -static inline
> >  void i40e_process_skb_fields(struct i40e_ring *rx_ring,
> >                              union i40e_rx_desc *rx_desc, struct sk_buff *skb,
> >                              u8 rx_ptype)
> > @@ -1829,46 +1793,6 @@ void i40e_process_skb_fields(struct i40e_ring *rx_ring,
> >         skb->protocol = eth_type_trans(skb, rx_ring->netdev);
> >  }
> >
> > -/**
> > - * i40e_cleanup_headers - Correct empty headers
> > - * @rx_ring: rx descriptor ring packet is being transacted on
> > - * @skb: pointer to current skb being fixed
> > - * @rx_desc: pointer to the EOP Rx descriptor
> > - *
> > - * Also address the case where we are pulling data in on pages only
> > - * and as such no data is present in the skb header.
> > - *
> > - * In addition if skb is not at least 60 bytes we need to pad it so that
> > - * it is large enough to qualify as a valid Ethernet frame.
> > - *
> > - * Returns true if an error was encountered and skb was freed.
> > - **/
> > -static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb,
> > -                                union i40e_rx_desc *rx_desc)
> > -
> > -{
> > -       /* XDP packets use error pointer so abort at this point */
> > -       if (IS_ERR(skb))
> > -               return true;
> > -
> > -       /* ERR_MASK will only have valid bits if EOP set, and
> > -        * what we are doing here is actually checking
> > -        * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
> > -        * the error field
> > -        */
> > -       if (unlikely(i40e_test_staterr(rx_desc,
> > -                                      BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
> > -               dev_kfree_skb_any(skb);
> > -               return true;
> > -       }
> > -
> > -       /* if eth_skb_pad returns an error the skb was freed */
> > -       if (eth_skb_pad(skb))
> > -               return true;
> > -
> > -       return false;
> > -}
> > -
> >  /**
> >   * i40e_page_is_reusable - check if any reuse is possible
> >   * @page: page struct to check
> > @@ -2177,15 +2101,11 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
> >         return true;
> >  }
> >
> > -#define I40E_XDP_PASS 0
> > -#define I40E_XDP_CONSUMED 1
> > -#define I40E_XDP_TX 2
> > -
> >  static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
> >                               struct i40e_ring *xdp_ring);
> >
> > -static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> > -                                struct i40e_ring *xdp_ring)
> > +int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> > +                         struct i40e_ring *xdp_ring)
> >  {
> >         struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
> >
> > @@ -2214,8 +2134,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
> >         if (!xdp_prog)
> >                 goto xdp_out;
> >
> > -       prefetchw(xdp->data_hard_start); /* xdp_frame write */
> > -
> >         act = bpf_prog_run_xdp(xdp_prog, xdp);
> >         switch (act) {
> >         case XDP_PASS:
> > @@ -2263,15 +2181,6 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
> >  #endif
> >  }
> >
> > -static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
> > -{
> > -       /* Force memory writes to complete before letting h/w
> > -        * know there are new descriptors to fetch.
> > -        */
> > -       wmb();
> > -       writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
> > -}
> > -
> >  /**
> >   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
> >   * @rx_ring: rx descriptor ring to transact packets on
> > @@ -2284,7 +2193,7 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
> >   *
> >   * Returns amount of work completed
> >   **/
> > -static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
> > +int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
> >  {
> >         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> >         struct sk_buff *skb = rx_ring->skb;
> > @@ -2576,7 +2485,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
> >         budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
> >
> >         i40e_for_each_ring(ring, q_vector->rx) {
> > -               int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
> > +               int cleaned = ring->clean_rx_irq(ring, budget_per_ring);
> >
> >                 work_done += cleaned;
> >                 /* if we clean as many as budgeted, we must not be done */
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > index 820f76db251b..cddb185cd2f8 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > @@ -296,13 +296,22 @@ struct i40e_tx_buffer {
> >
> >  struct i40e_rx_buffer {
> >         dma_addr_t dma;
> > -       struct page *page;
> > +       union {
> > +               struct {
> > +                       struct page *page;
> >  #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
> > -       __u32 page_offset;
> > +                       __u32 page_offset;
> >  #else
> > -       __u16 page_offset;
> > +                       __u16 page_offset;
> >  #endif
> > -       __u16 pagecnt_bias;
> > +                       __u16 pagecnt_bias;
> > +               };
> > +               struct {
> > +                       /* for umem */
> > +                       void *addr;
> > +                       u64 handle;
> > +               };
>
> It might work better to just do this as a pair of unions. One for
> page/addr and another for handle, page_offset, and pagecnt_bias.
>
> > +       };
> >  };
> >
> >  struct i40e_queue_stats {
> > @@ -414,6 +423,12 @@ struct i40e_ring {
> >
> >         struct i40e_channel *ch;
> >         struct xdp_rxq_info xdp_rxq;
> > +
> > +       int (*clean_rx_irq)(struct i40e_ring *ring, int budget);
> > +       bool (*alloc_rx_buffers)(struct i40e_ring *ring, u16 n);
> > +       struct xdp_umem *xsk_umem;
> > +
> > +       struct zero_copy_allocator zca; /* ZC allocator anchor */
> >  } ____cacheline_internodealigned_in_smp;
> >
> >  static inline bool ring_uses_build_skb(struct i40e_ring *ring)
> > @@ -490,6 +505,7 @@ bool __i40e_chk_linearize(struct sk_buff *skb);
> >  int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
> >                   u32 flags);
> >  void i40e_xdp_flush(struct net_device *dev);
> > +int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
> >
> >  /**
> >   * i40e_get_head - Retrieve head from head writeback
> > @@ -576,4 +592,108 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
> >  {
> >         return netdev_get_tx_queue(ring->netdev, ring->queue_index);
> >  }
> > +
> > +#define I40E_XDP_PASS 0
> > +#define I40E_XDP_CONSUMED 1
> > +#define I40E_XDP_TX 2
> > +
> > +/**
> > + * i40e_release_rx_desc - Store the new tail and head values
> > + * @rx_ring: ring to bump
> > + * @val: new head index
> > + **/
> > +static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
> > +{
> > +       rx_ring->next_to_use = val;
> > +
> > +       /* update next to alloc since we have filled the ring */
> > +       rx_ring->next_to_alloc = val;
> > +
> > +       /* Force memory writes to complete before letting h/w
> > +        * know there are new descriptors to fetch.  (Only
> > +        * applicable for weak-ordered memory model archs,
> > +        * such as IA-64).
> > +        */
> > +       wmb();
> > +       writel(val, rx_ring->tail);
> > +}
> > +
> > +/**
> > + * i40e_rx_is_programming_status - check for programming status descriptor
> > + * @qw: qword representing status_error_len in CPU ordering
> > + *
> > + * The value of in the descriptor length field indicate if this
> > + * is a programming status descriptor for flow director or FCoE
> > + * by the value of I40E_RX_PROG_STATUS_DESC_LENGTH, otherwise
> > + * it is a packet descriptor.
> > + **/
> > +static inline bool i40e_rx_is_programming_status(u64 qw)
> > +{
> > +       /* The Rx filter programming status and SPH bit occupy the same
> > +        * spot in the descriptor. Since we don't support packet split we
> > +        * can just reuse the bit as an indication that this is a
> > +        * programming status descriptor.
> > +        */
> > +       return qw & I40E_RXD_QW1_LENGTH_SPH_MASK;
> > +}
> > +
> > +/**
> > + * i40e_cleanup_headers - Correct empty headers
> > + * @rx_ring: rx descriptor ring packet is being transacted on
> > + * @skb: pointer to current skb being fixed
> > + * @rx_desc: pointer to the EOP Rx descriptor
> > + *
> > + * Also address the case where we are pulling data in on pages only
> > + * and as such no data is present in the skb header.
> > + *
> > + * In addition if skb is not at least 60 bytes we need to pad it so that
> > + * it is large enough to qualify as a valid Ethernet frame.
> > + *
> > + * Returns true if an error was encountered and skb was freed.
> > + **/
> > +static inline bool i40e_cleanup_headers(struct i40e_ring *rx_ring,
> > +                                       struct sk_buff *skb,
> > +                                       union i40e_rx_desc *rx_desc)
> > +
> > +{
> > +       /* XDP packets use error pointer so abort at this point */
> > +       if (IS_ERR(skb))
> > +               return true;
> > +
> > +       /* ERR_MASK will only have valid bits if EOP set, and
> > +        * what we are doing here is actually checking
> > +        * I40E_RX_DESC_ERROR_RXE_SHIFT, since it is the zeroth bit in
> > +        * the error field
> > +        */
> > +       if (unlikely(i40e_test_staterr(rx_desc,
> > +                                      BIT(I40E_RXD_QW1_ERROR_SHIFT)))) {
> > +               dev_kfree_skb_any(skb);
> > +               return true;
> > +       }
> > +
> > +       /* if eth_skb_pad returns an error the skb was freed */
> > +       if (eth_skb_pad(skb))
> > +               return true;
> > +
> > +       return false;
> > +}
> > +
> > +static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
> > +{
> > +       /* Force memory writes to complete before letting h/w
> > +        * know there are new descriptors to fetch.
> > +        */
> > +       wmb();
> > +       writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
> > +}
> > +
> > +void i40e_fd_handle_status(struct i40e_ring *rx_ring,
> > +                          union i40e_rx_desc *rx_desc, u8 prog_id);
> > +int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
> > +                         struct i40e_ring *xdp_ring);
> > +void i40e_process_skb_fields(struct i40e_ring *rx_ring,
> > +                            union i40e_rx_desc *rx_desc, struct sk_buff *skb,
> > +                            u8 rx_ptype);
> > +void i40e_receive_skb(struct i40e_ring *rx_ring,
> > +                     struct sk_buff *skb, u16 vlan_tag);
> >  #endif /* _I40E_TXRX_H_ */
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > new file mode 100644
> > index 000000000000..9d16924415b9
> > --- /dev/null
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -0,0 +1,537 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright(c) 2018 Intel Corporation. */
> > +
> > +#include <linux/bpf_trace.h>
> > +#include <net/xdp_sock.h>
> > +#include <net/xdp.h>
> > +
> > +#include "i40e.h"
> > +#include "i40e_txrx.h"
> > +
> > +static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
> > +{
> > +       if (vsi->xsk_umems)
> > +               return 0;
> > +
> > +       vsi->num_xsk_umems_used = 0;
> > +       vsi->num_xsk_umems = vsi->alloc_queue_pairs;
> > +       vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
> > +                                GFP_KERNEL);
> > +       if (!vsi->xsk_umems) {
> > +               vsi->num_xsk_umems = 0;
> > +               return -ENOMEM;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
> > +                            u16 qid)
> > +{
> > +       int err;
> > +
> > +       err = i40e_alloc_xsk_umems(vsi);
> > +       if (err)
> > +               return err;
> > +
> > +       vsi->xsk_umems[qid] = umem;
> > +       vsi->num_xsk_umems_used++;
> > +
> > +       return 0;
> > +}
> > +
> > +static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
> > +{
> > +       vsi->xsk_umems[qid] = NULL;
> > +       vsi->num_xsk_umems_used--;
> > +
> > +       if (vsi->num_xsk_umems == 0) {
> > +               kfree(vsi->xsk_umems);
> > +               vsi->xsk_umems = NULL;
> > +               vsi->num_xsk_umems = 0;
> > +       }
> > +}
> > +
> > +static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
> > +{
> > +       struct i40e_pf *pf = vsi->back;
> > +       struct device *dev;
> > +       unsigned int i, j;
> > +       dma_addr_t dma;
> > +
> > +       dev = &pf->pdev->dev;
> > +       for (i = 0; i < umem->npgs; i++) {
> > +               dma = dma_map_page_attrs(dev, umem->pgs[i], 0, PAGE_SIZE,
> > +                                        DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> > +               if (dma_mapping_error(dev, dma))
> > +                       goto out_unmap;
> > +
> > +               umem->pages[i].dma = dma;
> > +       }
> > +
> > +       return 0;
> > +
> > +out_unmap:
> > +       for (j = 0; j < i; j++) {
> > +               dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
> > +                                    DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> > +               umem->pages[i].dma = 0;
> > +       }
> > +
> > +       return -1;
> > +}
> > +
> > +static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
> > +{
> > +       struct i40e_pf *pf = vsi->back;
> > +       struct device *dev;
> > +       unsigned int i;
> > +
> > +       dev = &pf->pdev->dev;
> > +
> > +       for (i = 0; i < umem->npgs; i++) {
> > +               dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
> > +                                    DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> > +
> > +               umem->pages[i].dma = 0;
> > +       }
> > +}
> > +
> > +static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
> > +                               u16 qid)
> > +{
> > +       bool if_running;
> > +       int err;
> > +
> > +       if (vsi->type != I40E_VSI_MAIN)
> > +               return -EINVAL;
> > +
> > +       if (qid >= vsi->num_queue_pairs)
> > +               return -EINVAL;
> > +
> > +       if (vsi->xsk_umems && vsi->xsk_umems[qid])
> > +               return -EBUSY;
> > +
> > +       err = i40e_xsk_umem_dma_map(vsi, umem);
> > +       if (err)
> > +               return err;
> > +
> > +       if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
> > +
> > +       if (if_running) {
> > +               err = i40e_queue_pair_disable(vsi, qid);
> > +               if (err)
> > +                       return err;
> > +       }
> > +
> > +       err = i40e_add_xsk_umem(vsi, umem, qid);
> > +       if (err)
> > +               return err;
> > +
> > +       if (if_running) {
> > +               err = i40e_queue_pair_enable(vsi, qid);
> > +               if (err)
> > +                       return err;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
> > +{
> > +       bool if_running;
> > +       int err;
> > +
> > +       if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
> > +           !vsi->xsk_umems[qid])
> > +               return -EINVAL;
> > +
> > +       if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
> > +
> > +       if (if_running) {
> > +               err = i40e_queue_pair_disable(vsi, qid);
> > +               if (err)
> > +                       return err;
> > +       }
> > +
> > +       i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
> > +       i40e_remove_xsk_umem(vsi, qid);
> > +
> > +       if (if_running) {
> > +               err = i40e_queue_pair_enable(vsi, qid);
> > +               if (err)
> > +                       return err;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
> > +                       u16 qid)
> > +{
> > +       if (umem)
> > +               return i40e_xsk_umem_enable(vsi, umem, qid);
> > +
> > +       return i40e_xsk_umem_disable(vsi, qid);
> > +}
> > +
> > +static struct sk_buff *i40e_run_xdp_zc(struct i40e_ring *rx_ring,
> > +                                      struct xdp_buff *xdp)
> > +{
> > +       int err, result = I40E_XDP_PASS;
> > +       struct i40e_ring *xdp_ring;
> > +       struct bpf_prog *xdp_prog;
> > +       u32 act;
> > +       u16 off;
> > +
> > +       rcu_read_lock();
> > +       xdp_prog = READ_ONCE(rx_ring->xdp_prog);
> > +       act = bpf_prog_run_xdp(xdp_prog, xdp);
> > +       off = xdp->data - xdp->data_hard_start;
> > +       xdp->handle += off;
> > +       switch (act) {
> > +       case XDP_PASS:
> > +               break;
> > +       case XDP_TX:
> > +               xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> > +               result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
> > +               break;
> > +       case XDP_REDIRECT:
> > +               err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
> > +               result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
> > +               break;
> > +       default:
> > +               bpf_warn_invalid_xdp_action(act);
> > +       case XDP_ABORTED:
> > +               trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
> > +               /* fallthrough -- handle aborts by dropping packet */
> > +       case XDP_DROP:
> > +               result = I40E_XDP_CONSUMED;
> > +               break;
> > +       }
> > +
> > +       rcu_read_unlock();
> > +       return ERR_PTR(-result);
> > +}
> > +
> > +static bool i40e_alloc_frame_zc(struct i40e_ring *rx_ring,
> > +                               struct i40e_rx_buffer *bi)
> > +{
> > +       struct xdp_umem *umem = rx_ring->xsk_umem;
> > +       void *addr = bi->addr;
> > +       u64 handle;
> > +
> > +       if (addr) {
> > +               rx_ring->rx_stats.page_reuse_count++;
> > +               return true;
> > +       }
> > +
> > +       if (!xsk_umem_peek_addr(umem, &handle)) {
> > +               rx_ring->rx_stats.alloc_page_failed++;
> > +               return false;
> > +       }
> > +
> > +       bi->dma = xdp_umem_get_dma(umem, handle);
> > +       bi->addr = xdp_umem_get_data(umem, handle);
> > +
> > +       bi->dma += umem->headroom + XDP_PACKET_HEADROOM;
> > +       bi->addr += umem->headroom + XDP_PACKET_HEADROOM;
> > +       bi->handle = handle + umem->headroom;
> > +
> > +       xsk_umem_discard_addr(umem);
> > +       return true;
> > +}
> > +
> > +bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count)
> > +{
> > +       u16 ntu = rx_ring->next_to_use;
> > +       union i40e_rx_desc *rx_desc;
> > +       struct i40e_rx_buffer *bi;
> > +
> > +       rx_desc = I40E_RX_DESC(rx_ring, ntu);
> > +       bi = &rx_ring->rx_bi[ntu];
> > +
> > +       do {
> > +               if (!i40e_alloc_frame_zc(rx_ring, bi))
> > +                       goto no_buffers;
> > +
> > +               /* sync the buffer for use by the device */
> > +               dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
> > +                                                rx_ring->rx_buf_len,
> > +                                                DMA_BIDIRECTIONAL);
> > +
> > +               /* Refresh the desc even if buffer_addrs didn't change
> > +                * because each write-back erases this info.
> > +                */
> > +               rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
> > +
> > +               rx_desc++;
> > +               bi++;
> > +               ntu++;
> > +               if (unlikely(ntu == rx_ring->count)) {
> > +                       rx_desc = I40E_RX_DESC(rx_ring, 0);
> > +                       bi = rx_ring->rx_bi;
> > +                       ntu = 0;
> > +               }
> > +
> > +               /* clear the status bits for the next_to_use descriptor */
> > +               rx_desc->wb.qword1.status_error_len = 0;
> > +
> > +               cleaned_count--;
> > +       } while (cleaned_count);
> > +
> > +       if (rx_ring->next_to_use != ntu)
> > +               i40e_release_rx_desc(rx_ring, ntu);
> > +
> > +       return false;
> > +
> > +no_buffers:
> > +       if (rx_ring->next_to_use != ntu)
> > +               i40e_release_rx_desc(rx_ring, ntu);
> > +
> > +       /* make sure to come back via polling to try again after
> > +        * allocation failure
> > +        */
> > +       return true;
> > +}
> > +
> > +static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
> > +                                                   const unsigned int size)
> > +{
> > +       struct i40e_rx_buffer *rx_buffer;
> > +
> > +       rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
> > +
> > +       /* we are reusing so sync this buffer for CPU use */
> > +       dma_sync_single_range_for_cpu(rx_ring->dev,
> > +                                     rx_buffer->dma, 0,
> > +                                     size,
> > +                                     DMA_BIDIRECTIONAL);
> > +
> > +       return rx_buffer;
> > +}
> > +
> > +static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
> > +                                   struct i40e_rx_buffer *old_buff)
> > +{
> > +       u64 mask = rx_ring->xsk_umem->props.chunk_mask;
> > +       u64 hr = rx_ring->xsk_umem->headroom;
> > +       u16 nta = rx_ring->next_to_alloc;
> > +       struct i40e_rx_buffer *new_buff;
> > +
> > +       new_buff = &rx_ring->rx_bi[nta];
> > +
> > +       /* update, and store next to alloc */
> > +       nta++;
> > +       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> > +
> > +       /* transfer page from old buffer to new buffer */
> > +       new_buff->dma           = old_buff->dma & mask;
> > +       new_buff->addr          = (void *)((u64)old_buff->addr & mask);
> > +       new_buff->handle        = old_buff->handle & mask;
> > +
> > +       new_buff->dma += hr + XDP_PACKET_HEADROOM;
> > +       new_buff->addr += hr + XDP_PACKET_HEADROOM;
> > +       new_buff->handle += hr;
> > +}
> > +
> > +/* Called from the XDP return API in NAPI context. */
> > +void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
> > +{
> > +       struct i40e_rx_buffer *new_buff;
> > +       struct i40e_ring *rx_ring;
> > +       u64 mask;
> > +       u16 nta;
> > +
> > +       rx_ring = container_of(alloc, struct i40e_ring, zca);
> > +       mask = rx_ring->xsk_umem->props.chunk_mask;
> > +
> > +       nta = rx_ring->next_to_alloc;
> > +
> > +       new_buff = &rx_ring->rx_bi[nta];
> > +
> > +       /* update, and store next to alloc */
> > +       nta++;
> > +       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> > +
> > +       handle &= mask;
> > +
> > +       new_buff->dma           = xdp_umem_get_dma(rx_ring->xsk_umem, handle);
> > +       new_buff->addr          = xdp_umem_get_data(rx_ring->xsk_umem, handle);
> > +       new_buff->handle        = (u64)handle;
> > +
> > +       new_buff->dma += rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
> > +       new_buff->addr += rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
> > +       new_buff->handle += rx_ring->xsk_umem->headroom;
> > +}
> > +
> > +static struct sk_buff *i40e_zc_frame_to_skb(struct i40e_ring *rx_ring,
> > +                                           struct i40e_rx_buffer *rx_buffer,
> > +                                           struct xdp_buff *xdp)
> > +{
> > +       /* XXX implement alloc skb and copy */
> > +       i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> > +       return NULL;
> > +}
> > +
> > +static void i40e_clean_programming_status_zc(struct i40e_ring *rx_ring,
> > +                                            union i40e_rx_desc *rx_desc,
> > +                                            u64 qw)
> > +{
> > +       struct i40e_rx_buffer *rx_buffer;
> > +       u32 ntc = rx_ring->next_to_clean;
> > +       u8 id;
> > +
> > +       /* fetch, update, and store next to clean */
> > +       rx_buffer = &rx_ring->rx_bi[ntc++];
> > +       ntc = (ntc < rx_ring->count) ? ntc : 0;
> > +       rx_ring->next_to_clean = ntc;
> > +
> > +       prefetch(I40E_RX_DESC(rx_ring, ntc));
> > +
> > +       /* place unused page back on the ring */
> > +       i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> > +       rx_ring->rx_stats.page_reuse_count++;
> > +
> > +       /* clear contents of buffer_info */
> > +       rx_buffer->addr = NULL;
> > +
> > +       id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
> > +                 I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
> > +
> > +       if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
> > +               i40e_fd_handle_status(rx_ring, rx_desc, id);
> > +}
> > +
> > +int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> > +{
> > +       unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> > +       u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
> > +       bool failure = false, xdp_xmit = false;
> > +       struct sk_buff *skb;
> > +       struct xdp_buff xdp;
> > +
> > +       xdp.rxq = &rx_ring->xdp_rxq;
> > +
> > +       while (likely(total_rx_packets < (unsigned int)budget)) {
> > +               struct i40e_rx_buffer *rx_buffer;
> > +               union i40e_rx_desc *rx_desc;
> > +               unsigned int size;
> > +               u16 vlan_tag;
> > +               u8 rx_ptype;
> > +               u64 qword;
> > +               u32 ntc;
> > +
> > +               /* return some buffers to hardware, one at a time is too slow */
> > +               if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
> > +                       failure = failure ||
> > +                                 i40e_alloc_rx_buffers_zc(rx_ring,
> > +                                                          cleaned_count);
> > +                       cleaned_count = 0;
> > +               }
> > +
> > +               rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
> > +
> > +               /* status_error_len will always be zero for unused descriptors
> > +                * because it's cleared in cleanup, and overlaps with hdr_addr
> > +                * which is always zero because packet split isn't used, if the
> > +                * hardware wrote DD then the length will be non-zero
> > +                */
> > +               qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> > +
> > +               /* This memory barrier is needed to keep us from reading
> > +                * any other fields out of the rx_desc until we have
> > +                * verified the descriptor has been written back.
> > +                */
> > +               dma_rmb();
> > +
> > +               if (unlikely(i40e_rx_is_programming_status(qword))) {
> > +                       i40e_clean_programming_status_zc(rx_ring, rx_desc,
> > +                                                        qword);
> > +                       cleaned_count++;
> > +                       continue;
> > +               }
> > +               size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > +                      I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
> > +               if (!size)
> > +                       break;
> > +
> > +               rx_buffer = i40e_get_rx_buffer_zc(rx_ring, size);
> > +
> > +               /* retrieve a buffer from the ring */
> > +               xdp.data = rx_buffer->addr;
> > +               xdp_set_data_meta_invalid(&xdp);
> > +               xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
> > +               xdp.data_end = xdp.data + size;
> > +               xdp.handle = rx_buffer->handle;
> > +
> > +               skb = i40e_run_xdp_zc(rx_ring, &xdp);
> > +
> > +               if (IS_ERR(skb)) {
> > +                       if (PTR_ERR(skb) == -I40E_XDP_TX)
> > +                               xdp_xmit = true;
> > +                       else
> > +                               i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> > +                       total_rx_bytes += size;
> > +                       total_rx_packets++;
> > +               } else {
> > +                       skb = i40e_zc_frame_to_skb(rx_ring, rx_buffer, &xdp);
> > +                       if (!skb) {
> > +                               rx_ring->rx_stats.alloc_buff_failed++;
> > +                               break;
> > +                       }
> > +               }
> > +
> > +               rx_buffer->addr = NULL;
> > +               cleaned_count++;
> > +
> > +               /* don't care about non-EOP frames in XDP mode */
> > +               ntc = rx_ring->next_to_clean + 1;
> > +               ntc = (ntc < rx_ring->count) ? ntc : 0;
> > +               rx_ring->next_to_clean = ntc;
> > +               prefetch(I40E_RX_DESC(rx_ring, ntc));
> > +
> > +               if (i40e_cleanup_headers(rx_ring, skb, rx_desc)) {
> > +                       skb = NULL;
> > +                       continue;
> > +               }
> > +
> > +               /* probably a little skewed due to removing CRC */
> > +               total_rx_bytes += skb->len;
> > +
> > +               qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> > +               rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
> > +                          I40E_RXD_QW1_PTYPE_SHIFT;
> > +
> > +               /* populate checksum, VLAN, and protocol */
> > +               i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
> > +
> > +               vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
> > +                          le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
> > +
> > +               i40e_receive_skb(rx_ring, skb, vlan_tag);
> > +               skb = NULL;
> > +
> > +               /* update budget accounting */
> > +               total_rx_packets++;
> > +       }
> > +
> > +       if (xdp_xmit) {
> > +               struct i40e_ring *xdp_ring =
> > +                       rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> > +
> > +               i40e_xdp_ring_update_tail(xdp_ring);
> > +               xdp_do_flush_map();
> > +       }
> > +
> > +       u64_stats_update_begin(&rx_ring->syncp);
> > +       rx_ring->stats.packets += total_rx_packets;
> > +       rx_ring->stats.bytes += total_rx_bytes;
> > +       u64_stats_update_end(&rx_ring->syncp);
> > +       rx_ring->q_vector->rx.total_packets += total_rx_packets;
> > +       rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
> > +
> > +       /* guarantee a trip back through this routine if there was a failure */
> > +       return failure ? budget : (int)total_rx_packets;
> > +}
> > +
>
> You should really look at adding comments to the code you are adding.
> From what I can tell almost all of the code comments were just copied
> exactly from the original functions in the i40e_txrx.c file.
>
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> > new file mode 100644
> > index 000000000000..757ac5ca8511
> > --- /dev/null
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> > @@ -0,0 +1,17 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/* Copyright(c) 2018 Intel Corporation. */
> > +
> > +#ifndef _I40E_XSK_H_
> > +#define _I40E_XSK_H_
> > +
> > +struct i40e_vsi;
> > +struct xdp_umem;
> > +struct zero_copy_allocator;
> > +
> > +int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
> > +                       u16 qid);
> > +void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
> > +bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
> > +int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
> > +
> > +#endif /* _I40E_XSK_H_ */
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index 9fe472f2ac95..ec8fd3314097 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -94,6 +94,25 @@ static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
> >  {
> >         return false;
> >  }
> > +
> > +static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
> > +{
> > +       return NULL;
> > +}
> > +
> > +static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
> > +{
> > +}
> >  #endif /* CONFIG_XDP_SOCKETS */
> >
> > +static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
> > +{
> > +       return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
> > +}
> > +
> > +static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
> > +{
> > +       return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
> > +}
> > +
> >  #endif /* _LINUX_XDP_SOCK_H */
> > diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> > index f11560334f88..c8be1ad3eb88 100644
> > --- a/net/xdp/xdp_umem.h
> > +++ b/net/xdp/xdp_umem.h
> > @@ -8,16 +8,6 @@
> >
> >  #include <net/xdp_sock.h>
> >
> > -static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
> > -{
> > -       return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
> > -}
> > -
> > -static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
> > -{
> > -       return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
> > -}
> > -
> >  int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
> >                         u32 queue_id, u16 flags);
> >  bool xdp_umem_validate_queues(struct xdp_umem *umem);
> > --
> > 2.14.1
> >

Apologies for the late response, Alex.

We'll address all the items above, and also your Tx ZC related
comments. Thanks for the quick reply!


Björn

^ permalink raw reply	[flat|nested] 22+ messages in thread

* af_xdp zero copy ideas
  2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
                   ` (11 preceding siblings ...)
  2018-06-04 16:38 ` [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Alexei Starovoitov
@ 2018-11-14  8:10 ` Michael S. Tsirkin
  12 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2018-11-14  8:10 UTC (permalink / raw)
  To: bjorn.topel; +Cc: netdev

So a I mentioned during the presentation for the af_xdp zero copy I
think it's pretty important to be able to close the device and get back
the affected memory. One way would be to unmap the DMA memory from
userspace and map in some other memory. It's tricky since you need
to also replace the mapping to the backing file which could be
hugetlbfs, tmpfs, just a file ...

HTH,

-- 
MST

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [bpf-next,02/11] xsk: introduce xdp_umem_page
  2018-06-04 12:05 ` [PATCH bpf-next 02/11] xsk: introduce xdp_umem_page Björn Töpel
@ 2019-03-13  9:39   ` Jiri Slaby
  2019-03-13 11:23     ` Björn Töpel
  0 siblings, 1 reply; 22+ messages in thread
From: Jiri Slaby @ 2019-03-13  9:39 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, magnus.karlsson,
	alexander.h.duyck, alexander.duyck, ast, brouer, daniel, netdev,
	mykyta.iziumtsev
  Cc: Björn Töpel, john.fastabend, willemdebruijn.kernel,
	mst, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, francois.ozog, ilias.apalodimas, brian.brooks, andy,
	michael.chan, intel-wired-lan

On 04. 06. 18, 14:05, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The xdp_umem_page holds the address for a page. Trade memory for
> faster lookup. Later, we'll add DMA address here as well.
...
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
...
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -65,6 +65,9 @@ static void xdp_umem_release(struct xdp_umem *umem)
>  		goto out;
>  
>  	mmput(mm);
> +	kfree(umem->pages);
> +	umem->pages = NULL;
> +

Are you sure about the placement of kfree here? Why is it dependent on
task && mm above?

IMO the kfree should be below "out:":

>  	xdp_umem_unaccount_pages(umem);
>  out:
>  	kfree(umem);
Syzkaller reported this memleak:
r0 = socket$xdp(0x2c, 0x3, 0x0)
setsockopt$XDP_UMEM_REG(r0, 0x11b, 0x4,
&(0x7f0000000100)={&(0x7f0000000000)=""/210, 0x20000, 0x1000, 0x7}, 0x18)
BUG: memory leak
unreferenced object 0xffff88003648de68 (size 512):
  comm "syz-executor.3", pid 11688, jiffies 4295555546 (age 15.752s)
  hex dump (first 32 bytes):
    00 00 40 23 00 88 ff ff 00 00 00 00 00 00 00 00  ..@#............
    00 10 40 23 00 88 ff ff 00 00 00 00 00 00 00 00  ..@#............
  backtrace:
    [<ffffffffa9f8346c>] xsk_setsockopt+0x40c/0x510 net/xdp/xsk.c:539
    [<ffffffffa9935c41>] SyS_setsockopt+0x171/0x370 net/socket.c:1862
    [<ffffffffa800b28c>] do_syscall_64+0x26c/0x6e0
arch/x86/entry/common.c:284
    [<ffffffffaa00009a>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [<ffffffffffffffff>] 0xffffffffffffffff


Given the size of the leak, it looks like umem->pages is leaked:
mr->len/page_size*sizeof(*umem->pages)
0x20000/4096*16=512

So I added a check, and really, task is NULL in my testcase -- the
program is gone when the deferred work triggers. But umem->pages is not
freed.

Moving the free after "out:", no leaks happen anymore.

Easily reproducible with:
#include <err.h>
#include <stdlib.h>
#include <unistd.h>

#include <sys/socket.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <linux/if_xdp.h>

void fun()
{
        static char buffer[0x20000] __attribute__((aligned(4096)));
        struct xdp_umem_reg mr = {
                (unsigned long)buffer,
                0x20000,
                0x1000,
                0x7,
         //&(0x7f0000000100)={&(0x7f0000000000)=""/210, 0x20000, 0x1000,
0x7}
        };
        int r0;
        r0 = socket(AF_XDP, SOCK_RAW, 0);
        if (r0 < 0)
                err(1, "socket");
        if (setsockopt(r0, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) < 0)
                err(1, "setsockopt");
        close(r0);
}

int main()
{
        int a;
        while (1) {
                for (a = 0; a < 40; a++)
                        if (!fork()) {
                                fun();
                                exit(0);
                        }
                for (a = 0; a < 100; a++)
                        wait(NULL);
        }
}


thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [bpf-next,02/11] xsk: introduce xdp_umem_page
  2019-03-13  9:39   ` [bpf-next,02/11] " Jiri Slaby
@ 2019-03-13 11:23     ` Björn Töpel
  0 siblings, 0 replies; 22+ messages in thread
From: Björn Töpel @ 2019-03-13 11:23 UTC (permalink / raw)
  To: Jiri Slaby, Björn Töpel, magnus.karlsson,
	magnus.karlsson, alexander.h.duyck, alexander.duyck, ast, brouer,
	daniel, netdev, mykyta.iziumtsev
  Cc: john.fastabend, willemdebruijn.kernel, mst, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang, francois.ozog,
	ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan

On 2019-03-13 10:39, Jiri Slaby wrote:
> On 04. 06. 18, 14:05, Björn Töpel wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>> 
>> The xdp_umem_page holds the address for a page. Trade memory for 
>> faster lookup. Later, we'll add DMA address here as well.
> ...
>> --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h
> ...
>> --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -65,6 +65,9 @@
>> static void xdp_umem_release(struct xdp_umem *umem) goto out;
>> 
>> mmput(mm); +	kfree(umem->pages); +	umem->pages = NULL; +
> 
> Are you sure about the placement of kfree here? Why is it dependent
> on task && mm above?
> 
> IMO the kfree should be below "out:":
> 
>> xdp_umem_unaccount_pages(umem); out: kfree(umem);
> Syzkaller reported this memleak: r0 = socket$xdp(0x2c, 0x3, 0x0) 
> setsockopt$XDP_UMEM_REG(r0, 0x11b, 0x4, 
> &(0x7f0000000100)={&(0x7f0000000000)=""/210, 0x20000, 0x1000, 0x7},
> 0x18) BUG: memory leak unreferenced object 0xffff88003648de68 (size
> 512): comm "syz-executor.3", pid 11688, jiffies 4295555546 (age
> 15.752s) hex dump (first 32 bytes): 00 00 40 23 00 88 ff ff 00 00 00
> 00 00 00 00 00  ..@#............ 00 10 40 23 00 88 ff ff 00 00 00 00
> 00 00 00 00  ..@#............ backtrace: [<ffffffffa9f8346c>]
> xsk_setsockopt+0x40c/0x510 net/xdp/xsk.c:539 [<ffffffffa9935c41>]
> SyS_setsockopt+0x171/0x370 net/socket.c:1862 [<ffffffffa800b28c>]
> do_syscall_64+0x26c/0x6e0 arch/x86/entry/common.c:284 
> [<ffffffffaa00009a>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> 
> Given the size of the leak, it looks like umem->pages is leaked: 
> mr->len/page_size*sizeof(*umem->pages) 0x20000/4096*16=512
> 
> So I added a check, and really, task is NULL in my testcase -- the 
> program is gone when the deferred work triggers. But umem->pages is
> not freed.
> 
> Moving the free after "out:", no leaks happen anymore.
> 
> Easily reproducible with: #include <err.h> #include <stdlib.h> 
> #include <unistd.h>
> 
> #include <sys/socket.h> #include <sys/types.h> #include <sys/wait.h>
> 
> #include <linux/if_xdp.h>
> 
> void fun() { static char buffer[0x20000]
> __attribute__((aligned(4096))); struct xdp_umem_reg mr = { (unsigned
> long)buffer, 0x20000, 0x1000, 0x7, 
> //&(0x7f0000000100)={&(0x7f0000000000)=""/210, 0x20000, 0x1000, 0x7} 
> }; int r0; r0 = socket(AF_XDP, SOCK_RAW, 0); if (r0 < 0) err(1,
> "socket"); if (setsockopt(r0, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr))
> < 0) err(1, "setsockopt"); close(r0); }
> 
> int main() { int a; while (1) { for (a = 0; a < 40; a++) if (!fork())
> { fun(); exit(0); } for (a = 0; a < 100; a++) wait(NULL); } }
> 
> 
> thanks,
> 

Nice catch, Jiri! Thank you!

It turns out that the whole task/pid dance is useless. It was a
left-over from the first AF_XDP RFC when we did per task accounting,
instead of per user accounting.

I will do some testing with the patch below, and then submit it as a
proper patch.


Cheers,
Björn

--

From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= <bjorn.topel@intel.com>
Date: Wed, 13 Mar 2019 12:00:51 +0100
Subject: [PATCH] xsk: fix umem memory leak on cleanup
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When the umem is cleaned up, the task that created it might already be
gone. If the task was gone, the xdp_umem_release function did not free
the pages member of struct xdp_umem.

It turned out that the task lookup was not needed at all; The code was
a left-over when we moved from task accounting to user accounting [1].

This patch fixes the memory leak by removing the task lookup logic
completely.

[1] 
https://lore.kernel.org/netdev/20180131135356.19134-3-bjorn.topel@gmail.com/

Link: 
https://lore.kernel.org/netdev/c1cb2ca8-6a14-3980-8672-f3de0bb38dfd@suse.cz/
Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
Reported-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
  include/net/xdp_sock.h |  1 -
  net/xdp/xdp_umem.c     | 19 +------------------
  2 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 61cf7dbb6782..d074b6d60f8a 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -36,7 +36,6 @@ struct xdp_umem {
  	u32 headroom;
  	u32 chunk_size_nohr;
  	struct user_struct *user;
-	struct pid *pid;
  	unsigned long address;
  	refcount_t users;
  	struct work_struct work;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 77520eacee8f..989e52386c35 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -193,9 +193,6 @@ static void xdp_umem_unaccount_pages(struct xdp_umem 
*umem)

  static void xdp_umem_release(struct xdp_umem *umem)
  {
-	struct task_struct *task;
-	struct mm_struct *mm;
-
  	xdp_umem_clear_dev(umem);

  	ida_simple_remove(&umem_ida, umem->id);
@@ -214,21 +211,10 @@ static void xdp_umem_release(struct xdp_umem *umem)

  	xdp_umem_unpin_pages(umem);

-	task = get_pid_task(umem->pid, PIDTYPE_PID);
-	put_pid(umem->pid);
-	if (!task)
-		goto out;
-	mm = get_task_mm(task);
-	put_task_struct(task);
-	if (!mm)
-		goto out;
-
-	mmput(mm);
  	kfree(umem->pages);
  	umem->pages = NULL;

  	xdp_umem_unaccount_pages(umem);
-out:
  	kfree(umem);
  }

@@ -357,7 +343,6 @@ static int xdp_umem_reg(struct xdp_umem *umem, 
struct xdp_umem_reg *mr)
  	if (size_chk < 0)
  		return -EINVAL;

-	umem->pid = get_task_pid(current, PIDTYPE_PID);
  	umem->address = (unsigned long)addr;
  	umem->chunk_mask = ~((u64)chunk_size - 1);
  	umem->size = size;
@@ -373,7 +358,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, 
struct xdp_umem_reg *mr)

  	err = xdp_umem_account_pages(umem);
  	if (err)
-		goto out;
+		return err;

  	err = xdp_umem_pin_pages(umem);
  	if (err)
@@ -392,8 +377,6 @@ static int xdp_umem_reg(struct xdp_umem *umem, 
struct xdp_umem_reg *mr)

  out_account:
  	xdp_umem_unaccount_pages(umem);
-out:
-	put_pid(umem->pid);
  	return err;
  }

-- 
2.19.1





^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-03-13 11:23 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-04 12:05 [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Björn Töpel
2018-06-04 12:05 ` [PATCH bpf-next 01/11] xsk: moved struct xdp_umem definition Björn Töpel
2018-06-04 12:05 ` [PATCH bpf-next 02/11] xsk: introduce xdp_umem_page Björn Töpel
2019-03-13  9:39   ` [bpf-next,02/11] " Jiri Slaby
2019-03-13 11:23     ` Björn Töpel
2018-06-04 12:05 ` [PATCH bpf-next 03/11] net: xdp: added bpf_netdev_command XDP_{QUERY,SETUP}_XSK_UMEM Björn Töpel
2018-06-04 12:05 ` [PATCH bpf-next 04/11] xdp: add MEM_TYPE_ZERO_COPY Björn Töpel
2018-06-04 12:05 ` [PATCH bpf-next 05/11] xsk: add zero-copy support for Rx Björn Töpel
2018-06-04 12:05 ` [PATCH bpf-next 06/11] net: added netdevice operation for Tx Björn Töpel
2018-06-04 12:05 ` [PATCH bpf-next 07/11] xsk: wire upp Tx zero-copy functions Björn Töpel
2018-06-04 12:05 ` [PATCH bpf-next 08/11] i40e: added queue pair disable/enable functions Björn Töpel
2018-06-04 12:05 ` [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx Björn Töpel
2018-06-04 20:35   ` Alexander Duyck
2018-06-07  7:40     ` Björn Töpel
2018-06-04 12:06 ` [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx Björn Töpel
2018-06-04 20:53   ` Alexander Duyck
2018-06-05 12:43   ` Jesper Dangaard Brouer
2018-06-05 13:07     ` Björn Töpel
2018-06-04 12:06 ` [PATCH bpf-next 11/11] samples/bpf: xdpsock: use skb Tx path for XDP_SKB Björn Töpel
2018-06-04 16:38 ` [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support Alexei Starovoitov
2018-06-04 20:29   ` [Intel-wired-lan] " Jeff Kirsher
2018-11-14  8:10 ` af_xdp zero copy ideas Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).