All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
@ 2018-08-28 12:44 Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY Björn Töpel
                   ` (14 more replies)
  0 siblings, 15 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: Björn Töpel, michael.lundkvist, willemdebruijn.kernel,
	john.fastabend, jakub.kicinski, neerav.parikh, mykyta.iziumtsev,
	francois.ozog, ilias.apalodimas, brian.brooks, u9012063, pavel,
	qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This patch set introduces zero-copy AF_XDP support for Intel's i40e
driver. In the first preparatory patch we also add support for
XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
redirect them. This was a ToDo from the first AF_XDP zero-copy patch
set from early June. Special thanks to Alex Duyck and Jesper Dangaard
Brouer for reviewing earlier versions of this patch set.

The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
that in the interest of time, to get an AF_XDP zero-copy implementation
out there for people to try, some code paths have been copied from the
XDP path to the zero-copy path. It is out goal to merge the two paths
in later patch sets.

In contrast to the implementation from beginning of June, this patch
set does not require any extra HW queues for AF_XDP zero-copy
TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
AF_XDP zero-copy TX.

Jeff, given that most of changes are in i40e, it is up to you how you
would like to route these patches. The set is tagged bpf-next, but
if taking it via the Intel driver tree is easier, let us know.

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are with
retpoline and all other spectre and meltdown fixes, so these results
are not comparable to the ones from the zero-copy patch set in June.

AF_XDP performance 64 byte packets.
Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
rxdrop       2.6        8.2         15.0
txpush       2.2        -           21.9
l2fwd        1.7        2.3         11.3

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
rxdrop       2.0        3.3         3.3
l2fwd        1.3        1.7         3.1

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      18.4M  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3.3M    0

The structure of the patch set is as follows:

Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
Patches 2-4: Preparatory patches to common xsk and net code
Patches 5-7: Preparatory patches to i40e driver code for RX
Patch 8: i40e zero-copy support for RX
Patch 9: Preparatory patch to i40e driver code for TX
Patch 10: i40e zero-copy support for TX
Patch 11: Add flags to sample application to force zero-copy/copy mode

We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")


Magnus & Björn

Björn Töpel (8):
  xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
  xdp: export xdp_rxq_info_unreg_mem_model
  xsk: expose xdp_umem_get_{data,dma} to drivers
  i40e: added queue pair disable/enable functions
  i40e: refactor Rx path for re-use
  i40e: move common Rx functions to i40e_txrx_common.h
  i40e: add AF_XDP zero-copy Rx support
  samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock

Magnus Karlsson (3):
  net: add napi_if_scheduled_mark_missed
  i40e: move common Tx functions to i40e_txrx_common.h
  i40e: add AF_XDP zero-copy Tx support

 drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h        |  19 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   | 307 ++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 182 ++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  20 +-
 .../ethernet/intel/i40e/i40e_txrx_common.h    |  90 ++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 834 ++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.h    |  25 +
 include/linux/netdevice.h                     |  26 +
 include/net/xdp.h                             |   6 +-
 include/net/xdp_sock.h                        |  43 +
 net/core/xdp.c                                |  54 +-
 net/xdp/xdp_umem.h                            |  10 -
 samples/bpf/xdpsock_user.c                    |  12 +-
 14 files changed, 1523 insertions(+), 108 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h

-- 
2.17.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-28 14:11   ` Jesper Dangaard Brouer
  2018-08-29 18:06   ` [bpf-next, " Maciek Fijalkowski
  2018-08-28 12:44 ` [PATCH bpf-next 02/11] xdp: export xdp_rxq_info_unreg_mem_model Björn Töpel
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: Björn Töpel, michael.lundkvist, willemdebruijn.kernel,
	john.fastabend, jakub.kicinski, neerav.parikh, mykyta.iziumtsev,
	francois.ozog, ilias.apalodimas, brian.brooks, u9012063, pavel,
	qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This commit adds proper MEM_TYPE_ZERO_COPY support for
convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
make sense to implement a more sophisticated thread-safe alloc/free
scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
required in the fast-path.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp.h |  5 +++--
 net/core/xdp.c    | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 76b95256c266..0d5c6fb4b2e2 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -91,6 +91,8 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
 	frame->dev_rx = NULL;
 }
 
+struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+
 /* Convert xdp_buff to xdp_frame */
 static inline
 struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
@@ -99,9 +101,8 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	int metasize;
 	int headroom;
 
-	/* TODO: implement clone, copy, use "native" MEM_TYPE */
 	if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
-		return NULL;
+		return xdp_convert_zc_to_xdp_frame(xdp);
 
 	/* Assure headroom is available for storing info */
 	headroom = xdp->data - xdp->data_hard_start;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 89b6785cef2a..be6cb2f0e722 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -398,3 +398,42 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 	info->flags = bpf->flags;
 }
 EXPORT_SYMBOL_GPL(xdp_attachment_setup);
+
+struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
+{
+	unsigned int metasize, headroom, totsize;
+	void *addr, *data_to_copy;
+	struct xdp_frame *xdpf;
+	struct page *page;
+
+	/* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
+	metasize = xdp_data_meta_unsupported(xdp) ? 0 :
+		   xdp->data - xdp->data_meta;
+	headroom = xdp->data - xdp->data_hard_start;
+	totsize = xdp->data_end - xdp->data + metasize;
+
+	if (sizeof(*xdpf) + totsize > PAGE_SIZE)
+		return NULL;
+
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+
+	addr = page_to_virt(page);
+	xdpf = addr;
+	memset(xdpf, 0, sizeof(*xdpf));
+
+	addr += sizeof(*xdpf);
+	data_to_copy = metasize ? xdp->data_meta : xdp->data;
+	memcpy(addr, data_to_copy, totsize);
+
+	xdpf->data = addr + metasize;
+	xdpf->len = totsize - metasize;
+	xdpf->headroom = 0;
+	xdpf->metasize = metasize;
+	xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
+
+	xdp_return_buff(xdp);
+	return xdpf;
+}
+EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 02/11] xdp: export xdp_rxq_info_unreg_mem_model
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 03/11] xsk: expose xdp_umem_get_{data,dma} to drivers Björn Töpel
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: Björn Töpel, michael.lundkvist, willemdebruijn.kernel,
	john.fastabend, jakub.kicinski, neerav.parikh, mykyta.iziumtsev,
	francois.ozog, ilias.apalodimas, brian.brooks, u9012063, pavel,
	qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Export __xdp_rxq_info_unreg_mem_model as xdp_rxq_info_unreg_mem_model,
so it can be used from netdev drivers. Also, add additional checks for
the memory type.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp.h |  1 +
 net/core/xdp.c    | 15 +++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 0d5c6fb4b2e2..0f25b3675c5c 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -136,6 +136,7 @@ void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
 bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 			       enum xdp_mem_type type, void *allocator);
+void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq);
 
 /* Drivers not supporting XDP metadata can use this helper, which
  * rejects any room expansion for metadata as a result.
diff --git a/net/core/xdp.c b/net/core/xdp.c
index be6cb2f0e722..654dbb19707e 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -94,11 +94,21 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
 	kfree(xa);
 }
 
-static void __xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
+void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
 {
 	struct xdp_mem_allocator *xa;
 	int id = xdp_rxq->mem.id;
 
+	if (xdp_rxq->reg_state != REG_STATE_REGISTERED) {
+		WARN(1, "Missing register, driver bug");
+		return;
+	}
+
+	if (xdp_rxq->mem.type != MEM_TYPE_PAGE_POOL &&
+	    xdp_rxq->mem.type != MEM_TYPE_ZERO_COPY) {
+		return;
+	}
+
 	if (id == 0)
 		return;
 
@@ -110,6 +120,7 @@ static void __xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
 
 	mutex_unlock(&mem_id_lock);
 }
+EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg_mem_model);
 
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 {
@@ -119,7 +130,7 @@ void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 
 	WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
 
-	__xdp_rxq_info_unreg_mem_model(xdp_rxq);
+	xdp_rxq_info_unreg_mem_model(xdp_rxq);
 
 	xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
 	xdp_rxq->dev = NULL;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 03/11] xsk: expose xdp_umem_get_{data,dma} to drivers
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 02/11] xdp: export xdp_rxq_info_unreg_mem_model Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 04/11] net: add napi_if_scheduled_mark_missed Björn Töpel
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: Björn Töpel, michael.lundkvist, willemdebruijn.kernel,
	john.fastabend, jakub.kicinski, neerav.parikh, mykyta.iziumtsev,
	francois.ozog, ilias.apalodimas, brian.brooks, u9012063, pavel,
	qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Move the xdp_umem_get_{data,dma} functions to include/net/xdp_sock.h,
so that the upcoming zero-copy implementation in the Ethernet drivers
can utilize them.

Also, supply some dummy function implementations for
CONFIG_XDP_SOCKETS=n configs.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h | 43 ++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.h     | 10 ----------
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 7161856bcf9c..56994ad1ab40 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -79,6 +79,16 @@ void xsk_umem_discard_addr(struct xdp_umem *umem);
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
 bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len);
 void xsk_umem_consume_tx_done(struct xdp_umem *umem);
+
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
+{
+	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
+}
+
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
+{
+	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
+}
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
@@ -98,6 +108,39 @@ static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 {
 	return false;
 }
+
+static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
+{
+	return NULL;
+}
+
+static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
+{
+}
+
+static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
+{
+}
+
+static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
+				       u32 *len)
+{
+	return false;
+}
+
+static inline void xsk_umem_consume_tx_done(struct xdp_umem *umem)
+{
+}
+
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
+{
+	return NULL;
+}
+
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
+{
+	return 0;
+}
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index f11560334f88..c8be1ad3eb88 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -8,16 +8,6 @@
 
 #include <net/xdp_sock.h>
 
-static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
-{
-	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
-}
-
-static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
-{
-	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
-}
-
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 			u32 queue_id, u16 flags);
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 04/11] net: add napi_if_scheduled_mark_missed
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (2 preceding siblings ...)
  2018-08-28 12:44 ` [PATCH bpf-next 03/11] xsk: expose xdp_umem_get_{data,dma} to drivers Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 05/11] i40e: added queue pair disable/enable functions Björn Töpel
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	jakub.kicinski, neerav.parikh, mykyta.iziumtsev, francois.ozog,
	ilias.apalodimas, brian.brooks, u9012063, pavel, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

The function napi_if_scheduled_mark_missed is used to check if the
NAPI context is scheduled, if so set NAPIF_STATE_MISSED and return
true. Used by the AF_XDP zero-copy i40e Tx code implementation in
order to make sure that irq affinity is honored by the napi context.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca5ab98053c8..4271f6b4e419 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -535,6 +535,32 @@ static inline void napi_synchronize(const struct napi_struct *n)
 		barrier();
 }
 
+/**
+ *	napi_if_scheduled_mark_missed - if napi is running, set the
+ *	NAPIF_STATE_MISSED
+ *	@n: NAPI context
+ *
+ * If napi is running, set the NAPIF_STATE_MISSED, and return true if
+ * NAPI is scheduled.
+ **/
+static inline bool napi_if_scheduled_mark_missed(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (val & NAPIF_STATE_DISABLE)
+			return true;
+
+		if (!(val & NAPIF_STATE_SCHED))
+			return false;
+
+		new = val | NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return true;
+}
+
 enum netdev_queue_state_t {
 	__QUEUE_STATE_DRV_XOFF,
 	__QUEUE_STATE_STACK_XOFF,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 05/11] i40e: added queue pair disable/enable functions
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (3 preceding siblings ...)
  2018-08-28 12:44 ` [PATCH bpf-next 04/11] net: add napi_if_scheduled_mark_missed Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 06/11] i40e: refactor Rx path for re-use Björn Töpel
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: Björn Töpel, michael.lundkvist, willemdebruijn.kernel,
	john.fastabend, jakub.kicinski, neerav.parikh, mykyta.iziumtsev,
	francois.ozog, ilias.apalodimas, brian.brooks, u9012063, pavel,
	qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Add functions for queue pair enable/disable. Instead of resetting the
whole device, only the affected queue pair is disabled or enabled.

This plumbing is used in a later commit, when zero-copy AF_XDP support
is introduced.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 250 ++++++++++++++++++++
 1 file changed, 250 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ac685ad4d877..d8b5a6af72bd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11827,6 +11827,256 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	return 0;
 }
 
+/**
+ * i40e_enter_busy_conf - Enters busy config state
+ * @vsi: vsi
+ *
+ * Returns 0 on success, <0 for failure.
+ **/
+static int i40e_enter_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+	int timeout = 50;
+
+	while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
+		timeout--;
+		if (!timeout)
+			return -EBUSY;
+		usleep_range(1000, 2000);
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_exit_busy_conf - Exits busy config state
+ * @vsi: vsi
+ **/
+static void i40e_exit_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+
+	clear_bit(__I40E_CONFIG_BUSY, pf->state);
+}
+
+/**
+ * i40e_queue_pair_reset_stats - Resets all statistics for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_reset_stats(struct i40e_vsi *vsi, int queue_pair)
+{
+	memset(&vsi->rx_rings[queue_pair]->rx_stats, 0,
+	       sizeof(vsi->rx_rings[queue_pair]->rx_stats));
+	memset(&vsi->tx_rings[queue_pair]->stats, 0,
+	       sizeof(vsi->tx_rings[queue_pair]->stats));
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		memset(&vsi->xdp_rings[queue_pair]->stats, 0,
+		       sizeof(vsi->xdp_rings[queue_pair]->stats));
+	}
+}
+
+/**
+ * i40e_queue_pair_clean_rings - Cleans all the rings of a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_clean_rings(struct i40e_vsi *vsi, int queue_pair)
+{
+	i40e_clean_tx_ring(vsi->tx_rings[queue_pair]);
+	if (i40e_enabled_xdp_vsi(vsi))
+		i40e_clean_tx_ring(vsi->xdp_rings[queue_pair]);
+	i40e_clean_rx_ring(vsi->rx_rings[queue_pair]);
+}
+
+/**
+ * i40e_queue_pair_toggle_napi - Enables/disables NAPI for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ **/
+static void i40e_queue_pair_toggle_napi(struct i40e_vsi *vsi, int queue_pair,
+					bool enable)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_q_vector *q_vector = rxr->q_vector;
+
+	if (!vsi->netdev)
+		return;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (q_vector->rx.ring || q_vector->tx.ring) {
+		if (enable)
+			napi_enable(&q_vector->napi);
+		else
+			napi_disable(&q_vector->napi);
+	}
+}
+
+/**
+ * i40e_queue_pair_toggle_rings - Enables/disables all rings for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_toggle_rings(struct i40e_vsi *vsi, int queue_pair,
+					bool enable)
+{
+	struct i40e_pf *pf = vsi->back;
+	int pf_q, ret = 0;
+
+	pf_q = vsi->base_queue + queue_pair;
+	ret = i40e_control_wait_tx_q(vsi->seid, pf, pf_q,
+				     false /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	i40e_control_rx_q(pf, pf_q, enable);
+	ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Rx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	/* Due to HW errata, on Rx disable only, the register can
+	 * indicate done before it really is. Needs 50ms to be sure
+	 */
+	if (!enable)
+		mdelay(50);
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return ret;
+
+	ret = i40e_control_wait_tx_q(vsi->seid, pf,
+				     pf_q + vsi->alloc_queue_pairs,
+				     true /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d XDP Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+	}
+
+	return ret;
+}
+
+/**
+ * i40e_queue_pair_enable_irq - Enables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_enable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED)
+		i40e_irq_dynamic_enable(vsi, rxr->q_vector->v_idx);
+	else
+		i40e_irq_dynamic_enable_icr0(pf);
+
+	i40e_flush(hw);
+}
+
+/**
+ * i40e_queue_pair_disable_irq - Disables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* For simplicity, instead of removing the qp interrupt causes
+	 * from the interrupt linked list, we simply disable the interrupt, and
+	 * leave the list intact.
+	 *
+	 * All rings in a qp belong to the same qvector.
+	 */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
+		u32 intpf = vsi->base_vector + rxr->q_vector->v_idx;
+
+		wr32(hw, I40E_PFINT_DYN_CTLN(intpf - 1), 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->msix_entries[intpf].vector);
+	} else {
+		/* Legacy and MSI mode - this stops all interrupt handling */
+		wr32(hw, I40E_PFINT_ICR0_ENA, 0);
+		wr32(hw, I40E_PFINT_DYN_CTL0, 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->pdev->irq);
+	}
+}
+
+/**
+ * i40e_queue_pair_disable - Disables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_enter_busy_conf(vsi);
+	if (err)
+		return err;
+
+	i40e_queue_pair_disable_irq(vsi, queue_pair);
+	err = i40e_queue_pair_toggle_rings(vsi, queue_pair, false /* off */);
+	i40e_queue_pair_toggle_napi(vsi, queue_pair, false /* off */);
+	i40e_queue_pair_clean_rings(vsi, queue_pair);
+	i40e_queue_pair_reset_stats(vsi, queue_pair);
+
+	return err;
+}
+
+/**
+ * i40e_queue_pair_enable - Enables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_configure_tx_ring(vsi->tx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		err = i40e_configure_tx_ring(vsi->xdp_rings[queue_pair]);
+		if (err)
+			return err;
+	}
+
+	err = i40e_configure_rx_ring(vsi->rx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	err = i40e_queue_pair_toggle_rings(vsi, queue_pair, true /* on */);
+	i40e_queue_pair_toggle_napi(vsi, queue_pair, true /* on */);
+	i40e_queue_pair_enable_irq(vsi, queue_pair);
+
+	i40e_exit_busy_conf(vsi);
+
+	return err;
+}
+
 /**
  * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 06/11] i40e: refactor Rx path for re-use
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (4 preceding siblings ...)
  2018-08-28 12:44 ` [PATCH bpf-next 05/11] i40e: added queue pair disable/enable functions Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 07/11] i40e: move common Rx functions to i40e_txrx_common.h Björn Töpel
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: Björn Töpel, michael.lundkvist, willemdebruijn.kernel,
	john.fastabend, jakub.kicinski, neerav.parikh, mykyta.iziumtsev,
	francois.ozog, ilias.apalodimas, brian.brooks, u9012063, pavel,
	qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

In this commit, the Rx path is refactored some, as a step torwards the
introduction AF_XDP Rx zero-copy.

The page re-use counter is moved into the i40e_reuse_rx_page, instead
of bumping the counter in many places. The Rx buffer page clearing is
moved for better readability. Lastely, functions to update statistics
and bump the XDP Tx ring are introduced.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 111 ++++++++++++++------
 1 file changed, 77 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index b5042d1a63c0..b5a2cfeb68a5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1244,6 +1244,11 @@ static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
 	new_buff->page		= old_buff->page;
 	new_buff->page_offset	= old_buff->page_offset;
 	new_buff->pagecnt_bias	= old_buff->pagecnt_bias;
+
+	rx_ring->rx_stats.page_reuse_count++;
+
+	/* clear contents of buffer_info */
+	old_buff->page = NULL;
 }
 
 /**
@@ -1266,7 +1271,7 @@ static inline bool i40e_rx_is_programming_status(u64 qw)
 }
 
 /**
- * i40e_clean_programming_status - clean the programming status descriptor
+ * i40e_clean_programming_status - try clean the programming status descriptor
  * @rx_ring: the rx ring that has this descriptor
  * @rx_desc: the rx descriptor written back by HW
  * @qw: qword representing status_error_len in CPU ordering
@@ -1275,15 +1280,22 @@ static inline bool i40e_rx_is_programming_status(u64 qw)
  * status being successful or not and take actions accordingly. FCoE should
  * handle its context/filter programming/invalidation status and take actions.
  *
+ * Returns an i40e_rx_buffer to reuse if the cleanup occurred, otherwise NULL.
  **/
-static void i40e_clean_programming_status(struct i40e_ring *rx_ring,
-					  union i40e_rx_desc *rx_desc,
-					  u64 qw)
+static struct i40e_rx_buffer *i40e_clean_programming_status(
+	struct i40e_ring *rx_ring,
+	union i40e_rx_desc *rx_desc,
+	u64 qw)
 {
 	struct i40e_rx_buffer *rx_buffer;
-	u32 ntc = rx_ring->next_to_clean;
+	u32 ntc;
 	u8 id;
 
+	if (!i40e_rx_is_programming_status(qw))
+		return NULL;
+
+	ntc = rx_ring->next_to_clean;
+
 	/* fetch, update, and store next to clean */
 	rx_buffer = &rx_ring->rx_bi[ntc++];
 	ntc = (ntc < rx_ring->count) ? ntc : 0;
@@ -1291,18 +1303,13 @@ static void i40e_clean_programming_status(struct i40e_ring *rx_ring,
 
 	prefetch(I40E_RX_DESC(rx_ring, ntc));
 
-	/* place unused page back on the ring */
-	i40e_reuse_rx_page(rx_ring, rx_buffer);
-	rx_ring->rx_stats.page_reuse_count++;
-
-	/* clear contents of buffer_info */
-	rx_buffer->page = NULL;
-
 	id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
 		  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
 
 	if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
 		i40e_fd_handle_status(rx_ring, rx_desc, id);
+
+	return rx_buffer;
 }
 
 /**
@@ -2152,7 +2159,6 @@ static void i40e_put_rx_buffer(struct i40e_ring *rx_ring,
 	if (i40e_can_reuse_rx_page(rx_buffer)) {
 		/* hand second half of page back to the ring */
 		i40e_reuse_rx_page(rx_ring, rx_buffer);
-		rx_ring->rx_stats.page_reuse_count++;
 	} else {
 		/* we are not reusing the buffer so unmap it */
 		dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
@@ -2160,10 +2166,9 @@ static void i40e_put_rx_buffer(struct i40e_ring *rx_ring,
 				     DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
 		__page_frag_cache_drain(rx_buffer->page,
 					rx_buffer->pagecnt_bias);
+		/* clear contents of buffer_info */
+		rx_buffer->page = NULL;
 	}
-
-	/* clear contents of buffer_info */
-	rx_buffer->page = NULL;
 }
 
 /**
@@ -2287,6 +2292,12 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
 #endif
 }
 
+/**
+ * i40e_xdp_ring_update_tail - Updates the XDP Tx ring tail register
+ * @xdp_ring: XDP Tx ring
+ *
+ * This function updates the XDP Tx ring tail register.
+ **/
 static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
 {
 	/* Force memory writes to complete before letting h/w
@@ -2296,6 +2307,49 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
 	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
 }
 
+/**
+ * i40e_update_rx_stats - Update Rx ring statistics
+ * @rx_ring: rx descriptor ring
+ * @total_rx_bytes: number of bytes received
+ * @total_rx_packets: number of packets received
+ *
+ * This function updates the Rx ring statistics.
+ **/
+static void i40e_update_rx_stats(struct i40e_ring *rx_ring,
+				 unsigned int total_rx_bytes,
+				 unsigned int total_rx_packets)
+{
+	u64_stats_update_begin(&rx_ring->syncp);
+	rx_ring->stats.packets += total_rx_packets;
+	rx_ring->stats.bytes += total_rx_bytes;
+	u64_stats_update_end(&rx_ring->syncp);
+	rx_ring->q_vector->rx.total_packets += total_rx_packets;
+	rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
+}
+
+/**
+ * i40e_finalize_xdp_rx - Bump XDP Tx tail and/or flush redirect map
+ * @rx_ring: Rx ring
+ * @xdp_res: Result of the receive batch
+ *
+ * This function bumps XDP Tx tail and/or flush redirect map, and
+ * should be called when a batch of packets has been processed in the
+ * napi loop.
+ **/
+static void i40e_finalize_xdp_rx(struct i40e_ring *rx_ring,
+				 unsigned int xdp_res)
+{
+	if (xdp_res & I40E_XDP_REDIR)
+		xdp_do_flush_map();
+
+	if (xdp_res & I40E_XDP_TX) {
+		struct i40e_ring *xdp_ring =
+			rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+
+		i40e_xdp_ring_update_tail(xdp_ring);
+	}
+}
+
 /**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
@@ -2349,11 +2403,14 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 		 */
 		dma_rmb();
 
-		if (unlikely(i40e_rx_is_programming_status(qword))) {
-			i40e_clean_programming_status(rx_ring, rx_desc, qword);
+		rx_buffer = i40e_clean_programming_status(rx_ring, rx_desc,
+							  qword);
+		if (unlikely(rx_buffer)) {
+			i40e_reuse_rx_page(rx_ring, rx_buffer);
 			cleaned_count++;
 			continue;
 		}
+
 		size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
 		       I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
 		if (!size)
@@ -2432,24 +2489,10 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 		total_rx_packets++;
 	}
 
-	if (xdp_xmit & I40E_XDP_REDIR)
-		xdp_do_flush_map();
-
-	if (xdp_xmit & I40E_XDP_TX) {
-		struct i40e_ring *xdp_ring =
-			rx_ring->vsi->xdp_rings[rx_ring->queue_index];
-
-		i40e_xdp_ring_update_tail(xdp_ring);
-	}
-
+	i40e_finalize_xdp_rx(rx_ring, xdp_xmit);
 	rx_ring->skb = skb;
 
-	u64_stats_update_begin(&rx_ring->syncp);
-	rx_ring->stats.packets += total_rx_packets;
-	rx_ring->stats.bytes += total_rx_bytes;
-	u64_stats_update_end(&rx_ring->syncp);
-	rx_ring->q_vector->rx.total_packets += total_rx_packets;
-	rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
+	i40e_update_rx_stats(rx_ring, total_rx_bytes, total_rx_packets);
 
 	/* guarantee a trip back through this routine if there was a failure */
 	return failure ? budget : (int)total_rx_packets;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 07/11] i40e: move common Rx functions to i40e_txrx_common.h
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (5 preceding siblings ...)
  2018-08-28 12:44 ` [PATCH bpf-next 06/11] i40e: refactor Rx path for re-use Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support Björn Töpel
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: Björn Töpel, michael.lundkvist, willemdebruijn.kernel,
	john.fastabend, jakub.kicinski, neerav.parikh, mykyta.iziumtsev,
	francois.ozog, ilias.apalodimas, brian.brooks, u9012063, pavel,
	qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This patch prepares for the upcoming zero-copy Rx functionality, by
moving/changing linkage of common functions, used both by the regular
path and zero-copy path.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 33 ++++++++-----------
 .../ethernet/intel/i40e/i40e_txrx_common.h    | 31 +++++++++++++++++
 2 files changed, 44 insertions(+), 20 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_txrx_common.h

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index b5a2cfeb68a5..878fb4b47484 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -8,6 +8,7 @@
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
+#include "i40e_txrx_common.h"
 
 static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
 				u32 td_tag)
@@ -536,8 +537,8 @@ int i40e_add_del_fdir(struct i40e_vsi *vsi,
  * This is used to verify if the FD programming or invalidation
  * requested by SW to the HW is successful or not and take actions accordingly.
  **/
-static void i40e_fd_handle_status(struct i40e_ring *rx_ring,
-				  union i40e_rx_desc *rx_desc, u8 prog_id)
+void i40e_fd_handle_status(struct i40e_ring *rx_ring,
+			   union i40e_rx_desc *rx_desc, u8 prog_id)
 {
 	struct i40e_pf *pf = rx_ring->vsi->back;
 	struct pci_dev *pdev = pf->pdev;
@@ -1282,7 +1283,7 @@ static inline bool i40e_rx_is_programming_status(u64 qw)
  *
  * Returns an i40e_rx_buffer to reuse if the cleanup occurred, otherwise NULL.
  **/
-static struct i40e_rx_buffer *i40e_clean_programming_status(
+struct i40e_rx_buffer *i40e_clean_programming_status(
 	struct i40e_ring *rx_ring,
 	union i40e_rx_desc *rx_desc,
 	u64 qw)
@@ -1499,7 +1500,7 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
  * @rx_ring: ring to bump
  * @val: new head index
  **/
-static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
+void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 {
 	rx_ring->next_to_use = val;
 
@@ -1583,8 +1584,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
  * @skb: packet to send up
  * @vlan_tag: vlan tag for packet
  **/
-static void i40e_receive_skb(struct i40e_ring *rx_ring,
-			     struct sk_buff *skb, u16 vlan_tag)
+void i40e_receive_skb(struct i40e_ring *rx_ring,
+		      struct sk_buff *skb, u16 vlan_tag)
 {
 	struct i40e_q_vector *q_vector = rx_ring->q_vector;
 
@@ -1811,7 +1812,6 @@ static inline void i40e_rx_hash(struct i40e_ring *ring,
  * order to populate the hash, checksum, VLAN, protocol, and
  * other fields within the skb.
  **/
-static inline
 void i40e_process_skb_fields(struct i40e_ring *rx_ring,
 			     union i40e_rx_desc *rx_desc, struct sk_buff *skb,
 			     u8 rx_ptype)
@@ -2204,16 +2204,10 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 	return true;
 }
 
-#define I40E_XDP_PASS		0
-#define I40E_XDP_CONSUMED	BIT(0)
-#define I40E_XDP_TX		BIT(1)
-#define I40E_XDP_REDIR		BIT(2)
-
 static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 			      struct i40e_ring *xdp_ring);
 
-static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
-				 struct i40e_ring *xdp_ring)
+int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring *xdp_ring)
 {
 	struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
 
@@ -2298,7 +2292,7 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
  *
  * This function updates the XDP Tx ring tail register.
  **/
-static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
 {
 	/* Force memory writes to complete before letting h/w
 	 * know there are new descriptors to fetch.
@@ -2315,9 +2309,9 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
  *
  * This function updates the Rx ring statistics.
  **/
-static void i40e_update_rx_stats(struct i40e_ring *rx_ring,
-				 unsigned int total_rx_bytes,
-				 unsigned int total_rx_packets)
+void i40e_update_rx_stats(struct i40e_ring *rx_ring,
+			  unsigned int total_rx_bytes,
+			  unsigned int total_rx_packets)
 {
 	u64_stats_update_begin(&rx_ring->syncp);
 	rx_ring->stats.packets += total_rx_packets;
@@ -2336,8 +2330,7 @@ static void i40e_update_rx_stats(struct i40e_ring *rx_ring,
  * should be called when a batch of packets has been processed in the
  * napi loop.
  **/
-static void i40e_finalize_xdp_rx(struct i40e_ring *rx_ring,
-				 unsigned int xdp_res)
+void i40e_finalize_xdp_rx(struct i40e_ring *rx_ring, unsigned int xdp_res)
 {
 	if (xdp_res & I40E_XDP_REDIR)
 		xdp_do_flush_map();
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
new file mode 100644
index 000000000000..2bd5187fcd66
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2018 Intel Corporation. */
+
+#ifndef I40E_TXRX_COMMON_
+#define I40E_TXRX_COMMON_
+
+void i40e_fd_handle_status(struct i40e_ring *rx_ring,
+			   union i40e_rx_desc *rx_desc, u8 prog_id);
+int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring *xdp_ring);
+struct i40e_rx_buffer *i40e_clean_programming_status(
+	struct i40e_ring *rx_ring,
+	union i40e_rx_desc *rx_desc,
+	u64 qw);
+void i40e_process_skb_fields(struct i40e_ring *rx_ring,
+			     union i40e_rx_desc *rx_desc, struct sk_buff *skb,
+			     u8 rx_ptype);
+void i40e_receive_skb(struct i40e_ring *rx_ring,
+		      struct sk_buff *skb, u16 vlan_tag);
+void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring);
+void i40e_update_rx_stats(struct i40e_ring *rx_ring,
+			  unsigned int total_rx_bytes,
+			  unsigned int total_rx_packets);
+void i40e_finalize_xdp_rx(struct i40e_ring *rx_ring, unsigned int xdp_res);
+void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val);
+
+#define I40E_XDP_PASS		0
+#define I40E_XDP_CONSUMED	BIT(0)
+#define I40E_XDP_TX		BIT(1)
+#define I40E_XDP_REDIR		BIT(2)
+
+#endif /* I40E_TXRX_COMMON_ */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (6 preceding siblings ...)
  2018-08-28 12:44 ` [PATCH bpf-next 07/11] i40e: move common Rx functions to i40e_txrx_common.h Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-29 19:14   ` Jakub Kicinski
  2018-08-29 19:22   ` Alexei Starovoitov
  2018-08-28 12:44 ` [PATCH bpf-next 09/11] i40e: move common Tx functions to i40e_txrx_common.h Björn Töpel
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: Björn Töpel, michael.lundkvist, willemdebruijn.kernel,
	john.fastabend, jakub.kicinski, neerav.parikh, mykyta.iziumtsev,
	francois.ozog, ilias.apalodimas, brian.brooks, u9012063, pavel,
	qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
queue.

All AF_XDP specific functions are added to a new file, i40e_xsk.c.

Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
will allocate a new buffer and copy the zero-copy frame prior passing
it to the kernel stack.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/Makefile    |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h      |  19 +
 drivers/net/ethernet/intel/i40e/i40e_main.c |  53 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |   9 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  20 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 661 ++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  21 +
 7 files changed, 775 insertions(+), 11 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h

diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
index 14397e7e9925..50590e8d1fd1 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -22,6 +22,7 @@ i40e-objs := i40e_main.o \
 	i40e_txrx.o	\
 	i40e_ptp.o	\
 	i40e_client.o   \
-	i40e_virtchnl_pf.o
+	i40e_virtchnl_pf.o \
+	i40e_xsk.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 7a80652e2500..876cac317e79 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -786,6 +786,11 @@ struct i40e_vsi {
 
 	/* VSI specific handlers */
 	irqreturn_t (*irq_handler)(int irq, void *data);
+
+	/* AF_XDP zero-copy */
+	struct xdp_umem **xsk_umems;
+	u16 num_xsk_umems_used;
+	u16 num_xsk_umems;
 } ____cacheline_internodealigned_in_smp;
 
 struct i40e_netdev_priv {
@@ -1090,6 +1095,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
 	return !!vsi->xdp_prog;
 }
 
+static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
+{
+	bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
+	int qid = ring->queue_index;
+
+	if (ring_is_xdp(ring))
+		qid -= ring->vsi->alloc_queue_pairs;
+
+	if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
+		return NULL;
+
+	return ring->vsi->xsk_umems[qid];
+}
+
 int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
 int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
 int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index d8b5a6af72bd..848eea7c84db 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9,7 +9,9 @@
 /* Local includes */
 #include "i40e.h"
 #include "i40e_diag.h"
+#include "i40e_xsk.h"
 #include <net/udp_tunnel.h>
+#include <net/xdp_sock.h>
 /* All i40e tracepoints are defined by the include below, which
  * must be included exactly once across the whole kernel with
  * CREATE_TRACE_POINTS defined
@@ -3181,13 +3183,46 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	struct i40e_hw *hw = &vsi->back->hw;
 	struct i40e_hmc_obj_rxq rx_ctx;
 	i40e_status err = 0;
+	bool ok;
+	int ret;
 
 	bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
 
 	/* clear the context structure first */
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
-	ring->rx_buf_len = vsi->rx_buf_len;
+	if (ring->vsi->type == I40E_VSI_MAIN)
+		xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
+
+	ring->xsk_umem = i40e_xsk_umem(ring);
+	if (ring->xsk_umem) {
+		ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -
+				   XDP_PACKET_HEADROOM;
+		/* For AF_XDP ZC, we disallow packets to span on
+		 * multiple buffers, thus letting us skip that
+		 * handling in the fast-path.
+		 */
+		chain_len = 1;
+		ring->zca.free = i40e_zca_free;
+		ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+						 MEM_TYPE_ZERO_COPY,
+						 &ring->zca);
+		if (ret)
+			return ret;
+		dev_info(&vsi->back->pdev->dev,
+			 "Registered XDP mem model MEM_TYPE_ZERO_COPY on Rx ring %d\n",
+			 ring->queue_index);
+
+	} else {
+		ring->rx_buf_len = vsi->rx_buf_len;
+		if (ring->vsi->type == I40E_VSI_MAIN) {
+			ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+							 MEM_TYPE_PAGE_SHARED,
+							 NULL);
+			if (ret)
+				return ret;
+		}
+	}
 
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
 				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
@@ -3243,7 +3278,15 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
 	writel(0, ring->tail);
 
-	i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
+	ok = ring->xsk_umem ?
+	     i40e_alloc_rx_buffers_zc(ring, I40E_DESC_UNUSED(ring)) :
+	     !i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
+	if (!ok) {
+		dev_info(&vsi->back->pdev->dev,
+			 "Failed allocate some buffers on %sRx ring %d (pf_q %d)\n",
+			 ring->xsk_umem ? "UMEM enabled " : "",
+			 ring->queue_index, pf_q);
+	}
 
 	return 0;
 }
@@ -12097,6 +12140,12 @@ static int i40e_xdp(struct net_device *dev,
 	case XDP_QUERY_PROG:
 		xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
 		return 0;
+	case XDP_QUERY_XSK_UMEM:
+		return i40e_xsk_umem_query(vsi, &xdp->xsk.umem,
+					   xdp->xsk.queue_id);
+	case XDP_SETUP_XSK_UMEM:
+		return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
+					   xdp->xsk.queue_id);
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 878fb4b47484..2c4d179ffebf 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -9,6 +9,7 @@
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
 #include "i40e_txrx_common.h"
+#include "i40e_xsk.h"
 
 static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
 				u32 td_tag)
@@ -1380,6 +1381,9 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 		rx_ring->skb = NULL;
 	}
 
+	if (rx_ring->xsk_umem)
+		goto skip_free;
+
 	/* Free all the Rx ring sk_buffs */
 	for (i = 0; i < rx_ring->count; i++) {
 		struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
@@ -1408,6 +1412,7 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 		rx_bi->page_offset = 0;
 	}
 
+skip_free:
 	bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
 	memset(rx_ring->rx_bi, 0, bi_size);
 
@@ -2641,7 +2646,9 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
 
 	i40e_for_each_ring(ring, q_vector->rx) {
-		int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
+		int cleaned = ring->xsk_umem ?
+			      i40e_clean_rx_irq_zc(ring, budget_per_ring) :
+			      i40e_clean_rx_irq(ring, budget_per_ring);
 
 		work_done += cleaned;
 		/* if we clean as many as budgeted, we must not be done */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index bb04f6a731fe..100e92d2982f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -296,13 +296,17 @@ struct i40e_tx_buffer {
 
 struct i40e_rx_buffer {
 	dma_addr_t dma;
-	struct page *page;
-#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
-	__u32 page_offset;
-#else
-	__u16 page_offset;
-#endif
-	__u16 pagecnt_bias;
+	union {
+		struct {
+			struct page *page;
+			__u32 page_offset;
+			__u16 pagecnt_bias;
+		};
+		struct {
+			void *addr;
+			u64 handle;
+		};
+	};
 };
 
 struct i40e_queue_stats {
@@ -414,6 +418,8 @@ struct i40e_ring {
 
 	struct i40e_channel *ch;
 	struct xdp_rxq_info xdp_rxq;
+	struct xdp_umem *xsk_umem;
+	struct zero_copy_allocator zca; /* ZC allocator anchor */
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
new file mode 100644
index 000000000000..bf502f2307c2
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -0,0 +1,661 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2018 Intel Corporation. */
+
+#include <linux/bpf_trace.h>
+#include <net/xdp_sock.h>
+#include <net/xdp.h>
+
+#include "i40e.h"
+#include "i40e_txrx_common.h"
+#include "i40e_xsk.h"
+
+/**
+ * i40e_alloc_xsk_umems - Allocate an array to store per ring UMEMs
+ * @vsi: Current VSI
+ *
+ * Returns 0 on success, <0 on failure
+ **/
+static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
+{
+	if (vsi->xsk_umems)
+		return 0;
+
+	vsi->num_xsk_umems_used = 0;
+	vsi->num_xsk_umems = vsi->alloc_queue_pairs;
+	vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
+				 GFP_KERNEL);
+	if (!vsi->xsk_umems) {
+		vsi->num_xsk_umems = 0;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_add_xsk_umem - Store an UMEM for a certain ring/qid
+ * @vsi: Current VSI
+ * @umem: UMEM to store
+ * @qid: Ring/qid to associate with the UMEM
+ *
+ * Returns 0 on success, <0 on failure
+ **/
+static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			     u16 qid)
+{
+	int err;
+
+	err = i40e_alloc_xsk_umems(vsi);
+	if (err)
+		return err;
+
+	vsi->xsk_umems[qid] = umem;
+	vsi->num_xsk_umems_used++;
+
+	return 0;
+}
+
+/**
+ * i40e_remove_xsk_umem - Remove an UMEM for a certain ring/qid
+ * @vsi: Current VSI
+ * @qid: Ring/qid associated with the UMEM
+ **/
+static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
+{
+	vsi->xsk_umems[qid] = NULL;
+	vsi->num_xsk_umems_used--;
+
+	if (vsi->num_xsk_umems == 0) {
+		kfree(vsi->xsk_umems);
+		vsi->xsk_umems = NULL;
+		vsi->num_xsk_umems = 0;
+	}
+}
+
+/**
+ * i40e_xsk_umem_dma_map - DMA maps all UMEM memory for the netdev
+ * @vsi: Current VSI
+ * @umem: UMEM to DMA map
+ *
+ * Returns 0 on success, <0 on failure
+ **/
+static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i, j;
+	dma_addr_t dma;
+
+	dev = &pf->pdev->dev;
+	for (i = 0; i < umem->npgs; i++) {
+		dma = dma_map_page_attrs(dev, umem->pgs[i], 0, PAGE_SIZE,
+					 DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+		if (dma_mapping_error(dev, dma))
+			goto out_unmap;
+
+		umem->pages[i].dma = dma;
+	}
+
+	return 0;
+
+out_unmap:
+	for (j = 0; j < i; j++) {
+		dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
+				     DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+		umem->pages[i].dma = 0;
+	}
+
+	return -1;
+}
+
+/**
+ * i40e_xsk_umem_dma_unmap - DMA unmaps all UMEM memory for the netdev
+ * @vsi: Current VSI
+ * @umem: UMEM to DMA map
+ **/
+static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i;
+
+	dev = &pf->pdev->dev;
+
+	for (i = 0; i < umem->npgs; i++) {
+		dma_unmap_page_attrs(dev, umem->pages[i].dma, PAGE_SIZE,
+				     DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+
+		umem->pages[i].dma = 0;
+	}
+}
+
+/**
+ * i40e_xsk_umem_enable - Enable/associate an UMEM to a certain ring/qid
+ * @vsi: Current VSI
+ * @umem: UMEM
+ * @qid: Rx ring to associate UMEM to
+ *
+ * Returns 0 on success, <0 on failure
+ **/
+static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
+				u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	if (qid >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (vsi->xsk_umems) {
+		if (qid >= vsi->num_xsk_umems)
+			return -EINVAL;
+		if (vsi->xsk_umems[qid])
+			return -EBUSY;
+	}
+
+	err = i40e_xsk_umem_dma_map(vsi, umem);
+	if (err)
+		return err;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	err = i40e_add_xsk_umem(vsi, umem, qid);
+	if (err)
+		return err;
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_xsk_umem_disable - Diassociate an UMEM from a certain ring/qid
+ * @vsi: Current VSI
+ * @qid: Rx ring to associate UMEM to
+ *
+ * Returns 0 on success, <0 on failure
+ **/
+static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
+	    !vsi->xsk_umems[qid])
+		return -EINVAL;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
+	i40e_remove_xsk_umem(vsi, qid);
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_xsk_umem_query - Queries a certain ring/qid for its UMEM
+ * @vsi: Current VSI
+ * @umem: UMEM associated to the ring, if any
+ * @qid: Rx ring to associate UMEM to
+ *
+ * This function will store, if any, the UMEM associated to certain ring.
+ *
+ * Returns 0 on success, <0 on failure
+ **/
+int i40e_xsk_umem_query(struct i40e_vsi *vsi, struct xdp_umem **umem,
+			u16 qid)
+{
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	if (qid >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (vsi->xsk_umems) {
+		if (qid >= vsi->num_xsk_umems)
+			return -EINVAL;
+		*umem = vsi->xsk_umems[qid];
+		return 0;
+	}
+
+	*umem = NULL;
+	return 0;
+}
+
+/**
+ * i40e_xsk_umem_query - Queries a certain ring/qid for its UMEM
+ * @vsi: Current VSI
+ * @umem: UMEM to enable/associate to a ring, or NULL to disable
+ * @qid: Rx ring to (dis)associate UMEM (from)to
+ *
+ * This function enables or disables an UMEM to a certain ring.
+ *
+ * Returns 0 on success, <0 on failure
+ **/
+int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			u16 qid)
+{
+	return umem ? i40e_xsk_umem_enable(vsi, umem, qid) :
+		i40e_xsk_umem_disable(vsi, qid);
+}
+
+/**
+ * i40e_run_xdp_zc - Executes an XDP program on an xdp_buff
+ * @rx_ring: Rx ring
+ * @xdp: xdp_buff used as input to the XDP program
+ *
+ * This function enables or disables an UMEM to a certain ring.
+ *
+ * Returns any of I40E_XDP_{PASS, CONSUMED, TX, REDIR}
+ **/
+static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
+{
+	int err, result = I40E_XDP_PASS;
+	struct i40e_ring *xdp_ring;
+	struct bpf_prog *xdp_prog;
+	u32 act;
+
+	rcu_read_lock();
+	/* NB! xdp_prog will always be !NULL, due to the fact that
+	 * this path is enabled by setting an XDP program.
+	 */
+	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	xdp->handle += xdp->data - xdp->data_hard_start;
+	switch (act) {
+	case XDP_PASS:
+		break;
+	case XDP_TX:
+		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+		result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
+		break;
+	case XDP_REDIRECT:
+		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+		result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
+		/* fallthrough -- handle aborts by dropping packet */
+	case XDP_DROP:
+		result = I40E_XDP_CONSUMED;
+		break;
+	}
+	rcu_read_unlock();
+	return result;
+}
+
+/**
+ * i40e_alloc_buffer_zc - Allocates an i40e_rx_buffer
+ * @rx_ring: Rx ring
+ * @bi: Rx buffer to populate
+ *
+ * This function allocates an Rx buffer. The buffer can come from fill
+ * queue, or via the recycle queue (next_to_alloc).
+ *
+ * Returns true for a successful allocation, false otherwise
+ **/
+static bool i40e_alloc_buffer_zc(struct i40e_ring *rx_ring,
+				 struct i40e_rx_buffer *bi)
+{
+	struct xdp_umem *umem = rx_ring->xsk_umem;
+	void *addr = bi->addr;
+	u64 handle, hr;
+
+	if (addr) {
+		rx_ring->rx_stats.page_reuse_count++;
+		return true;
+	}
+
+	if (!xsk_umem_peek_addr(umem, &handle)) {
+		rx_ring->rx_stats.alloc_page_failed++;
+		return false;
+	}
+
+	hr = umem->headroom + XDP_PACKET_HEADROOM;
+
+	bi->dma = xdp_umem_get_dma(umem, handle);
+	bi->dma += hr;
+
+	bi->addr = xdp_umem_get_data(umem, handle);
+	bi->addr += hr;
+
+	bi->handle = handle + umem->headroom;
+
+	xsk_umem_discard_addr(umem);
+	return true;
+}
+
+/**
+ * i40e_alloc_rx_buffers_zc - Allocates a number of Rx buffers
+ * @rx_ring: Rx ring
+ * @count: The number of buffers to allocate
+ *
+ * This function allocates a number of Rx buffers and places them on
+ * the Rx ring.
+ *
+ * Returns true for a successful allocation, false otherwise
+ **/
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count)
+{
+	u16 ntu = rx_ring->next_to_use;
+	union i40e_rx_desc *rx_desc;
+	struct i40e_rx_buffer *bi;
+	bool ok = true;
+
+	rx_desc = I40E_RX_DESC(rx_ring, ntu);
+	bi = &rx_ring->rx_bi[ntu];
+	do {
+		if (!i40e_alloc_buffer_zc(rx_ring, bi)) {
+			ok = false;
+			goto no_buffers;
+		}
+
+		dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
+						 rx_ring->rx_buf_len,
+						 DMA_BIDIRECTIONAL);
+
+		rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
+
+		rx_desc++;
+		bi++;
+		ntu++;
+
+		if (unlikely(ntu == rx_ring->count)) {
+			rx_desc = I40E_RX_DESC(rx_ring, 0);
+			bi = rx_ring->rx_bi;
+			ntu = 0;
+		}
+
+		rx_desc->wb.qword1.status_error_len = 0;
+		count--;
+	} while (count);
+
+no_buffers:
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	return ok;
+}
+
+/**
+ * i40e_get_rx_buffer_zc - Return the current Rx buffer
+ * @rx_ring: Rx ring
+ * @size: The size of the rx buffer (read from descriptor)
+ *
+ * This function returns the current, received Rx buffer, and also
+ * does DMA synchronization.  the Rx ring.
+ *
+ * Returns the received Rx buffer
+ **/
+static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
+						    const unsigned int size)
+{
+	struct i40e_rx_buffer *bi;
+
+	bi = &rx_ring->rx_bi[rx_ring->next_to_clean];
+
+	/* we are reusing so sync this buffer for CPU use */
+	dma_sync_single_range_for_cpu(rx_ring->dev,
+				      bi->dma, 0,
+				      size,
+				      DMA_BIDIRECTIONAL);
+
+	return bi;
+}
+
+/**
+ * i40e_reuse_rx_buffer_zc - Recycle an Rx buffer
+ * @rx_ring: Rx ring
+ * @old_bi: The Rx buffer to recycle
+ *
+ * This function recycles a finished Rx buffer, and places it on the
+ * recycle queue (next_to_alloc).
+ **/
+static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
+				    struct i40e_rx_buffer *old_bi)
+{
+	struct i40e_rx_buffer *new_bi = &rx_ring->rx_bi[rx_ring->next_to_alloc];
+	unsigned long mask = (unsigned long)rx_ring->xsk_umem->props.chunk_mask;
+	u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
+	u16 nta = rx_ring->next_to_alloc;
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	/* transfer page from old buffer to new buffer */
+	new_bi->dma = old_bi->dma & mask;
+	new_bi->dma += hr;
+
+	new_bi->addr = (void *)((unsigned long)old_bi->addr & mask);
+	new_bi->addr += hr;
+
+	new_bi->handle = old_bi->handle & mask;
+	new_bi->handle += rx_ring->xsk_umem->headroom;
+
+	old_bi->addr = NULL;
+}
+
+/**
+ * i40e_zca_free - Free callback for MEM_TYPE_ZERO_COPY allocations
+ * @alloc: Zero-copy allocator
+ * @handle: Buffer handle
+ **/
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
+{
+	struct i40e_rx_buffer *bi;
+	struct i40e_ring *rx_ring;
+	u64 hr, mask;
+	u16 nta;
+
+	rx_ring = container_of(alloc, struct i40e_ring, zca);
+	hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
+	mask = rx_ring->xsk_umem->props.chunk_mask;
+
+	nta = rx_ring->next_to_alloc;
+	bi = &rx_ring->rx_bi[nta];
+
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	handle &= mask;
+
+	bi->dma = xdp_umem_get_dma(rx_ring->xsk_umem, handle);
+	bi->dma += hr;
+
+	bi->addr = xdp_umem_get_data(rx_ring->xsk_umem, handle);
+	bi->addr += hr;
+
+	bi->handle = (u64)handle + rx_ring->xsk_umem->headroom;
+}
+
+/**
+ * i40e_construct_skb_zc - Create skbufff from zero-copy Rx buffer
+ * @rx_ring: Rx ring
+ * @bi: Rx buffer
+ * @xdp: xdp_buff
+ *
+ * This functions allocates a new skb from a zero-copy Rx buffer.
+ *
+ * Returns the skb, or NULL on failure.
+ **/
+static struct sk_buff *i40e_construct_skb_zc(struct i40e_ring *rx_ring,
+					     struct i40e_rx_buffer *bi,
+					     struct xdp_buff *xdp)
+{
+	unsigned int metasize = xdp->data - xdp->data_meta;
+	unsigned int datasize = xdp->data_end - xdp->data;
+	struct sk_buff *skb;
+
+	/* allocate a skb to store the frags */
+	skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
+			       xdp->data_end - xdp->data_hard_start,
+			       GFP_ATOMIC | __GFP_NOWARN);
+	if (unlikely(!skb))
+		return NULL;
+
+	skb_reserve(skb, xdp->data - xdp->data_hard_start);
+	memcpy(__skb_put(skb, datasize), xdp->data, datasize);
+	if (metasize)
+		skb_metadata_set(skb, metasize);
+
+	i40e_reuse_rx_buffer_zc(rx_ring, bi);
+	return skb;
+}
+
+/**
+ * i40e_inc_ntc: Advance the next_to_clean index
+ * @rx_ring: Rx ring
+ **/
+static void i40e_inc_ntc(struct i40e_ring *rx_ring)
+{
+	u32 ntc = rx_ring->next_to_clean + 1;
+
+	ntc = (ntc < rx_ring->count) ? ntc : 0;
+	rx_ring->next_to_clean = ntc;
+	prefetch(I40E_RX_DESC(rx_ring, ntc));
+}
+
+/**
+ * i40e_clean_rx_irq_zc - Consumes Rx packets from the hardware ring
+ * @rx_ring: Rx ring
+ * @budget: NAPI budget
+ *
+ * Returns amount of work completed
+ **/
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
+{
+	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
+	unsigned int xdp_res, xdp_xmit = 0;
+	bool failure = false;
+	struct sk_buff *skb;
+	struct xdp_buff xdp;
+
+	xdp.rxq = &rx_ring->xdp_rxq;
+
+	while (likely(total_rx_packets < (unsigned int)budget)) {
+		struct i40e_rx_buffer *bi;
+		union i40e_rx_desc *rx_desc;
+		unsigned int size;
+		u16 vlan_tag;
+		u8 rx_ptype;
+		u64 qword;
+
+		if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
+			failure = failure ||
+				  !i40e_alloc_rx_buffers_zc(rx_ring,
+							    cleaned_count);
+			cleaned_count = 0;
+		}
+
+		rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+
+		/* This memory barrier is needed to keep us from reading
+		 * any other fields out of the rx_desc until we have
+		 * verified the descriptor has been written back.
+		 */
+		dma_rmb();
+
+		bi = i40e_clean_programming_status(rx_ring, rx_desc,
+						   qword);
+		if (unlikely(bi)) {
+			i40e_reuse_rx_buffer_zc(rx_ring, bi);
+			cleaned_count++;
+			continue;
+		}
+
+		size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
+		       I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
+		if (!size)
+			break;
+
+		bi = i40e_get_rx_buffer_zc(rx_ring, size);
+		xdp.data = bi->addr;
+		xdp.data_meta = xdp.data;
+		xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
+		xdp.data_end = xdp.data + size;
+		xdp.handle = bi->handle;
+
+		xdp_res = i40e_run_xdp_zc(rx_ring, &xdp);
+		if (xdp_res) {
+			if (xdp_res & (I40E_XDP_TX | I40E_XDP_REDIR)) {
+				xdp_xmit |= xdp_res;
+				bi->addr = NULL;
+			} else {
+				i40e_reuse_rx_buffer_zc(rx_ring, bi);
+			}
+
+			total_rx_bytes += size;
+			total_rx_packets++;
+
+			cleaned_count++;
+			i40e_inc_ntc(rx_ring);
+			continue;
+		}
+
+		/* XDP_PASS path */
+
+		/* NB! We are not checking for errors using
+		 * i40e_test_staterr with
+		 * BIT(I40E_RXD_QW1_ERROR_SHIFT). This is due to that
+		 * SBP is *not* set in PRT_SBPVSI (default not set).
+		 */
+		skb = i40e_construct_skb_zc(rx_ring, bi, &xdp);
+		if (!skb) {
+			rx_ring->rx_stats.alloc_buff_failed++;
+			break;
+		}
+
+		cleaned_count++;
+		i40e_inc_ntc(rx_ring);
+
+		if (eth_skb_pad(skb))
+			continue;
+
+		total_rx_bytes += skb->len;
+		total_rx_packets++;
+
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+		rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
+			   I40E_RXD_QW1_PTYPE_SHIFT;
+		i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+
+		vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
+			   le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
+		i40e_receive_skb(rx_ring, skb, vlan_tag);
+	}
+
+	i40e_finalize_xdp_rx(rx_ring, xdp_xmit);
+	i40e_update_rx_stats(rx_ring, total_rx_bytes, total_rx_packets);
+	return failure ? budget : (int)total_rx_packets;
+}
+
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
new file mode 100644
index 000000000000..427a844f78a7
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2018 Intel Corporation. */
+
+#ifndef _I40E_XSK_H_
+#define _I40E_XSK_H_
+
+struct i40e_vsi;
+struct xdp_umem;
+struct zero_copy_allocator;
+
+int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair);
+int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair);
+int i40e_xsk_umem_query(struct i40e_vsi *vsi, struct xdp_umem **umem,
+			u16 qid);
+int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			u16 qid);
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
+
+#endif /* _I40E_XSK_H_ */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 09/11] i40e: move common Tx functions to i40e_txrx_common.h
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (7 preceding siblings ...)
  2018-08-28 12:44 ` [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 10/11] i40e: add AF_XDP zero-copy Tx support Björn Töpel
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	jakub.kicinski, neerav.parikh, mykyta.iziumtsev, francois.ozog,
	ilias.apalodimas, brian.brooks, u9012063, pavel, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

This patch prepares for the upcoming zero-copy Tx functionality, by
moving common functions and refactor chunks of code into re-usable
functions, used both by the regular path and zero-copy path.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 35 +----------
 .../ethernet/intel/i40e/i40e_txrx_common.h    | 59 +++++++++++++++++++
 2 files changed, 61 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 2c4d179ffebf..11e201fcb57a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -11,16 +11,6 @@
 #include "i40e_txrx_common.h"
 #include "i40e_xsk.h"
 
-static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
-				u32 td_tag)
-{
-	return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
-			   ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
-			   ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
-			   ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
-			   ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
-}
-
 #define I40E_TXD_CMD (I40E_TX_DESC_CMD_EOP | I40E_TX_DESC_CMD_RS)
 /**
  * i40e_fdir - Generate a Flow Director descriptor based on fdata
@@ -769,8 +759,6 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
 	}
 }
 
-#define WB_STRIDE 4
-
 /**
  * i40e_clean_tx_irq - Reclaim resources after transmit completes
  * @vsi: the VSI we care about
@@ -875,27 +863,8 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
-	u64_stats_update_begin(&tx_ring->syncp);
-	tx_ring->stats.bytes += total_bytes;
-	tx_ring->stats.packets += total_packets;
-	u64_stats_update_end(&tx_ring->syncp);
-	tx_ring->q_vector->tx.total_bytes += total_bytes;
-	tx_ring->q_vector->tx.total_packets += total_packets;
-
-	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
-		/* check to see if there are < 4 descriptors
-		 * waiting to be written back, then kick the hardware to force
-		 * them to be written back in case we stay in NAPI.
-		 * In this mode on X722 we do not enable Interrupt.
-		 */
-		unsigned int j = i40e_get_tx_pending(tx_ring, false);
-
-		if (budget &&
-		    ((j / WB_STRIDE) == 0) && (j > 0) &&
-		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
-		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
-			tx_ring->arm_wb = true;
-	}
+	i40e_update_tx_stats(tx_ring, total_packets, total_bytes);
+	i40e_arm_wb(tx_ring, vsi, budget);
 
 	if (ring_is_xdp(tx_ring))
 		return !!budget;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index 2bd5187fcd66..b5afd479a9c5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -28,4 +28,63 @@ void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val);
 #define I40E_XDP_TX		BIT(1)
 #define I40E_XDP_REDIR		BIT(2)
 
+/**
+ * build_ctob - Builds the Tx descriptor (cmd, offset and type) qword
+ **/
+static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
+				u32 td_tag)
+{
+	return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
+			   ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
+			   ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
+			   ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
+			   ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
+}
+
+/**
+ * i40e_update_tx_stats - Update the egress statistics for the Tx ring
+ * @tx_ring: Tx ring to update
+ * @total_packet: total packets sent
+ * @total_bytes: total bytes sent
+ **/
+static inline void i40e_update_tx_stats(struct i40e_ring *tx_ring,
+					unsigned int total_packets,
+					unsigned int total_bytes)
+{
+	u64_stats_update_begin(&tx_ring->syncp);
+	tx_ring->stats.bytes += total_bytes;
+	tx_ring->stats.packets += total_packets;
+	u64_stats_update_end(&tx_ring->syncp);
+	tx_ring->q_vector->tx.total_bytes += total_bytes;
+	tx_ring->q_vector->tx.total_packets += total_packets;
+}
+
+#define WB_STRIDE 4
+
+/**
+ * i40e_arm_wb - (Possibly) arms Tx write-back
+ * @tx_ring: Tx ring to update
+ * @vsi: the VSI
+ * @budget: the NAPI budget left
+ **/
+static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
+			       struct i40e_vsi *vsi,
+			       int budget)
+{
+	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+		/* check to see if there are < 4 descriptors
+		 * waiting to be written back, then kick the hardware to force
+		 * them to be written back in case we stay in NAPI.
+		 * In this mode on X722 we do not enable Interrupt.
+		 */
+		unsigned int j = i40e_get_tx_pending(tx_ring, false);
+
+		if (budget &&
+		    ((j / WB_STRIDE) == 0) && j > 0 &&
+		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
+		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
+			tx_ring->arm_wb = true;
+	}
+}
+
 #endif /* I40E_TXRX_COMMON_ */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 10/11] i40e: add AF_XDP zero-copy Tx support
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (8 preceding siblings ...)
  2018-08-28 12:44 ` [PATCH bpf-next 09/11] i40e: move common Tx functions to i40e_txrx_common.h Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-28 12:44 ` [PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock Björn Töpel
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	jakub.kicinski, neerav.parikh, mykyta.iziumtsev, francois.ozog,
	ilias.apalodimas, brian.brooks, u9012063, pavel, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

This patch adds zero-copy Tx support for AF_XDP sockets. It implements
the ndo_xsk_async_xmit netdev ndo and performs all the Tx logic from a
NAPI context. This means pulling egress packets from the Tx ring,
placing the frames on the NIC HW descriptor ring and completing sent
frames back to the application via the completion ring.

The regular XDP Tx ring is used for AF_XDP as well. This rationale for
this is as follows: XDP_REDIRECT guarantees mutual exclusion between
different NAPI contexts based on CPU id. In other words, a netdev can
XDP_REDIRECT to another netdev with a different NAPI context, since
the operation is bound to a specific core and each core has its own
hardware ring.

As the AF_XDP Tx action is running in the same NAPI context and using
the same ring, it will also be protected from XDP_REDIRECT actions
with the exact same mechanism.

As with AF_XDP Rx, all AF_XDP Tx specific functions are added to
i40e_xsk.c.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |   4 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |   6 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 173 ++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |   4 +
 4 files changed, 186 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 848eea7c84db..5da7eb0fe4ae 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3074,6 +3074,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
+	if (ring_is_xdp(ring))
+		ring->xsk_umem = i40e_xsk_umem(ring);
+
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
 		ring->atr_sample_rate = vsi->back->atr_sample_rate;
@@ -12185,6 +12188,7 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_bridge_setlink	= i40e_ndo_bridge_setlink,
 	.ndo_bpf		= i40e_xdp,
 	.ndo_xdp_xmit		= i40e_xdp_xmit,
+	.ndo_xsk_async_xmit	= i40e_xsk_async_xmit,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 11e201fcb57a..37bd4e50ccde 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2597,7 +2597,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	 * budget and be more aggressive about cleaning up the Tx descriptors.
 	 */
 	i40e_for_each_ring(ring, q_vector->tx) {
-		if (!i40e_clean_tx_irq(vsi, ring, budget)) {
+		bool wd = ring->xsk_umem ?
+			  i40e_clean_xdp_tx_irq(vsi, ring, budget) :
+			  i40e_clean_tx_irq(vsi, ring, budget);
+
+		if (!wd) {
 			clean_complete = false;
 			continue;
 		}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index bf502f2307c2..94947a826bc3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -659,3 +659,176 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 	return failure ? budget : (int)total_rx_packets;
 }
 
+/**
+ * i40e_xmit_zc - Performs zero-copy Tx AF_XDP
+ * @xdp_ring: XDP Tx ring
+ * @budget: NAPI budget
+ *
+ * Returns true if the work is finished.
+ **/
+static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
+{
+	unsigned int total_packets = 0;
+	struct i40e_tx_buffer *tx_bi;
+	struct i40e_tx_desc *tx_desc;
+	bool work_done = true;
+	dma_addr_t dma;
+	u32 len;
+
+	while (budget-- > 0) {
+		if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
+			xdp_ring->tx_stats.tx_busy++;
+			work_done = false;
+			break;
+		}
+
+		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
+			break;
+
+		dma_sync_single_for_device(xdp_ring->dev, dma, len,
+					   DMA_BIDIRECTIONAL);
+
+		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
+		tx_bi->bytecount = len;
+
+		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
+		tx_desc->buffer_addr = cpu_to_le64(dma);
+		tx_desc->cmd_type_offset_bsz =
+			build_ctob(I40E_TX_DESC_CMD_ICRC
+				   | I40E_TX_DESC_CMD_EOP,
+				   0, len, 0);
+		total_packets++;
+
+		xdp_ring->next_to_use++;
+		if (xdp_ring->next_to_use == xdp_ring->count)
+			xdp_ring->next_to_use = 0;
+	}
+
+	if (total_packets > 0) {
+		/* Request an interrupt for the last frame and bump tail ptr. */
+		tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
+						 I40E_TXD_QW1_CMD_SHIFT);
+		i40e_xdp_ring_update_tail(xdp_ring);
+
+		xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
+	}
+
+	return !!budget && work_done;
+}
+
+/**
+ * i40e_clean_xdp_tx_buffer - Frees and unmaps an XDP Tx entry
+ * @tx_ring: XDP Tx ring
+ * @tx_bi: Tx buffer info to clean
+ **/
+static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
+				     struct i40e_tx_buffer *tx_bi)
+{
+	xdp_return_frame(tx_bi->xdpf);
+	dma_unmap_single(tx_ring->dev,
+			 dma_unmap_addr(tx_bi, dma),
+			 dma_unmap_len(tx_bi, len), DMA_TO_DEVICE);
+	dma_unmap_len_set(tx_bi, len, 0);
+}
+
+/**
+ * i40e_clean_xdp_tx_irq - Completes AF_XDP entries, and cleans XDP entries
+ * @tx_ring: XDP Tx ring
+ * @tx_bi: Tx buffer info to clean
+ *
+ * Returns true if cleanup/tranmission is done.
+ **/
+bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
+			   struct i40e_ring *tx_ring, int napi_budget)
+{
+	unsigned int ntc, total_bytes = 0, budget = vsi->work_limit;
+	u32 i, completed_frames, frames_ready, xsk_frames = 0;
+	struct xdp_umem *umem = tx_ring->xsk_umem;
+	u32 head_idx = i40e_get_head(tx_ring);
+	bool work_done = true, xmit_done;
+	struct i40e_tx_buffer *tx_bi;
+
+	if (head_idx < tx_ring->next_to_clean)
+		head_idx += tx_ring->count;
+	frames_ready = head_idx - tx_ring->next_to_clean;
+
+	if (frames_ready == 0) {
+		goto out_xmit;
+	} else if (frames_ready > budget) {
+		completed_frames = budget;
+		work_done = false;
+	} else {
+		completed_frames = frames_ready;
+	}
+
+	ntc = tx_ring->next_to_clean;
+
+	for (i = 0; i < completed_frames; i++) {
+		tx_bi = &tx_ring->tx_bi[ntc];
+
+		if (tx_bi->xdpf)
+			i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
+		else
+			xsk_frames++;
+
+		tx_bi->xdpf = NULL;
+		total_bytes += tx_bi->bytecount;
+
+		if (++ntc >= tx_ring->count)
+			ntc = 0;
+	}
+
+	tx_ring->next_to_clean += completed_frames;
+	if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
+		tx_ring->next_to_clean -= tx_ring->count;
+
+	if (xsk_frames)
+		xsk_umem_complete_tx(umem, xsk_frames);
+
+	i40e_arm_wb(tx_ring, vsi, budget);
+	i40e_update_tx_stats(tx_ring, completed_frames, total_bytes);
+
+out_xmit:
+	xmit_done = i40e_xmit_zc(tx_ring, budget);
+
+	return work_done && xmit_done;
+}
+
+/**
+ * i40e_xsk_async_xmit - Implements the ndo_xsk_async_xmit
+ * @dev: the netdevice
+ * @queue_id: queue id to wake up
+ *
+ * Returns <0 for errors, 0 otherwise.
+ **/
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
+{
+	struct i40e_netdev_priv *np = netdev_priv(dev);
+	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_ring *ring;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return -ENETDOWN;
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return -ENXIO;
+
+	if (queue_id >= vsi->num_queue_pairs)
+		return -ENXIO;
+
+	if (!vsi->xdp_rings[queue_id]->xsk_umem)
+		return -ENXIO;
+
+	ring = vsi->xdp_rings[queue_id];
+
+	/* The idea here is that if NAPI is running, mark a miss, so
+	 * it will run again. If not, trigger an interrupt and
+	 * schedule the NAPI from interrupt context. If NAPI would be
+	 * scheduled here, the interrupt affinity would not be
+	 * honored.
+	 */
+	if (!napi_if_scheduled_mark_missed(&ring->q_vector->napi))
+		i40e_force_wb(vsi, ring->q_vector);
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
index 427a844f78a7..9038c5d5cf08 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
@@ -18,4 +18,8 @@ void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
 bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
 int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
 
+bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
+			   struct i40e_ring *tx_ring, int napi_budget);
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id);
+
 #endif /* _I40E_XSK_H_ */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (9 preceding siblings ...)
  2018-08-28 12:44 ` [PATCH bpf-next 10/11] i40e: add AF_XDP zero-copy Tx support Björn Töpel
@ 2018-08-28 12:44 ` Björn Töpel
  2018-08-29 12:44   ` Jesper Dangaard Brouer
  2018-08-28 12:50   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:44 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: Björn Töpel, michael.lundkvist, willemdebruijn.kernel,
	john.fastabend, jakub.kicinski, neerav.parikh, mykyta.iziumtsev,
	francois.ozog, ilias.apalodimas, brian.brooks, u9012063, pavel,
	qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

The -c/--copy -z/--zero-copy flags enforces either copy or zero-copy
mode.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 samples/bpf/xdpsock_user.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 4914788b6727..b3906111bedb 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -649,6 +649,8 @@ static struct option long_options[] = {
 	{"xdp-skb", no_argument, 0, 'S'},
 	{"xdp-native", no_argument, 0, 'N'},
 	{"interval", required_argument, 0, 'n'},
+	{"zero-copy", no_argument, 0, 'z'},
+	{"copy", no_argument, 0, 'c'},
 	{0, 0, 0, 0}
 };
 
@@ -667,6 +669,8 @@ static void usage(const char *prog)
 		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
 		"  -N, --xdp-native=n	Enfore XDP native mode\n"
 		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
+		"  -z, --zero-copy      Force zero-copy mode.\n"
+		"  -c, --copy           Force copy mode.\n"
 		"\n";
 	fprintf(stderr, str, prog);
 	exit(EXIT_FAILURE);
@@ -679,7 +683,7 @@ static void parse_command_line(int argc, char **argv)
 	opterr = 0;
 
 	for (;;) {
-		c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
+		c = getopt_long(argc, argv, "rtli:q:psSNn:cz", long_options,
 				&option_index);
 		if (c == -1)
 			break;
@@ -716,6 +720,12 @@ static void parse_command_line(int argc, char **argv)
 		case 'n':
 			opt_interval = atoi(optarg);
 			break;
+		case 'z':
+			opt_xdp_bind_flags |= XDP_ZEROCOPY;
+			break;
+		case 'c':
+			opt_xdp_bind_flags |= XDP_COPY;
+			break;
 		default:
 			usage(basename(argv[0]));
 		}
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
@ 2018-08-28 12:50   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  2018-08-28 12:44 ` [PATCH bpf-next 02/11] xdp: export xdp_rxq_info_unreg_mem_model Björn Töpel
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 12:50 UTC (permalink / raw)
  To: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexander Duyck, ast, Jesper Dangaard Brouer, Daniel Borkmann,
	Netdev, Brandeburg, Jesse, Singhai, Anjali, peter.waskiewicz.jr,
	intel-wired-lan, Jeff Kirsher
  Cc: Björn Töpel, michael.lundkvist, Willem de Bruijn,
	John Fastabend, Jakub Kicinski, neerav.parikh, MykytaI Iziumtsev,
	Francois Ozog, Ilias Apalodimas, Brian Brooks, William Tu, pavel,
	Zhang, Qi Z

Den tis 28 aug. 2018 kl 14:47 skrev Björn Töpel <bjorn.topel@gmail.com>:
>
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This patch set introduces zero-copy AF_XDP support for Intel's i40e
> driver. In the first preparatory patch we also add support for
> XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> Brouer for reviewing earlier versions of this patch set.
>
> The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> that in the interest of time, to get an AF_XDP zero-copy implementation
> out there for people to try, some code paths have been copied from the
> XDP path to the zero-copy path. It is out goal to merge the two paths
> in later patch sets.
>
> In contrast to the implementation from beginning of June, this patch
> set does not require any extra HW queues for AF_XDP zero-copy
> TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> AF_XDP zero-copy TX.
>
> Jeff, given that most of changes are in i40e, it is up to you how you
> would like to route these patches. The set is tagged bpf-next, but
> if taking it via the Intel driver tree is easier, let us know.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are with
> retpoline and all other spectre and meltdown fixes, so these results
> are not comparable to the ones from the zero-copy patch set in June.
>
> AF_XDP performance 64 byte packets.
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.6        8.2         15.0
> txpush       2.2        -           21.9
> l2fwd        1.7        2.3         11.3
>
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.0        3.3         3.3
> l2fwd        1.3        1.7         3.1
>
> XDP performance on our system as a base line:
>
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      18.4M  0
>
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0
>
> The structure of the patch set is as follows:
>
> Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> Patches 2-4: Preparatory patches to common xsk and net code
> Patches 5-7: Preparatory patches to i40e driver code for RX
> Patch 8: i40e zero-copy support for RX
> Patch 9: Preparatory patch to i40e driver code for TX
> Patch 10: i40e zero-copy support for TX
> Patch 11: Add flags to sample application to force zero-copy/copy mode
>
> We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
>
>
> Magnus & Björn
>
> Björn Töpel (8):
>   xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
>   xdp: export xdp_rxq_info_unreg_mem_model
>   xsk: expose xdp_umem_get_{data,dma} to drivers
>   i40e: added queue pair disable/enable functions
>   i40e: refactor Rx path for re-use
>   i40e: move common Rx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Rx support
>   samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
>
> Magnus Karlsson (3):
>   net: add napi_if_scheduled_mark_missed
>   i40e: move common Tx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Tx support
>
>  drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
>  drivers/net/ethernet/intel/i40e/i40e.h        |  19 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   | 307 ++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 182 ++--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  20 +-
>  .../ethernet/intel/i40e/i40e_txrx_common.h    |  90 ++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 834 ++++++++++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h    |  25 +
>  include/linux/netdevice.h                     |  26 +
>  include/net/xdp.h                             |   6 +-
>  include/net/xdp_sock.h                        |  43 +
>  net/core/xdp.c                                |  54 +-
>  net/xdp/xdp_umem.h                            |  10 -
>  samples/bpf/xdpsock_user.c                    |  12 +-
>  14 files changed, 1523 insertions(+), 108 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h
>
> --
> 2.17.1
>

I was too quick on the trigger. Adding intel-wired-lan and JeffK.

@Jeff Apologies. :-(

Björn

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Intel-wired-lan] [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
@ 2018-08-28 12:50   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 29+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-08-28 12:50 UTC (permalink / raw)
  To: intel-wired-lan

Den tis 28 aug. 2018 kl 14:47 skrev Bj?rn T?pel <bjorn.topel@gmail.com>:
>
> From: Bj?rn T?pel <bjorn.topel@intel.com>
>
> This patch set introduces zero-copy AF_XDP support for Intel's i40e
> driver. In the first preparatory patch we also add support for
> XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> Brouer for reviewing earlier versions of this patch set.
>
> The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> that in the interest of time, to get an AF_XDP zero-copy implementation
> out there for people to try, some code paths have been copied from the
> XDP path to the zero-copy path. It is out goal to merge the two paths
> in later patch sets.
>
> In contrast to the implementation from beginning of June, this patch
> set does not require any extra HW queues for AF_XDP zero-copy
> TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> AF_XDP zero-copy TX.
>
> Jeff, given that most of changes are in i40e, it is up to you how you
> would like to route these patches. The set is tagged bpf-next, but
> if taking it via the Intel driver tree is easier, let us know.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are with
> retpoline and all other spectre and meltdown fixes, so these results
> are not comparable to the ones from the zero-copy patch set in June.
>
> AF_XDP performance 64 byte packets.
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.6        8.2         15.0
> txpush       2.2        -           21.9
> l2fwd        1.7        2.3         11.3
>
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.0        3.3         3.3
> l2fwd        1.3        1.7         3.1
>
> XDP performance on our system as a base line:
>
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      18.4M  0
>
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0
>
> The structure of the patch set is as follows:
>
> Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> Patches 2-4: Preparatory patches to common xsk and net code
> Patches 5-7: Preparatory patches to i40e driver code for RX
> Patch 8: i40e zero-copy support for RX
> Patch 9: Preparatory patch to i40e driver code for TX
> Patch 10: i40e zero-copy support for TX
> Patch 11: Add flags to sample application to force zero-copy/copy mode
>
> We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
>
>
> Magnus & Bj?rn
>
> Bj?rn T?pel (8):
>   xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
>   xdp: export xdp_rxq_info_unreg_mem_model
>   xsk: expose xdp_umem_get_{data,dma} to drivers
>   i40e: added queue pair disable/enable functions
>   i40e: refactor Rx path for re-use
>   i40e: move common Rx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Rx support
>   samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
>
> Magnus Karlsson (3):
>   net: add napi_if_scheduled_mark_missed
>   i40e: move common Tx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Tx support
>
>  drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
>  drivers/net/ethernet/intel/i40e/i40e.h        |  19 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   | 307 ++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 182 ++--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  20 +-
>  .../ethernet/intel/i40e/i40e_txrx_common.h    |  90 ++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 834 ++++++++++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h    |  25 +
>  include/linux/netdevice.h                     |  26 +
>  include/net/xdp.h                             |   6 +-
>  include/net/xdp_sock.h                        |  43 +
>  net/core/xdp.c                                |  54 +-
>  net/xdp/xdp_umem.h                            |  10 -
>  samples/bpf/xdpsock_user.c                    |  12 +-
>  14 files changed, 1523 insertions(+), 108 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h
>
> --
> 2.17.1
>

I was too quick on the trigger. Adding intel-wired-lan and JeffK.

@Jeff Apologies. :-(

Bj?rn

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
  2018-08-28 12:44 ` [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY Björn Töpel
@ 2018-08-28 14:11   ` Jesper Dangaard Brouer
  2018-08-28 17:42     ` Björn Töpel
  2018-08-29 18:06   ` [bpf-next, " Maciek Fijalkowski
  1 sibling, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2018-08-28 14:11 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr, Björn Töpel,
	michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	jakub.kicinski, neerav.parikh, mykyta.iziumtsev, francois.ozog,
	ilias.apalodimas, brian.brooks, u9012063, pavel, qi.z.zhang,
	brouer

On Tue, 28 Aug 2018 14:44:25 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This commit adds proper MEM_TYPE_ZERO_COPY support for
> convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
> xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
> MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
> make sense to implement a more sophisticated thread-safe alloc/free
> scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
> required in the fast-path.

This is going to be slow. Especially the dev_alloc_page() call, which
for small frames is likely going to be slower than the data copy.
I guess this is a good first step, but I do hope we will circle back and
optimize this later.  (It would also be quite easy to use
MEM_TYPE_PAGE_POOL instead to get page recycling in devmap redirect case).

I would have liked the MEM_TYPE_ZERO_COPY frame to travel one level
deeper into the redirect-core code.  Allowing devmap to send these
frame without copy, and allow cpumap to do the dev_alloc_page() call
(+copy) on the remote CPU.


> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  include/net/xdp.h |  5 +++--
>  net/core/xdp.c    | 39 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 76b95256c266..0d5c6fb4b2e2 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -91,6 +91,8 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
>  	frame->dev_rx = NULL;
>  }
>  
> +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> +
>  /* Convert xdp_buff to xdp_frame */
>  static inline
>  struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> @@ -99,9 +101,8 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>  	int metasize;
>  	int headroom;
>  
> -	/* TODO: implement clone, copy, use "native" MEM_TYPE */
>  	if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> -		return NULL;
> +		return xdp_convert_zc_to_xdp_frame(xdp);
>  
>  	/* Assure headroom is available for storing info */
>  	headroom = xdp->data - xdp->data_hard_start;
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 89b6785cef2a..be6cb2f0e722 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -398,3 +398,42 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
>  	info->flags = bpf->flags;
>  }
>  EXPORT_SYMBOL_GPL(xdp_attachment_setup);
> +
> +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
> +{
> +	unsigned int metasize, headroom, totsize;
> +	void *addr, *data_to_copy;
> +	struct xdp_frame *xdpf;
> +	struct page *page;
> +
> +	/* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
> +	metasize = xdp_data_meta_unsupported(xdp) ? 0 :
> +		   xdp->data - xdp->data_meta;
> +	headroom = xdp->data - xdp->data_hard_start;
> +	totsize = xdp->data_end - xdp->data + metasize;
> +
> +	if (sizeof(*xdpf) + totsize > PAGE_SIZE)
> +		return NULL;
> +
> +	page = dev_alloc_page();
> +	if (!page)
> +		return NULL;
> +
> +	addr = page_to_virt(page);
> +	xdpf = addr;
> +	memset(xdpf, 0, sizeof(*xdpf));
> +
> +	addr += sizeof(*xdpf);
> +	data_to_copy = metasize ? xdp->data_meta : xdp->data;
> +	memcpy(addr, data_to_copy, totsize);
> +
> +	xdpf->data = addr + metasize;
> +	xdpf->len = totsize - metasize;
> +	xdpf->headroom = 0;
> +	xdpf->metasize = metasize;
> +	xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
> +
> +	xdp_return_buff(xdp);
> +	return xdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
  2018-08-28 14:11   ` Jesper Dangaard Brouer
@ 2018-08-28 17:42     ` Björn Töpel
  0 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-28 17:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexander Duyck, ast, Daniel Borkmann, Netdev, Brandeburg, Jesse,
	Singhai, Anjali, peter.waskiewicz.jr, Björn Töpel,
	michael.lundkvist, Willem de Bruijn, John Fastabend,
	Jakub Kicinski, neerav.parikh, MykytaI Iziumtsev, Francois Ozog

Den tis 28 aug. 2018 kl 16:11 skrev Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Tue, 28 Aug 2018 14:44:25 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > This commit adds proper MEM_TYPE_ZERO_COPY support for
> > convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
> > xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
> > MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
> > make sense to implement a more sophisticated thread-safe alloc/free
> > scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
> > required in the fast-path.
>
> This is going to be slow. Especially the dev_alloc_page() call, which
> for small frames is likely going to be slower than the data copy.
> I guess this is a good first step, but I do hope we will circle back and
> optimize this later.  (It would also be quite easy to use
> MEM_TYPE_PAGE_POOL instead to get page recycling in devmap redirect case).
>

Yes, slow. :-( Still, I think this is a good starting point, and then
introduce a page pool in later performance oriented series to make XDP
faster for the AF_XDP scenario.

But I'm definitely on your side here; This need to be addressed -- but
not now IMO.


And thanks for spending time on the series!
Björn

> I would have liked the MEM_TYPE_ZERO_COPY frame to travel one level
> deeper into the redirect-core code.  Allowing devmap to send these
> frame without copy, and allow cpumap to do the dev_alloc_page() call
> (+copy) on the remote CPU.
>
>
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> > ---
> >  include/net/xdp.h |  5 +++--
> >  net/core/xdp.c    | 39 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 42 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 76b95256c266..0d5c6fb4b2e2 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -91,6 +91,8 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
> >       frame->dev_rx = NULL;
> >  }
> >
> > +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> > +
> >  /* Convert xdp_buff to xdp_frame */
> >  static inline
> >  struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> > @@ -99,9 +101,8 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> >       int metasize;
> >       int headroom;
> >
> > -     /* TODO: implement clone, copy, use "native" MEM_TYPE */
> >       if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> > -             return NULL;
> > +             return xdp_convert_zc_to_xdp_frame(xdp);
> >
> >       /* Assure headroom is available for storing info */
> >       headroom = xdp->data - xdp->data_hard_start;
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 89b6785cef2a..be6cb2f0e722 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -398,3 +398,42 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> >       info->flags = bpf->flags;
> >  }
> >  EXPORT_SYMBOL_GPL(xdp_attachment_setup);
> > +
> > +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
> > +{
> > +     unsigned int metasize, headroom, totsize;
> > +     void *addr, *data_to_copy;
> > +     struct xdp_frame *xdpf;
> > +     struct page *page;
> > +
> > +     /* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
> > +     metasize = xdp_data_meta_unsupported(xdp) ? 0 :
> > +                xdp->data - xdp->data_meta;
> > +     headroom = xdp->data - xdp->data_hard_start;
> > +     totsize = xdp->data_end - xdp->data + metasize;
> > +
> > +     if (sizeof(*xdpf) + totsize > PAGE_SIZE)
> > +             return NULL;
> > +
> > +     page = dev_alloc_page();
> > +     if (!page)
> > +             return NULL;
> > +
> > +     addr = page_to_virt(page);
> > +     xdpf = addr;
> > +     memset(xdpf, 0, sizeof(*xdpf));
> > +
> > +     addr += sizeof(*xdpf);
> > +     data_to_copy = metasize ? xdp->data_meta : xdp->data;
> > +     memcpy(addr, data_to_copy, totsize);
> > +
> > +     xdpf->data = addr + metasize;
> > +     xdpf->len = totsize - metasize;
> > +     xdpf->headroom = 0;
> > +     xdpf->metasize = metasize;
> > +     xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
> > +
> > +     xdp_return_buff(xdp);
> > +     return xdpf;
> > +}
> > +EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
>
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
  2018-08-28 12:44 ` [PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock Björn Töpel
@ 2018-08-29 12:44   ` Jesper Dangaard Brouer
  2018-08-30 10:21     ` Björn Töpel
  0 siblings, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2018-08-29 12:44 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr, Björn Töpel,
	michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	jakub.kicinski, neerav.parikh, mykyta.iziumtsev, francois.ozog,
	ilias.apalodimas, brian.brooks, u9012063, pavel, qi.z.zhang,
	brouer

On Tue, 28 Aug 2018 14:44:35 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The -c/--copy -z/--zero-copy flags enforces either copy or zero-copy
> mode.

Nice, thanks for adding this.  It allows me to quickly test the
difference between normal-copy vs zero-copy modes.
(Kernel bpf-next without RETPOLINE).

AF_XDP RX-drop:
 Normal-copy mode: rx 13,070,318 pps - 76.5 ns
 Zero-copy   mode: rx 26,132,328 pps - 38.3 ns

Compare to XDP_DROP:  34,251,464 pps - 29.2 ns
   XDP_DROP + read :  30,756,664 pps - 32.5 ns

The normal-copy mode is surprisingly fast (and it works for every
driver implemeting the regular XDP_REDIRECT action).  It is still
faster to do in-kernel XDP_DROP than AF_XDP zero-copy mode dropping,
which was expected given frames travel to a remote CPU before returned
(don't think remote CPU reads payload?).  The gap in nanosec is
actually quite small, thus I'm impressed by the SPSC-queue
implementation working across these CPUs.


AF_XDP layer2-fwd:
 Normal-copy mode: rx  3,200,885   tx  3,200,892
 Zero-copy   mode: rx 17,026,300   tx 17,026,269

Compare to XDP_TX: rx 14,529,079   tx 14,529,850  - 68.82 ns
     XDP_REDIRECT: rx 13,235,785   tx 13,235,784  - 75.55 ns

The copy-mode is slow because it allocates SKBs internally (I do
wonder if we could speed it up by using ndo_xdp_xmit + disable-BH).
More intersting is that the zero-copy is faster than XDP_TX and
XDP_REDIRECT. I think the speedup comes from avoiding some DMA mapping
calls with ZC.

Side-note: XDP_TX vs. REDIRECT: 75.55 - 68.82 = 6.73 ns.  The cost of
going through the xdp_do_redirect_map core is actually quite small :-)
(I have some micro optimizations that should help ~2ns).


AF_XDP TX-only:
 Normal-copy mode: tx  2,853,461 pps
 Zero-copy   mode: tx 22,255,311 pps

(There is not XDP mode that does TX to compare against)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (11 preceding siblings ...)
  2018-08-28 12:50   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-08-29 16:12 ` Daniel Borkmann
  2018-08-30  0:10   ` William Tu
  2018-08-30  9:05   ` Björn Töpel
  2018-08-29 19:19 ` [RFC] net: xsk: add a simple buffer reuse queue Jakub Kicinski
  2018-08-29 19:39 ` [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Alexei Starovoitov
  14 siblings, 2 replies; 29+ messages in thread
From: Daniel Borkmann @ 2018-08-29 16:12 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, magnus.karlsson,
	alexander.h.duyck, alexander.duyck, ast, brouer, netdev,
	jesse.brandeburg, anjali.singhai, peter.waskiewicz.jr
  Cc: Björn Töpel, michael.lundkvist, willemdebruijn.kernel,
	john.fastabend, jakub.kicinski, neerav.parikh, mykyta.iziumtsev,
	francois.ozog, ilias.apalodimas, brian.brooks, u9012063, pavel,
	qi.z.zhang

On 08/28/2018 02:44 PM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This patch set introduces zero-copy AF_XDP support for Intel's i40e
> driver. In the first preparatory patch we also add support for
> XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> Brouer for reviewing earlier versions of this patch set.
> 
> The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> that in the interest of time, to get an AF_XDP zero-copy implementation
> out there for people to try, some code paths have been copied from the
> XDP path to the zero-copy path. It is out goal to merge the two paths
> in later patch sets.
> 
> In contrast to the implementation from beginning of June, this patch
> set does not require any extra HW queues for AF_XDP zero-copy
> TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> AF_XDP zero-copy TX.
> 
> Jeff, given that most of changes are in i40e, it is up to you how you
> would like to route these patches. The set is tagged bpf-next, but
> if taking it via the Intel driver tree is easier, let us know.
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are with
> retpoline and all other spectre and meltdown fixes, so these results
> are not comparable to the ones from the zero-copy patch set in June.
> 
> AF_XDP performance 64 byte packets.
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.6        8.2         15.0
> txpush       2.2        -           21.9
> l2fwd        1.7        2.3         11.3
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.0        3.3         3.3
> l2fwd        1.3        1.7         3.1
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      18.4M  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0
> 
> The structure of the patch set is as follows:
> 
> Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> Patches 2-4: Preparatory patches to common xsk and net code
> Patches 5-7: Preparatory patches to i40e driver code for RX
> Patch 8: i40e zero-copy support for RX
> Patch 9: Preparatory patch to i40e driver code for TX
> Patch 10: i40e zero-copy support for TX
> Patch 11: Add flags to sample application to force zero-copy/copy mode
> 
> We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
> 
> 
> Magnus & Björn
> 
> Björn Töpel (8):
>   xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
>   xdp: export xdp_rxq_info_unreg_mem_model
>   xsk: expose xdp_umem_get_{data,dma} to drivers
>   i40e: added queue pair disable/enable functions
>   i40e: refactor Rx path for re-use
>   i40e: move common Rx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Rx support
>   samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
> 
> Magnus Karlsson (3):
>   net: add napi_if_scheduled_mark_missed
>   i40e: move common Tx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Tx support

Thanks for working on this, LGTM! Are you also planning to get ixgbe
out after that?

For the series:

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [bpf-next, 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
  2018-08-28 12:44 ` [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY Björn Töpel
  2018-08-28 14:11   ` Jesper Dangaard Brouer
@ 2018-08-29 18:06   ` Maciek Fijalkowski
  1 sibling, 0 replies; 29+ messages in thread
From: Maciek Fijalkowski @ 2018-08-29 18:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr
  Cc: michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	jakub.kicinski, neerav.parikh, mykyta.iziumtsev, francois.ozog,
	ilias.apalodimas, brian.brooks, u9012063, pavel, qi.z.zhang

From: Maciej Fijalkowski <macfij7@wp.pl>

> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This commit adds proper MEM_TYPE_ZERO_COPY support for
> convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
> xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
> MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
> make sense to implement a more sophisticated thread-safe alloc/free
> scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
> required in the fast-path.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  include/net/xdp.h |  5 +++--
>  net/core/xdp.c    | 39 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 76b95256c266..0d5c6fb4b2e2 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -91,6 +91,8 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
>  	frame->dev_rx = NULL;
>  }
>  
> +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> +
>  /* Convert xdp_buff to xdp_frame */
>  static inline
>  struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> @@ -99,9 +101,8 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>  	int metasize;
>  	int headroom;
>  
> -	/* TODO: implement clone, copy, use "native" MEM_TYPE */
>  	if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> -		return NULL;
> +		return xdp_convert_zc_to_xdp_frame(xdp);
>  
>  	/* Assure headroom is available for storing info */
>  	headroom = xdp->data - xdp->data_hard_start;
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 89b6785cef2a..be6cb2f0e722 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -398,3 +398,42 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
>  	info->flags = bpf->flags;
>  }
>  EXPORT_SYMBOL_GPL(xdp_attachment_setup);
> +
> +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
> +{
> +	unsigned int metasize, headroom, totsize;
> +	void *addr, *data_to_copy;
> +	struct xdp_frame *xdpf;
> +	struct page *page;
> +
> +	/* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
> +	metasize = xdp_data_meta_unsupported(xdp) ? 0 :
> +		   xdp->data - xdp->data_meta;
> +	headroom = xdp->data - xdp->data_hard_start;

You actually don't use the headroom calculated here;
xdpf->headroom is assigned to 0 - was this your intention?
If so, the local variable can be removed.

> +	totsize = xdp->data_end - xdp->data + metasize;
> +
> +	if (sizeof(*xdpf) + totsize > PAGE_SIZE)
> +		return NULL;
> +
> +	page = dev_alloc_page();
> +	if (!page)
> +		return NULL;
> +
> +	addr = page_to_virt(page);
> +	xdpf = addr;
> +	memset(xdpf, 0, sizeof(*xdpf));
> +
> +	addr += sizeof(*xdpf);
> +	data_to_copy = metasize ? xdp->data_meta : xdp->data;
> +	memcpy(addr, data_to_copy, totsize);
> +
> +	xdpf->data = addr + metasize;
> +	xdpf->len = totsize - metasize;
> +	xdpf->headroom = 0;
> +	xdpf->metasize = metasize;
> +	xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
> +
> +	xdp_return_buff(xdp);
> +	return xdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support
  2018-08-28 12:44 ` [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support Björn Töpel
@ 2018-08-29 19:14   ` Jakub Kicinski
  2018-08-30 12:06     ` Björn Töpel
  2018-08-29 19:22   ` Alexei Starovoitov
  1 sibling, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2018-08-29 19:14 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr, Björn Töpel,
	michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	neerav.parikh, mykyta.iziumtsev, francois.ozog, ilias.apalodimas,
	brian.brooks, u9012063, pavel, qi.z.zhang

On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
> queue.
> 
> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
> 
> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
> will allocate a new buffer and copy the zero-copy frame prior passing
> it to the kernel stack.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

Mm.. I'm surprised you don't run into buffer reuse issues that I had
when playing with AF_XDP.  What happens in i40e if someone downs the
interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?

I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
FWIW.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC] net: xsk: add a simple buffer reuse queue
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (12 preceding siblings ...)
  2018-08-29 16:12 ` Daniel Borkmann
@ 2018-08-29 19:19 ` Jakub Kicinski
  2018-08-31  8:34   ` Björn Töpel
  2018-08-29 19:39 ` [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Alexei Starovoitov
  14 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2018-08-29 19:19 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr, Björn Töpel,
	michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	neerav.parikh, mykyta.iziumtsev, francois.ozog, ilias.apalodimas,
	brian.brooks, u9012063, pavel, qi.z.zhang

XSK UMEM is strongly single producer single consumer so reuse of
frames is challenging.  Add a simple "stash" of FILL packets to
reuse for drivers to optionally make use of.  This is useful
when driver has to free (ndo_stop) or resize a ring with an active
AF_XDP ZC socket.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 include/net/xdp_sock.h | 44 +++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.c     |  2 ++
 net/xdp/xsk_queue.c    | 56 ++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h    |  3 +++
 4 files changed, 105 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 6871e4755975..108c1c100de4 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -14,6 +14,7 @@
 #include <net/sock.h>
 
 struct net_device;
+struct xdp_umem_fq_reuse;
 struct xsk_queue;
 
 struct xdp_umem_props {
@@ -41,6 +42,7 @@ struct xdp_umem {
 	struct page **pgs;
 	u32 npgs;
 	struct net_device *dev;
+	struct xdp_umem_fq_reuse *fq_reuse;
 	u16 queue_id;
 	bool zc;
 	spinlock_t xsk_list_lock;
@@ -110,4 +112,46 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
 	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
 }
 
+struct xdp_umem_fq_reuse {
+	u32 nentries;
+	u32 length;
+	u64 handles[];
+};
+
+/* Following functions are not thread-safe in any way */
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+					  struct xdp_umem_fq_reuse *newq);
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
+
+/* Reuse-queue aware version of FILL queue helpers */
+static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+{
+	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+	if (!rq->length) {
+		return xsk_umem_peek_addr(umem, addr);
+	} else {
+		*addr = rq->handles[rq->length - 1];
+		return addr;
+	}
+}
+
+static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
+{
+	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+	if (!rq->length)
+		xsk_umem_discard_addr(umem);
+	else
+		rq->length--;
+}
+
+static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
+{
+	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+	rq->handles[rq->length++] = addr;
+}
+
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index e762310c9bee..40303e24c954 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -170,6 +170,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
 		umem->cq = NULL;
 	}
 
+	xsk_reuseq_destroy(umem);
+
 	xdp_umem_unpin_pages(umem);
 
 	task = get_pid_task(umem->pid, PIDTYPE_PID);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 6c32e92e98fc..f9ee40a13a9a 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -3,7 +3,9 @@
  * Copyright(c) 2018 Intel Corporation.
  */
 
+#include <linux/log2.h>
 #include <linux/slab.h>
+#include <linux/overflow.h>
 
 #include "xsk_queue.h"
 
@@ -61,3 +63,57 @@ void xskq_destroy(struct xsk_queue *q)
 	page_frag_free(q->ring);
 	kfree(q);
 }
+
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
+{
+	struct xdp_umem_fq_reuse *newq;
+
+	/* Check for overflow */
+	if (nentries > (u32)roundup_pow_of_two(nentries))
+		return NULL;
+	nentries = roundup_pow_of_two(nentries);
+
+	newq = kvmalloc(struct_size(newq, handles, nentries), GFP_KERNEL);
+	if (!newq)
+		return NULL;
+	memset(newq, 0, offsetof(typeof(*newq), handles));
+
+	newq->nentries = nentries;
+	return newq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_prepare);
+
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+					  struct xdp_umem_fq_reuse *newq)
+{
+	struct xdp_umem_fq_reuse *oldq = umem->fq_reuse;
+
+	if (!oldq) {
+		umem->fq_reuse = newq;
+		return NULL;
+	}
+
+	if (newq->nentries < oldq->length)
+		return newq;
+
+
+	memcpy(newq->handles, oldq->handles,
+	       array_size(oldq->length, sizeof(u64)));
+	newq->length = oldq->length;
+
+	umem->fq_reuse = newq;
+	return oldq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_swap);
+
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq)
+{
+	kvfree(rq);
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_free);
+
+void xsk_reuseq_destroy(struct xdp_umem *umem)
+{
+	xsk_reuseq_free(umem->fq_reuse);
+	umem->fq_reuse = NULL;
+}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 8a64b150be54..7a480e3eb35d 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -257,4 +257,7 @@ void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props);
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q_ops);
 
+/* Executed by the core when the entire UMEM gets freed */
+void xsk_reuseq_destroy(struct xdp_umem *umem);
+
 #endif /* _LINUX_XSK_QUEUE_H */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support
  2018-08-28 12:44 ` [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support Björn Töpel
  2018-08-29 19:14   ` Jakub Kicinski
@ 2018-08-29 19:22   ` Alexei Starovoitov
  1 sibling, 0 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2018-08-29 19:22 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr, Björn Töpel,
	michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	jakub.kicinski, neerav.parikh, mykyta.iziumtsev, francois.ozog,
	ilias.apalodimas, brian.brooks, u9012063, pavel, qi.z.zhang

On Tue, Aug 28, 2018 at 02:44:32PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
> queue.
> 
> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
> 
> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
> will allocate a new buffer and copy the zero-copy frame prior passing
> it to the kernel stack.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
...
> +/**
> + * i40e_construct_skb_zc - Create skbufff from zero-copy Rx buffer

typo sk_buff

overall the set looks great to me. The driver side changes look
particularly clean and independent from skb path of the driver.
I think the issues brought up by other reviews are not blockers
for the set, but certainly should be addressed in the follow up patches.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
  2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
                   ` (13 preceding siblings ...)
  2018-08-29 19:19 ` [RFC] net: xsk: add a simple buffer reuse queue Jakub Kicinski
@ 2018-08-29 19:39 ` Alexei Starovoitov
  14 siblings, 0 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2018-08-29 19:39 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr, Björn Töpel,
	michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	jakub.kicinski, neerav.parikh, mykyta.iziumtsev, francois.ozog,
	ilias.apalodimas, brian.brooks, u9012063, pavel, qi.z.zhang

On Tue, Aug 28, 2018 at 02:44:24PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This patch set introduces zero-copy AF_XDP support for Intel's i40e
> driver. In the first preparatory patch we also add support for
> XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> Brouer for reviewing earlier versions of this patch set.
> 
> The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> that in the interest of time, to get an AF_XDP zero-copy implementation
> out there for people to try, some code paths have been copied from the
> XDP path to the zero-copy path. It is out goal to merge the two paths
> in later patch sets.
> 
> In contrast to the implementation from beginning of June, this patch
> set does not require any extra HW queues for AF_XDP zero-copy
> TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> AF_XDP zero-copy TX.
> 
> Jeff, given that most of changes are in i40e, it is up to you how you
> would like to route these patches. The set is tagged bpf-next, but
> if taking it via the Intel driver tree is easier, let us know.
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are with
> retpoline and all other spectre and meltdown fixes, so these results
> are not comparable to the ones from the zero-copy patch set in June.
> 
> AF_XDP performance 64 byte packets.
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.6        8.2         15.0
> txpush       2.2        -           21.9
> l2fwd        1.7        2.3         11.3
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.0        3.3         3.3
> l2fwd        1.3        1.7         3.1
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      18.4M  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0
> 
> The structure of the patch set is as follows:
> 
> Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> Patches 2-4: Preparatory patches to common xsk and net code
> Patches 5-7: Preparatory patches to i40e driver code for RX
> Patch 8: i40e zero-copy support for RX
> Patch 9: Preparatory patch to i40e driver code for TX
> Patch 10: i40e zero-copy support for TX
> Patch 11: Add flags to sample application to force zero-copy/copy mode
> 
> We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
> 
> 
> Magnus & Björn
> 
> Björn Töpel (8):
>   xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
>   xdp: export xdp_rxq_info_unreg_mem_model
>   xsk: expose xdp_umem_get_{data,dma} to drivers
>   i40e: added queue pair disable/enable functions
>   i40e: refactor Rx path for re-use
>   i40e: move common Rx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Rx support
>   samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
> 
> Magnus Karlsson (3):
>   net: add napi_if_scheduled_mark_missed
>   i40e: move common Tx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Tx support

Applied to bpf-next.
Thanks a lot to the authors, code reviewers and testers.

I hope all the issues brought up during code review can be quickly
addressed in the follow up patches and zerocopy feature in i40e
and other drivers will be mainline ready for the next merge window.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
  2018-08-29 16:12 ` Daniel Borkmann
@ 2018-08-30  0:10   ` William Tu
  2018-08-30  9:05   ` Björn Töpel
  1 sibling, 0 replies; 29+ messages in thread
From: William Tu @ 2018-08-30  0:10 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Björn Töpel, Karlsson, Magnus, magnus.karlsson,
	Alexander Duyck, Alexander Duyck, Alexei Starovoitov,
	Jesper Dangaard Brouer, Linux Kernel Network Developers,
	Brandeburg, Jesse, Anjali Singhai Jain, peter.waskiewicz.jr,
	Björn Töpel, michael.lundkvist, Willem de Bruijn,
	John Fastabend, Jakub Kicinski, neerav.parikh, mykyta.iziumtsev

> Thanks for working on this, LGTM! Are you also planning to get ixgbe
> out after that?
>

I currently don't have i40e nic to test, so
I'm also looking forward to the ixgbe patch!

Thank you
William

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
  2018-08-29 16:12 ` Daniel Borkmann
  2018-08-30  0:10   ` William Tu
@ 2018-08-30  9:05   ` Björn Töpel
  1 sibling, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-30  9:05 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexander Duyck, ast, Jesper Dangaard Brouer, Netdev, Brandeburg,
	Jesse, Singhai, Anjali, peter.waskiewicz.jr,
	Björn Töpel, michael.lundkvist, Willem de Bruijn,
	John Fastabend, Jakub Kicinski, neerav.parikh, MykytaI Iziumtsev,
	Francois Ozog

Den ons 29 aug. 2018 kl 18:12 skrev Daniel Borkmann <daniel@iogearbox.net>:
>
> On 08/28/2018 02:44 PM, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > This patch set introduces zero-copy AF_XDP support for Intel's i40e
> > driver. In the first preparatory patch we also add support for
> > XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> > redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> > set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> > Brouer for reviewing earlier versions of this patch set.
> >
> > The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> > that in the interest of time, to get an AF_XDP zero-copy implementation
> > out there for people to try, some code paths have been copied from the
> > XDP path to the zero-copy path. It is out goal to merge the two paths
> > in later patch sets.
> >
> > In contrast to the implementation from beginning of June, this patch
> > set does not require any extra HW queues for AF_XDP zero-copy
> > TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> > AF_XDP zero-copy TX.
> >
> > Jeff, given that most of changes are in i40e, it is up to you how you
> > would like to route these patches. The set is tagged bpf-next, but
> > if taking it via the Intel driver tree is easier, let us know.
> >
> > We have run some benchmarks on a dual socket system with two Broadwell
> > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> > cores which gives a total of 28, but only two cores are used in these
> > experiments. One for TR/RX and one for the user space application. The
> > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> > NIC is Intel I40E 40Gbit/s using the i40e driver.
> >
> > Below are the results in Mpps of the I40E NIC benchmark runs for 64
> > and 1500 byte packets, generated by a commercial packet generator HW
> > outputing packets at full 40 Gbit/s line rate. The results are with
> > retpoline and all other spectre and meltdown fixes, so these results
> > are not comparable to the ones from the zero-copy patch set in June.
> >
> > AF_XDP performance 64 byte packets.
> > Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> > rxdrop       2.6        8.2         15.0
> > txpush       2.2        -           21.9
> > l2fwd        1.7        2.3         11.3
> >
> > AF_XDP performance 1500 byte packets:
> > Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> > rxdrop       2.0        3.3         3.3
> > l2fwd        1.3        1.7         3.1
> >
> > XDP performance on our system as a base line:
> >
> > 64 byte packets:
> > XDP stats       CPU     pps         issue-pps
> > XDP-RX CPU      16      18.4M  0
> >
> > 1500 byte packets:
> > XDP stats       CPU     pps         issue-pps
> > XDP-RX CPU      16      3.3M    0
> >
> > The structure of the patch set is as follows:
> >
> > Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> > Patches 2-4: Preparatory patches to common xsk and net code
> > Patches 5-7: Preparatory patches to i40e driver code for RX
> > Patch 8: i40e zero-copy support for RX
> > Patch 9: Preparatory patch to i40e driver code for TX
> > Patch 10: i40e zero-copy support for TX
> > Patch 11: Add flags to sample application to force zero-copy/copy mode
> >
> > We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
> >
> >
> > Magnus & Björn
> >
> > Björn Töpel (8):
> >   xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
> >   xdp: export xdp_rxq_info_unreg_mem_model
> >   xsk: expose xdp_umem_get_{data,dma} to drivers
> >   i40e: added queue pair disable/enable functions
> >   i40e: refactor Rx path for re-use
> >   i40e: move common Rx functions to i40e_txrx_common.h
> >   i40e: add AF_XDP zero-copy Rx support
> >   samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
> >
> > Magnus Karlsson (3):
> >   net: add napi_if_scheduled_mark_missed
> >   i40e: move common Tx functions to i40e_txrx_common.h
> >   i40e: add AF_XDP zero-copy Tx support
>
> Thanks for working on this, LGTM! Are you also planning to get ixgbe
> out after that?
>

Yes, the plan is to get ixgbe out as the next driver, but we'll focus
on the existing i40e issues first.


Thanks,
Björn

> For the series:
>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
  2018-08-29 12:44   ` Jesper Dangaard Brouer
@ 2018-08-30 10:21     ` Björn Töpel
  0 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-30 10:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexander Duyck, ast, Daniel Borkmann, Netdev, Brandeburg, Jesse,
	Singhai, Anjali, peter.waskiewicz.jr, Björn Töpel,
	michael.lundkvist, Willem de Bruijn, John Fastabend,
	Jakub Kicinski, neerav.parikh, MykytaI Iziumtsev, Francois Ozog

Den ons 29 aug. 2018 kl 14:44 skrev Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Tue, 28 Aug 2018 14:44:35 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > The -c/--copy -z/--zero-copy flags enforces either copy or zero-copy
> > mode.
>
> Nice, thanks for adding this.  It allows me to quickly test the
> difference between normal-copy vs zero-copy modes.
> (Kernel bpf-next without RETPOLINE).
>
> AF_XDP RX-drop:
>  Normal-copy mode: rx 13,070,318 pps - 76.5 ns
>  Zero-copy   mode: rx 26,132,328 pps - 38.3 ns
>
> Compare to XDP_DROP:  34,251,464 pps - 29.2 ns
>    XDP_DROP + read :  30,756,664 pps - 32.5 ns
>
> The normal-copy mode is surprisingly fast (and it works for every
> driver implemeting the regular XDP_REDIRECT action).  It is still
> faster to do in-kernel XDP_DROP than AF_XDP zero-copy mode dropping,
> which was expected given frames travel to a remote CPU before returned
> (don't think remote CPU reads payload?).  The gap in nanosec is
> actually quite small, thus I'm impressed by the SPSC-queue
> implementation working across these CPUs.
>
>
> AF_XDP layer2-fwd:
>  Normal-copy mode: rx  3,200,885   tx  3,200,892
>  Zero-copy   mode: rx 17,026,300   tx 17,026,269
>
> Compare to XDP_TX: rx 14,529,079   tx 14,529,850  - 68.82 ns
>      XDP_REDIRECT: rx 13,235,785   tx 13,235,784  - 75.55 ns
>
> The copy-mode is slow because it allocates SKBs internally (I do
> wonder if we could speed it up by using ndo_xdp_xmit + disable-BH).
> More intersting is that the zero-copy is faster than XDP_TX and
> XDP_REDIRECT. I think the speedup comes from avoiding some DMA mapping
> calls with ZC.
>
> Side-note: XDP_TX vs. REDIRECT: 75.55 - 68.82 = 6.73 ns.  The cost of
> going through the xdp_do_redirect_map core is actually quite small :-)
> (I have some micro optimizations that should help ~2ns).
>
>
> AF_XDP TX-only:
>  Normal-copy mode: tx  2,853,461 pps
>  Zero-copy   mode: tx 22,255,311 pps
>
> (There is not XDP mode that does TX to compare against)
>

Kudos for doing the in-depth benchmarking!


Thanks!
Björn

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support
  2018-08-29 19:14   ` Jakub Kicinski
@ 2018-08-30 12:06     ` Björn Töpel
  2018-08-31  7:55       ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: Björn Töpel @ 2018-08-30 12:06 UTC (permalink / raw)
  To: Jakub Kicinski, Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr, michael.lundkvist,
	willemdebruijn.kernel, john.fastabend, neerav.parikh,
	mykyta.iziumtsev, francois.ozog, ilias.apalodimas, brian.brooks,
	u9012063, pavel, qi.z.zhang

On 2018-08-29 21:14, Jakub Kicinski wrote:
 > On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote:
 >> From: Björn Töpel <bjorn.topel@intel.com>
 >>
 >> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
 >> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
 >> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
 >> queue.
 >>
 >> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
 >>
 >> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
 >> will allocate a new buffer and copy the zero-copy frame prior passing
 >> it to the kernel stack.
 >>
 >> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
 >
 > Mm.. I'm surprised you don't run into buffer reuse issues that I had
 > when playing with AF_XDP.  What happens in i40e if someone downs the
 > interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?
 >

The UMEM will linger in the driver until the sockets are dead.

 > I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
 > FWIW.
 >

Some background for folks that don't know the details: A zero-copy
capable driver picks buffers off the fill ring and places them on the
hardware Rx ring to be completed at a later point when DMA is
complete. Analogous for the Tx side; The driver picks buffers off the
Tx ring and places them on the Tx hardware ring.

In the typical flow, the Rx buffer will be placed onto an Rx ring
(completed to the user), and the Tx buffer will be placed on the
completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), what should be
done with the Rx and Tx buffers that has been given to the driver?

So, to frame the problem: What should a driver do when this happens,
so that buffers aren't leaked?

Note that when the UMEM is going down, there's no need to complete
anything, since the sockets are dying/dead already.

This is, as you state, a missing piece in the implementation and needs
to be fixed.

Now on to possible solutions:

1. Complete the buffers back to the user. For Tx, this is probably the
    best way -- just place the buffers onto the completion ring.

    For Rx, we can give buffers back to user space by setting the
    length in the Rx descriptor to zero And putting them on the Rx
    ring. However, one complication here is that we do not have any
    back-pressure mechanism for the Rx side like we have on Tx. If the
    Rx ring(s) is (are) full the kernel will have to leak them or
    implement a retry mechanism (ugly and should be avoided).

    Another option to solve this without needing any retry or leaking
    for Rx is to implement the same back-pressure mechanism that we
    have on the Tx path in the Rx path. In the Tx path, the driver will
    only get a Tx packet to send if there is space for it in the
    completion ring. On Rx, this would be that the driver would only
    get a buffer from the fill ring if there is space for it to put it
    on the Rx ring. The drawback of this is that it would likely impact
    performance negatively since the Rx ring would have to be touch one
    more time (in the Tx path, it increased performance since it made
    it possible to implement the Tx path without any buffering), but it
    would guarantee that all buffers can always be returned to user
    space making solution this a viable option.

2. Store the buffers internally in the driver, and make sure that they
    are inserted into the "normal flow" again. For Rx that would be
    putting the buffers back into the allocation scheme that the driver
    is using. For Tx, placing the buffers back onto the Tx HW ring
    (plus all the logic for making sure that all corner cases work).

3. Mark the socket(s) as in error state, en require the user to redo
    the setup. This is bit harsh...

For i40e I think #2 for Rx (buffers reside in kernel, return to
allocator) and #1 for Tx (complete to userland).

Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan
of extending the umem with the "reuse queue". This decision is really
up the driver. Some driver might prefer another scheme, or simply
prefer storing the buffers in another manner.

Looking forward, as both you and Jesper has alluded to, we need a
proper allocator for zero-copy. Then it would be a matter of injecting
the Rx buffers back to the allocator.

I'll send out a patch, so that we don't leak the buffers, but I'd like
to hear your thoughts on what the behavior should be.

And what should the behavior be when the netdev is removed from the
kernel?

And thanks for looking into the code!


Björn

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support
  2018-08-30 12:06     ` Björn Töpel
@ 2018-08-31  7:55       ` Jakub Kicinski
  0 siblings, 0 replies; 29+ messages in thread
From: Jakub Kicinski @ 2018-08-31  7:55 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson,
	alexander.h.duyck, alexander.duyck, ast, brouer, daniel, netdev,
	jesse.brandeburg, anjali.singhai, peter.waskiewicz.jr,
	michael.lundkvist, willemdebruijn.kernel, john.fastabend,
	neerav.parikh, mykyta.iziumtsev, francois.ozog, ilias.apalodimas,
	brian.brooks, u9012063, pavel, qi.z.zhang

On Thu, 30 Aug 2018 14:06:24 +0200, Björn Töpel wrote:
> On 2018-08-29 21:14, Jakub Kicinski wrote:
>  > On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote:  
>  >> From: Björn Töpel <bjorn.topel@intel.com>
>  >>
>  >> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
>  >> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
>  >> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
>  >> queue.
>  >>
>  >> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
>  >>
>  >> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
>  >> will allocate a new buffer and copy the zero-copy frame prior passing
>  >> it to the kernel stack.
>  >>
>  >> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>  
>  >
>  > Mm.. I'm surprised you don't run into buffer reuse issues that I had
>  > when playing with AF_XDP.  What happens in i40e if someone downs the
>  > interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?
>  >  
> 
> The UMEM will linger in the driver until the sockets are dead.
> 
>  > I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
>  > FWIW.
>  >  
> 
> Some background for folks that don't know the details: A zero-copy
> capable driver picks buffers off the fill ring and places them on the
> hardware Rx ring to be completed at a later point when DMA is
> complete. Analogous for the Tx side; The driver picks buffers off the
> Tx ring and places them on the Tx hardware ring.
> 
> In the typical flow, the Rx buffer will be placed onto an Rx ring
> (completed to the user), and the Tx buffer will be placed on the
> completion ring to notify the user that the transfer is done.
> 
> However, if the driver needs to tear down the hardware rings for some
> reason (interface goes down, reconfiguration and such), what should be
> done with the Rx and Tx buffers that has been given to the driver?
> 
> So, to frame the problem: What should a driver do when this happens,
> so that buffers aren't leaked?
> 
> Note that when the UMEM is going down, there's no need to complete
> anything, since the sockets are dying/dead already.
> 
> This is, as you state, a missing piece in the implementation and needs
> to be fixed.
> 
> Now on to possible solutions:
> 
> 1. Complete the buffers back to the user. For Tx, this is probably the
>     best way -- just place the buffers onto the completion ring.
> 
>     For Rx, we can give buffers back to user space by setting the
>     length in the Rx descriptor to zero And putting them on the Rx
>     ring. However, one complication here is that we do not have any
>     back-pressure mechanism for the Rx side like we have on Tx. If the
>     Rx ring(s) is (are) full the kernel will have to leak them or
>     implement a retry mechanism (ugly and should be avoided).
> 
>     Another option to solve this without needing any retry or leaking
>     for Rx is to implement the same back-pressure mechanism that we
>     have on the Tx path in the Rx path. In the Tx path, the driver will
>     only get a Tx packet to send if there is space for it in the
>     completion ring. On Rx, this would be that the driver would only
>     get a buffer from the fill ring if there is space for it to put it
>     on the Rx ring. The drawback of this is that it would likely impact
>     performance negatively since the Rx ring would have to be touch one
>     more time (in the Tx path, it increased performance since it made
>     it possible to implement the Tx path without any buffering), but it
>     would guarantee that all buffers can always be returned to user
>     space making solution this a viable option.
> 
> 2. Store the buffers internally in the driver, and make sure that they
>     are inserted into the "normal flow" again. For Rx that would be
>     putting the buffers back into the allocation scheme that the driver
>     is using. For Tx, placing the buffers back onto the Tx HW ring
>     (plus all the logic for making sure that all corner cases work).
> 
> 3. Mark the socket(s) as in error state, en require the user to redo
>     the setup. This is bit harsh...

That's a good summary, one more (bad) option:

4. Disallow any operations on the device which could lead to RX buffer
   discards if any UMEMs are attached.

> For i40e I think #2 for Rx (buffers reside in kernel, return to
> allocator) and #1 for Tx (complete to userland).
> 
> Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan
> of extending the umem with the "reuse queue". This decision is really
> up the driver. Some driver might prefer another scheme, or simply
> prefer storing the buffers in another manner.

The only performance cost is the extra pointer in xdp_umem.  Drivers
have to opt-in by using the *_rq() helpers.  We can move the extra
pointer to driver structs, and have them pass it to the helpers, so that
would be zero extra cost, and we can reuse the implementation.

> Looking forward, as both you and Jesper has alluded to, we need a
> proper allocator for zero-copy. Then it would be a matter of injecting
> the Rx buffers back to the allocator.
> 
> I'll send out a patch, so that we don't leak the buffers, but I'd like
> to hear your thoughts on what the behavior should be.
> 
> And what should the behavior be when the netdev is removed from the
> kernel?

What would AF_PACKET do, return ENXIO?

> And thanks for looking into the code!
> 
> Björn

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] net: xsk: add a simple buffer reuse queue
  2018-08-29 19:19 ` [RFC] net: xsk: add a simple buffer reuse queue Jakub Kicinski
@ 2018-08-31  8:34   ` Björn Töpel
  0 siblings, 0 replies; 29+ messages in thread
From: Björn Töpel @ 2018-08-31  8:34 UTC (permalink / raw)
  To: Jakub Kicinski, Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr, michael.lundkvist,
	willemdebruijn.kernel, john.fastabend, neerav.parikh,
	mykyta.iziumtsev, francois.ozog, ilias.apalodimas, brian.brooks,
	u9012063, pavel, qi.z.zhang

On 2018-08-29 21:19, Jakub Kicinski wrote:
> XSK UMEM is strongly single producer single consumer so reuse of
> frames is challenging.  Add a simple "stash" of FILL packets to
> reuse for drivers to optionally make use of.  This is useful
> when driver has to free (ndo_stop) or resize a ring with an active
> AF_XDP ZC socket.
>

I'll take a stab using this in i40e. I have a couple of
comments/thoughts on this RFC, but let me get back when I have an actual
patch in place. :-)


Thanks!
Björn


> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
>   include/net/xdp_sock.h | 44 +++++++++++++++++++++++++++++++++
>   net/xdp/xdp_umem.c     |  2 ++
>   net/xdp/xsk_queue.c    | 56 ++++++++++++++++++++++++++++++++++++++++++
>   net/xdp/xsk_queue.h    |  3 +++
>   4 files changed, 105 insertions(+)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 6871e4755975..108c1c100de4 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -14,6 +14,7 @@
>   #include <net/sock.h>
>   
>   struct net_device;
> +struct xdp_umem_fq_reuse;
>   struct xsk_queue;
>   
>   struct xdp_umem_props {
> @@ -41,6 +42,7 @@ struct xdp_umem {
>   	struct page **pgs;
>   	u32 npgs;
>   	struct net_device *dev;
> +	struct xdp_umem_fq_reuse *fq_reuse;
>   	u16 queue_id;
>   	bool zc;
>   	spinlock_t xsk_list_lock;
> @@ -110,4 +112,46 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
>   	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
>   }
>   
> +struct xdp_umem_fq_reuse {
> +	u32 nentries;
> +	u32 length;
> +	u64 handles[];
> +};
> +
> +/* Following functions are not thread-safe in any way */
> +struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
> +struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
> +					  struct xdp_umem_fq_reuse *newq);
> +void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
> +
> +/* Reuse-queue aware version of FILL queue helpers */
> +static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
> +{
> +	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
> +
> +	if (!rq->length) {
> +		return xsk_umem_peek_addr(umem, addr);
> +	} else {
> +		*addr = rq->handles[rq->length - 1];
> +		return addr;
> +	}
> +}
> +
> +static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
> +{
> +	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
> +
> +	if (!rq->length)
> +		xsk_umem_discard_addr(umem);
> +	else
> +		rq->length--;
> +}
> +
> +static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
> +{
> +	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
> +
> +	rq->handles[rq->length++] = addr;
> +}
> +
>   #endif /* _LINUX_XDP_SOCK_H */
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index e762310c9bee..40303e24c954 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -170,6 +170,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
>   		umem->cq = NULL;
>   	}
>   
> +	xsk_reuseq_destroy(umem);
> +
>   	xdp_umem_unpin_pages(umem);
>   
>   	task = get_pid_task(umem->pid, PIDTYPE_PID);
> diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
> index 6c32e92e98fc..f9ee40a13a9a 100644
> --- a/net/xdp/xsk_queue.c
> +++ b/net/xdp/xsk_queue.c
> @@ -3,7 +3,9 @@
>    * Copyright(c) 2018 Intel Corporation.
>    */
>   
> +#include <linux/log2.h>
>   #include <linux/slab.h>
> +#include <linux/overflow.h>
>   
>   #include "xsk_queue.h"
>   
> @@ -61,3 +63,57 @@ void xskq_destroy(struct xsk_queue *q)
>   	page_frag_free(q->ring);
>   	kfree(q);
>   }
> +
> +struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
> +{
> +	struct xdp_umem_fq_reuse *newq;
> +
> +	/* Check for overflow */
> +	if (nentries > (u32)roundup_pow_of_two(nentries))
> +		return NULL;
> +	nentries = roundup_pow_of_two(nentries);
> +
> +	newq = kvmalloc(struct_size(newq, handles, nentries), GFP_KERNEL);
> +	if (!newq)
> +		return NULL;
> +	memset(newq, 0, offsetof(typeof(*newq), handles));
> +
> +	newq->nentries = nentries;
> +	return newq;
> +}
> +EXPORT_SYMBOL_GPL(xsk_reuseq_prepare);
> +
> +struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
> +					  struct xdp_umem_fq_reuse *newq)
> +{
> +	struct xdp_umem_fq_reuse *oldq = umem->fq_reuse;
> +
> +	if (!oldq) {
> +		umem->fq_reuse = newq;
> +		return NULL;
> +	}
> +
> +	if (newq->nentries < oldq->length)
> +		return newq;
> +
> +
> +	memcpy(newq->handles, oldq->handles,
> +	       array_size(oldq->length, sizeof(u64)));
> +	newq->length = oldq->length;
> +
> +	umem->fq_reuse = newq;
> +	return oldq;
> +}
> +EXPORT_SYMBOL_GPL(xsk_reuseq_swap);
> +
> +void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq)
> +{
> +	kvfree(rq);
> +}
> +EXPORT_SYMBOL_GPL(xsk_reuseq_free);
> +
> +void xsk_reuseq_destroy(struct xdp_umem *umem)
> +{
> +	xsk_reuseq_free(umem->fq_reuse);
> +	umem->fq_reuse = NULL;
> +}
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index 8a64b150be54..7a480e3eb35d 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -257,4 +257,7 @@ void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props);
>   struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
>   void xskq_destroy(struct xsk_queue *q_ops);
>   
> +/* Executed by the core when the entire UMEM gets freed */
> +void xsk_reuseq_destroy(struct xdp_umem *umem);
> +
>   #endif /* _LINUX_XSK_QUEUE_H */
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-08-31 12:41 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-28 12:44 [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
2018-08-28 12:44 ` [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY Björn Töpel
2018-08-28 14:11   ` Jesper Dangaard Brouer
2018-08-28 17:42     ` Björn Töpel
2018-08-29 18:06   ` [bpf-next, " Maciek Fijalkowski
2018-08-28 12:44 ` [PATCH bpf-next 02/11] xdp: export xdp_rxq_info_unreg_mem_model Björn Töpel
2018-08-28 12:44 ` [PATCH bpf-next 03/11] xsk: expose xdp_umem_get_{data,dma} to drivers Björn Töpel
2018-08-28 12:44 ` [PATCH bpf-next 04/11] net: add napi_if_scheduled_mark_missed Björn Töpel
2018-08-28 12:44 ` [PATCH bpf-next 05/11] i40e: added queue pair disable/enable functions Björn Töpel
2018-08-28 12:44 ` [PATCH bpf-next 06/11] i40e: refactor Rx path for re-use Björn Töpel
2018-08-28 12:44 ` [PATCH bpf-next 07/11] i40e: move common Rx functions to i40e_txrx_common.h Björn Töpel
2018-08-28 12:44 ` [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support Björn Töpel
2018-08-29 19:14   ` Jakub Kicinski
2018-08-30 12:06     ` Björn Töpel
2018-08-31  7:55       ` Jakub Kicinski
2018-08-29 19:22   ` Alexei Starovoitov
2018-08-28 12:44 ` [PATCH bpf-next 09/11] i40e: move common Tx functions to i40e_txrx_common.h Björn Töpel
2018-08-28 12:44 ` [PATCH bpf-next 10/11] i40e: add AF_XDP zero-copy Tx support Björn Töpel
2018-08-28 12:44 ` [PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock Björn Töpel
2018-08-29 12:44   ` Jesper Dangaard Brouer
2018-08-30 10:21     ` Björn Töpel
2018-08-28 12:50 ` [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Björn Töpel
2018-08-28 12:50   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-08-29 16:12 ` Daniel Borkmann
2018-08-30  0:10   ` William Tu
2018-08-30  9:05   ` Björn Töpel
2018-08-29 19:19 ` [RFC] net: xsk: add a simple buffer reuse queue Jakub Kicinski
2018-08-31  8:34   ` Björn Töpel
2018-08-29 19:39 ` [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.