All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
@ 2018-05-15 19:06 ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

This RFC introduces zerocopy (ZC) support for AF_XDP. Programs using
AF_XDP sockets will now receive RX packets without any copies and can
also transmit packets without incurring any copies. No modifications
to the application are needed, but the NIC driver needs to be modified
to support ZC. If ZC is not supported by the driver, the modes
introduced in the AF_XDP patch will be used. Using ZC in our
micro benchmarks results in significantly improved performance as can
be seen in the performance section later in this cover letter.

Note that we did not post this as a proper patch set as suggested by
Alexei due to mainly one reason. The i40e modifications need to be
fully and properly implemented (we need support for dynamically
creating and removing queues in the driver), split up in multiple
patches, then reviewed and QA:ed by the Intel NIC team before they can
become a proper patch. We just did not have time to finish all of this
in this merge window. 

Alexei had two concerns in conjunction with adding ZC support to
AF_XDP: show that the user interface holds and can deliver good
performance for ZC and that the driver interfaces for ZC are good. We
think that this patch set shows that we have addressed the first
issue: performance is good and there is no change to the uapi. But
please take a look at the code and see if you like the ZC interfaces
that was the second concern.

Note that for an untrusted application, HW packet steering to a
specific queue pair (the one associated with the application) is a
requirement when using ZC, as the application would otherwise be able
to see other user space processes' packets. If the HW cannot support
the required packet steering you need to use the XDP_SKB mode or the
XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP
patch set can be used to do load balancing in that case.

For benchmarking, you can use the xdpsock application from the AF_XDP
patch set without any modifications. Say that you would like your UDP
traffic from port 4242 to end up in queue 16, that we will enable
AF_XDP on. Here, we use ethtool for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be
done using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are without
retpoline so that we can compare against previous numbers. 

AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
set are also reported for ease of reference.

Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
rxdrop       2.9*       9.6*       21.5
txpush       2.6*       -          21.6
l2fwd        1.9*       2.5*       15.0

* From AF_XDP V3 patch set and cover letter.

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
rxdrop       2.1        3.3       3.3
l2fwd        1.4        1.8       3.1

So why do we not get higher values for RX similar to the 34 Mpps we
had in AF_PACKET V4? We made an experiment running the rxdrop
benchmark without using the xdp_do_redirect/flush infrastructure nor
using an XDP program (all traffic on a queue goes to one
socket). Instead the driver acts directly on the AF_XDP socket. With
this we got 36.9 Mpps, a significant improvement without any change to
the uapi. So not forcing users to have an XDP program if they do not
need it, might be a good idea. This measurement is actually higher
than what we got with AF_PACKET V4.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      32.3M  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3.3M    0

The structure of the patch set is as follows:

Patch 1: Removes rebind support. Complicated to support for ZC,
         so will not be supported for AF_XDP in any mode at this
         point. Will be a follow up patch for the AF_XDP patch set.
Patches 2-4: Plumbing for AF_XDP ZC support
Patches 5-6: AF_XDP ZC for RX
Patches 7-8: AF_XDP ZC for TX
Patch 9: Minor performance fix for the sample application. ZC will
         work with nearly as good performance without this.
Patch 10-12: ZC support for i40e. Should be broken out in smaller
             pieces as pre-patches.

We based this patch set on bpf-next commit f2467c2dbc01
("selftests/bpf: make sure build-id is on")

To do for this RFC to become a patch set:

* Implement dynamic creation and deletion of queues in the i40e driver

* Properly splitting up the i40e changes

* Have the Intel NIC team review the i40e changes from at least an
  architecture point of view

* Implement a more fair scheduling policy for multiple XSKs that share
  an umem for TX. This can be combined with a batching API for
  xsk_umem_consume_tx.

We are planning on joining the iovisor call on Wednesday if you would
like to have a chat with us about this.

Thanks: Björn and Magnus

Björn Töpel (8):
  xsk: remove rebind support
  xsk: moved struct xdp_umem definition
  xsk: introduce xdp_umem_frame
  net: xdp: added bpf_netdev_command XDP_SETUP_XSK_UMEM
  xdp: add MEM_TYPE_ZERO_COPY
  xsk: add zero-copy support for Rx
  i40e: added queue pair disable/enable functions
  i40e: implement AF_XDP zero-copy support for Rx

Magnus Karlsson (4):
  net: added netdevice operation for Tx
  xsk: wire upp Tx zero-copy functions
  samples/bpf: minor *_nb_free performance fix
  i40e: implement Tx zero-copy

 drivers/net/ethernet/intel/i40e/i40e.h      |  20 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 458 +++++++++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 635 +++++++++++++++++++++++++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  36 +-
 include/linux/netdevice.h                   |  13 +
 include/net/xdp.h                           |  10 +
 include/net/xdp_sock.h                      |  45 +-
 net/core/xdp.c                              |  47 +-
 net/xdp/xdp_umem.c                          | 112 ++++-
 net/xdp/xdp_umem.h                          |  42 +-
 net/xdp/xdp_umem_props.h                    |  23 -
 net/xdp/xsk.c                               | 162 +++++--
 net/xdp/xsk_queue.h                         |  35 +-
 samples/bpf/xdpsock_user.c                  |   8 +-
 14 files changed, 1458 insertions(+), 188 deletions(-)
 delete mode 100644 net/xdp/xdp_umem_props.h

-- 
2.14.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
@ 2018-05-15 19:06 ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

This RFC introduces zerocopy (ZC) support for AF_XDP. Programs using
AF_XDP sockets will now receive RX packets without any copies and can
also transmit packets without incurring any copies. No modifications
to the application are needed, but the NIC driver needs to be modified
to support ZC. If ZC is not supported by the driver, the modes
introduced in the AF_XDP patch will be used. Using ZC in our
micro benchmarks results in significantly improved performance as can
be seen in the performance section later in this cover letter.

Note that we did not post this as a proper patch set as suggested by
Alexei due to mainly one reason. The i40e modifications need to be
fully and properly implemented (we need support for dynamically
creating and removing queues in the driver), split up in multiple
patches, then reviewed and QA:ed by the Intel NIC team before they can
become a proper patch. We just did not have time to finish all of this
in this merge window. 

Alexei had two concerns in conjunction with adding ZC support to
AF_XDP: show that the user interface holds and can deliver good
performance for ZC and that the driver interfaces for ZC are good. We
think that this patch set shows that we have addressed the first
issue: performance is good and there is no change to the uapi. But
please take a look at the code and see if you like the ZC interfaces
that was the second concern.

Note that for an untrusted application, HW packet steering to a
specific queue pair (the one associated with the application) is a
requirement when using ZC, as the application would otherwise be able
to see other user space processes' packets. If the HW cannot support
the required packet steering you need to use the XDP_SKB mode or the
XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP
patch set can be used to do load balancing in that case.

For benchmarking, you can use the xdpsock application from the AF_XDP
patch set without any modifications. Say that you would like your UDP
traffic from port 4242 to end up in queue 16, that we will enable
AF_XDP on. Here, we use ethtool for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be
done using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are without
retpoline so that we can compare against previous numbers. 

AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
set are also reported for ease of reference.

Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
rxdrop       2.9*       9.6*       21.5
txpush       2.6*       -          21.6
l2fwd        1.9*       2.5*       15.0

* From AF_XDP V3 patch set and cover letter.

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
rxdrop       2.1        3.3       3.3
l2fwd        1.4        1.8       3.1

So why do we not get higher values for RX similar to the 34 Mpps we
had in AF_PACKET V4? We made an experiment running the rxdrop
benchmark without using the xdp_do_redirect/flush infrastructure nor
using an XDP program (all traffic on a queue goes to one
socket). Instead the driver acts directly on the AF_XDP socket. With
this we got 36.9 Mpps, a significant improvement without any change to
the uapi. So not forcing users to have an XDP program if they do not
need it, might be a good idea. This measurement is actually higher
than what we got with AF_PACKET V4.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      32.3M  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3.3M    0

The structure of the patch set is as follows:

Patch 1: Removes rebind support. Complicated to support for ZC,
         so will not be supported for AF_XDP in any mode at this
         point. Will be a follow up patch for the AF_XDP patch set.
Patches 2-4: Plumbing for AF_XDP ZC support
Patches 5-6: AF_XDP ZC for RX
Patches 7-8: AF_XDP ZC for TX
Patch 9: Minor performance fix for the sample application. ZC will
         work with nearly as good performance without this.
Patch 10-12: ZC support for i40e. Should be broken out in smaller
             pieces as pre-patches.

We based this patch set on bpf-next commit f2467c2dbc01
("selftests/bpf: make sure build-id is on")

To do for this RFC to become a patch set:

* Implement dynamic creation and deletion of queues in the i40e driver

* Properly splitting up the i40e changes

* Have the Intel NIC team review the i40e changes from at least an
  architecture point of view

* Implement a more fair scheduling policy for multiple XSKs that share
  an umem for TX. This can be combined with a batching API for
  xsk_umem_consume_tx.

We are planning on joining the iovisor call on Wednesday if you would
like to have a chat with us about this.

Thanks: Bj?rn and Magnus

Bj?rn T?pel (8):
  xsk: remove rebind support
  xsk: moved struct xdp_umem definition
  xsk: introduce xdp_umem_frame
  net: xdp: added bpf_netdev_command XDP_SETUP_XSK_UMEM
  xdp: add MEM_TYPE_ZERO_COPY
  xsk: add zero-copy support for Rx
  i40e: added queue pair disable/enable functions
  i40e: implement AF_XDP zero-copy support for Rx

Magnus Karlsson (4):
  net: added netdevice operation for Tx
  xsk: wire upp Tx zero-copy functions
  samples/bpf: minor *_nb_free performance fix
  i40e: implement Tx zero-copy

 drivers/net/ethernet/intel/i40e/i40e.h      |  20 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 458 +++++++++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 635 +++++++++++++++++++++++++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  36 +-
 include/linux/netdevice.h                   |  13 +
 include/net/xdp.h                           |  10 +
 include/net/xdp_sock.h                      |  45 +-
 net/core/xdp.c                              |  47 +-
 net/xdp/xdp_umem.c                          | 112 ++++-
 net/xdp/xdp_umem.h                          |  42 +-
 net/xdp/xdp_umem_props.h                    |  23 -
 net/xdp/xsk.c                               | 162 +++++--
 net/xdp/xsk_queue.h                         |  35 +-
 samples/bpf/xdpsock_user.c                  |   8 +-
 14 files changed, 1458 insertions(+), 188 deletions(-)
 delete mode 100644 net/xdp/xdp_umem_props.h

-- 
2.14.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 01/12] xsk: remove rebind support
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Supporting rebind, i.e. after a successful bind the process can call
bind again without closing the socket, makes the setup state machine
more complex. Let us constrain the state space, but not supporting
rebind.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c | 30 +++++++++---------------------
 1 file changed, 9 insertions(+), 21 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 009c5af5bba5..e59ca8e2618d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -236,14 +236,6 @@ static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 	return 0;
 }
 
-static void __xsk_release(struct xdp_sock *xs)
-{
-	/* Wait for driver to stop using the xdp socket. */
-	synchronize_net();
-
-	dev_put(xs->dev);
-}
-
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -260,7 +252,9 @@ static int xsk_release(struct socket *sock)
 	local_bh_enable();
 
 	if (xs->dev) {
-		__xsk_release(xs);
+		/* Wait for driver to stop using the xdp socket. */
+		synchronize_net();
+		dev_put(xs->dev);
 		xs->dev = NULL;
 	}
 
@@ -294,9 +288,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 {
 	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
 	struct sock *sk = sock->sk;
-	struct net_device *dev, *dev_curr;
 	struct xdp_sock *xs = xdp_sk(sk);
-	struct xdp_umem *old_umem = NULL;
+	struct net_device *dev;
 	int err = 0;
 
 	if (addr_len < sizeof(struct sockaddr_xdp))
@@ -305,7 +298,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		return -EINVAL;
 
 	mutex_lock(&xs->mutex);
-	dev_curr = xs->dev;
+	if (xs->dev) {
+		err = -EBUSY;
+		goto out_release;
+	}
+
 	dev = dev_get_by_index(sock_net(sk), sxdp->sxdp_ifindex);
 	if (!dev) {
 		err = -ENODEV;
@@ -352,7 +349,6 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		}
 
 		xdp_get_umem(umem_xs->umem);
-		old_umem = xs->umem;
 		xs->umem = umem_xs->umem;
 		sockfd_put(sock);
 	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
@@ -364,14 +360,6 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		xskq_set_umem(xs->umem->cq, &xs->umem->props);
 	}
 
-	/* Rebind? */
-	if (dev_curr && (dev_curr != dev ||
-			 xs->queue_id != sxdp->sxdp_queue_id)) {
-		__xsk_release(xs);
-		if (old_umem)
-			xdp_put_umem(old_umem);
-	}
-
 	xs->dev = dev;
 	xs->queue_id = sxdp->sxdp_queue_id;
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 01/12] xsk: remove rebind support
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

Supporting rebind, i.e. after a successful bind the process can call
bind again without closing the socket, makes the setup state machine
more complex. Let us constrain the state space, but not supporting
rebind.

Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c | 30 +++++++++---------------------
 1 file changed, 9 insertions(+), 21 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 009c5af5bba5..e59ca8e2618d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -236,14 +236,6 @@ static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 	return 0;
 }
 
-static void __xsk_release(struct xdp_sock *xs)
-{
-	/* Wait for driver to stop using the xdp socket. */
-	synchronize_net();
-
-	dev_put(xs->dev);
-}
-
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -260,7 +252,9 @@ static int xsk_release(struct socket *sock)
 	local_bh_enable();
 
 	if (xs->dev) {
-		__xsk_release(xs);
+		/* Wait for driver to stop using the xdp socket. */
+		synchronize_net();
+		dev_put(xs->dev);
 		xs->dev = NULL;
 	}
 
@@ -294,9 +288,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 {
 	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
 	struct sock *sk = sock->sk;
-	struct net_device *dev, *dev_curr;
 	struct xdp_sock *xs = xdp_sk(sk);
-	struct xdp_umem *old_umem = NULL;
+	struct net_device *dev;
 	int err = 0;
 
 	if (addr_len < sizeof(struct sockaddr_xdp))
@@ -305,7 +298,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		return -EINVAL;
 
 	mutex_lock(&xs->mutex);
-	dev_curr = xs->dev;
+	if (xs->dev) {
+		err = -EBUSY;
+		goto out_release;
+	}
+
 	dev = dev_get_by_index(sock_net(sk), sxdp->sxdp_ifindex);
 	if (!dev) {
 		err = -ENODEV;
@@ -352,7 +349,6 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		}
 
 		xdp_get_umem(umem_xs->umem);
-		old_umem = xs->umem;
 		xs->umem = umem_xs->umem;
 		sockfd_put(sock);
 	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
@@ -364,14 +360,6 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		xskq_set_umem(xs->umem->cq, &xs->umem->props);
 	}
 
-	/* Rebind? */
-	if (dev_curr && (dev_curr != dev ||
-			 xs->queue_id != sxdp->sxdp_queue_id)) {
-		__xsk_release(xs);
-		if (old_umem)
-			xdp_put_umem(old_umem);
-	}
-
 	xs->dev = dev;
 	xs->queue_id = sxdp->sxdp_queue_id;
 
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 02/12] xsk: moved struct xdp_umem definition
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Moved struct xdp_umem to xdp_sock.h, in order to prepare for zero-copy
support.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h   | 27 ++++++++++++++++++++++++++-
 net/xdp/xdp_umem.c       |  1 +
 net/xdp/xdp_umem.h       | 25 +------------------------
 net/xdp/xdp_umem_props.h | 23 -----------------------
 net/xdp/xsk_queue.h      |  3 +--
 5 files changed, 29 insertions(+), 50 deletions(-)
 delete mode 100644 net/xdp/xdp_umem_props.h

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 185f4928fbda..c959aa43fb01 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -15,12 +15,37 @@
 #ifndef _LINUX_XDP_SOCK_H
 #define _LINUX_XDP_SOCK_H
 
+#include <linux/workqueue.h>
+#include <linux/if_xdp.h>
 #include <linux/mutex.h>
+#include <linux/mm.h>
 #include <net/sock.h>
 
 struct net_device;
 struct xsk_queue;
-struct xdp_umem;
+
+struct xdp_umem_props {
+	u32 frame_size;
+	u32 nframes;
+};
+
+struct xdp_umem {
+	struct xsk_queue *fq;
+	struct xsk_queue *cq;
+	struct page **pgs;
+	struct xdp_umem_props props;
+	u32 npgs;
+	u32 frame_headroom;
+	u32 nfpp_mask;
+	u32 nfpplog2;
+	u32 frame_size_log2;
+	struct user_struct *user;
+	struct pid *pid;
+	unsigned long address;
+	size_t size;
+	atomic_t users;
+	struct work_struct work;
+};
 
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 2b47a1dd7c6c..7cc162799744 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 
 #include "xdp_umem.h"
+#include "xsk_queue.h"
 
 #define XDP_UMEM_MIN_FRAME_SIZE 2048
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 7e0b2fab8522..32ad59b7322f 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -15,30 +15,7 @@
 #ifndef XDP_UMEM_H_
 #define XDP_UMEM_H_
 
-#include <linux/mm.h>
-#include <linux/if_xdp.h>
-#include <linux/workqueue.h>
-
-#include "xsk_queue.h"
-#include "xdp_umem_props.h"
-
-struct xdp_umem {
-	struct xsk_queue *fq;
-	struct xsk_queue *cq;
-	struct page **pgs;
-	struct xdp_umem_props props;
-	u32 npgs;
-	u32 frame_headroom;
-	u32 nfpp_mask;
-	u32 nfpplog2;
-	u32 frame_size_log2;
-	struct user_struct *user;
-	struct pid *pid;
-	unsigned long address;
-	size_t size;
-	atomic_t users;
-	struct work_struct work;
-};
+#include <net/xdp_sock.h>
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
 {
diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
deleted file mode 100644
index 77fb5daf29f3..000000000000
--- a/net/xdp/xdp_umem_props.h
+++ /dev/null
@@ -1,23 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0
- * XDP user-space packet buffer
- * Copyright(c) 2018 Intel Corporation.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
- */
-
-#ifndef XDP_UMEM_PROPS_H_
-#define XDP_UMEM_PROPS_H_
-
-struct xdp_umem_props {
-	u32 frame_size;
-	u32 nframes;
-};
-
-#endif /* XDP_UMEM_PROPS_H_ */
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 7aa9a535db0e..599a8d43c69a 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -17,8 +17,7 @@
 
 #include <linux/types.h>
 #include <linux/if_xdp.h>
-
-#include "xdp_umem_props.h"
+#include <net/xdp_sock.h>
 
 #define RX_BATCH_SIZE 16
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 02/12] xsk: moved struct xdp_umem definition
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

Moved struct xdp_umem to xdp_sock.h, in order to prepare for zero-copy
support.

Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h   | 27 ++++++++++++++++++++++++++-
 net/xdp/xdp_umem.c       |  1 +
 net/xdp/xdp_umem.h       | 25 +------------------------
 net/xdp/xdp_umem_props.h | 23 -----------------------
 net/xdp/xsk_queue.h      |  3 +--
 5 files changed, 29 insertions(+), 50 deletions(-)
 delete mode 100644 net/xdp/xdp_umem_props.h

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 185f4928fbda..c959aa43fb01 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -15,12 +15,37 @@
 #ifndef _LINUX_XDP_SOCK_H
 #define _LINUX_XDP_SOCK_H
 
+#include <linux/workqueue.h>
+#include <linux/if_xdp.h>
 #include <linux/mutex.h>
+#include <linux/mm.h>
 #include <net/sock.h>
 
 struct net_device;
 struct xsk_queue;
-struct xdp_umem;
+
+struct xdp_umem_props {
+	u32 frame_size;
+	u32 nframes;
+};
+
+struct xdp_umem {
+	struct xsk_queue *fq;
+	struct xsk_queue *cq;
+	struct page **pgs;
+	struct xdp_umem_props props;
+	u32 npgs;
+	u32 frame_headroom;
+	u32 nfpp_mask;
+	u32 nfpplog2;
+	u32 frame_size_log2;
+	struct user_struct *user;
+	struct pid *pid;
+	unsigned long address;
+	size_t size;
+	atomic_t users;
+	struct work_struct work;
+};
 
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 2b47a1dd7c6c..7cc162799744 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 
 #include "xdp_umem.h"
+#include "xsk_queue.h"
 
 #define XDP_UMEM_MIN_FRAME_SIZE 2048
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 7e0b2fab8522..32ad59b7322f 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -15,30 +15,7 @@
 #ifndef XDP_UMEM_H_
 #define XDP_UMEM_H_
 
-#include <linux/mm.h>
-#include <linux/if_xdp.h>
-#include <linux/workqueue.h>
-
-#include "xsk_queue.h"
-#include "xdp_umem_props.h"
-
-struct xdp_umem {
-	struct xsk_queue *fq;
-	struct xsk_queue *cq;
-	struct page **pgs;
-	struct xdp_umem_props props;
-	u32 npgs;
-	u32 frame_headroom;
-	u32 nfpp_mask;
-	u32 nfpplog2;
-	u32 frame_size_log2;
-	struct user_struct *user;
-	struct pid *pid;
-	unsigned long address;
-	size_t size;
-	atomic_t users;
-	struct work_struct work;
-};
+#include <net/xdp_sock.h>
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
 {
diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
deleted file mode 100644
index 77fb5daf29f3..000000000000
--- a/net/xdp/xdp_umem_props.h
+++ /dev/null
@@ -1,23 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0
- * XDP user-space packet buffer
- * Copyright(c) 2018 Intel Corporation.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
- */
-
-#ifndef XDP_UMEM_PROPS_H_
-#define XDP_UMEM_PROPS_H_
-
-struct xdp_umem_props {
-	u32 frame_size;
-	u32 nframes;
-};
-
-#endif /* XDP_UMEM_PROPS_H_ */
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 7aa9a535db0e..599a8d43c69a 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -17,8 +17,7 @@
 
 #include <linux/types.h>
 #include <linux/if_xdp.h>
-
-#include "xdp_umem_props.h"
+#include <net/xdp_sock.h>
 
 #define RX_BATCH_SIZE 16
 
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 03/12] xsk: introduce xdp_umem_frame
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

The xdp_umem_frame holds the address for a frame. Trade memory for
faster lookup. Later, we'll add DMA address here as well.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  8 +++++---
 net/xdp/xdp_umem.c     | 29 +++++++++++++++++++++++++----
 net/xdp/xdp_umem.h     |  9 +--------
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index c959aa43fb01..09068c4f068e 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -29,16 +29,18 @@ struct xdp_umem_props {
 	u32 nframes;
 };
 
+struct xdp_umem_frame {
+	void *addr;
+};
+
 struct xdp_umem {
 	struct xsk_queue *fq;
 	struct xsk_queue *cq;
+	struct xdp_umem_frame *frames;
 	struct page **pgs;
 	struct xdp_umem_props props;
 	u32 npgs;
 	u32 frame_headroom;
-	u32 nfpp_mask;
-	u32 nfpplog2;
-	u32 frame_size_log2;
 	struct user_struct *user;
 	struct pid *pid;
 	unsigned long address;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 7cc162799744..b426cbe3151a 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -92,6 +92,9 @@ static void xdp_umem_release(struct xdp_umem *umem)
 		umem->pgs = NULL;
 	}
 
+	kfree(umem->frames);
+	umem->frames = NULL;
+
 	xdp_umem_unaccount_pages(umem);
 out:
 	kfree(umem);
@@ -181,7 +184,8 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 {
 	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
 	u64 addr = mr->addr, size = mr->len;
-	unsigned int nframes, nfpp;
+	u32 nfpplog2, frame_size_log2;
+	unsigned int nframes, nfpp, i;
 	int size_chk, err;
 
 	if (!umem)
@@ -234,9 +238,6 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	umem->pgs = NULL;
 	umem->user = NULL;
 
-	umem->frame_size_log2 = ilog2(frame_size);
-	umem->nfpp_mask = nfpp - 1;
-	umem->nfpplog2 = ilog2(nfpp);
 	atomic_set(&umem->users, 1);
 
 	err = xdp_umem_account_pages(umem);
@@ -246,6 +247,26 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	err = xdp_umem_pin_pages(umem);
 	if (err)
 		goto out_account;
+
+	umem->frames = kcalloc(nframes, sizeof(*umem->frames), GFP_KERNEL);
+	if (!umem->frames) {
+		err = -ENOMEM;
+		goto out_account;
+	}
+
+	frame_size_log2 = ilog2(frame_size);
+	nfpplog2 = ilog2(nfpp);
+	for (i = 0; i < nframes; i++) {
+		u64 pg, off;
+		char *data;
+
+		pg = i >> nfpplog2;
+		off = (i & (nfpp - 1)) << frame_size_log2;
+
+		data = page_address(umem->pgs[pg]);
+		umem->frames[i].addr = data + off;
+	}
+
 	return 0;
 
 out_account:
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 32ad59b7322f..0a969384af93 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -19,14 +19,7 @@
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
 {
-	u64 pg, off;
-	char *data;
-
-	pg = idx >> umem->nfpplog2;
-	off = (idx & umem->nfpp_mask) << umem->frame_size_log2;
-
-	data = page_address(umem->pgs[pg]);
-	return data + off;
+	return umem->frames[idx].addr;
 }
 
 static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 03/12] xsk: introduce xdp_umem_frame
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

The xdp_umem_frame holds the address for a frame. Trade memory for
faster lookup. Later, we'll add DMA address here as well.

Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  8 +++++---
 net/xdp/xdp_umem.c     | 29 +++++++++++++++++++++++++----
 net/xdp/xdp_umem.h     |  9 +--------
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index c959aa43fb01..09068c4f068e 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -29,16 +29,18 @@ struct xdp_umem_props {
 	u32 nframes;
 };
 
+struct xdp_umem_frame {
+	void *addr;
+};
+
 struct xdp_umem {
 	struct xsk_queue *fq;
 	struct xsk_queue *cq;
+	struct xdp_umem_frame *frames;
 	struct page **pgs;
 	struct xdp_umem_props props;
 	u32 npgs;
 	u32 frame_headroom;
-	u32 nfpp_mask;
-	u32 nfpplog2;
-	u32 frame_size_log2;
 	struct user_struct *user;
 	struct pid *pid;
 	unsigned long address;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 7cc162799744..b426cbe3151a 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -92,6 +92,9 @@ static void xdp_umem_release(struct xdp_umem *umem)
 		umem->pgs = NULL;
 	}
 
+	kfree(umem->frames);
+	umem->frames = NULL;
+
 	xdp_umem_unaccount_pages(umem);
 out:
 	kfree(umem);
@@ -181,7 +184,8 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 {
 	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
 	u64 addr = mr->addr, size = mr->len;
-	unsigned int nframes, nfpp;
+	u32 nfpplog2, frame_size_log2;
+	unsigned int nframes, nfpp, i;
 	int size_chk, err;
 
 	if (!umem)
@@ -234,9 +238,6 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	umem->pgs = NULL;
 	umem->user = NULL;
 
-	umem->frame_size_log2 = ilog2(frame_size);
-	umem->nfpp_mask = nfpp - 1;
-	umem->nfpplog2 = ilog2(nfpp);
 	atomic_set(&umem->users, 1);
 
 	err = xdp_umem_account_pages(umem);
@@ -246,6 +247,26 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	err = xdp_umem_pin_pages(umem);
 	if (err)
 		goto out_account;
+
+	umem->frames = kcalloc(nframes, sizeof(*umem->frames), GFP_KERNEL);
+	if (!umem->frames) {
+		err = -ENOMEM;
+		goto out_account;
+	}
+
+	frame_size_log2 = ilog2(frame_size);
+	nfpplog2 = ilog2(nfpp);
+	for (i = 0; i < nframes; i++) {
+		u64 pg, off;
+		char *data;
+
+		pg = i >> nfpplog2;
+		off = (i & (nfpp - 1)) << frame_size_log2;
+
+		data = page_address(umem->pgs[pg]);
+		umem->frames[i].addr = data + off;
+	}
+
 	return 0;
 
 out_account:
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 32ad59b7322f..0a969384af93 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -19,14 +19,7 @@
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
 {
-	u64 pg, off;
-	char *data;
-
-	pg = idx >> umem->nfpplog2;
-	off = (idx & umem->nfpp_mask) << umem->frame_size_log2;
-
-	data = page_address(umem->pgs[pg]);
-	return data + off;
+	return umem->frames[idx].addr;
 }
 
 static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 04/12] net: xdp: added bpf_netdev_command XDP_SETUP_XSK_UMEM
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Extend ndo_bpf with a new command used for register an UMEM to a
queue_id of a netdev.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/netdevice.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492c4e14..2084536ad4af 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -817,10 +817,12 @@ enum bpf_netdev_command {
 	BPF_OFFLOAD_DESTROY,
 	BPF_OFFLOAD_MAP_ALLOC,
 	BPF_OFFLOAD_MAP_FREE,
+	XDP_SETUP_XSK_UMEM,
 };
 
 struct bpf_prog_offload_ops;
 struct netlink_ext_ack;
+struct xdp_umem;
 
 struct netdev_bpf {
 	enum bpf_netdev_command command;
@@ -851,6 +853,11 @@ struct netdev_bpf {
 		struct {
 			struct bpf_offloaded_map *offmap;
 		};
+		/* XDP_SETUP_XSK_UMEM */
+		struct {
+			struct xdp_umem *umem;
+			u16 queue_id;
+		} xsk;
 	};
 };
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 04/12] net: xdp: added bpf_netdev_command XDP_SETUP_XSK_UMEM
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

Extend ndo_bpf with a new command used for register an UMEM to a
queue_id of a netdev.

Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 include/linux/netdevice.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492c4e14..2084536ad4af 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -817,10 +817,12 @@ enum bpf_netdev_command {
 	BPF_OFFLOAD_DESTROY,
 	BPF_OFFLOAD_MAP_ALLOC,
 	BPF_OFFLOAD_MAP_FREE,
+	XDP_SETUP_XSK_UMEM,
 };
 
 struct bpf_prog_offload_ops;
 struct netlink_ext_ack;
+struct xdp_umem;
 
 struct netdev_bpf {
 	enum bpf_netdev_command command;
@@ -851,6 +853,11 @@ struct netdev_bpf {
 		struct {
 			struct bpf_offloaded_map *offmap;
 		};
+		/* XDP_SETUP_XSK_UMEM */
+		struct {
+			struct xdp_umem *umem;
+			u16 queue_id;
+		} xsk;
 	};
 };
 
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Here, a new type of allocator support is added to the XDP return
API. A zero-copy allocated xdp_buff cannot be converted to an
xdp_frame. Instead is the buff has to be copied. This is not supported
at all in this commit.

Also, an opaque "handle" is added to xdp_buff. This can be used as a
context for the zero-copy allocator implementation.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp.h | 10 ++++++++++
 net/core/xdp.c    | 47 ++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 0b689cf561c7..e9eee37cddd6 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -37,6 +37,7 @@ enum xdp_mem_type {
 	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
 	MEM_TYPE_PAGE_POOL,
+	MEM_TYPE_ZERO_COPY,
 	MEM_TYPE_MAX,
 };
 
@@ -47,6 +48,10 @@ struct xdp_mem_info {
 
 struct page_pool;
 
+struct zero_copy_allocator {
+	void (*free)(struct zero_copy_allocator *, unsigned long);
+};
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
@@ -59,6 +64,7 @@ struct xdp_buff {
 	void *data_end;
 	void *data_meta;
 	void *data_hard_start;
+	unsigned long handle;
 	struct xdp_rxq_info *rxq;
 };
 
@@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	int metasize;
 	int headroom;
 
+	// XXX implement clone, copy, use "native" MEM_TYPE
+	if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
+		return NULL;
+
 	/* Assure headroom is available for storing info */
 	headroom = xdp->data - xdp->data_hard_start;
 	metasize = xdp->data - xdp->data_meta;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index bf6758f74339..4e11895b8cd9 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -31,6 +31,7 @@ struct xdp_mem_allocator {
 	union {
 		void *allocator;
 		struct page_pool *page_pool;
+		struct zero_copy_allocator *zc_alloc;
 	};
 	struct rhash_head node;
 	struct rcu_head rcu;
@@ -261,7 +262,7 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 	xdp_rxq->mem.type = type;
 
 	if (!allocator) {
-		if (type == MEM_TYPE_PAGE_POOL)
+		if (type == MEM_TYPE_PAGE_POOL || type == MEM_TYPE_ZERO_COPY)
 			return -EINVAL; /* Setup time check page_pool req */
 		return 0;
 	}
@@ -308,9 +309,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-static void xdp_return(void *data, struct xdp_mem_info *mem)
+void xdp_return_frame(struct xdp_frame *xdpf)
 {
+	struct xdp_mem_info *mem = &xdpf->mem;
 	struct xdp_mem_allocator *xa;
+	void *data = xdpf->data;
 	struct page *page;
 
 	switch (mem->type) {
@@ -336,16 +339,46 @@ static void xdp_return(void *data, struct xdp_mem_info *mem)
 		/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
 		break;
 	}
-}
 
-void xdp_return_frame(struct xdp_frame *xdpf)
-{
-	xdp_return(xdpf->data, &xdpf->mem);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);
 
 void xdp_return_buff(struct xdp_buff *xdp)
 {
-	xdp_return(xdp->data, &xdp->rxq->mem);
+	struct xdp_mem_info *mem = &xdp->rxq->mem;
+	struct xdp_mem_allocator *xa;
+	void *data = xdp->data;
+	struct page *page;
+
+	switch (mem->type) {
+	case MEM_TYPE_ZERO_COPY:
+		rcu_read_lock();
+		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+		xa->zc_alloc->free(xa->zc_alloc, xdp->handle);
+		rcu_read_unlock();
+		break;
+	case MEM_TYPE_PAGE_POOL:
+		rcu_read_lock();
+		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+		page = virt_to_head_page(data);
+		if (xa)
+			page_pool_put_page(xa->page_pool, page);
+		else
+			put_page(page);
+		rcu_read_unlock();
+		break;
+	case MEM_TYPE_PAGE_SHARED:
+		page_frag_free(data);
+		break;
+	case MEM_TYPE_PAGE_ORDER0:
+		page = virt_to_page(data); /* Assumes order0 page*/
+		put_page(page);
+		break;
+	default:
+		/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
+		break;
+	}
 }
 EXPORT_SYMBOL_GPL(xdp_return_buff);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

Here, a new type of allocator support is added to the XDP return
API. A zero-copy allocated xdp_buff cannot be converted to an
xdp_frame. Instead is the buff has to be copied. This is not supported
at all in this commit.

Also, an opaque "handle" is added to xdp_buff. This can be used as a
context for the zero-copy allocator implementation.

Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 include/net/xdp.h | 10 ++++++++++
 net/core/xdp.c    | 47 ++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 0b689cf561c7..e9eee37cddd6 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -37,6 +37,7 @@ enum xdp_mem_type {
 	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
 	MEM_TYPE_PAGE_POOL,
+	MEM_TYPE_ZERO_COPY,
 	MEM_TYPE_MAX,
 };
 
@@ -47,6 +48,10 @@ struct xdp_mem_info {
 
 struct page_pool;
 
+struct zero_copy_allocator {
+	void (*free)(struct zero_copy_allocator *, unsigned long);
+};
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
@@ -59,6 +64,7 @@ struct xdp_buff {
 	void *data_end;
 	void *data_meta;
 	void *data_hard_start;
+	unsigned long handle;
 	struct xdp_rxq_info *rxq;
 };
 
@@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	int metasize;
 	int headroom;
 
+	// XXX implement clone, copy, use "native" MEM_TYPE
+	if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
+		return NULL;
+
 	/* Assure headroom is available for storing info */
 	headroom = xdp->data - xdp->data_hard_start;
 	metasize = xdp->data - xdp->data_meta;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index bf6758f74339..4e11895b8cd9 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -31,6 +31,7 @@ struct xdp_mem_allocator {
 	union {
 		void *allocator;
 		struct page_pool *page_pool;
+		struct zero_copy_allocator *zc_alloc;
 	};
 	struct rhash_head node;
 	struct rcu_head rcu;
@@ -261,7 +262,7 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 	xdp_rxq->mem.type = type;
 
 	if (!allocator) {
-		if (type == MEM_TYPE_PAGE_POOL)
+		if (type == MEM_TYPE_PAGE_POOL || type == MEM_TYPE_ZERO_COPY)
 			return -EINVAL; /* Setup time check page_pool req */
 		return 0;
 	}
@@ -308,9 +309,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-static void xdp_return(void *data, struct xdp_mem_info *mem)
+void xdp_return_frame(struct xdp_frame *xdpf)
 {
+	struct xdp_mem_info *mem = &xdpf->mem;
 	struct xdp_mem_allocator *xa;
+	void *data = xdpf->data;
 	struct page *page;
 
 	switch (mem->type) {
@@ -336,16 +339,46 @@ static void xdp_return(void *data, struct xdp_mem_info *mem)
 		/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
 		break;
 	}
-}
 
-void xdp_return_frame(struct xdp_frame *xdpf)
-{
-	xdp_return(xdpf->data, &xdpf->mem);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);
 
 void xdp_return_buff(struct xdp_buff *xdp)
 {
-	xdp_return(xdp->data, &xdp->rxq->mem);
+	struct xdp_mem_info *mem = &xdp->rxq->mem;
+	struct xdp_mem_allocator *xa;
+	void *data = xdp->data;
+	struct page *page;
+
+	switch (mem->type) {
+	case MEM_TYPE_ZERO_COPY:
+		rcu_read_lock();
+		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+		xa->zc_alloc->free(xa->zc_alloc, xdp->handle);
+		rcu_read_unlock();
+		break;
+	case MEM_TYPE_PAGE_POOL:
+		rcu_read_lock();
+		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+		page = virt_to_head_page(data);
+		if (xa)
+			page_pool_put_page(xa->page_pool, page);
+		else
+			put_page(page);
+		rcu_read_unlock();
+		break;
+	case MEM_TYPE_PAGE_SHARED:
+		page_frag_free(data);
+		break;
+	case MEM_TYPE_PAGE_ORDER0:
+		page = virt_to_page(data); /* Assumes order0 page*/
+		put_page(page);
+		break;
+	default:
+		/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
+		break;
+	}
 }
 EXPORT_SYMBOL_GPL(xdp_return_buff);
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 06/12] xsk: add zero-copy support for Rx
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Extend the xsk_rcv to support the new MEM_TYPE_ZERO_COPY memory, and
wireup ndo_bpf call in bind.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  7 +++++
 net/xdp/xdp_umem.c     | 60 +++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.h     |  3 +++
 net/xdp/xsk.c          | 69 ++++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 125 insertions(+), 14 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 09068c4f068e..644684eb2caf 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -31,6 +31,7 @@ struct xdp_umem_props {
 
 struct xdp_umem_frame {
 	void *addr;
+	dma_addr_t dma;
 };
 
 struct xdp_umem {
@@ -47,6 +48,8 @@ struct xdp_umem {
 	size_t size;
 	atomic_t users;
 	struct work_struct work;
+	struct net_device *dev;
+	u16 queue_id;
 };
 
 struct xdp_sock {
@@ -69,6 +72,10 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
+
+u32 *xsk_umem_peek_id(struct xdp_umem *umem);
+void xsk_umem_discard_id(struct xdp_umem *umem);
+
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index b426cbe3151a..f70cdaa2ef4d 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -26,6 +26,64 @@
 
 #define XDP_UMEM_MIN_FRAME_SIZE 2048
 
+int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
+			u16 queue_id)
+{
+	struct netdev_bpf bpf;
+	int err;
+
+	if (umem->dev) {
+		if (dev != umem->dev || queue_id != umem->queue_id)
+			return -EBUSY;
+		return 0;
+	}
+
+	dev_hold(dev);
+	if (dev->netdev_ops->ndo_bpf) {
+		bpf.command = XDP_SETUP_XSK_UMEM;
+		bpf.xsk.umem = umem;
+		bpf.xsk.queue_id = queue_id;
+
+		rtnl_lock();
+		err = dev->netdev_ops->ndo_bpf(dev, &bpf);
+		rtnl_unlock();
+
+		if (err) {
+			dev_put(dev);
+			return 0;
+		}
+
+		umem->dev = dev;
+		umem->queue_id = queue_id;
+		return 0;
+	}
+
+	dev_put(dev);
+	return 0;
+}
+
+void xdp_umem_clear_dev(struct xdp_umem *umem)
+{
+	struct netdev_bpf bpf;
+	int err;
+
+	if (umem->dev) {
+		bpf.command = XDP_SETUP_XSK_UMEM;
+		bpf.xsk.umem = NULL;
+		bpf.xsk.queue_id = umem->queue_id;
+
+		rtnl_lock();
+		err = umem->dev->netdev_ops->ndo_bpf(umem->dev, &bpf);
+		rtnl_unlock();
+
+		if (err)
+			WARN(1, "failed to disable umem!\n");
+
+		dev_put(umem->dev);
+		umem->dev = NULL;
+	}
+}
+
 int xdp_umem_create(struct xdp_umem **umem)
 {
 	*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
@@ -66,6 +124,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
 	struct task_struct *task;
 	struct mm_struct *mm;
 
+	xdp_umem_clear_dev(umem);
+
 	if (umem->fq) {
 		xskq_destroy(umem->fq);
 		umem->fq = NULL;
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 0a969384af93..3bb96d156b40 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -34,4 +34,7 @@ void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
 int xdp_umem_create(struct xdp_umem **umem);
 
+int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
+			u16 queue_id);
+
 #endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e59ca8e2618d..a0cf9c042ed2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -43,6 +43,18 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+u32 *xsk_umem_peek_id(struct xdp_umem *umem)
+{
+	return xskq_peek_id(umem->fq);
+}
+EXPORT_SYMBOL(xsk_umem_peek_id);
+
+void xsk_umem_discard_id(struct xdp_umem *umem)
+{
+	xskq_discard_id(umem->fq);
+}
+EXPORT_SYMBOL(xsk_umem_discard_id);
+
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 {
 	return !!xs->rx;
@@ -50,40 +62,54 @@ bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	u32 *id, len = xdp->data_end - xdp->data;
+	u32 *id, len;
 	void *buffer;
 	int err = 0;
 
-	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
-		return -EINVAL;
-
 	id = xskq_peek_id(xs->umem->fq);
 	if (!id)
 		return -ENOSPC;
 
 	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
+	len = xdp->data_end - xdp->data;
 	memcpy(buffer, xdp->data, len);
 	err = xskq_produce_batch_desc(xs->rx, *id, len,
 				      xs->umem->frame_headroom);
-	if (!err)
+	if (!err) {
 		xskq_discard_id(xs->umem->fq);
+		xdp_return_buff(xdp);
+		return 0;
+	}
 
+	xs->rx_dropped++;
 	return err;
 }
 
-int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
+	u16 off = xdp->data - xdp->data_hard_start;
+	u32 len = xdp->data_end - xdp->data;
 	int err;
 
-	err = __xsk_rcv(xs, xdp);
-	if (likely(!err))
+	err = xskq_produce_batch_desc(xs->rx, (u32)xdp->handle, len,
+				      xs->umem->frame_headroom + off);
+	if (err) {
 		xdp_return_buff(xdp);
-	else
 		xs->rx_dropped++;
+	}
 
 	return err;
 }
 
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+		return -EINVAL;
+
+	return (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) ?
+		__xsk_rcv_zc(xs, xdp) : __xsk_rcv(xs, xdp);
+}
+
 void xsk_flush(struct xdp_sock *xs)
 {
 	xskq_produce_flush_desc(xs->rx);
@@ -92,14 +118,26 @@ void xsk_flush(struct xdp_sock *xs)
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	int err;
+	u32 *id, len;
+	void *buffer;
+	int err = 0;
 
-	err = __xsk_rcv(xs, xdp);
-	if (!err)
+	id = xskq_peek_id(xs->umem->fq);
+	if (!id)
+		return -ENOSPC;
+
+	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
+	len = xdp->data_end - xdp->data;
+	memcpy(buffer, xdp->data, len);
+	err = xskq_produce_batch_desc(xs->rx, *id, len,
+				      xs->umem->frame_headroom);
+	if (!err) {
+		xskq_discard_id(xs->umem->fq);
 		xsk_flush(xs);
-	else
-		xs->rx_dropped++;
+		return 0;
+	}
 
+	xs->rx_dropped++;
 	return err;
 }
 
@@ -362,6 +400,9 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 
 	xs->dev = dev;
 	xs->queue_id = sxdp->sxdp_queue_id;
+	err = xdp_umem_assign_dev(xs->umem, dev, xs->queue_id);
+	if (err)
+		goto out_unlock;
 
 	xskq_set_umem(xs->rx, &xs->umem->props);
 	xskq_set_umem(xs->tx, &xs->umem->props);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 06/12] xsk: add zero-copy support for Rx
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

Extend the xsk_rcv to support the new MEM_TYPE_ZERO_COPY memory, and
wireup ndo_bpf call in bind.

Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  7 +++++
 net/xdp/xdp_umem.c     | 60 +++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.h     |  3 +++
 net/xdp/xsk.c          | 69 ++++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 125 insertions(+), 14 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 09068c4f068e..644684eb2caf 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -31,6 +31,7 @@ struct xdp_umem_props {
 
 struct xdp_umem_frame {
 	void *addr;
+	dma_addr_t dma;
 };
 
 struct xdp_umem {
@@ -47,6 +48,8 @@ struct xdp_umem {
 	size_t size;
 	atomic_t users;
 	struct work_struct work;
+	struct net_device *dev;
+	u16 queue_id;
 };
 
 struct xdp_sock {
@@ -69,6 +72,10 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
+
+u32 *xsk_umem_peek_id(struct xdp_umem *umem);
+void xsk_umem_discard_id(struct xdp_umem *umem);
+
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index b426cbe3151a..f70cdaa2ef4d 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -26,6 +26,64 @@
 
 #define XDP_UMEM_MIN_FRAME_SIZE 2048
 
+int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
+			u16 queue_id)
+{
+	struct netdev_bpf bpf;
+	int err;
+
+	if (umem->dev) {
+		if (dev != umem->dev || queue_id != umem->queue_id)
+			return -EBUSY;
+		return 0;
+	}
+
+	dev_hold(dev);
+	if (dev->netdev_ops->ndo_bpf) {
+		bpf.command = XDP_SETUP_XSK_UMEM;
+		bpf.xsk.umem = umem;
+		bpf.xsk.queue_id = queue_id;
+
+		rtnl_lock();
+		err = dev->netdev_ops->ndo_bpf(dev, &bpf);
+		rtnl_unlock();
+
+		if (err) {
+			dev_put(dev);
+			return 0;
+		}
+
+		umem->dev = dev;
+		umem->queue_id = queue_id;
+		return 0;
+	}
+
+	dev_put(dev);
+	return 0;
+}
+
+void xdp_umem_clear_dev(struct xdp_umem *umem)
+{
+	struct netdev_bpf bpf;
+	int err;
+
+	if (umem->dev) {
+		bpf.command = XDP_SETUP_XSK_UMEM;
+		bpf.xsk.umem = NULL;
+		bpf.xsk.queue_id = umem->queue_id;
+
+		rtnl_lock();
+		err = umem->dev->netdev_ops->ndo_bpf(umem->dev, &bpf);
+		rtnl_unlock();
+
+		if (err)
+			WARN(1, "failed to disable umem!\n");
+
+		dev_put(umem->dev);
+		umem->dev = NULL;
+	}
+}
+
 int xdp_umem_create(struct xdp_umem **umem)
 {
 	*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
@@ -66,6 +124,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
 	struct task_struct *task;
 	struct mm_struct *mm;
 
+	xdp_umem_clear_dev(umem);
+
 	if (umem->fq) {
 		xskq_destroy(umem->fq);
 		umem->fq = NULL;
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 0a969384af93..3bb96d156b40 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -34,4 +34,7 @@ void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
 int xdp_umem_create(struct xdp_umem **umem);
 
+int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
+			u16 queue_id);
+
 #endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e59ca8e2618d..a0cf9c042ed2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -43,6 +43,18 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+u32 *xsk_umem_peek_id(struct xdp_umem *umem)
+{
+	return xskq_peek_id(umem->fq);
+}
+EXPORT_SYMBOL(xsk_umem_peek_id);
+
+void xsk_umem_discard_id(struct xdp_umem *umem)
+{
+	xskq_discard_id(umem->fq);
+}
+EXPORT_SYMBOL(xsk_umem_discard_id);
+
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 {
 	return !!xs->rx;
@@ -50,40 +62,54 @@ bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	u32 *id, len = xdp->data_end - xdp->data;
+	u32 *id, len;
 	void *buffer;
 	int err = 0;
 
-	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
-		return -EINVAL;
-
 	id = xskq_peek_id(xs->umem->fq);
 	if (!id)
 		return -ENOSPC;
 
 	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
+	len = xdp->data_end - xdp->data;
 	memcpy(buffer, xdp->data, len);
 	err = xskq_produce_batch_desc(xs->rx, *id, len,
 				      xs->umem->frame_headroom);
-	if (!err)
+	if (!err) {
 		xskq_discard_id(xs->umem->fq);
+		xdp_return_buff(xdp);
+		return 0;
+	}
 
+	xs->rx_dropped++;
 	return err;
 }
 
-int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
+	u16 off = xdp->data - xdp->data_hard_start;
+	u32 len = xdp->data_end - xdp->data;
 	int err;
 
-	err = __xsk_rcv(xs, xdp);
-	if (likely(!err))
+	err = xskq_produce_batch_desc(xs->rx, (u32)xdp->handle, len,
+				      xs->umem->frame_headroom + off);
+	if (err) {
 		xdp_return_buff(xdp);
-	else
 		xs->rx_dropped++;
+	}
 
 	return err;
 }
 
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+		return -EINVAL;
+
+	return (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) ?
+		__xsk_rcv_zc(xs, xdp) : __xsk_rcv(xs, xdp);
+}
+
 void xsk_flush(struct xdp_sock *xs)
 {
 	xskq_produce_flush_desc(xs->rx);
@@ -92,14 +118,26 @@ void xsk_flush(struct xdp_sock *xs)
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	int err;
+	u32 *id, len;
+	void *buffer;
+	int err = 0;
 
-	err = __xsk_rcv(xs, xdp);
-	if (!err)
+	id = xskq_peek_id(xs->umem->fq);
+	if (!id)
+		return -ENOSPC;
+
+	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
+	len = xdp->data_end - xdp->data;
+	memcpy(buffer, xdp->data, len);
+	err = xskq_produce_batch_desc(xs->rx, *id, len,
+				      xs->umem->frame_headroom);
+	if (!err) {
+		xskq_discard_id(xs->umem->fq);
 		xsk_flush(xs);
-	else
-		xs->rx_dropped++;
+		return 0;
+	}
 
+	xs->rx_dropped++;
 	return err;
 }
 
@@ -362,6 +400,9 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 
 	xs->dev = dev;
 	xs->queue_id = sxdp->sxdp_queue_id;
+	err = xdp_umem_assign_dev(xs->umem, dev, xs->queue_id);
+	if (err)
+		goto out_unlock;
 
 	xskq_set_umem(xs->rx, &xs->umem->props);
 	xskq_set_umem(xs->tx, &xs->umem->props);
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 07/12] net: added netdevice operation for Tx
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Added ndo_xsk_async_xmit. This ndo "kicks" the netdev to start pull
userland Tx frames from a NAPI context.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2084536ad4af..8f4292dc6670 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1385,6 +1385,12 @@ struct net_device_ops {
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
 						struct xdp_frame *xdp);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
+	/* AF_XDP Tx function. NB! That in the PoC we take ownership
+	 * of the XDP Tx rings, so you wont be able to XDP_REDIRECT
+	 * there...
+	 */
+	int			(*ndo_xsk_async_xmit)(struct net_device *dev,
+						      u32 queue_id);
 };
 
 /**
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 07/12] net: added netdevice operation for Tx
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Added ndo_xsk_async_xmit. This ndo "kicks" the netdev to start pull
userland Tx frames from a NAPI context.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2084536ad4af..8f4292dc6670 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1385,6 +1385,12 @@ struct net_device_ops {
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
 						struct xdp_frame *xdp);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
+	/* AF_XDP Tx function. NB! That in the PoC we take ownership
+	 * of the XDP Tx rings, so you wont be able to XDP_REDIRECT
+	 * there...
+	 */
+	int			(*ndo_xsk_async_xmit)(struct net_device *dev,
+						      u32 queue_id);
 };
 
 /**
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 08/12] xsk: wire upp Tx zero-copy functions
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here we add the functionality required to support zero-copy Tx, and
also exposes various zero-copy related functions to for the netdevs.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h | 11 +++++++-
 net/xdp/xdp_umem.c     | 66 ++++++++++++++++++++++++++++++-----------------
 net/xdp/xdp_umem.h     |  9 +++++--
 net/xdp/xsk.c          | 69 ++++++++++++++++++++++++++++++++++++++++----------
 net/xdp/xsk_queue.h    | 32 ++++++++++++++++++++++-
 5 files changed, 146 insertions(+), 41 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 644684eb2caf..6d89fe84674e 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -18,6 +18,7 @@
 #include <linux/workqueue.h>
 #include <linux/if_xdp.h>
 #include <linux/mutex.h>
+#include <linux/spinlock.h>
 #include <linux/mm.h>
 #include <net/sock.h>
 
@@ -49,6 +50,9 @@ struct xdp_umem {
 	atomic_t users;
 	struct work_struct work;
 	struct net_device *dev;
+	bool zc;
+	spinlock_t xsk_list_lock;
+	struct list_head xsk_list;
 	u16 queue_id;
 };
 
@@ -61,6 +65,8 @@ struct xdp_sock {
 	struct list_head flush_node;
 	u16 queue_id;
 	struct xsk_queue *tx ____cacheline_aligned_in_smp;
+	struct list_head list;
+	bool zc;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 	u64 rx_dropped;
@@ -73,9 +79,12 @@ int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
 
+/* Used from netdev driver */
 u32 *xsk_umem_peek_id(struct xdp_umem *umem);
 void xsk_umem_discard_id(struct xdp_umem *umem);
-
+void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
+			 u32 *len, u16 *offset);
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index f70cdaa2ef4d..b904786ac836 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -27,42 +27,49 @@
 #define XDP_UMEM_MIN_FRAME_SIZE 2048
 
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
-			u16 queue_id)
+			u16 queue_id, struct list_head *list_entry)
 {
 	struct netdev_bpf bpf;
+	unsigned long flags;
 	int err;
 
 	if (umem->dev) {
 		if (dev != umem->dev || queue_id != umem->queue_id)
 			return -EBUSY;
-		return 0;
-	}
-
-	dev_hold(dev);
-	if (dev->netdev_ops->ndo_bpf) {
-		bpf.command = XDP_SETUP_XSK_UMEM;
-		bpf.xsk.umem = umem;
-		bpf.xsk.queue_id = queue_id;
-
-		rtnl_lock();
-		err = dev->netdev_ops->ndo_bpf(dev, &bpf);
-		rtnl_unlock();
-
-		if (err) {
+	} else {
+		dev_hold(dev);
+
+		if (dev->netdev_ops->ndo_bpf) {
+			bpf.command = XDP_SETUP_XSK_UMEM;
+			bpf.xsk.umem = umem;
+			bpf.xsk.queue_id = queue_id;
+
+			rtnl_lock();
+			err = dev->netdev_ops->ndo_bpf(dev, &bpf);
+			rtnl_unlock();
+
+			if (err) {
+				dev_put(dev);
+				goto fallback;
+			}
+
+			umem->dev = dev;
+			umem->queue_id = queue_id;
+			umem->zc = true;
+		} else {
 			dev_put(dev);
-			return 0;
 		}
-
-		umem->dev = dev;
-		umem->queue_id = queue_id;
-		return 0;
 	}
 
-	dev_put(dev);
+fallback:
+	spin_lock_irqsave(&umem->xsk_list_lock, flags);
+	list_add_rcu(list_entry, &umem->xsk_list);
+	spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
+
 	return 0;
 }
 
-void xdp_umem_clear_dev(struct xdp_umem *umem)
+static void xdp_umem_clear_dev(struct xdp_umem *umem)
 {
 	struct netdev_bpf bpf;
 	int err;
@@ -172,11 +179,22 @@ void xdp_get_umem(struct xdp_umem *umem)
 	atomic_inc(&umem->users);
 }
 
-void xdp_put_umem(struct xdp_umem *umem)
+void xdp_put_umem(struct xdp_umem *umem, struct xdp_sock *xs)
 {
+	unsigned long flags;
+
 	if (!umem)
 		return;
 
+	if (xs->dev) {
+		spin_lock_irqsave(&umem->xsk_list_lock, flags);
+		list_del_rcu(&xs->list);
+		spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
+
+		if (umem->zc)
+			synchronize_net();
+	}
+
 	if (atomic_dec_and_test(&umem->users)) {
 		INIT_WORK(&umem->work, xdp_umem_release_deferred);
 		schedule_work(&umem->work);
@@ -297,6 +315,8 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	umem->npgs = size / PAGE_SIZE;
 	umem->pgs = NULL;
 	umem->user = NULL;
+	INIT_LIST_HEAD(&umem->xsk_list);
+	spin_lock_init(&umem->xsk_list_lock);
 
 	atomic_set(&umem->users, 1);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 3bb96d156b40..5687748a9be3 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -22,6 +22,11 @@ static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
 	return umem->frames[idx].addr;
 }
 
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u32 idx)
+{
+	return umem->frames[idx].dma;
+}
+
 static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
 						    u32 idx)
 {
@@ -31,10 +36,10 @@ static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
 void xdp_get_umem(struct xdp_umem *umem);
-void xdp_put_umem(struct xdp_umem *umem);
+void xdp_put_umem(struct xdp_umem *umem, struct xdp_sock *xs);
 int xdp_umem_create(struct xdp_umem **umem);
 
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
-			u16 queue_id);
+			u16 queue_id, struct list_head *list_entry);
 
 #endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a0cf9c042ed2..ac979026671f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -30,6 +30,7 @@
 #include <linux/uaccess.h>
 #include <linux/net.h>
 #include <linux/netdevice.h>
+#include <linux/rculist.h>
 #include <net/xdp_sock.h>
 #include <net/xdp.h>
 
@@ -141,6 +142,49 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
+{
+	xskq_produce_flush_id_n(umem->cq, nb_entries);
+}
+EXPORT_SYMBOL(xsk_umem_complete_tx);
+
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
+			 u32 *len, u16 *offset)
+{
+	struct xdp_desc desc;
+	struct xdp_sock *xs;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
+		if (!xskq_peek_desc(xs->tx, &desc))
+			continue;
+
+		if (xskq_produce_id_lazy(umem->cq, desc.idx))
+			goto out;
+
+		*dma = xdp_umem_get_dma(umem, desc.idx);
+		*len = desc.len;
+		*offset = desc.offset;
+
+		xskq_discard_desc(xs->tx);
+		rcu_read_unlock();
+		return true;
+	}
+
+out:
+	rcu_read_unlock();
+	return false;
+}
+EXPORT_SYMBOL(xsk_umem_consume_tx);
+
+static int xsk_zc_xmit(struct sock *sk)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct net_device *dev = xs->dev;
+
+	return dev->netdev_ops->ndo_xsk_async_xmit(dev, xs->queue_id);
+}
+
 static void xsk_destruct_skb(struct sk_buff *skb)
 {
 	u32 id = (u32)(long)skb_shinfo(skb)->destructor_arg;
@@ -154,7 +198,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
 static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 			    size_t total_len)
 {
-	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
 	u32 max_batch = TX_BATCH_SIZE;
 	struct xdp_sock *xs = xdp_sk(sk);
 	bool sent_frame = false;
@@ -164,8 +207,6 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 	if (unlikely(!xs->tx))
 		return -ENOBUFS;
-	if (need_wait)
-		return -EOPNOTSUPP;
 
 	mutex_lock(&xs->mutex);
 
@@ -184,12 +225,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 		}
 
 		len = desc.len;
-		if (unlikely(len > xs->dev->mtu)) {
-			err = -EMSGSIZE;
-			goto out;
-		}
-
-		skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
+		skb = sock_alloc_send_skb(sk, len, 1, &err);
 		if (unlikely(!skb)) {
 			err = -EAGAIN;
 			goto out;
@@ -232,6 +268,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 {
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
 
@@ -239,8 +276,10 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 		return -ENXIO;
 	if (unlikely(!(xs->dev->flags & IFF_UP)))
 		return -ENETDOWN;
+	if (need_wait)
+		return -EOPNOTSUPP;
 
-	return xsk_generic_xmit(sk, m, total_len);
+	return (xs->zc) ? xsk_zc_xmit(sk) : xsk_generic_xmit(sk, m, total_len);
 }
 
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
@@ -398,12 +437,14 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		xskq_set_umem(xs->umem->cq, &xs->umem->props);
 	}
 
-	xs->dev = dev;
-	xs->queue_id = sxdp->sxdp_queue_id;
-	err = xdp_umem_assign_dev(xs->umem, dev, xs->queue_id);
+	err = xdp_umem_assign_dev(xs->umem, dev, sxdp->sxdp_queue_id,
+				  &xs->list);
 	if (err)
 		goto out_unlock;
 
+	xs->dev = dev;
+	xs->zc = xs->umem->zc;
+	xs->queue_id = sxdp->sxdp_queue_id;
 	xskq_set_umem(xs->rx, &xs->umem->props);
 	xskq_set_umem(xs->tx, &xs->umem->props);
 
@@ -612,7 +653,7 @@ static void xsk_destruct(struct sock *sk)
 
 	xskq_destroy(xs->rx);
 	xskq_destroy(xs->tx);
-	xdp_put_umem(xs->umem);
+	xdp_put_umem(xs->umem, xs);
 
 	sk_refcnt_debug_dec(sk);
 }
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 599a8d43c69a..5533bf32a254 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -17,9 +17,11 @@
 
 #include <linux/types.h>
 #include <linux/if_xdp.h>
+#include <linux/cache.h>
 #include <net/xdp_sock.h>
 
 #define RX_BATCH_SIZE 16
+#define LAZY_UPDATE_THRESHOLD 128
 
 struct xsk_queue {
 	struct xdp_umem_props umem_props;
@@ -53,9 +55,14 @@ static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
 	return (entries > dcnt) ? dcnt : entries;
 }
 
+static inline u32 xskq_nb_free_lazy(struct xsk_queue *q, u32 producer)
+{
+	return q->nentries - (producer - q->cons_tail);
+}
+
 static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
 {
-	u32 free_entries = q->nentries - (producer - q->cons_tail);
+	u32 free_entries = xskq_nb_free_lazy(q, producer);
 
 	if (free_entries >= dcnt)
 		return free_entries;
@@ -119,6 +126,9 @@ static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
 {
 	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
 
+	if (xskq_nb_free(q, q->prod_tail, LAZY_UPDATE_THRESHOLD) == 0)
+		return -ENOSPC;
+
 	ring->desc[q->prod_tail++ & q->ring_mask] = id;
 
 	/* Order producer and data */
@@ -128,6 +138,26 @@ static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
 	return 0;
 }
 
+static inline int xskq_produce_id_lazy(struct xsk_queue *q, u32 id)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+
+	if (xskq_nb_free(q, q->prod_head, LAZY_UPDATE_THRESHOLD) == 0)
+		return -ENOSPC;
+
+	ring->desc[q->prod_head++ & q->ring_mask] = id;
+	return 0;
+}
+
+static inline void xskq_produce_flush_id_n(struct xsk_queue *q, u32 nb_entries)
+{
+	/* Order producer and data */
+	smp_wmb();
+
+	q->prod_tail += nb_entries;
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+}
+
 static inline int xskq_reserve_id(struct xsk_queue *q)
 {
 	if (xskq_nb_free(q, q->prod_head, 1) == 0)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 08/12] xsk: wire upp Tx zero-copy functions
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here we add the functionality required to support zero-copy Tx, and
also exposes various zero-copy related functions to for the netdevs.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h | 11 +++++++-
 net/xdp/xdp_umem.c     | 66 ++++++++++++++++++++++++++++++-----------------
 net/xdp/xdp_umem.h     |  9 +++++--
 net/xdp/xsk.c          | 69 ++++++++++++++++++++++++++++++++++++++++----------
 net/xdp/xsk_queue.h    | 32 ++++++++++++++++++++++-
 5 files changed, 146 insertions(+), 41 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 644684eb2caf..6d89fe84674e 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -18,6 +18,7 @@
 #include <linux/workqueue.h>
 #include <linux/if_xdp.h>
 #include <linux/mutex.h>
+#include <linux/spinlock.h>
 #include <linux/mm.h>
 #include <net/sock.h>
 
@@ -49,6 +50,9 @@ struct xdp_umem {
 	atomic_t users;
 	struct work_struct work;
 	struct net_device *dev;
+	bool zc;
+	spinlock_t xsk_list_lock;
+	struct list_head xsk_list;
 	u16 queue_id;
 };
 
@@ -61,6 +65,8 @@ struct xdp_sock {
 	struct list_head flush_node;
 	u16 queue_id;
 	struct xsk_queue *tx ____cacheline_aligned_in_smp;
+	struct list_head list;
+	bool zc;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 	u64 rx_dropped;
@@ -73,9 +79,12 @@ int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
 
+/* Used from netdev driver */
 u32 *xsk_umem_peek_id(struct xdp_umem *umem);
 void xsk_umem_discard_id(struct xdp_umem *umem);
-
+void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
+			 u32 *len, u16 *offset);
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index f70cdaa2ef4d..b904786ac836 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -27,42 +27,49 @@
 #define XDP_UMEM_MIN_FRAME_SIZE 2048
 
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
-			u16 queue_id)
+			u16 queue_id, struct list_head *list_entry)
 {
 	struct netdev_bpf bpf;
+	unsigned long flags;
 	int err;
 
 	if (umem->dev) {
 		if (dev != umem->dev || queue_id != umem->queue_id)
 			return -EBUSY;
-		return 0;
-	}
-
-	dev_hold(dev);
-	if (dev->netdev_ops->ndo_bpf) {
-		bpf.command = XDP_SETUP_XSK_UMEM;
-		bpf.xsk.umem = umem;
-		bpf.xsk.queue_id = queue_id;
-
-		rtnl_lock();
-		err = dev->netdev_ops->ndo_bpf(dev, &bpf);
-		rtnl_unlock();
-
-		if (err) {
+	} else {
+		dev_hold(dev);
+
+		if (dev->netdev_ops->ndo_bpf) {
+			bpf.command = XDP_SETUP_XSK_UMEM;
+			bpf.xsk.umem = umem;
+			bpf.xsk.queue_id = queue_id;
+
+			rtnl_lock();
+			err = dev->netdev_ops->ndo_bpf(dev, &bpf);
+			rtnl_unlock();
+
+			if (err) {
+				dev_put(dev);
+				goto fallback;
+			}
+
+			umem->dev = dev;
+			umem->queue_id = queue_id;
+			umem->zc = true;
+		} else {
 			dev_put(dev);
-			return 0;
 		}
-
-		umem->dev = dev;
-		umem->queue_id = queue_id;
-		return 0;
 	}
 
-	dev_put(dev);
+fallback:
+	spin_lock_irqsave(&umem->xsk_list_lock, flags);
+	list_add_rcu(list_entry, &umem->xsk_list);
+	spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
+
 	return 0;
 }
 
-void xdp_umem_clear_dev(struct xdp_umem *umem)
+static void xdp_umem_clear_dev(struct xdp_umem *umem)
 {
 	struct netdev_bpf bpf;
 	int err;
@@ -172,11 +179,22 @@ void xdp_get_umem(struct xdp_umem *umem)
 	atomic_inc(&umem->users);
 }
 
-void xdp_put_umem(struct xdp_umem *umem)
+void xdp_put_umem(struct xdp_umem *umem, struct xdp_sock *xs)
 {
+	unsigned long flags;
+
 	if (!umem)
 		return;
 
+	if (xs->dev) {
+		spin_lock_irqsave(&umem->xsk_list_lock, flags);
+		list_del_rcu(&xs->list);
+		spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
+
+		if (umem->zc)
+			synchronize_net();
+	}
+
 	if (atomic_dec_and_test(&umem->users)) {
 		INIT_WORK(&umem->work, xdp_umem_release_deferred);
 		schedule_work(&umem->work);
@@ -297,6 +315,8 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	umem->npgs = size / PAGE_SIZE;
 	umem->pgs = NULL;
 	umem->user = NULL;
+	INIT_LIST_HEAD(&umem->xsk_list);
+	spin_lock_init(&umem->xsk_list_lock);
 
 	atomic_set(&umem->users, 1);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 3bb96d156b40..5687748a9be3 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -22,6 +22,11 @@ static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
 	return umem->frames[idx].addr;
 }
 
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u32 idx)
+{
+	return umem->frames[idx].dma;
+}
+
 static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
 						    u32 idx)
 {
@@ -31,10 +36,10 @@ static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
 void xdp_get_umem(struct xdp_umem *umem);
-void xdp_put_umem(struct xdp_umem *umem);
+void xdp_put_umem(struct xdp_umem *umem, struct xdp_sock *xs);
 int xdp_umem_create(struct xdp_umem **umem);
 
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
-			u16 queue_id);
+			u16 queue_id, struct list_head *list_entry);
 
 #endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a0cf9c042ed2..ac979026671f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -30,6 +30,7 @@
 #include <linux/uaccess.h>
 #include <linux/net.h>
 #include <linux/netdevice.h>
+#include <linux/rculist.h>
 #include <net/xdp_sock.h>
 #include <net/xdp.h>
 
@@ -141,6 +142,49 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
+{
+	xskq_produce_flush_id_n(umem->cq, nb_entries);
+}
+EXPORT_SYMBOL(xsk_umem_complete_tx);
+
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
+			 u32 *len, u16 *offset)
+{
+	struct xdp_desc desc;
+	struct xdp_sock *xs;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
+		if (!xskq_peek_desc(xs->tx, &desc))
+			continue;
+
+		if (xskq_produce_id_lazy(umem->cq, desc.idx))
+			goto out;
+
+		*dma = xdp_umem_get_dma(umem, desc.idx);
+		*len = desc.len;
+		*offset = desc.offset;
+
+		xskq_discard_desc(xs->tx);
+		rcu_read_unlock();
+		return true;
+	}
+
+out:
+	rcu_read_unlock();
+	return false;
+}
+EXPORT_SYMBOL(xsk_umem_consume_tx);
+
+static int xsk_zc_xmit(struct sock *sk)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct net_device *dev = xs->dev;
+
+	return dev->netdev_ops->ndo_xsk_async_xmit(dev, xs->queue_id);
+}
+
 static void xsk_destruct_skb(struct sk_buff *skb)
 {
 	u32 id = (u32)(long)skb_shinfo(skb)->destructor_arg;
@@ -154,7 +198,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
 static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 			    size_t total_len)
 {
-	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
 	u32 max_batch = TX_BATCH_SIZE;
 	struct xdp_sock *xs = xdp_sk(sk);
 	bool sent_frame = false;
@@ -164,8 +207,6 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 	if (unlikely(!xs->tx))
 		return -ENOBUFS;
-	if (need_wait)
-		return -EOPNOTSUPP;
 
 	mutex_lock(&xs->mutex);
 
@@ -184,12 +225,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 		}
 
 		len = desc.len;
-		if (unlikely(len > xs->dev->mtu)) {
-			err = -EMSGSIZE;
-			goto out;
-		}
-
-		skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
+		skb = sock_alloc_send_skb(sk, len, 1, &err);
 		if (unlikely(!skb)) {
 			err = -EAGAIN;
 			goto out;
@@ -232,6 +268,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 {
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
 
@@ -239,8 +276,10 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 		return -ENXIO;
 	if (unlikely(!(xs->dev->flags & IFF_UP)))
 		return -ENETDOWN;
+	if (need_wait)
+		return -EOPNOTSUPP;
 
-	return xsk_generic_xmit(sk, m, total_len);
+	return (xs->zc) ? xsk_zc_xmit(sk) : xsk_generic_xmit(sk, m, total_len);
 }
 
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
@@ -398,12 +437,14 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		xskq_set_umem(xs->umem->cq, &xs->umem->props);
 	}
 
-	xs->dev = dev;
-	xs->queue_id = sxdp->sxdp_queue_id;
-	err = xdp_umem_assign_dev(xs->umem, dev, xs->queue_id);
+	err = xdp_umem_assign_dev(xs->umem, dev, sxdp->sxdp_queue_id,
+				  &xs->list);
 	if (err)
 		goto out_unlock;
 
+	xs->dev = dev;
+	xs->zc = xs->umem->zc;
+	xs->queue_id = sxdp->sxdp_queue_id;
 	xskq_set_umem(xs->rx, &xs->umem->props);
 	xskq_set_umem(xs->tx, &xs->umem->props);
 
@@ -612,7 +653,7 @@ static void xsk_destruct(struct sock *sk)
 
 	xskq_destroy(xs->rx);
 	xskq_destroy(xs->tx);
-	xdp_put_umem(xs->umem);
+	xdp_put_umem(xs->umem, xs);
 
 	sk_refcnt_debug_dec(sk);
 }
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 599a8d43c69a..5533bf32a254 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -17,9 +17,11 @@
 
 #include <linux/types.h>
 #include <linux/if_xdp.h>
+#include <linux/cache.h>
 #include <net/xdp_sock.h>
 
 #define RX_BATCH_SIZE 16
+#define LAZY_UPDATE_THRESHOLD 128
 
 struct xsk_queue {
 	struct xdp_umem_props umem_props;
@@ -53,9 +55,14 @@ static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
 	return (entries > dcnt) ? dcnt : entries;
 }
 
+static inline u32 xskq_nb_free_lazy(struct xsk_queue *q, u32 producer)
+{
+	return q->nentries - (producer - q->cons_tail);
+}
+
 static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
 {
-	u32 free_entries = q->nentries - (producer - q->cons_tail);
+	u32 free_entries = xskq_nb_free_lazy(q, producer);
 
 	if (free_entries >= dcnt)
 		return free_entries;
@@ -119,6 +126,9 @@ static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
 {
 	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
 
+	if (xskq_nb_free(q, q->prod_tail, LAZY_UPDATE_THRESHOLD) == 0)
+		return -ENOSPC;
+
 	ring->desc[q->prod_tail++ & q->ring_mask] = id;
 
 	/* Order producer and data */
@@ -128,6 +138,26 @@ static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
 	return 0;
 }
 
+static inline int xskq_produce_id_lazy(struct xsk_queue *q, u32 id)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+
+	if (xskq_nb_free(q, q->prod_head, LAZY_UPDATE_THRESHOLD) == 0)
+		return -ENOSPC;
+
+	ring->desc[q->prod_head++ & q->ring_mask] = id;
+	return 0;
+}
+
+static inline void xskq_produce_flush_id_n(struct xsk_queue *q, u32 nb_entries)
+{
+	/* Order producer and data */
+	smp_wmb();
+
+	q->prod_tail += nb_entries;
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+}
+
 static inline int xskq_reserve_id(struct xsk_queue *q)
 {
 	if (xskq_nb_free(q, q->prod_head, 1) == 0)
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 09/12] samples/bpf: minor *_nb_free performance fix
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 samples/bpf/xdpsock_user.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 7fe60f6f7d53..50b9cabba4e8 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -159,15 +159,15 @@ static const char pkt_data[] =
 
 static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
 {
-	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
+	u32 free_entries = q->cached_cons - q->cached_prod;
 
 	if (free_entries >= nb)
 		return free_entries;
 
 	/* Refresh the local tail pointer */
-	q->cached_cons = q->ring->ptrs.consumer;
+	q->cached_cons = q->ring->ptrs.consumer + q->size;
 
-	return q->size - (q->cached_prod - q->cached_cons);
+	return q->cached_cons - q->cached_prod;
 }
 
 static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
@@ -432,6 +432,7 @@ static struct xdp_umem *xdp_umem_configure(int sfd)
 
 	umem->fq.mask = FQ_NUM_DESCS - 1;
 	umem->fq.size = FQ_NUM_DESCS;
+	umem->fq.cached_cons = FQ_NUM_DESCS;
 
 	umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
 			     CQ_NUM_DESCS * sizeof(u32),
@@ -514,6 +515,7 @@ static struct xdpsock *xsk_configure(struct xdp_umem *umem)
 
 	xsk->tx.mask = NUM_DESCS - 1;
 	xsk->tx.size = NUM_DESCS;
+	xsk->tx.cached_cons = NUM_DESCS;
 
 	sxdp.sxdp_family = PF_XDP;
 	sxdp.sxdp_ifindex = opt_ifindex;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 09/12] samples/bpf: minor *_nb_free performance fix
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 samples/bpf/xdpsock_user.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 7fe60f6f7d53..50b9cabba4e8 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -159,15 +159,15 @@ static const char pkt_data[] =
 
 static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
 {
-	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
+	u32 free_entries = q->cached_cons - q->cached_prod;
 
 	if (free_entries >= nb)
 		return free_entries;
 
 	/* Refresh the local tail pointer */
-	q->cached_cons = q->ring->ptrs.consumer;
+	q->cached_cons = q->ring->ptrs.consumer + q->size;
 
-	return q->size - (q->cached_prod - q->cached_cons);
+	return q->cached_cons - q->cached_prod;
 }
 
 static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
@@ -432,6 +432,7 @@ static struct xdp_umem *xdp_umem_configure(int sfd)
 
 	umem->fq.mask = FQ_NUM_DESCS - 1;
 	umem->fq.size = FQ_NUM_DESCS;
+	umem->fq.cached_cons = FQ_NUM_DESCS;
 
 	umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
 			     CQ_NUM_DESCS * sizeof(u32),
@@ -514,6 +515,7 @@ static struct xdpsock *xsk_configure(struct xdp_umem *umem)
 
 	xsk->tx.mask = NUM_DESCS - 1;
 	xsk->tx.size = NUM_DESCS;
+	xsk->tx.cached_cons = NUM_DESCS;
 
 	sxdp.sxdp_family = PF_XDP;
 	sxdp.sxdp_ifindex = opt_ifindex;
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 10/12] i40e: added queue pair disable/enable functions
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Queue pair enable/disable plumbing.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 251 ++++++++++++++++++++++++++++
 1 file changed, 251 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c8659fbd7111..b4c23cf3979c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11799,6 +11799,257 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	return 0;
 }
 
+/**
+ * i40e_enter_busy_conf - Enters busy config state
+ * @vsi: vsi
+ *
+ * Returns 0 on success, <0 for failure.
+ **/
+static int i40e_enter_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+	int timeout = 50;
+
+	while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
+		timeout--;
+		if (!timeout)
+			return -EBUSY;
+		usleep_range(1000, 2000);
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_exit_busy_conf - Exits busy config state
+ * @vsi: vsi
+ **/
+static void i40e_exit_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+
+	clear_bit(__I40E_CONFIG_BUSY, pf->state);
+}
+
+/**
+ * i40e_queue_pair_reset_stats - Resets all statistics for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_reset_stats(struct i40e_vsi *vsi, int queue_pair)
+{
+	memset(&vsi->rx_rings[queue_pair]->rx_stats, 0,
+	       sizeof(vsi->rx_rings[queue_pair]->rx_stats));
+	memset(&vsi->tx_rings[queue_pair]->stats, 0,
+	       sizeof(vsi->tx_rings[queue_pair]->stats));
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		memset(&vsi->xdp_rings[queue_pair]->stats, 0,
+		       sizeof(vsi->xdp_rings[queue_pair]->stats));
+	}
+}
+
+/**
+ * i40e_queue_pair_clean_rings - Cleans all the rings of a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_clean_rings(struct i40e_vsi *vsi, int queue_pair)
+{
+	i40e_clean_tx_ring(vsi->tx_rings[queue_pair]);
+	if (i40e_enabled_xdp_vsi(vsi))
+		i40e_clean_tx_ring(vsi->xdp_rings[queue_pair]);
+	i40e_clean_rx_ring(vsi->rx_rings[queue_pair]);
+}
+
+/**
+ * i40e_queue_pair_control_napi - Enables/disables NAPI for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ **/
+static void i40e_queue_pair_control_napi(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_q_vector *q_vector = rxr->q_vector;
+
+	if (!vsi->netdev)
+		return;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (q_vector->rx.ring || q_vector->tx.ring) {
+		if (enable)
+			napi_enable(&q_vector->napi);
+		else
+			napi_disable(&q_vector->napi);
+	}
+}
+
+/**
+ * i40e_queue_pair_control_rings - Enables/disables all rings for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_control_rings(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_pf *pf = vsi->back;
+	int pf_q, ret = 0;
+
+	pf_q = vsi->base_queue + queue_pair;
+	ret = i40e_control_wait_tx_q(vsi->seid, pf, pf_q,
+				     false /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	i40e_control_rx_q(pf, pf_q, enable);
+	ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Rx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	/* Due to HW errata, on Rx disable only, the register can
+ * indicate done before it really is. Needs 50ms to be sure
+ */
+	if (!enable)
+		mdelay(50);
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return ret;
+
+	ret = i40e_control_wait_tx_q(vsi->seid, pf,
+				     pf_q + vsi->alloc_queue_pairs,
+				     true /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d XDP Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+	}
+
+	return ret;
+}
+
+/**
+ * i40e_queue_pair_enable_irq - Enables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_enable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED)
+		i40e_irq_dynamic_enable(vsi, rxr->q_vector->v_idx);
+	else
+		i40e_irq_dynamic_enable_icr0(pf);
+
+	i40e_flush(hw);
+}
+
+/**
+ * i40e_queue_pair_disable_irq - Disables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* For simplicity, instead of removing the qp interrupt causes
+ * from the interrupt linked list, we simply disable the interrupt, and
+ * leave the list intact.
+ *
+ * All rings in a qp belong to the same qvector.
+ */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
+		u32 intpf = vsi->base_vector + rxr->q_vector->v_idx;
+
+		wr32(hw, I40E_PFINT_DYN_CTLN(intpf - 1), 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->msix_entries[intpf].vector);
+	} else {
+		/* Legacy and MSI mode - this stops all interrupt handling */
+		wr32(hw, I40E_PFINT_ICR0_ENA, 0);
+		wr32(hw, I40E_PFINT_DYN_CTL0, 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->pdev->irq);
+	}
+}
+
+/**
+ * i40e_queue_pair_disable - Disables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_enter_busy_conf(vsi);
+	if (err)
+		return err;
+
+	i40e_queue_pair_disable_irq(vsi, queue_pair);
+	err = i40e_queue_pair_control_rings(vsi, queue_pair,
+					    false /* disable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, false /* disable */);
+	i40e_queue_pair_clean_rings(vsi, queue_pair);
+	i40e_queue_pair_reset_stats(vsi, queue_pair);
+
+	return err;
+}
+
+/**
+ * i40e_queue_pair_enable - Enables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_configure_tx_ring(vsi->tx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		err = i40e_configure_tx_ring(vsi->xdp_rings[queue_pair]);
+		if (err)
+			return err;
+	}
+
+	err = i40e_configure_rx_ring(vsi->rx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	err = i40e_queue_pair_control_rings(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_enable_irq(vsi, queue_pair);
+
+	i40e_exit_busy_conf(vsi);
+
+	return err;
+}
+
 /**
  * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 10/12] i40e: added queue pair disable/enable functions
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

Queue pair enable/disable plumbing.

Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 251 ++++++++++++++++++++++++++++
 1 file changed, 251 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c8659fbd7111..b4c23cf3979c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11799,6 +11799,257 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	return 0;
 }
 
+/**
+ * i40e_enter_busy_conf - Enters busy config state
+ * @vsi: vsi
+ *
+ * Returns 0 on success, <0 for failure.
+ **/
+static int i40e_enter_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+	int timeout = 50;
+
+	while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
+		timeout--;
+		if (!timeout)
+			return -EBUSY;
+		usleep_range(1000, 2000);
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_exit_busy_conf - Exits busy config state
+ * @vsi: vsi
+ **/
+static void i40e_exit_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+
+	clear_bit(__I40E_CONFIG_BUSY, pf->state);
+}
+
+/**
+ * i40e_queue_pair_reset_stats - Resets all statistics for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_reset_stats(struct i40e_vsi *vsi, int queue_pair)
+{
+	memset(&vsi->rx_rings[queue_pair]->rx_stats, 0,
+	       sizeof(vsi->rx_rings[queue_pair]->rx_stats));
+	memset(&vsi->tx_rings[queue_pair]->stats, 0,
+	       sizeof(vsi->tx_rings[queue_pair]->stats));
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		memset(&vsi->xdp_rings[queue_pair]->stats, 0,
+		       sizeof(vsi->xdp_rings[queue_pair]->stats));
+	}
+}
+
+/**
+ * i40e_queue_pair_clean_rings - Cleans all the rings of a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_clean_rings(struct i40e_vsi *vsi, int queue_pair)
+{
+	i40e_clean_tx_ring(vsi->tx_rings[queue_pair]);
+	if (i40e_enabled_xdp_vsi(vsi))
+		i40e_clean_tx_ring(vsi->xdp_rings[queue_pair]);
+	i40e_clean_rx_ring(vsi->rx_rings[queue_pair]);
+}
+
+/**
+ * i40e_queue_pair_control_napi - Enables/disables NAPI for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ **/
+static void i40e_queue_pair_control_napi(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_q_vector *q_vector = rxr->q_vector;
+
+	if (!vsi->netdev)
+		return;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (q_vector->rx.ring || q_vector->tx.ring) {
+		if (enable)
+			napi_enable(&q_vector->napi);
+		else
+			napi_disable(&q_vector->napi);
+	}
+}
+
+/**
+ * i40e_queue_pair_control_rings - Enables/disables all rings for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_control_rings(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_pf *pf = vsi->back;
+	int pf_q, ret = 0;
+
+	pf_q = vsi->base_queue + queue_pair;
+	ret = i40e_control_wait_tx_q(vsi->seid, pf, pf_q,
+				     false /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	i40e_control_rx_q(pf, pf_q, enable);
+	ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Rx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	/* Due to HW errata, on Rx disable only, the register can
+ * indicate done before it really is. Needs 50ms to be sure
+ */
+	if (!enable)
+		mdelay(50);
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return ret;
+
+	ret = i40e_control_wait_tx_q(vsi->seid, pf,
+				     pf_q + vsi->alloc_queue_pairs,
+				     true /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d XDP Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+	}
+
+	return ret;
+}
+
+/**
+ * i40e_queue_pair_enable_irq - Enables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_enable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED)
+		i40e_irq_dynamic_enable(vsi, rxr->q_vector->v_idx);
+	else
+		i40e_irq_dynamic_enable_icr0(pf);
+
+	i40e_flush(hw);
+}
+
+/**
+ * i40e_queue_pair_disable_irq - Disables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* For simplicity, instead of removing the qp interrupt causes
+ * from the interrupt linked list, we simply disable the interrupt, and
+ * leave the list intact.
+ *
+ * All rings in a qp belong to the same qvector.
+ */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
+		u32 intpf = vsi->base_vector + rxr->q_vector->v_idx;
+
+		wr32(hw, I40E_PFINT_DYN_CTLN(intpf - 1), 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->msix_entries[intpf].vector);
+	} else {
+		/* Legacy and MSI mode - this stops all interrupt handling */
+		wr32(hw, I40E_PFINT_ICR0_ENA, 0);
+		wr32(hw, I40E_PFINT_DYN_CTL0, 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->pdev->irq);
+	}
+}
+
+/**
+ * i40e_queue_pair_disable - Disables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_enter_busy_conf(vsi);
+	if (err)
+		return err;
+
+	i40e_queue_pair_disable_irq(vsi, queue_pair);
+	err = i40e_queue_pair_control_rings(vsi, queue_pair,
+					    false /* disable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, false /* disable */);
+	i40e_queue_pair_clean_rings(vsi, queue_pair);
+	i40e_queue_pair_reset_stats(vsi, queue_pair);
+
+	return err;
+}
+
+/**
+ * i40e_queue_pair_enable - Enables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_configure_tx_ring(vsi->tx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		err = i40e_configure_tx_ring(vsi->xdp_rings[queue_pair]);
+		if (err)
+			return err;
+	}
+
+	err = i40e_configure_rx_ring(vsi->rx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	err = i40e_queue_pair_control_rings(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_enable_irq(vsi, queue_pair);
+
+	i40e_exit_busy_conf(vsi);
+
+	return err;
+}
+
 /**
  * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 11/12] i40e: implement AF_XDP zero-copy support for Rx
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

A lot of things here. First we add support for the new
XDP_SETUP_XSK_UMEM command in ndo_bpf. This allows the AF_XDP socket
to pass a UMEM to the driver. The driver will then DMA map all the
frames in the UMEM for the driver. Next, the Rx code will allocate
frames from the UMEM fill queue, instead of the regular page
allocator.

Externally, for the rest of the XDP code, the driver the driver
internal UMEM allocator will appear as a MEM_TYPE_ZERO_COPY.

Keep in mind that having frames coming from userland requires some
extra care taken when passing them to the regular kernel stack. In
these cases the ZC frame must be copied.

The commit also introduces a completely new clean_rx_irq/allocator
functions for zero-copy, and means (functions pointers) to set
allocators and clean_rx functions.

Finally, a lot of this are *not* implemented here. To mention some:

* No passing to the stack via XDP_PASS (clone/copy to skb).
* No XDP redirect to other than sockets (convert_to_xdp_frame does not
  clone the frame yet).

And yes, too much C&P and too big commit. :-)

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h      |  20 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c | 202 +++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 400 ++++++++++++++++++++++++++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  30 ++-
 4 files changed, 619 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 7a80652e2500..e6ee6c9bf094 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -786,6 +786,12 @@ struct i40e_vsi {
 
 	/* VSI specific handlers */
 	irqreturn_t (*irq_handler)(int irq, void *data);
+
+	/* AF_XDP zero-copy */
+	struct xdp_umem **xsk_umems;
+	u16 num_xsk_umems_used;
+	u16 num_xsk_umems;
+
 } ____cacheline_internodealigned_in_smp;
 
 struct i40e_netdev_priv {
@@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
 	return !!vsi->xdp_prog;
 }
 
+static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
+{
+	bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
+	int qid = ring->queue_index;
+
+	if (ring_is_xdp(ring))
+		qid -= ring->vsi->alloc_queue_pairs;
+
+	if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
+		return NULL;
+
+	return ring->vsi->xsk_umems[qid];
+}
+
 int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
 int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
 int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b4c23cf3979c..dc3d668a741e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include <linux/of_net.h>
 #include <linux/pci.h>
 #include <linux/bpf.h>
+#include <net/xdp_sock.h>
 
 /* Local includes */
 #include "i40e.h"
@@ -3054,6 +3055,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
+	if (ring_is_xdp(ring))
+		ring->xsk_umem = i40e_xsk_umem(ring);
+
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
 		ring->atr_sample_rate = vsi->back->atr_sample_rate;
@@ -3163,13 +3167,31 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	struct i40e_hw *hw = &vsi->back->hw;
 	struct i40e_hmc_obj_rxq rx_ctx;
 	i40e_status err = 0;
+	int ret;
 
 	bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
 
 	/* clear the context structure first */
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
-	ring->rx_buf_len = vsi->rx_buf_len;
+	ring->xsk_umem = i40e_xsk_umem(ring);
+	if (ring->xsk_umem) {
+		ring->clean_rx_irq = i40e_clean_rx_irq_zc;
+		ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
+		ring->rx_buf_len = ring->xsk_umem->props.frame_size -
+				   ring->xsk_umem->frame_headroom -
+				   XDP_PACKET_HEADROOM;
+		ring->zca.free = i40e_zca_free;
+		ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+						 MEM_TYPE_ZERO_COPY,
+						 &ring->zca);
+		if (ret)
+			return ret;
+	} else {
+		ring->clean_rx_irq = i40e_clean_rx_irq;
+		ring->alloc_rx_buffers = i40e_alloc_rx_buffers;
+		ring->rx_buf_len = vsi->rx_buf_len;
+	}
 
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
 				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
@@ -3225,7 +3247,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
 	writel(0, ring->tail);
 
-	i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
+	ring->alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
 
 	return 0;
 }
@@ -12050,6 +12072,179 @@ static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
 	return err;
 }
 
+static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
+{
+	if (vsi->xsk_umems)
+		return 0;
+
+	vsi->num_xsk_umems_used = 0;
+	vsi->num_xsk_umems = vsi->alloc_queue_pairs;
+	vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
+				 GFP_KERNEL);
+	if (!vsi->xsk_umems) {
+		vsi->num_xsk_umems = 0;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			     u16 qid)
+{
+	int err;
+
+	err = i40e_alloc_xsk_umems(vsi);
+	if (err)
+		return err;
+
+	vsi->xsk_umems[qid] = umem;
+	vsi->num_xsk_umems_used++;
+
+	return 0;
+}
+
+static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
+{
+	vsi->xsk_umems[qid] = NULL;
+	vsi->num_xsk_umems_used--;
+
+	if (vsi->num_xsk_umems == 0) {
+		kfree(vsi->xsk_umems);
+		vsi->xsk_umems = NULL;
+		vsi->num_xsk_umems = 0;
+	}
+}
+
+static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i, j;
+	dma_addr_t dma;
+
+	dev = &pf->pdev->dev;
+
+	for (i = 0; i < umem->props.nframes; i++) {
+		dma = dma_map_single_attrs(dev, umem->frames[i].addr,
+					   umem->props.frame_size,
+					   DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+		if (dma_mapping_error(dev, dma))
+			goto out_unmap;
+
+		umem->frames[i].dma = dma;
+	}
+
+	return 0;
+
+out_unmap:
+	for (j = 0; j < i; j++) {
+		dma_unmap_single_attrs(dev, umem->frames[i].dma,
+				       umem->props.frame_size,
+				       DMA_BIDIRECTIONAL,
+				       I40E_RX_DMA_ATTR);
+		umem->frames[i].dma = 0;
+	}
+
+	return -1;
+}
+
+static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i;
+
+	dev = &pf->pdev->dev;
+
+	for (i = 0; i < umem->props.nframes; i++) {
+		dma_unmap_single_attrs(dev, umem->frames[i].dma,
+				       umem->props.frame_size,
+				       DMA_BIDIRECTIONAL,
+				       I40E_RX_DMA_ATTR);
+
+		umem->frames[i].dma = 0;
+	}
+}
+
+static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
+				u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	if (qid >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (vsi->xsk_umems && vsi->xsk_umems[qid])
+		return -EBUSY;
+
+	err = i40e_xsk_umem_dma_map(vsi, umem);
+	if (err)
+		return err;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	err = i40e_add_xsk_umem(vsi, umem, qid);
+	if (err)
+		return err;
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
+	    !vsi->xsk_umems[qid])
+		return -EINVAL;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
+	i40e_remove_xsk_umem(vsi, qid);
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			       u16 qid)
+{
+	if (umem)
+		return i40e_xsk_umem_enable(vsi, umem, qid);
+
+	return i40e_xsk_umem_disable(vsi, qid);
+}
+
 /**
  * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
@@ -12071,6 +12266,9 @@ static int i40e_xdp(struct net_device *dev,
 		xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
 		xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
 		return 0;
+	case XDP_SETUP_XSK_UMEM:
+		return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
+					   xdp->xsk.queue_id);
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 5efa68de935b..f89ac524652c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -5,6 +5,7 @@
 #include <net/busy_poll.h>
 #include <linux/bpf_trace.h>
 #include <net/xdp.h>
+#include <net/xdp_sock.h>
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
@@ -1373,31 +1374,35 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 	}
 
 	/* Free all the Rx ring sk_buffs */
-	for (i = 0; i < rx_ring->count; i++) {
-		struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
+	if (!rx_ring->xsk_umem) {
+		for (i = 0; i < rx_ring->count; i++) {
+			struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
 
-		if (!rx_bi->page)
-			continue;
-
-		/* Invalidate cache lines that may have been written to by
-		 * device so that we avoid corrupting memory.
-		 */
-		dma_sync_single_range_for_cpu(rx_ring->dev,
-					      rx_bi->dma,
-					      rx_bi->page_offset,
-					      rx_ring->rx_buf_len,
-					      DMA_FROM_DEVICE);
-
-		/* free resources associated with mapping */
-		dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
-				     i40e_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE,
-				     I40E_RX_DMA_ATTR);
-
-		__page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
+			if (!rx_bi->page)
+				continue;
 
-		rx_bi->page = NULL;
-		rx_bi->page_offset = 0;
+			/* Invalidate cache lines that may have been
+			 * written to by device so that we avoid
+			 * corrupting memory.
+			 */
+			dma_sync_single_range_for_cpu(rx_ring->dev,
+						      rx_bi->dma,
+						      rx_bi->page_offset,
+						      rx_ring->rx_buf_len,
+						      DMA_FROM_DEVICE);
+
+			/* free resources associated with mapping */
+			dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
+					     i40e_rx_pg_size(rx_ring),
+					     DMA_FROM_DEVICE,
+					     I40E_RX_DMA_ATTR);
+
+			__page_frag_cache_drain(rx_bi->page,
+						rx_bi->pagecnt_bias);
+
+			rx_bi->page = NULL;
+			rx_bi->page_offset = 0;
+		}
 	}
 
 	bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
@@ -2214,8 +2219,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	if (!xdp_prog)
 		goto xdp_out;
 
-	prefetchw(xdp->data_hard_start); /* xdp_frame write */
-
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
@@ -2284,7 +2287,7 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
  *
  * Returns amount of work completed
  **/
-static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 {
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	struct sk_buff *skb = rx_ring->skb;
@@ -2426,6 +2429,349 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 	return failure ? budget : (int)total_rx_packets;
 }
 
+static struct sk_buff *i40e_run_xdp_zc(struct i40e_ring *rx_ring,
+				       struct xdp_buff *xdp)
+{
+	int err, result = I40E_XDP_PASS;
+	struct i40e_ring *xdp_ring;
+	struct bpf_prog *xdp_prog;
+	u32 act;
+
+	rcu_read_lock();
+	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	switch (act) {
+	case XDP_PASS:
+		break;
+	case XDP_TX:
+		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+		result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
+		break;
+	case XDP_REDIRECT:
+		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+		result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
+		/* fallthrough -- handle aborts by dropping packet */
+	case XDP_DROP:
+		result = I40E_XDP_CONSUMED;
+		break;
+	}
+
+	rcu_read_unlock();
+	return ERR_PTR(-result);
+}
+
+static bool i40e_alloc_frame_zc(struct i40e_ring *rx_ring,
+				struct i40e_rx_buffer *bi)
+{
+	struct xdp_umem *umem = rx_ring->xsk_umem;
+	void *addr = bi->addr;
+	u32 *id;
+
+	if (addr) {
+		rx_ring->rx_stats.page_reuse_count++;
+		return true;
+	}
+
+	id = xsk_umem_peek_id(umem);
+	if (unlikely(!id)) {
+		rx_ring->rx_stats.alloc_page_failed++;
+		return false;
+	}
+
+	bi->dma = umem->frames[*id].dma + umem->frame_headroom +
+		  XDP_PACKET_HEADROOM;
+	bi->addr = umem->frames[*id].addr + umem->frame_headroom +
+		  XDP_PACKET_HEADROOM;
+	bi->id = *id;
+
+	xsk_umem_discard_id(umem);
+	return true;
+}
+
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count)
+{
+	u16 ntu = rx_ring->next_to_use;
+	union i40e_rx_desc *rx_desc;
+	struct i40e_rx_buffer *bi;
+
+	rx_desc = I40E_RX_DESC(rx_ring, ntu);
+	bi = &rx_ring->rx_bi[ntu];
+
+	do {
+		if (!i40e_alloc_frame_zc(rx_ring, bi))
+			goto no_buffers;
+
+		/* sync the buffer for use by the device */
+		dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
+						 rx_ring->rx_buf_len,
+						 DMA_BIDIRECTIONAL);
+
+		/* Refresh the desc even if buffer_addrs didn't change
+		 * because each write-back erases this info.
+		 */
+		rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
+
+		rx_desc++;
+		bi++;
+		ntu++;
+		if (unlikely(ntu == rx_ring->count)) {
+			rx_desc = I40E_RX_DESC(rx_ring, 0);
+			bi = rx_ring->rx_bi;
+			ntu = 0;
+		}
+
+		/* clear the status bits for the next_to_use descriptor */
+		rx_desc->wb.qword1.status_error_len = 0;
+
+		cleaned_count--;
+	} while (cleaned_count);
+
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	return false;
+
+no_buffers:
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	/* make sure to come back via polling to try again after
+	 * allocation failure
+	 */
+	return true;
+}
+
+static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
+						    const unsigned int size)
+{
+	struct i40e_rx_buffer *rx_buffer;
+
+	rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
+
+	/* we are reusing so sync this buffer for CPU use */
+	dma_sync_single_range_for_cpu(rx_ring->dev,
+				      rx_buffer->dma, 0,
+				      size,
+				      DMA_BIDIRECTIONAL);
+
+	return rx_buffer;
+}
+
+static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
+				    struct i40e_rx_buffer *old_buff)
+{
+	struct i40e_rx_buffer *new_buff;
+	u16 nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	/* transfer page from old buffer to new buffer */
+	new_buff->dma  = old_buff->dma;
+	new_buff->addr = old_buff->addr;
+	new_buff->id   = old_buff->id;
+}
+
+/* Called from the XDP return API in NAPI context. */
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
+{
+	struct i40e_rx_buffer *new_buff;
+	struct i40e_ring *rx_ring;
+	u16 nta;
+
+	rx_ring = container_of(alloc, struct i40e_ring, zca);
+	nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	new_buff->dma  = rx_ring->xsk_umem->frames[handle].dma;
+	new_buff->addr = rx_ring->xsk_umem->frames[handle].addr;
+	new_buff->id   = (u32)handle;
+}
+
+static struct sk_buff *i40e_zc_frame_to_skb(struct i40e_ring *rx_ring,
+					    struct i40e_rx_buffer *rx_buffer,
+					    struct xdp_buff *xdp)
+{
+	// XXX implement alloc skb and copy
+	i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+	return NULL;
+}
+
+static void i40e_clean_programming_status_zc(struct i40e_ring *rx_ring,
+					     union i40e_rx_desc *rx_desc,
+					     u64 qw)
+{
+	struct i40e_rx_buffer *rx_buffer;
+	u32 ntc = rx_ring->next_to_clean;
+	u8 id;
+
+	/* fetch, update, and store next to clean */
+	rx_buffer = &rx_ring->rx_bi[ntc++];
+	ntc = (ntc < rx_ring->count) ? ntc : 0;
+	rx_ring->next_to_clean = ntc;
+
+	prefetch(I40E_RX_DESC(rx_ring, ntc));
+
+	/* place unused page back on the ring */
+	i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+	rx_ring->rx_stats.page_reuse_count++;
+
+	/* clear contents of buffer_info */
+	rx_buffer->addr = NULL;
+
+	id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
+		  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
+
+	if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
+		i40e_fd_handle_status(rx_ring, rx_desc, id);
+}
+
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
+{
+	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
+	bool failure = false, xdp_xmit = false;
+	struct sk_buff *skb;
+	struct xdp_buff xdp;
+
+	xdp.rxq = &rx_ring->xdp_rxq;
+
+	while (likely(total_rx_packets < (unsigned int)budget)) {
+		struct i40e_rx_buffer *rx_buffer;
+		union i40e_rx_desc *rx_desc;
+		unsigned int size;
+		u16 vlan_tag;
+		u8 rx_ptype;
+		u64 qword;
+		u32 ntc;
+
+		/* return some buffers to hardware, one at a time is too slow */
+		if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
+			failure = failure ||
+				  i40e_alloc_rx_buffers_zc(rx_ring,
+							   cleaned_count);
+			cleaned_count = 0;
+		}
+
+		rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
+
+		/* status_error_len will always be zero for unused descriptors
+		 * because it's cleared in cleanup, and overlaps with hdr_addr
+		 * which is always zero because packet split isn't used, if the
+		 * hardware wrote DD then the length will be non-zero
+		 */
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+
+		/* This memory barrier is needed to keep us from reading
+		 * any other fields out of the rx_desc until we have
+		 * verified the descriptor has been written back.
+		 */
+		dma_rmb();
+
+		if (unlikely(i40e_rx_is_programming_status(qword))) {
+			i40e_clean_programming_status_zc(rx_ring, rx_desc,
+							 qword);
+			cleaned_count++;
+			continue;
+		}
+		size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
+		       I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
+		if (!size)
+			break;
+
+		rx_buffer = i40e_get_rx_buffer_zc(rx_ring, size);
+
+		/* retrieve a buffer from the ring */
+		xdp.data = rx_buffer->addr;
+		xdp_set_data_meta_invalid(&xdp);
+		xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
+		xdp.data_end = xdp.data + size;
+		xdp.handle = rx_buffer->id;
+
+		skb = i40e_run_xdp_zc(rx_ring, &xdp);
+
+		if (IS_ERR(skb)) {
+			if (PTR_ERR(skb) == -I40E_XDP_TX)
+				xdp_xmit = true;
+			else
+				i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+			total_rx_bytes += size;
+			total_rx_packets++;
+		} else {
+			skb = i40e_zc_frame_to_skb(rx_ring, rx_buffer, &xdp);
+			if (!skb) {
+				rx_ring->rx_stats.alloc_buff_failed++;
+				break;
+			}
+		}
+
+		rx_buffer->addr = NULL;
+		cleaned_count++;
+
+		/* don't care about non-EOP frames in XDP mode */
+		ntc = rx_ring->next_to_clean + 1;
+		ntc = (ntc < rx_ring->count) ? ntc : 0;
+		rx_ring->next_to_clean = ntc;
+		prefetch(I40E_RX_DESC(rx_ring, ntc));
+
+		if (i40e_cleanup_headers(rx_ring, skb, rx_desc)) {
+			skb = NULL;
+			continue;
+		}
+
+		/* probably a little skewed due to removing CRC */
+		total_rx_bytes += skb->len;
+
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+		rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
+			   I40E_RXD_QW1_PTYPE_SHIFT;
+
+		/* populate checksum, VLAN, and protocol */
+		i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+
+		vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
+			   le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
+
+		i40e_receive_skb(rx_ring, skb, vlan_tag);
+		skb = NULL;
+
+		/* update budget accounting */
+		total_rx_packets++;
+	}
+
+	if (xdp_xmit) {
+		struct i40e_ring *xdp_ring =
+			rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+
+		i40e_xdp_ring_update_tail(xdp_ring);
+		xdp_do_flush_map();
+	}
+
+	u64_stats_update_begin(&rx_ring->syncp);
+	rx_ring->stats.packets += total_rx_packets;
+	rx_ring->stats.bytes += total_rx_bytes;
+	u64_stats_update_end(&rx_ring->syncp);
+	rx_ring->q_vector->rx.total_packets += total_rx_packets;
+	rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
+
+	/* guarantee a trip back through this routine if there was a failure */
+	return failure ? budget : (int)total_rx_packets;
+}
+
 static inline u32 i40e_buildreg_itr(const int type, u16 itr)
 {
 	u32 val;
@@ -2576,7 +2922,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
 
 	i40e_for_each_ring(ring, q_vector->rx) {
-		int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
+		int cleaned = ring->clean_rx_irq(ring, budget_per_ring);
 
 		work_done += cleaned;
 		/* if we clean as many as budgeted, we must not be done */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index fdd2c55f03a6..9d5d9862e9f1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -296,13 +296,22 @@ struct i40e_tx_buffer {
 
 struct i40e_rx_buffer {
 	dma_addr_t dma;
-	struct page *page;
+	union {
+		struct {
+			struct page *page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
-	__u32 page_offset;
+			__u32 page_offset;
 #else
-	__u16 page_offset;
+			__u16 page_offset;
 #endif
-	__u16 pagecnt_bias;
+			__u16 pagecnt_bias;
+		};
+		struct {
+			/* for umem */
+			void *addr;
+			u32 id;
+		};
+	};
 };
 
 struct i40e_queue_stats {
@@ -344,6 +353,8 @@ enum i40e_ring_state_t {
 #define I40E_RX_SPLIT_TCP_UDP 0x4
 #define I40E_RX_SPLIT_SCTP    0x8
 
+void i40e_zc_recycle(struct zero_copy_allocator *alloc, unsigned long handle);
+
 /* struct that defines a descriptor ring, associated with a VSI */
 struct i40e_ring {
 	struct i40e_ring *next;		/* pointer to next ring in q_vector */
@@ -414,6 +425,12 @@ struct i40e_ring {
 
 	struct i40e_channel *ch;
 	struct xdp_rxq_info xdp_rxq;
+
+	int (*clean_rx_irq)(struct i40e_ring *, int);
+	bool (*alloc_rx_buffers)(struct i40e_ring *, u16);
+	struct xdp_umem *xsk_umem;
+
+	struct zero_copy_allocator zca; /* ZC allocator anchor */
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
@@ -474,6 +491,7 @@ static inline unsigned int i40e_rx_pg_order(struct i40e_ring *ring)
 #define i40e_rx_pg_size(_ring) (PAGE_SIZE << i40e_rx_pg_order(_ring))
 
 bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
 netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
 void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
@@ -489,6 +507,9 @@ int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
 int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
 void i40e_xdp_flush(struct net_device *dev);
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -575,4 +596,5 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
 {
 	return netdev_get_tx_queue(ring->netdev, ring->queue_index);
 }
+
 #endif /* _I40E_TXRX_H_ */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 11/12] i40e: implement AF_XDP zero-copy support for Rx
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Bj?rn T?pel <bjorn.topel@intel.com>

A lot of things here. First we add support for the new
XDP_SETUP_XSK_UMEM command in ndo_bpf. This allows the AF_XDP socket
to pass a UMEM to the driver. The driver will then DMA map all the
frames in the UMEM for the driver. Next, the Rx code will allocate
frames from the UMEM fill queue, instead of the regular page
allocator.

Externally, for the rest of the XDP code, the driver the driver
internal UMEM allocator will appear as a MEM_TYPE_ZERO_COPY.

Keep in mind that having frames coming from userland requires some
extra care taken when passing them to the regular kernel stack. In
these cases the ZC frame must be copied.

The commit also introduces a completely new clean_rx_irq/allocator
functions for zero-copy, and means (functions pointers) to set
allocators and clean_rx functions.

Finally, a lot of this are *not* implemented here. To mention some:

* No passing to the stack via XDP_PASS (clone/copy to skb).
* No XDP redirect to other than sockets (convert_to_xdp_frame does not
  clone the frame yet).

And yes, too much C&P and too big commit. :-)

Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h      |  20 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c | 202 +++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 400 ++++++++++++++++++++++++++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  30 ++-
 4 files changed, 619 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 7a80652e2500..e6ee6c9bf094 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -786,6 +786,12 @@ struct i40e_vsi {
 
 	/* VSI specific handlers */
 	irqreturn_t (*irq_handler)(int irq, void *data);
+
+	/* AF_XDP zero-copy */
+	struct xdp_umem **xsk_umems;
+	u16 num_xsk_umems_used;
+	u16 num_xsk_umems;
+
 } ____cacheline_internodealigned_in_smp;
 
 struct i40e_netdev_priv {
@@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
 	return !!vsi->xdp_prog;
 }
 
+static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
+{
+	bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
+	int qid = ring->queue_index;
+
+	if (ring_is_xdp(ring))
+		qid -= ring->vsi->alloc_queue_pairs;
+
+	if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
+		return NULL;
+
+	return ring->vsi->xsk_umems[qid];
+}
+
 int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
 int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
 int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b4c23cf3979c..dc3d668a741e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include <linux/of_net.h>
 #include <linux/pci.h>
 #include <linux/bpf.h>
+#include <net/xdp_sock.h>
 
 /* Local includes */
 #include "i40e.h"
@@ -3054,6 +3055,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
+	if (ring_is_xdp(ring))
+		ring->xsk_umem = i40e_xsk_umem(ring);
+
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
 		ring->atr_sample_rate = vsi->back->atr_sample_rate;
@@ -3163,13 +3167,31 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	struct i40e_hw *hw = &vsi->back->hw;
 	struct i40e_hmc_obj_rxq rx_ctx;
 	i40e_status err = 0;
+	int ret;
 
 	bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
 
 	/* clear the context structure first */
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
-	ring->rx_buf_len = vsi->rx_buf_len;
+	ring->xsk_umem = i40e_xsk_umem(ring);
+	if (ring->xsk_umem) {
+		ring->clean_rx_irq = i40e_clean_rx_irq_zc;
+		ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
+		ring->rx_buf_len = ring->xsk_umem->props.frame_size -
+				   ring->xsk_umem->frame_headroom -
+				   XDP_PACKET_HEADROOM;
+		ring->zca.free = i40e_zca_free;
+		ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+						 MEM_TYPE_ZERO_COPY,
+						 &ring->zca);
+		if (ret)
+			return ret;
+	} else {
+		ring->clean_rx_irq = i40e_clean_rx_irq;
+		ring->alloc_rx_buffers = i40e_alloc_rx_buffers;
+		ring->rx_buf_len = vsi->rx_buf_len;
+	}
 
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
 				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
@@ -3225,7 +3247,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
 	writel(0, ring->tail);
 
-	i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
+	ring->alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
 
 	return 0;
 }
@@ -12050,6 +12072,179 @@ static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
 	return err;
 }
 
+static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
+{
+	if (vsi->xsk_umems)
+		return 0;
+
+	vsi->num_xsk_umems_used = 0;
+	vsi->num_xsk_umems = vsi->alloc_queue_pairs;
+	vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
+				 GFP_KERNEL);
+	if (!vsi->xsk_umems) {
+		vsi->num_xsk_umems = 0;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			     u16 qid)
+{
+	int err;
+
+	err = i40e_alloc_xsk_umems(vsi);
+	if (err)
+		return err;
+
+	vsi->xsk_umems[qid] = umem;
+	vsi->num_xsk_umems_used++;
+
+	return 0;
+}
+
+static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
+{
+	vsi->xsk_umems[qid] = NULL;
+	vsi->num_xsk_umems_used--;
+
+	if (vsi->num_xsk_umems == 0) {
+		kfree(vsi->xsk_umems);
+		vsi->xsk_umems = NULL;
+		vsi->num_xsk_umems = 0;
+	}
+}
+
+static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i, j;
+	dma_addr_t dma;
+
+	dev = &pf->pdev->dev;
+
+	for (i = 0; i < umem->props.nframes; i++) {
+		dma = dma_map_single_attrs(dev, umem->frames[i].addr,
+					   umem->props.frame_size,
+					   DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+		if (dma_mapping_error(dev, dma))
+			goto out_unmap;
+
+		umem->frames[i].dma = dma;
+	}
+
+	return 0;
+
+out_unmap:
+	for (j = 0; j < i; j++) {
+		dma_unmap_single_attrs(dev, umem->frames[i].dma,
+				       umem->props.frame_size,
+				       DMA_BIDIRECTIONAL,
+				       I40E_RX_DMA_ATTR);
+		umem->frames[i].dma = 0;
+	}
+
+	return -1;
+}
+
+static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i;
+
+	dev = &pf->pdev->dev;
+
+	for (i = 0; i < umem->props.nframes; i++) {
+		dma_unmap_single_attrs(dev, umem->frames[i].dma,
+				       umem->props.frame_size,
+				       DMA_BIDIRECTIONAL,
+				       I40E_RX_DMA_ATTR);
+
+		umem->frames[i].dma = 0;
+	}
+}
+
+static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
+				u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	if (qid >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (vsi->xsk_umems && vsi->xsk_umems[qid])
+		return -EBUSY;
+
+	err = i40e_xsk_umem_dma_map(vsi, umem);
+	if (err)
+		return err;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	err = i40e_add_xsk_umem(vsi, umem, qid);
+	if (err)
+		return err;
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
+	    !vsi->xsk_umems[qid])
+		return -EINVAL;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
+	i40e_remove_xsk_umem(vsi, qid);
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			       u16 qid)
+{
+	if (umem)
+		return i40e_xsk_umem_enable(vsi, umem, qid);
+
+	return i40e_xsk_umem_disable(vsi, qid);
+}
+
 /**
  * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
@@ -12071,6 +12266,9 @@ static int i40e_xdp(struct net_device *dev,
 		xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
 		xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
 		return 0;
+	case XDP_SETUP_XSK_UMEM:
+		return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
+					   xdp->xsk.queue_id);
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 5efa68de935b..f89ac524652c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -5,6 +5,7 @@
 #include <net/busy_poll.h>
 #include <linux/bpf_trace.h>
 #include <net/xdp.h>
+#include <net/xdp_sock.h>
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
@@ -1373,31 +1374,35 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 	}
 
 	/* Free all the Rx ring sk_buffs */
-	for (i = 0; i < rx_ring->count; i++) {
-		struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
+	if (!rx_ring->xsk_umem) {
+		for (i = 0; i < rx_ring->count; i++) {
+			struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
 
-		if (!rx_bi->page)
-			continue;
-
-		/* Invalidate cache lines that may have been written to by
-		 * device so that we avoid corrupting memory.
-		 */
-		dma_sync_single_range_for_cpu(rx_ring->dev,
-					      rx_bi->dma,
-					      rx_bi->page_offset,
-					      rx_ring->rx_buf_len,
-					      DMA_FROM_DEVICE);
-
-		/* free resources associated with mapping */
-		dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
-				     i40e_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE,
-				     I40E_RX_DMA_ATTR);
-
-		__page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
+			if (!rx_bi->page)
+				continue;
 
-		rx_bi->page = NULL;
-		rx_bi->page_offset = 0;
+			/* Invalidate cache lines that may have been
+			 * written to by device so that we avoid
+			 * corrupting memory.
+			 */
+			dma_sync_single_range_for_cpu(rx_ring->dev,
+						      rx_bi->dma,
+						      rx_bi->page_offset,
+						      rx_ring->rx_buf_len,
+						      DMA_FROM_DEVICE);
+
+			/* free resources associated with mapping */
+			dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
+					     i40e_rx_pg_size(rx_ring),
+					     DMA_FROM_DEVICE,
+					     I40E_RX_DMA_ATTR);
+
+			__page_frag_cache_drain(rx_bi->page,
+						rx_bi->pagecnt_bias);
+
+			rx_bi->page = NULL;
+			rx_bi->page_offset = 0;
+		}
 	}
 
 	bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
@@ -2214,8 +2219,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	if (!xdp_prog)
 		goto xdp_out;
 
-	prefetchw(xdp->data_hard_start); /* xdp_frame write */
-
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
@@ -2284,7 +2287,7 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
  *
  * Returns amount of work completed
  **/
-static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 {
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	struct sk_buff *skb = rx_ring->skb;
@@ -2426,6 +2429,349 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 	return failure ? budget : (int)total_rx_packets;
 }
 
+static struct sk_buff *i40e_run_xdp_zc(struct i40e_ring *rx_ring,
+				       struct xdp_buff *xdp)
+{
+	int err, result = I40E_XDP_PASS;
+	struct i40e_ring *xdp_ring;
+	struct bpf_prog *xdp_prog;
+	u32 act;
+
+	rcu_read_lock();
+	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	switch (act) {
+	case XDP_PASS:
+		break;
+	case XDP_TX:
+		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+		result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
+		break;
+	case XDP_REDIRECT:
+		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+		result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
+		/* fallthrough -- handle aborts by dropping packet */
+	case XDP_DROP:
+		result = I40E_XDP_CONSUMED;
+		break;
+	}
+
+	rcu_read_unlock();
+	return ERR_PTR(-result);
+}
+
+static bool i40e_alloc_frame_zc(struct i40e_ring *rx_ring,
+				struct i40e_rx_buffer *bi)
+{
+	struct xdp_umem *umem = rx_ring->xsk_umem;
+	void *addr = bi->addr;
+	u32 *id;
+
+	if (addr) {
+		rx_ring->rx_stats.page_reuse_count++;
+		return true;
+	}
+
+	id = xsk_umem_peek_id(umem);
+	if (unlikely(!id)) {
+		rx_ring->rx_stats.alloc_page_failed++;
+		return false;
+	}
+
+	bi->dma = umem->frames[*id].dma + umem->frame_headroom +
+		  XDP_PACKET_HEADROOM;
+	bi->addr = umem->frames[*id].addr + umem->frame_headroom +
+		  XDP_PACKET_HEADROOM;
+	bi->id = *id;
+
+	xsk_umem_discard_id(umem);
+	return true;
+}
+
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count)
+{
+	u16 ntu = rx_ring->next_to_use;
+	union i40e_rx_desc *rx_desc;
+	struct i40e_rx_buffer *bi;
+
+	rx_desc = I40E_RX_DESC(rx_ring, ntu);
+	bi = &rx_ring->rx_bi[ntu];
+
+	do {
+		if (!i40e_alloc_frame_zc(rx_ring, bi))
+			goto no_buffers;
+
+		/* sync the buffer for use by the device */
+		dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
+						 rx_ring->rx_buf_len,
+						 DMA_BIDIRECTIONAL);
+
+		/* Refresh the desc even if buffer_addrs didn't change
+		 * because each write-back erases this info.
+		 */
+		rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
+
+		rx_desc++;
+		bi++;
+		ntu++;
+		if (unlikely(ntu == rx_ring->count)) {
+			rx_desc = I40E_RX_DESC(rx_ring, 0);
+			bi = rx_ring->rx_bi;
+			ntu = 0;
+		}
+
+		/* clear the status bits for the next_to_use descriptor */
+		rx_desc->wb.qword1.status_error_len = 0;
+
+		cleaned_count--;
+	} while (cleaned_count);
+
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	return false;
+
+no_buffers:
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	/* make sure to come back via polling to try again after
+	 * allocation failure
+	 */
+	return true;
+}
+
+static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
+						    const unsigned int size)
+{
+	struct i40e_rx_buffer *rx_buffer;
+
+	rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
+
+	/* we are reusing so sync this buffer for CPU use */
+	dma_sync_single_range_for_cpu(rx_ring->dev,
+				      rx_buffer->dma, 0,
+				      size,
+				      DMA_BIDIRECTIONAL);
+
+	return rx_buffer;
+}
+
+static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
+				    struct i40e_rx_buffer *old_buff)
+{
+	struct i40e_rx_buffer *new_buff;
+	u16 nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	/* transfer page from old buffer to new buffer */
+	new_buff->dma  = old_buff->dma;
+	new_buff->addr = old_buff->addr;
+	new_buff->id   = old_buff->id;
+}
+
+/* Called from the XDP return API in NAPI context. */
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
+{
+	struct i40e_rx_buffer *new_buff;
+	struct i40e_ring *rx_ring;
+	u16 nta;
+
+	rx_ring = container_of(alloc, struct i40e_ring, zca);
+	nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	new_buff->dma  = rx_ring->xsk_umem->frames[handle].dma;
+	new_buff->addr = rx_ring->xsk_umem->frames[handle].addr;
+	new_buff->id   = (u32)handle;
+}
+
+static struct sk_buff *i40e_zc_frame_to_skb(struct i40e_ring *rx_ring,
+					    struct i40e_rx_buffer *rx_buffer,
+					    struct xdp_buff *xdp)
+{
+	// XXX implement alloc skb and copy
+	i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+	return NULL;
+}
+
+static void i40e_clean_programming_status_zc(struct i40e_ring *rx_ring,
+					     union i40e_rx_desc *rx_desc,
+					     u64 qw)
+{
+	struct i40e_rx_buffer *rx_buffer;
+	u32 ntc = rx_ring->next_to_clean;
+	u8 id;
+
+	/* fetch, update, and store next to clean */
+	rx_buffer = &rx_ring->rx_bi[ntc++];
+	ntc = (ntc < rx_ring->count) ? ntc : 0;
+	rx_ring->next_to_clean = ntc;
+
+	prefetch(I40E_RX_DESC(rx_ring, ntc));
+
+	/* place unused page back on the ring */
+	i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+	rx_ring->rx_stats.page_reuse_count++;
+
+	/* clear contents of buffer_info */
+	rx_buffer->addr = NULL;
+
+	id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
+		  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
+
+	if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
+		i40e_fd_handle_status(rx_ring, rx_desc, id);
+}
+
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
+{
+	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
+	bool failure = false, xdp_xmit = false;
+	struct sk_buff *skb;
+	struct xdp_buff xdp;
+
+	xdp.rxq = &rx_ring->xdp_rxq;
+
+	while (likely(total_rx_packets < (unsigned int)budget)) {
+		struct i40e_rx_buffer *rx_buffer;
+		union i40e_rx_desc *rx_desc;
+		unsigned int size;
+		u16 vlan_tag;
+		u8 rx_ptype;
+		u64 qword;
+		u32 ntc;
+
+		/* return some buffers to hardware, one@a time is too slow */
+		if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
+			failure = failure ||
+				  i40e_alloc_rx_buffers_zc(rx_ring,
+							   cleaned_count);
+			cleaned_count = 0;
+		}
+
+		rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
+
+		/* status_error_len will always be zero for unused descriptors
+		 * because it's cleared in cleanup, and overlaps with hdr_addr
+		 * which is always zero because packet split isn't used, if the
+		 * hardware wrote DD then the length will be non-zero
+		 */
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+
+		/* This memory barrier is needed to keep us from reading
+		 * any other fields out of the rx_desc until we have
+		 * verified the descriptor has been written back.
+		 */
+		dma_rmb();
+
+		if (unlikely(i40e_rx_is_programming_status(qword))) {
+			i40e_clean_programming_status_zc(rx_ring, rx_desc,
+							 qword);
+			cleaned_count++;
+			continue;
+		}
+		size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
+		       I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
+		if (!size)
+			break;
+
+		rx_buffer = i40e_get_rx_buffer_zc(rx_ring, size);
+
+		/* retrieve a buffer from the ring */
+		xdp.data = rx_buffer->addr;
+		xdp_set_data_meta_invalid(&xdp);
+		xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
+		xdp.data_end = xdp.data + size;
+		xdp.handle = rx_buffer->id;
+
+		skb = i40e_run_xdp_zc(rx_ring, &xdp);
+
+		if (IS_ERR(skb)) {
+			if (PTR_ERR(skb) == -I40E_XDP_TX)
+				xdp_xmit = true;
+			else
+				i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+			total_rx_bytes += size;
+			total_rx_packets++;
+		} else {
+			skb = i40e_zc_frame_to_skb(rx_ring, rx_buffer, &xdp);
+			if (!skb) {
+				rx_ring->rx_stats.alloc_buff_failed++;
+				break;
+			}
+		}
+
+		rx_buffer->addr = NULL;
+		cleaned_count++;
+
+		/* don't care about non-EOP frames in XDP mode */
+		ntc = rx_ring->next_to_clean + 1;
+		ntc = (ntc < rx_ring->count) ? ntc : 0;
+		rx_ring->next_to_clean = ntc;
+		prefetch(I40E_RX_DESC(rx_ring, ntc));
+
+		if (i40e_cleanup_headers(rx_ring, skb, rx_desc)) {
+			skb = NULL;
+			continue;
+		}
+
+		/* probably a little skewed due to removing CRC */
+		total_rx_bytes += skb->len;
+
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+		rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
+			   I40E_RXD_QW1_PTYPE_SHIFT;
+
+		/* populate checksum, VLAN, and protocol */
+		i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+
+		vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
+			   le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
+
+		i40e_receive_skb(rx_ring, skb, vlan_tag);
+		skb = NULL;
+
+		/* update budget accounting */
+		total_rx_packets++;
+	}
+
+	if (xdp_xmit) {
+		struct i40e_ring *xdp_ring =
+			rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+
+		i40e_xdp_ring_update_tail(xdp_ring);
+		xdp_do_flush_map();
+	}
+
+	u64_stats_update_begin(&rx_ring->syncp);
+	rx_ring->stats.packets += total_rx_packets;
+	rx_ring->stats.bytes += total_rx_bytes;
+	u64_stats_update_end(&rx_ring->syncp);
+	rx_ring->q_vector->rx.total_packets += total_rx_packets;
+	rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
+
+	/* guarantee a trip back through this routine if there was a failure */
+	return failure ? budget : (int)total_rx_packets;
+}
+
 static inline u32 i40e_buildreg_itr(const int type, u16 itr)
 {
 	u32 val;
@@ -2576,7 +2922,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
 
 	i40e_for_each_ring(ring, q_vector->rx) {
-		int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
+		int cleaned = ring->clean_rx_irq(ring, budget_per_ring);
 
 		work_done += cleaned;
 		/* if we clean as many as budgeted, we must not be done */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index fdd2c55f03a6..9d5d9862e9f1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -296,13 +296,22 @@ struct i40e_tx_buffer {
 
 struct i40e_rx_buffer {
 	dma_addr_t dma;
-	struct page *page;
+	union {
+		struct {
+			struct page *page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
-	__u32 page_offset;
+			__u32 page_offset;
 #else
-	__u16 page_offset;
+			__u16 page_offset;
 #endif
-	__u16 pagecnt_bias;
+			__u16 pagecnt_bias;
+		};
+		struct {
+			/* for umem */
+			void *addr;
+			u32 id;
+		};
+	};
 };
 
 struct i40e_queue_stats {
@@ -344,6 +353,8 @@ enum i40e_ring_state_t {
 #define I40E_RX_SPLIT_TCP_UDP 0x4
 #define I40E_RX_SPLIT_SCTP    0x8
 
+void i40e_zc_recycle(struct zero_copy_allocator *alloc, unsigned long handle);
+
 /* struct that defines a descriptor ring, associated with a VSI */
 struct i40e_ring {
 	struct i40e_ring *next;		/* pointer to next ring in q_vector */
@@ -414,6 +425,12 @@ struct i40e_ring {
 
 	struct i40e_channel *ch;
 	struct xdp_rxq_info xdp_rxq;
+
+	int (*clean_rx_irq)(struct i40e_ring *, int);
+	bool (*alloc_rx_buffers)(struct i40e_ring *, u16);
+	struct xdp_umem *xsk_umem;
+
+	struct zero_copy_allocator zca; /* ZC allocator anchor */
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
@@ -474,6 +491,7 @@ static inline unsigned int i40e_rx_pg_order(struct i40e_ring *ring)
 #define i40e_rx_pg_size(_ring) (PAGE_SIZE << i40e_rx_pg_order(_ring))
 
 bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
 netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
 void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
@@ -489,6 +507,9 @@ int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
 int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
 void i40e_xdp_flush(struct net_device *dev);
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -575,4 +596,5 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
 {
 	return netdev_get_tx_queue(ring->netdev, ring->queue_index);
 }
+
 #endif /* _I40E_TXRX_H_ */
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, the zero-copy ndo is implemented. As a shortcut, the existing
XDP Tx rings are used for zero-copy. This means that and XDP program
cannot redirect to an AF_XDP enabled XDP Tx ring.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |   7 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 235 +++++++++++++++++++++++-----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   6 +
 3 files changed, 212 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index dc3d668a741e..91f8e892179a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3055,8 +3055,12 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
-	if (ring_is_xdp(ring))
+	ring->clean_tx_irq = i40e_clean_tx_irq;
+	if (ring_is_xdp(ring)) {
 		ring->xsk_umem = i40e_xsk_umem(ring);
+		if (ring->xsk_umem)
+			ring->clean_tx_irq = i40e_clean_tx_irq_zc;
+	}
 
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
@@ -12309,6 +12313,7 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_bpf		= i40e_xdp,
 	.ndo_xdp_xmit		= i40e_xdp_xmit,
 	.ndo_xdp_flush		= i40e_xdp_flush,
+	.ndo_xsk_async_xmit	= i40e_xsk_async_xmit,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f89ac524652c..17c067556aba 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -649,9 +649,13 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
 	if (!tx_ring->tx_bi)
 		return;
 
-	/* Free all the Tx ring sk_buffs */
-	for (i = 0; i < tx_ring->count; i++)
-		i40e_unmap_and_free_tx_resource(tx_ring, &tx_ring->tx_bi[i]);
+	/* Cleanup only needed for non XSK TX ZC rings */
+	if (!tx_ring->xsk_umem) {
+		/* Free all the Tx ring sk_buffs */
+		for (i = 0; i < tx_ring->count; i++)
+			i40e_unmap_and_free_tx_resource(tx_ring,
+							&tx_ring->tx_bi[i]);
+	}
 
 	bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
 	memset(tx_ring->tx_bi, 0, bi_size);
@@ -768,8 +772,139 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
 	}
 }
 
+static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+{
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.
+	 */
+	wmb();
+	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+}
+
 #define WB_STRIDE 4
 
+static void i40e_update_stats_and_arm_wb(struct i40e_ring *tx_ring,
+					 struct i40e_vsi *vsi,
+					 unsigned int total_packets,
+					 unsigned int total_bytes,
+					 int budget)
+{
+	u64_stats_update_begin(&tx_ring->syncp);
+	tx_ring->stats.bytes += total_bytes;
+	tx_ring->stats.packets += total_packets;
+	u64_stats_update_end(&tx_ring->syncp);
+	tx_ring->q_vector->tx.total_bytes += total_bytes;
+	tx_ring->q_vector->tx.total_packets += total_packets;
+
+	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+		/* check to see if there are < 4 descriptors
+		 * waiting to be written back, then kick the hardware to force
+		 * them to be written back in case we stay in NAPI.
+		 * In this mode on X722 we do not enable Interrupt.
+		 */
+		unsigned int j = i40e_get_tx_pending(tx_ring, false);
+
+		if (budget &&
+		    ((j / WB_STRIDE) == 0) && (j > 0) &&
+		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
+		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
+			tx_ring->arm_wb = true;
+	}
+}
+
+/* Returns true if the work is finished */
+static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
+{
+	bool work_done = true, xdp_xmit = false;
+	struct i40e_tx_buffer *tx_bi;
+	struct i40e_tx_desc *tx_desc;
+	dma_addr_t dma;
+	u16 offset;
+	u32 len;
+
+	while (budget-- > 0) {
+		if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
+			xdp_ring->tx_stats.tx_busy++;
+			work_done = false;
+			break;
+		}
+
+		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len,
+					 &offset))
+			break;
+
+		xdp_xmit = true;
+		dma_sync_single_for_device(xdp_ring->dev, dma, len,
+					   DMA_BIDIRECTIONAL);
+
+		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
+		tx_bi->bytecount = len;
+		tx_bi->gso_segs = 1;
+
+		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
+		tx_desc->buffer_addr = cpu_to_le64(dma);
+		tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
+							| I40E_TX_DESC_CMD_EOP,
+							  0, len, 0);
+
+		xdp_ring->next_to_use++;
+		if (xdp_ring->next_to_use == xdp_ring->count)
+			xdp_ring->next_to_use = 0;
+	}
+
+	/* Request an interrupt for the last frame and bump tail ptr. */
+	if (xdp_xmit) {
+		tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
+						 I40E_TXD_QW1_CMD_SHIFT);
+		i40e_xdp_ring_update_tail(xdp_ring);
+	}
+
+	return !!budget && work_done;
+}
+
+bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
+			  struct i40e_ring *tx_ring, int napi_budget)
+{
+	unsigned int total_bytes = 0, total_packets = 0;
+	struct xdp_umem *umem = tx_ring->xsk_umem;
+	u32 head_idx = i40e_get_head(tx_ring);
+	unsigned int budget = vsi->work_limit;
+	bool work_done = true, xmit_done;
+	u32 completed_frames;
+	u32 frames_ready;
+
+	if (head_idx < tx_ring->next_to_clean)
+		head_idx += tx_ring->count;
+	frames_ready = head_idx - tx_ring->next_to_clean;
+
+	if (frames_ready == 0) {
+		goto out_xmit;
+	} else if (frames_ready > budget) {
+		completed_frames = budget;
+		work_done = false;
+	} else {
+		completed_frames = frames_ready;
+	}
+
+	/* XXX Need to be calculated. */
+	/*total_bytes += tx_buf->bytecount;*/
+	total_packets += completed_frames;
+
+	tx_ring->next_to_clean += completed_frames;
+	if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
+		tx_ring->next_to_clean -= tx_ring->count;
+
+	xsk_umem_complete_tx(umem, completed_frames);
+
+	i40e_update_stats_and_arm_wb(tx_ring, vsi, total_packets,
+				     total_bytes, budget);
+
+out_xmit:
+	xmit_done = i40e_xmit_zc(tx_ring, budget);
+
+	return work_done && xmit_done;
+}
+
 /**
  * i40e_clean_tx_irq - Reclaim resources after transmit completes
  * @vsi: the VSI we care about
@@ -778,8 +913,8 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
  *
  * Returns true if there's any budget left (e.g. the clean is finished)
  **/
-static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
-			      struct i40e_ring *tx_ring, int napi_budget)
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget)
 {
 	u16 i = tx_ring->next_to_clean;
 	struct i40e_tx_buffer *tx_buf;
@@ -874,27 +1009,9 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
-	u64_stats_update_begin(&tx_ring->syncp);
-	tx_ring->stats.bytes += total_bytes;
-	tx_ring->stats.packets += total_packets;
-	u64_stats_update_end(&tx_ring->syncp);
-	tx_ring->q_vector->tx.total_bytes += total_bytes;
-	tx_ring->q_vector->tx.total_packets += total_packets;
 
-	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
-		/* check to see if there are < 4 descriptors
-		 * waiting to be written back, then kick the hardware to force
-		 * them to be written back in case we stay in NAPI.
-		 * In this mode on X722 we do not enable Interrupt.
-		 */
-		unsigned int j = i40e_get_tx_pending(tx_ring, false);
-
-		if (budget &&
-		    ((j / WB_STRIDE) == 0) && (j > 0) &&
-		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
-		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
-			tx_ring->arm_wb = true;
-	}
+	i40e_update_stats_and_arm_wb(tx_ring, vsi, total_packets,
+				     total_bytes, budget);
 
 	if (ring_is_xdp(tx_ring))
 		return !!budget;
@@ -2266,15 +2383,6 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
 #endif
 }
 
-static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
-{
-	/* Force memory writes to complete before letting h/w
-	 * know there are new descriptors to fetch.
-	 */
-	wmb();
-	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
-}
-
 /**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
@@ -2904,10 +3012,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	 * budget and be more aggressive about cleaning up the Tx descriptors.
 	 */
 	i40e_for_each_ring(ring, q_vector->tx) {
-		if (!i40e_clean_tx_irq(vsi, ring, budget)) {
+		if (!ring->clean_tx_irq(vsi, ring, budget)) {
 			clean_complete = false;
 			continue;
 		}
+
 		arm_wb |= ring->arm_wb;
 		ring->arm_wb = false;
 	}
@@ -3810,6 +3919,30 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 	return -1;
 }
 
+/**
+ * i40e_napi_is_scheduled - If napi is running, set the NAPIF_STATE_MISSED
+ * @n: napi context
+ *
+ * Returns true if NAPI is scheduled.
+ **/
+static bool i40e_napi_is_scheduled(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (val & NAPIF_STATE_DISABLE)
+			return true;
+
+		if (!(val & NAPIF_STATE_SCHED))
+			return false;
+
+		new = val | NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return true;
+}
+
 /**
  * i40e_xmit_xdp_ring - transmits an XDP buffer to an XDP Tx ring
  * @xdp: data to transmit
@@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return -ENXIO;
 
+	if (vsi->xdp_rings[queue_index]->xsk_umem)
+		return -ENXIO;
+
 	err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
 	if (err != I40E_XDP_TX)
 		return -ENOSPC;
@@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return;
 
+	if (vsi->xdp_rings[queue_index]->xsk_umem)
+		return;
+
 	i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
 }
+
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
+{
+	struct i40e_netdev_priv *np = netdev_priv(dev);
+	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_ring *ring;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return -ENETDOWN;
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return -ENXIO;
+
+	if (queue_id >= vsi->num_queue_pairs)
+		return -ENXIO;
+
+	if (!vsi->xdp_rings[queue_id]->xsk_umem)
+		return -ENXIO;
+
+	ring = vsi->xdp_rings[queue_id];
+
+	if (!i40e_napi_is_scheduled(&ring->q_vector->napi))
+		i40e_force_wb(vsi, ring->q_vector);
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 9d5d9862e9f1..ea1cac00cad4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -428,6 +428,7 @@ struct i40e_ring {
 
 	int (*clean_rx_irq)(struct i40e_ring *, int);
 	bool (*alloc_rx_buffers)(struct i40e_ring *, u16);
+	bool (*clean_tx_irq)(struct i40e_vsi *, struct i40e_ring *, int);
 	struct xdp_umem *xsk_umem;
 
 	struct zero_copy_allocator zca; /* ZC allocator anchor */
@@ -510,6 +511,11 @@ void i40e_xdp_flush(struct net_device *dev);
 int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
 int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
 void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
+bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
+			  struct i40e_ring *tx_ring, int napi_budget);
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget);
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
@ 2018-05-15 19:06   ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-15 19:06 UTC (permalink / raw)
  To: intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, the zero-copy ndo is implemented. As a shortcut, the existing
XDP Tx rings are used for zero-copy. This means that and XDP program
cannot redirect to an AF_XDP enabled XDP Tx ring.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |   7 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 235 +++++++++++++++++++++++-----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   6 +
 3 files changed, 212 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index dc3d668a741e..91f8e892179a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3055,8 +3055,12 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
-	if (ring_is_xdp(ring))
+	ring->clean_tx_irq = i40e_clean_tx_irq;
+	if (ring_is_xdp(ring)) {
 		ring->xsk_umem = i40e_xsk_umem(ring);
+		if (ring->xsk_umem)
+			ring->clean_tx_irq = i40e_clean_tx_irq_zc;
+	}
 
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
@@ -12309,6 +12313,7 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_bpf		= i40e_xdp,
 	.ndo_xdp_xmit		= i40e_xdp_xmit,
 	.ndo_xdp_flush		= i40e_xdp_flush,
+	.ndo_xsk_async_xmit	= i40e_xsk_async_xmit,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f89ac524652c..17c067556aba 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -649,9 +649,13 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
 	if (!tx_ring->tx_bi)
 		return;
 
-	/* Free all the Tx ring sk_buffs */
-	for (i = 0; i < tx_ring->count; i++)
-		i40e_unmap_and_free_tx_resource(tx_ring, &tx_ring->tx_bi[i]);
+	/* Cleanup only needed for non XSK TX ZC rings */
+	if (!tx_ring->xsk_umem) {
+		/* Free all the Tx ring sk_buffs */
+		for (i = 0; i < tx_ring->count; i++)
+			i40e_unmap_and_free_tx_resource(tx_ring,
+							&tx_ring->tx_bi[i]);
+	}
 
 	bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
 	memset(tx_ring->tx_bi, 0, bi_size);
@@ -768,8 +772,139 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
 	}
 }
 
+static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+{
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.
+	 */
+	wmb();
+	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+}
+
 #define WB_STRIDE 4
 
+static void i40e_update_stats_and_arm_wb(struct i40e_ring *tx_ring,
+					 struct i40e_vsi *vsi,
+					 unsigned int total_packets,
+					 unsigned int total_bytes,
+					 int budget)
+{
+	u64_stats_update_begin(&tx_ring->syncp);
+	tx_ring->stats.bytes += total_bytes;
+	tx_ring->stats.packets += total_packets;
+	u64_stats_update_end(&tx_ring->syncp);
+	tx_ring->q_vector->tx.total_bytes += total_bytes;
+	tx_ring->q_vector->tx.total_packets += total_packets;
+
+	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+		/* check to see if there are < 4 descriptors
+		 * waiting to be written back, then kick the hardware to force
+		 * them to be written back in case we stay in NAPI.
+		 * In this mode on X722 we do not enable Interrupt.
+		 */
+		unsigned int j = i40e_get_tx_pending(tx_ring, false);
+
+		if (budget &&
+		    ((j / WB_STRIDE) == 0) && (j > 0) &&
+		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
+		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
+			tx_ring->arm_wb = true;
+	}
+}
+
+/* Returns true if the work is finished */
+static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
+{
+	bool work_done = true, xdp_xmit = false;
+	struct i40e_tx_buffer *tx_bi;
+	struct i40e_tx_desc *tx_desc;
+	dma_addr_t dma;
+	u16 offset;
+	u32 len;
+
+	while (budget-- > 0) {
+		if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
+			xdp_ring->tx_stats.tx_busy++;
+			work_done = false;
+			break;
+		}
+
+		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len,
+					 &offset))
+			break;
+
+		xdp_xmit = true;
+		dma_sync_single_for_device(xdp_ring->dev, dma, len,
+					   DMA_BIDIRECTIONAL);
+
+		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
+		tx_bi->bytecount = len;
+		tx_bi->gso_segs = 1;
+
+		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
+		tx_desc->buffer_addr = cpu_to_le64(dma);
+		tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
+							| I40E_TX_DESC_CMD_EOP,
+							  0, len, 0);
+
+		xdp_ring->next_to_use++;
+		if (xdp_ring->next_to_use == xdp_ring->count)
+			xdp_ring->next_to_use = 0;
+	}
+
+	/* Request an interrupt for the last frame and bump tail ptr. */
+	if (xdp_xmit) {
+		tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
+						 I40E_TXD_QW1_CMD_SHIFT);
+		i40e_xdp_ring_update_tail(xdp_ring);
+	}
+
+	return !!budget && work_done;
+}
+
+bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
+			  struct i40e_ring *tx_ring, int napi_budget)
+{
+	unsigned int total_bytes = 0, total_packets = 0;
+	struct xdp_umem *umem = tx_ring->xsk_umem;
+	u32 head_idx = i40e_get_head(tx_ring);
+	unsigned int budget = vsi->work_limit;
+	bool work_done = true, xmit_done;
+	u32 completed_frames;
+	u32 frames_ready;
+
+	if (head_idx < tx_ring->next_to_clean)
+		head_idx += tx_ring->count;
+	frames_ready = head_idx - tx_ring->next_to_clean;
+
+	if (frames_ready == 0) {
+		goto out_xmit;
+	} else if (frames_ready > budget) {
+		completed_frames = budget;
+		work_done = false;
+	} else {
+		completed_frames = frames_ready;
+	}
+
+	/* XXX Need to be calculated. */
+	/*total_bytes += tx_buf->bytecount;*/
+	total_packets += completed_frames;
+
+	tx_ring->next_to_clean += completed_frames;
+	if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
+		tx_ring->next_to_clean -= tx_ring->count;
+
+	xsk_umem_complete_tx(umem, completed_frames);
+
+	i40e_update_stats_and_arm_wb(tx_ring, vsi, total_packets,
+				     total_bytes, budget);
+
+out_xmit:
+	xmit_done = i40e_xmit_zc(tx_ring, budget);
+
+	return work_done && xmit_done;
+}
+
 /**
  * i40e_clean_tx_irq - Reclaim resources after transmit completes
  * @vsi: the VSI we care about
@@ -778,8 +913,8 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
  *
  * Returns true if there's any budget left (e.g. the clean is finished)
  **/
-static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
-			      struct i40e_ring *tx_ring, int napi_budget)
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget)
 {
 	u16 i = tx_ring->next_to_clean;
 	struct i40e_tx_buffer *tx_buf;
@@ -874,27 +1009,9 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
-	u64_stats_update_begin(&tx_ring->syncp);
-	tx_ring->stats.bytes += total_bytes;
-	tx_ring->stats.packets += total_packets;
-	u64_stats_update_end(&tx_ring->syncp);
-	tx_ring->q_vector->tx.total_bytes += total_bytes;
-	tx_ring->q_vector->tx.total_packets += total_packets;
 
-	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
-		/* check to see if there are < 4 descriptors
-		 * waiting to be written back, then kick the hardware to force
-		 * them to be written back in case we stay in NAPI.
-		 * In this mode on X722 we do not enable Interrupt.
-		 */
-		unsigned int j = i40e_get_tx_pending(tx_ring, false);
-
-		if (budget &&
-		    ((j / WB_STRIDE) == 0) && (j > 0) &&
-		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
-		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
-			tx_ring->arm_wb = true;
-	}
+	i40e_update_stats_and_arm_wb(tx_ring, vsi, total_packets,
+				     total_bytes, budget);
 
 	if (ring_is_xdp(tx_ring))
 		return !!budget;
@@ -2266,15 +2383,6 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
 #endif
 }
 
-static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
-{
-	/* Force memory writes to complete before letting h/w
-	 * know there are new descriptors to fetch.
-	 */
-	wmb();
-	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
-}
-
 /**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
@@ -2904,10 +3012,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	 * budget and be more aggressive about cleaning up the Tx descriptors.
 	 */
 	i40e_for_each_ring(ring, q_vector->tx) {
-		if (!i40e_clean_tx_irq(vsi, ring, budget)) {
+		if (!ring->clean_tx_irq(vsi, ring, budget)) {
 			clean_complete = false;
 			continue;
 		}
+
 		arm_wb |= ring->arm_wb;
 		ring->arm_wb = false;
 	}
@@ -3810,6 +3919,30 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 	return -1;
 }
 
+/**
+ * i40e_napi_is_scheduled - If napi is running, set the NAPIF_STATE_MISSED
+ * @n: napi context
+ *
+ * Returns true if NAPI is scheduled.
+ **/
+static bool i40e_napi_is_scheduled(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (val & NAPIF_STATE_DISABLE)
+			return true;
+
+		if (!(val & NAPIF_STATE_SCHED))
+			return false;
+
+		new = val | NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return true;
+}
+
 /**
  * i40e_xmit_xdp_ring - transmits an XDP buffer to an XDP Tx ring
  * @xdp: data to transmit
@@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return -ENXIO;
 
+	if (vsi->xdp_rings[queue_index]->xsk_umem)
+		return -ENXIO;
+
 	err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
 	if (err != I40E_XDP_TX)
 		return -ENOSPC;
@@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return;
 
+	if (vsi->xdp_rings[queue_index]->xsk_umem)
+		return;
+
 	i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
 }
+
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
+{
+	struct i40e_netdev_priv *np = netdev_priv(dev);
+	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_ring *ring;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return -ENETDOWN;
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return -ENXIO;
+
+	if (queue_id >= vsi->num_queue_pairs)
+		return -ENXIO;
+
+	if (!vsi->xdp_rings[queue_id]->xsk_umem)
+		return -ENXIO;
+
+	ring = vsi->xdp_rings[queue_id];
+
+	if (!i40e_napi_is_scheduled(&ring->q_vector->napi))
+		i40e_force_wb(vsi, ring->q_vector);
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 9d5d9862e9f1..ea1cac00cad4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -428,6 +428,7 @@ struct i40e_ring {
 
 	int (*clean_rx_irq)(struct i40e_ring *, int);
 	bool (*alloc_rx_buffers)(struct i40e_ring *, u16);
+	bool (*clean_tx_irq)(struct i40e_vsi *, struct i40e_ring *, int);
 	struct xdp_umem *xsk_umem;
 
 	struct zero_copy_allocator zca; /* ZC allocator anchor */
@@ -510,6 +511,11 @@ void i40e_xdp_flush(struct net_device *dev);
 int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
 int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
 void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
+bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
+			  struct i40e_ring *tx_ring, int napi_budget);
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget);
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 11/12] i40e: implement AF_XDP zero-copy support for Rx
  2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-15 20:25     ` Alexander Duyck
  -1 siblings, 0 replies; 54+ messages in thread
From: Alexander Duyck @ 2018-05-15 20:25 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, Karlsson, Magnus, Duyck, Alexander H,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin, Netdev,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Anjali Singhai Jain, qi.z.zhang, intel-wired-lan

On Tue, May 15, 2018 at 12:06 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> A lot of things here. First we add support for the new
> XDP_SETUP_XSK_UMEM command in ndo_bpf. This allows the AF_XDP socket
> to pass a UMEM to the driver. The driver will then DMA map all the
> frames in the UMEM for the driver. Next, the Rx code will allocate
> frames from the UMEM fill queue, instead of the regular page
> allocator.
>
> Externally, for the rest of the XDP code, the driver the driver
> internal UMEM allocator will appear as a MEM_TYPE_ZERO_COPY.
>
> Keep in mind that having frames coming from userland requires some
> extra care taken when passing them to the regular kernel stack. In
> these cases the ZC frame must be copied.
>
> The commit also introduces a completely new clean_rx_irq/allocator
> functions for zero-copy, and means (functions pointers) to set
> allocators and clean_rx functions.
>
> Finally, a lot of this are *not* implemented here. To mention some:
>
> * No passing to the stack via XDP_PASS (clone/copy to skb).
> * No XDP redirect to other than sockets (convert_to_xdp_frame does not
>   clone the frame yet).
>
> And yes, too much C&P and too big commit. :-)
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

A few minor comments below.

> ---
>  drivers/net/ethernet/intel/i40e/i40e.h      |  20 ++
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 202 +++++++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 400 ++++++++++++++++++++++++++--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |  30 ++-
>  4 files changed, 619 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
> index 7a80652e2500..e6ee6c9bf094 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -786,6 +786,12 @@ struct i40e_vsi {
>
>         /* VSI specific handlers */
>         irqreturn_t (*irq_handler)(int irq, void *data);
> +
> +       /* AF_XDP zero-copy */
> +       struct xdp_umem **xsk_umems;
> +       u16 num_xsk_umems_used;
> +       u16 num_xsk_umems;
> +
>  } ____cacheline_internodealigned_in_smp;
>
>  struct i40e_netdev_priv {
> @@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
>         return !!vsi->xdp_prog;
>  }
>
> +static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
> +{
> +       bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
> +       int qid = ring->queue_index;
> +
> +       if (ring_is_xdp(ring))
> +               qid -= ring->vsi->alloc_queue_pairs;
> +
> +       if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
> +               return NULL;
> +
> +       return ring->vsi->xsk_umems[qid];
> +}
> +
>  int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
>  int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
>  int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index b4c23cf3979c..dc3d668a741e 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -5,6 +5,7 @@
>  #include <linux/of_net.h>
>  #include <linux/pci.h>
>  #include <linux/bpf.h>
> +#include <net/xdp_sock.h>
>
>  /* Local includes */
>  #include "i40e.h"
> @@ -3054,6 +3055,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
>         i40e_status err = 0;
>         u32 qtx_ctl = 0;
>
> +       if (ring_is_xdp(ring))
> +               ring->xsk_umem = i40e_xsk_umem(ring);
> +
>         /* some ATR related tx ring init */
>         if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
>                 ring->atr_sample_rate = vsi->back->atr_sample_rate;
> @@ -3163,13 +3167,31 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>         struct i40e_hw *hw = &vsi->back->hw;
>         struct i40e_hmc_obj_rxq rx_ctx;
>         i40e_status err = 0;
> +       int ret;
>
>         bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
>
>         /* clear the context structure first */
>         memset(&rx_ctx, 0, sizeof(rx_ctx));
>
> -       ring->rx_buf_len = vsi->rx_buf_len;
> +       ring->xsk_umem = i40e_xsk_umem(ring);
> +       if (ring->xsk_umem) {
> +               ring->clean_rx_irq = i40e_clean_rx_irq_zc;
> +               ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
> +               ring->rx_buf_len = ring->xsk_umem->props.frame_size -
> +                                  ring->xsk_umem->frame_headroom -
> +                                  XDP_PACKET_HEADROOM;
> +               ring->zca.free = i40e_zca_free;
> +               ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> +                                                MEM_TYPE_ZERO_COPY,
> +                                                &ring->zca);
> +               if (ret)
> +                       return ret;
> +       } else {
> +               ring->clean_rx_irq = i40e_clean_rx_irq;
> +               ring->alloc_rx_buffers = i40e_alloc_rx_buffers;
> +               ring->rx_buf_len = vsi->rx_buf_len;
> +       }
>
>         rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
>                                     BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
> @@ -3225,7 +3247,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>         ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
>         writel(0, ring->tail);
>
> -       i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
> +       ring->alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
>
>         return 0;
>  }
> @@ -12050,6 +12072,179 @@ static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
>         return err;
>  }
>
> +static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
> +{
> +       if (vsi->xsk_umems)
> +               return 0;
> +
> +       vsi->num_xsk_umems_used = 0;
> +       vsi->num_xsk_umems = vsi->alloc_queue_pairs;
> +       vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
> +                                GFP_KERNEL);
> +       if (!vsi->xsk_umems) {
> +               vsi->num_xsk_umems = 0;
> +               return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                            u16 qid)
> +{
> +       int err;
> +
> +       err = i40e_alloc_xsk_umems(vsi);
> +       if (err)
> +               return err;
> +
> +       vsi->xsk_umems[qid] = umem;
> +       vsi->num_xsk_umems_used++;
> +
> +       return 0;
> +}
> +
> +static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
> +{
> +       vsi->xsk_umems[qid] = NULL;
> +       vsi->num_xsk_umems_used--;
> +
> +       if (vsi->num_xsk_umems == 0) {
> +               kfree(vsi->xsk_umems);
> +               vsi->xsk_umems = NULL;
> +               vsi->num_xsk_umems = 0;
> +       }
> +}
> +
> +static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
> +{
> +       struct i40e_pf *pf = vsi->back;
> +       struct device *dev;
> +       unsigned int i, j;
> +       dma_addr_t dma;
> +
> +       dev = &pf->pdev->dev;
> +
> +       for (i = 0; i < umem->props.nframes; i++) {
> +               dma = dma_map_single_attrs(dev, umem->frames[i].addr,
> +                                          umem->props.frame_size,
> +                                          DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> +               if (dma_mapping_error(dev, dma))
> +                       goto out_unmap;
> +
> +               umem->frames[i].dma = dma;
> +       }
> +
> +       return 0;
> +
> +out_unmap:
> +       for (j = 0; j < i; j++) {
> +               dma_unmap_single_attrs(dev, umem->frames[i].dma,
> +                                      umem->props.frame_size,
> +                                      DMA_BIDIRECTIONAL,
> +                                      I40E_RX_DMA_ATTR);
> +               umem->frames[i].dma = 0;
> +       }
> +
> +       return -1;
> +}
> +
> +static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
> +{
> +       struct i40e_pf *pf = vsi->back;
> +       struct device *dev;
> +       unsigned int i;
> +
> +       dev = &pf->pdev->dev;
> +
> +       for (i = 0; i < umem->props.nframes; i++) {
> +               dma_unmap_single_attrs(dev, umem->frames[i].dma,
> +                                      umem->props.frame_size,
> +                                      DMA_BIDIRECTIONAL,
> +                                      I40E_RX_DMA_ATTR);
> +
> +               umem->frames[i].dma = 0;
> +       }
> +}
> +
> +static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                               u16 qid)
> +{
> +       bool if_running;
> +       int err;
> +
> +       if (vsi->type != I40E_VSI_MAIN)
> +               return -EINVAL;
> +
> +       if (qid >= vsi->num_queue_pairs)
> +               return -EINVAL;
> +
> +       if (vsi->xsk_umems && vsi->xsk_umems[qid])
> +               return -EBUSY;
> +
> +       err = i40e_xsk_umem_dma_map(vsi, umem);
> +       if (err)
> +               return err;
> +
> +       if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_disable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       err = i40e_add_xsk_umem(vsi, umem, qid);
> +       if (err)
> +               return err;
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_enable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
> +{
> +       bool if_running;
> +       int err;
> +
> +       if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
> +           !vsi->xsk_umems[qid])
> +               return -EINVAL;
> +
> +       if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_disable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
> +       i40e_remove_xsk_umem(vsi, qid);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_enable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                              u16 qid)
> +{
> +       if (umem)
> +               return i40e_xsk_umem_enable(vsi, umem, qid);
> +
> +       return i40e_xsk_umem_disable(vsi, qid);
> +}
> +
>  /**
>   * i40e_xdp - implements ndo_bpf for i40e
>   * @dev: netdevice
> @@ -12071,6 +12266,9 @@ static int i40e_xdp(struct net_device *dev,
>                 xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
>                 xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
>                 return 0;
> +       case XDP_SETUP_XSK_UMEM:
> +               return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
> +                                          xdp->xsk.queue_id);
>         default:
>                 return -EINVAL;
>         }
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 5efa68de935b..f89ac524652c 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -5,6 +5,7 @@
>  #include <net/busy_poll.h>
>  #include <linux/bpf_trace.h>
>  #include <net/xdp.h>
> +#include <net/xdp_sock.h>
>  #include "i40e.h"
>  #include "i40e_trace.h"
>  #include "i40e_prototype.h"
> @@ -1373,31 +1374,35 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>         }
>
>         /* Free all the Rx ring sk_buffs */
> -       for (i = 0; i < rx_ring->count; i++) {
> -               struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
> +       if (!rx_ring->xsk_umem) {
> +               for (i = 0; i < rx_ring->count; i++) {

I'm not a fan of all this extra indenting. This could be much more
easily handled with just a goto and a label.

> +                       struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
>
> -               if (!rx_bi->page)
> -                       continue;
> -
> -               /* Invalidate cache lines that may have been written to by
> -                * device so that we avoid corrupting memory.
> -                */
> -               dma_sync_single_range_for_cpu(rx_ring->dev,
> -                                             rx_bi->dma,
> -                                             rx_bi->page_offset,
> -                                             rx_ring->rx_buf_len,
> -                                             DMA_FROM_DEVICE);
> -
> -               /* free resources associated with mapping */
> -               dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> -                                    i40e_rx_pg_size(rx_ring),
> -                                    DMA_FROM_DEVICE,
> -                                    I40E_RX_DMA_ATTR);
> -
> -               __page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
> +                       if (!rx_bi->page)
> +                               continue;
>
> -               rx_bi->page = NULL;
> -               rx_bi->page_offset = 0;
> +                       /* Invalidate cache lines that may have been
> +                        * written to by device so that we avoid
> +                        * corrupting memory.
> +                        */
> +                       dma_sync_single_range_for_cpu(rx_ring->dev,
> +                                                     rx_bi->dma,
> +                                                     rx_bi->page_offset,
> +                                                     rx_ring->rx_buf_len,
> +                                                     DMA_FROM_DEVICE);
> +
> +                       /* free resources associated with mapping */
> +                       dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> +                                            i40e_rx_pg_size(rx_ring),
> +                                            DMA_FROM_DEVICE,
> +                                            I40E_RX_DMA_ATTR);
> +
> +                       __page_frag_cache_drain(rx_bi->page,
> +                                               rx_bi->pagecnt_bias);
> +
> +                       rx_bi->page = NULL;
> +                       rx_bi->page_offset = 0;
> +               }
>         }
>
>         bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
> @@ -2214,8 +2219,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
>         if (!xdp_prog)
>                 goto xdp_out;
>
> -       prefetchw(xdp->data_hard_start); /* xdp_frame write */
> -
>         act = bpf_prog_run_xdp(xdp_prog, xdp);
>         switch (act) {
>         case XDP_PASS:
> @@ -2284,7 +2287,7 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
>   *
>   * Returns amount of work completed
>   **/
> -static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
> +int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>  {
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>         struct sk_buff *skb = rx_ring->skb;
> @@ -2426,6 +2429,349 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>         return failure ? budget : (int)total_rx_packets;
>  }
>

How much of the code below is actually reused anywhere else? I would
almost be inclined to say that maybe the zero-copy path should be
moved to a new file since so much of this is being duplicated from the
original tx/rx code path. I can easily see this becoming confusing as
to which is which when a bug gets found and needs to be fixed.

> +static struct sk_buff *i40e_run_xdp_zc(struct i40e_ring *rx_ring,
> +                                      struct xdp_buff *xdp)
> +{
> +       int err, result = I40E_XDP_PASS;
> +       struct i40e_ring *xdp_ring;
> +       struct bpf_prog *xdp_prog;
> +       u32 act;
> +
> +       rcu_read_lock();
> +       xdp_prog = READ_ONCE(rx_ring->xdp_prog);
> +
> +       act = bpf_prog_run_xdp(xdp_prog, xdp);
> +       switch (act) {
> +       case XDP_PASS:
> +               break;
> +       case XDP_TX:
> +               xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +               result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
> +               break;
> +       case XDP_REDIRECT:
> +               err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
> +               result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
> +               break;
> +       default:
> +               bpf_warn_invalid_xdp_action(act);
> +       case XDP_ABORTED:
> +               trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
> +               /* fallthrough -- handle aborts by dropping packet */
> +       case XDP_DROP:
> +               result = I40E_XDP_CONSUMED;
> +               break;
> +       }
> +
> +       rcu_read_unlock();
> +       return ERR_PTR(-result);
> +}
> +
> +static bool i40e_alloc_frame_zc(struct i40e_ring *rx_ring,
> +                               struct i40e_rx_buffer *bi)
> +{
> +       struct xdp_umem *umem = rx_ring->xsk_umem;
> +       void *addr = bi->addr;
> +       u32 *id;
> +
> +       if (addr) {
> +               rx_ring->rx_stats.page_reuse_count++;
> +               return true;
> +       }
> +
> +       id = xsk_umem_peek_id(umem);
> +       if (unlikely(!id)) {
> +               rx_ring->rx_stats.alloc_page_failed++;
> +               return false;
> +       }
> +
> +       bi->dma = umem->frames[*id].dma + umem->frame_headroom +
> +                 XDP_PACKET_HEADROOM;
> +       bi->addr = umem->frames[*id].addr + umem->frame_headroom +
> +                 XDP_PACKET_HEADROOM;
> +       bi->id = *id;
> +
> +       xsk_umem_discard_id(umem);
> +       return true;
> +}
> +
> +bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count)
> +{
> +       u16 ntu = rx_ring->next_to_use;
> +       union i40e_rx_desc *rx_desc;
> +       struct i40e_rx_buffer *bi;
> +
> +       rx_desc = I40E_RX_DESC(rx_ring, ntu);
> +       bi = &rx_ring->rx_bi[ntu];
> +
> +       do {
> +               if (!i40e_alloc_frame_zc(rx_ring, bi))
> +                       goto no_buffers;
> +
> +               /* sync the buffer for use by the device */
> +               dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
> +                                                rx_ring->rx_buf_len,
> +                                                DMA_BIDIRECTIONAL);
> +
> +               /* Refresh the desc even if buffer_addrs didn't change
> +                * because each write-back erases this info.
> +                */
> +               rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
> +
> +               rx_desc++;
> +               bi++;
> +               ntu++;
> +               if (unlikely(ntu == rx_ring->count)) {
> +                       rx_desc = I40E_RX_DESC(rx_ring, 0);
> +                       bi = rx_ring->rx_bi;
> +                       ntu = 0;
> +               }
> +
> +               /* clear the status bits for the next_to_use descriptor */
> +               rx_desc->wb.qword1.status_error_len = 0;
> +
> +               cleaned_count--;
> +       } while (cleaned_count);
> +
> +       if (rx_ring->next_to_use != ntu)
> +               i40e_release_rx_desc(rx_ring, ntu);
> +
> +       return false;
> +
> +no_buffers:
> +       if (rx_ring->next_to_use != ntu)
> +               i40e_release_rx_desc(rx_ring, ntu);
> +
> +       /* make sure to come back via polling to try again after
> +        * allocation failure
> +        */
> +       return true;
> +}
> +
> +static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
> +                                                   const unsigned int size)
> +{
> +       struct i40e_rx_buffer *rx_buffer;
> +
> +       rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
> +
> +       /* we are reusing so sync this buffer for CPU use */
> +       dma_sync_single_range_for_cpu(rx_ring->dev,
> +                                     rx_buffer->dma, 0,
> +                                     size,
> +                                     DMA_BIDIRECTIONAL);
> +
> +       return rx_buffer;
> +}
> +
> +static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
> +                                   struct i40e_rx_buffer *old_buff)
> +{
> +       struct i40e_rx_buffer *new_buff;
> +       u16 nta = rx_ring->next_to_alloc;
> +
> +       new_buff = &rx_ring->rx_bi[nta];
> +
> +       /* update, and store next to alloc */
> +       nta++;
> +       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> +
> +       /* transfer page from old buffer to new buffer */
> +       new_buff->dma  = old_buff->dma;
> +       new_buff->addr = old_buff->addr;
> +       new_buff->id   = old_buff->id;
> +}
> +
> +/* Called from the XDP return API in NAPI context. */
> +void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
> +{
> +       struct i40e_rx_buffer *new_buff;
> +       struct i40e_ring *rx_ring;
> +       u16 nta;
> +
> +       rx_ring = container_of(alloc, struct i40e_ring, zca);
> +       nta = rx_ring->next_to_alloc;
> +
> +       new_buff = &rx_ring->rx_bi[nta];
> +
> +       /* update, and store next to alloc */
> +       nta++;
> +       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> +
> +       new_buff->dma  = rx_ring->xsk_umem->frames[handle].dma;
> +       new_buff->addr = rx_ring->xsk_umem->frames[handle].addr;
> +       new_buff->id   = (u32)handle;
> +}
> +
> +static struct sk_buff *i40e_zc_frame_to_skb(struct i40e_ring *rx_ring,
> +                                           struct i40e_rx_buffer *rx_buffer,
> +                                           struct xdp_buff *xdp)
> +{
> +       // XXX implement alloc skb and copy
> +       i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +       return NULL;
> +}
> +
> +static void i40e_clean_programming_status_zc(struct i40e_ring *rx_ring,
> +                                            union i40e_rx_desc *rx_desc,
> +                                            u64 qw)
> +{
> +       struct i40e_rx_buffer *rx_buffer;
> +       u32 ntc = rx_ring->next_to_clean;
> +       u8 id;
> +
> +       /* fetch, update, and store next to clean */
> +       rx_buffer = &rx_ring->rx_bi[ntc++];
> +       ntc = (ntc < rx_ring->count) ? ntc : 0;
> +       rx_ring->next_to_clean = ntc;
> +
> +       prefetch(I40E_RX_DESC(rx_ring, ntc));
> +
> +       /* place unused page back on the ring */
> +       i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +       rx_ring->rx_stats.page_reuse_count++;
> +
> +       /* clear contents of buffer_info */
> +       rx_buffer->addr = NULL;
> +
> +       id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
> +                 I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
> +
> +       if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
> +               i40e_fd_handle_status(rx_ring, rx_desc, id);
> +}
> +
> +int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> +{
> +       unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> +       u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
> +       bool failure = false, xdp_xmit = false;
> +       struct sk_buff *skb;
> +       struct xdp_buff xdp;
> +
> +       xdp.rxq = &rx_ring->xdp_rxq;
> +
> +       while (likely(total_rx_packets < (unsigned int)budget)) {
> +               struct i40e_rx_buffer *rx_buffer;
> +               union i40e_rx_desc *rx_desc;
> +               unsigned int size;
> +               u16 vlan_tag;
> +               u8 rx_ptype;
> +               u64 qword;
> +               u32 ntc;
> +
> +               /* return some buffers to hardware, one at a time is too slow */
> +               if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
> +                       failure = failure ||
> +                                 i40e_alloc_rx_buffers_zc(rx_ring,
> +                                                          cleaned_count);
> +                       cleaned_count = 0;
> +               }
> +
> +               rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
> +
> +               /* status_error_len will always be zero for unused descriptors
> +                * because it's cleared in cleanup, and overlaps with hdr_addr
> +                * which is always zero because packet split isn't used, if the
> +                * hardware wrote DD then the length will be non-zero
> +                */
> +               qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> +
> +               /* This memory barrier is needed to keep us from reading
> +                * any other fields out of the rx_desc until we have
> +                * verified the descriptor has been written back.
> +                */
> +               dma_rmb();
> +
> +               if (unlikely(i40e_rx_is_programming_status(qword))) {
> +                       i40e_clean_programming_status_zc(rx_ring, rx_desc,
> +                                                        qword);
> +                       cleaned_count++;
> +                       continue;
> +               }
> +               size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> +                      I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
> +               if (!size)
> +                       break;
> +
> +               rx_buffer = i40e_get_rx_buffer_zc(rx_ring, size);
> +
> +               /* retrieve a buffer from the ring */
> +               xdp.data = rx_buffer->addr;
> +               xdp_set_data_meta_invalid(&xdp);
> +               xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
> +               xdp.data_end = xdp.data + size;
> +               xdp.handle = rx_buffer->id;
> +
> +               skb = i40e_run_xdp_zc(rx_ring, &xdp);
> +
> +               if (IS_ERR(skb)) {
> +                       if (PTR_ERR(skb) == -I40E_XDP_TX)
> +                               xdp_xmit = true;
> +                       else
> +                               i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +                       total_rx_bytes += size;
> +                       total_rx_packets++;
> +               } else {
> +                       skb = i40e_zc_frame_to_skb(rx_ring, rx_buffer, &xdp);
> +                       if (!skb) {
> +                               rx_ring->rx_stats.alloc_buff_failed++;
> +                               break;
> +                       }
> +               }
> +
> +               rx_buffer->addr = NULL;
> +               cleaned_count++;
> +
> +               /* don't care about non-EOP frames in XDP mode */
> +               ntc = rx_ring->next_to_clean + 1;
> +               ntc = (ntc < rx_ring->count) ? ntc : 0;
> +               rx_ring->next_to_clean = ntc;
> +               prefetch(I40E_RX_DESC(rx_ring, ntc));
> +
> +               if (i40e_cleanup_headers(rx_ring, skb, rx_desc)) {
> +                       skb = NULL;
> +                       continue;
> +               }
> +
> +               /* probably a little skewed due to removing CRC */
> +               total_rx_bytes += skb->len;
> +
> +               qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> +               rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
> +                          I40E_RXD_QW1_PTYPE_SHIFT;
> +
> +               /* populate checksum, VLAN, and protocol */
> +               i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
> +
> +               vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
> +                          le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
> +
> +               i40e_receive_skb(rx_ring, skb, vlan_tag);
> +               skb = NULL;
> +
> +               /* update budget accounting */
> +               total_rx_packets++;
> +       }
> +
> +       if (xdp_xmit) {
> +               struct i40e_ring *xdp_ring =
> +                       rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +
> +               i40e_xdp_ring_update_tail(xdp_ring);
> +               xdp_do_flush_map();
> +       }
> +
> +       u64_stats_update_begin(&rx_ring->syncp);
> +       rx_ring->stats.packets += total_rx_packets;
> +       rx_ring->stats.bytes += total_rx_bytes;
> +       u64_stats_update_end(&rx_ring->syncp);
> +       rx_ring->q_vector->rx.total_packets += total_rx_packets;
> +       rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
> +
> +       /* guarantee a trip back through this routine if there was a failure */
> +       return failure ? budget : (int)total_rx_packets;
> +}
> +
>  static inline u32 i40e_buildreg_itr(const int type, u16 itr)
>  {
>         u32 val;
> @@ -2576,7 +2922,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
>         budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
>
>         i40e_for_each_ring(ring, q_vector->rx) {
> -               int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
> +               int cleaned = ring->clean_rx_irq(ring, budget_per_ring);
>
>                 work_done += cleaned;
>                 /* if we clean as many as budgeted, we must not be done */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index fdd2c55f03a6..9d5d9862e9f1 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -296,13 +296,22 @@ struct i40e_tx_buffer {
>
>  struct i40e_rx_buffer {
>         dma_addr_t dma;
> -       struct page *page;
> +       union {
> +               struct {
> +                       struct page *page;
>  #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
> -       __u32 page_offset;
> +                       __u32 page_offset;
>  #else
> -       __u16 page_offset;
> +                       __u16 page_offset;
>  #endif
> -       __u16 pagecnt_bias;
> +                       __u16 pagecnt_bias;
> +               };
> +               struct {
> +                       /* for umem */
> +                       void *addr;
> +                       u32 id;
> +               };
> +       };
>  };
>
>  struct i40e_queue_stats {
> @@ -344,6 +353,8 @@ enum i40e_ring_state_t {
>  #define I40E_RX_SPLIT_TCP_UDP 0x4
>  #define I40E_RX_SPLIT_SCTP    0x8
>
> +void i40e_zc_recycle(struct zero_copy_allocator *alloc, unsigned long handle);
> +
>  /* struct that defines a descriptor ring, associated with a VSI */
>  struct i40e_ring {
>         struct i40e_ring *next;         /* pointer to next ring in q_vector */
> @@ -414,6 +425,12 @@ struct i40e_ring {
>
>         struct i40e_channel *ch;
>         struct xdp_rxq_info xdp_rxq;
> +
> +       int (*clean_rx_irq)(struct i40e_ring *, int);
> +       bool (*alloc_rx_buffers)(struct i40e_ring *, u16);
> +       struct xdp_umem *xsk_umem;
> +
> +       struct zero_copy_allocator zca; /* ZC allocator anchor */
>  } ____cacheline_internodealigned_in_smp;
>
>  static inline bool ring_uses_build_skb(struct i40e_ring *ring)
> @@ -474,6 +491,7 @@ static inline unsigned int i40e_rx_pg_order(struct i40e_ring *ring)
>  #define i40e_rx_pg_size(_ring) (PAGE_SIZE << i40e_rx_pg_order(_ring))
>
>  bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
> +bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
>  netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
>  void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
>  void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
> @@ -489,6 +507,9 @@ int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
>  bool __i40e_chk_linearize(struct sk_buff *skb);
>  int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
>  void i40e_xdp_flush(struct net_device *dev);
> +int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
> +int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
> +void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
>
>  /**
>   * i40e_get_head - Retrieve head from head writeback
> @@ -575,4 +596,5 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
>  {
>         return netdev_get_tx_queue(ring->netdev, ring->queue_index);
>  }
> +
>  #endif /* _I40E_TXRX_H_ */
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 11/12] i40e: implement AF_XDP zero-copy support for Rx
@ 2018-05-15 20:25     ` Alexander Duyck
  0 siblings, 0 replies; 54+ messages in thread
From: Alexander Duyck @ 2018-05-15 20:25 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, May 15, 2018 at 12:06 PM, Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
> From: Bj?rn T?pel <bjorn.topel@intel.com>
>
> A lot of things here. First we add support for the new
> XDP_SETUP_XSK_UMEM command in ndo_bpf. This allows the AF_XDP socket
> to pass a UMEM to the driver. The driver will then DMA map all the
> frames in the UMEM for the driver. Next, the Rx code will allocate
> frames from the UMEM fill queue, instead of the regular page
> allocator.
>
> Externally, for the rest of the XDP code, the driver the driver
> internal UMEM allocator will appear as a MEM_TYPE_ZERO_COPY.
>
> Keep in mind that having frames coming from userland requires some
> extra care taken when passing them to the regular kernel stack. In
> these cases the ZC frame must be copied.
>
> The commit also introduces a completely new clean_rx_irq/allocator
> functions for zero-copy, and means (functions pointers) to set
> allocators and clean_rx functions.
>
> Finally, a lot of this are *not* implemented here. To mention some:
>
> * No passing to the stack via XDP_PASS (clone/copy to skb).
> * No XDP redirect to other than sockets (convert_to_xdp_frame does not
>   clone the frame yet).
>
> And yes, too much C&P and too big commit. :-)
>
> Signed-off-by: Bj?rn T?pel <bjorn.topel@intel.com>

A few minor comments below.

> ---
>  drivers/net/ethernet/intel/i40e/i40e.h      |  20 ++
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 202 +++++++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 400 ++++++++++++++++++++++++++--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |  30 ++-
>  4 files changed, 619 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
> index 7a80652e2500..e6ee6c9bf094 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -786,6 +786,12 @@ struct i40e_vsi {
>
>         /* VSI specific handlers */
>         irqreturn_t (*irq_handler)(int irq, void *data);
> +
> +       /* AF_XDP zero-copy */
> +       struct xdp_umem **xsk_umems;
> +       u16 num_xsk_umems_used;
> +       u16 num_xsk_umems;
> +
>  } ____cacheline_internodealigned_in_smp;
>
>  struct i40e_netdev_priv {
> @@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
>         return !!vsi->xdp_prog;
>  }
>
> +static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
> +{
> +       bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
> +       int qid = ring->queue_index;
> +
> +       if (ring_is_xdp(ring))
> +               qid -= ring->vsi->alloc_queue_pairs;
> +
> +       if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
> +               return NULL;
> +
> +       return ring->vsi->xsk_umems[qid];
> +}
> +
>  int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
>  int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
>  int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index b4c23cf3979c..dc3d668a741e 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -5,6 +5,7 @@
>  #include <linux/of_net.h>
>  #include <linux/pci.h>
>  #include <linux/bpf.h>
> +#include <net/xdp_sock.h>
>
>  /* Local includes */
>  #include "i40e.h"
> @@ -3054,6 +3055,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
>         i40e_status err = 0;
>         u32 qtx_ctl = 0;
>
> +       if (ring_is_xdp(ring))
> +               ring->xsk_umem = i40e_xsk_umem(ring);
> +
>         /* some ATR related tx ring init */
>         if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
>                 ring->atr_sample_rate = vsi->back->atr_sample_rate;
> @@ -3163,13 +3167,31 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>         struct i40e_hw *hw = &vsi->back->hw;
>         struct i40e_hmc_obj_rxq rx_ctx;
>         i40e_status err = 0;
> +       int ret;
>
>         bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
>
>         /* clear the context structure first */
>         memset(&rx_ctx, 0, sizeof(rx_ctx));
>
> -       ring->rx_buf_len = vsi->rx_buf_len;
> +       ring->xsk_umem = i40e_xsk_umem(ring);
> +       if (ring->xsk_umem) {
> +               ring->clean_rx_irq = i40e_clean_rx_irq_zc;
> +               ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
> +               ring->rx_buf_len = ring->xsk_umem->props.frame_size -
> +                                  ring->xsk_umem->frame_headroom -
> +                                  XDP_PACKET_HEADROOM;
> +               ring->zca.free = i40e_zca_free;
> +               ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> +                                                MEM_TYPE_ZERO_COPY,
> +                                                &ring->zca);
> +               if (ret)
> +                       return ret;
> +       } else {
> +               ring->clean_rx_irq = i40e_clean_rx_irq;
> +               ring->alloc_rx_buffers = i40e_alloc_rx_buffers;
> +               ring->rx_buf_len = vsi->rx_buf_len;
> +       }
>
>         rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
>                                     BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
> @@ -3225,7 +3247,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>         ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
>         writel(0, ring->tail);
>
> -       i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
> +       ring->alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
>
>         return 0;
>  }
> @@ -12050,6 +12072,179 @@ static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
>         return err;
>  }
>
> +static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
> +{
> +       if (vsi->xsk_umems)
> +               return 0;
> +
> +       vsi->num_xsk_umems_used = 0;
> +       vsi->num_xsk_umems = vsi->alloc_queue_pairs;
> +       vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
> +                                GFP_KERNEL);
> +       if (!vsi->xsk_umems) {
> +               vsi->num_xsk_umems = 0;
> +               return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                            u16 qid)
> +{
> +       int err;
> +
> +       err = i40e_alloc_xsk_umems(vsi);
> +       if (err)
> +               return err;
> +
> +       vsi->xsk_umems[qid] = umem;
> +       vsi->num_xsk_umems_used++;
> +
> +       return 0;
> +}
> +
> +static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
> +{
> +       vsi->xsk_umems[qid] = NULL;
> +       vsi->num_xsk_umems_used--;
> +
> +       if (vsi->num_xsk_umems == 0) {
> +               kfree(vsi->xsk_umems);
> +               vsi->xsk_umems = NULL;
> +               vsi->num_xsk_umems = 0;
> +       }
> +}
> +
> +static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
> +{
> +       struct i40e_pf *pf = vsi->back;
> +       struct device *dev;
> +       unsigned int i, j;
> +       dma_addr_t dma;
> +
> +       dev = &pf->pdev->dev;
> +
> +       for (i = 0; i < umem->props.nframes; i++) {
> +               dma = dma_map_single_attrs(dev, umem->frames[i].addr,
> +                                          umem->props.frame_size,
> +                                          DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
> +               if (dma_mapping_error(dev, dma))
> +                       goto out_unmap;
> +
> +               umem->frames[i].dma = dma;
> +       }
> +
> +       return 0;
> +
> +out_unmap:
> +       for (j = 0; j < i; j++) {
> +               dma_unmap_single_attrs(dev, umem->frames[i].dma,
> +                                      umem->props.frame_size,
> +                                      DMA_BIDIRECTIONAL,
> +                                      I40E_RX_DMA_ATTR);
> +               umem->frames[i].dma = 0;
> +       }
> +
> +       return -1;
> +}
> +
> +static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
> +{
> +       struct i40e_pf *pf = vsi->back;
> +       struct device *dev;
> +       unsigned int i;
> +
> +       dev = &pf->pdev->dev;
> +
> +       for (i = 0; i < umem->props.nframes; i++) {
> +               dma_unmap_single_attrs(dev, umem->frames[i].dma,
> +                                      umem->props.frame_size,
> +                                      DMA_BIDIRECTIONAL,
> +                                      I40E_RX_DMA_ATTR);
> +
> +               umem->frames[i].dma = 0;
> +       }
> +}
> +
> +static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                               u16 qid)
> +{
> +       bool if_running;
> +       int err;
> +
> +       if (vsi->type != I40E_VSI_MAIN)
> +               return -EINVAL;
> +
> +       if (qid >= vsi->num_queue_pairs)
> +               return -EINVAL;
> +
> +       if (vsi->xsk_umems && vsi->xsk_umems[qid])
> +               return -EBUSY;
> +
> +       err = i40e_xsk_umem_dma_map(vsi, umem);
> +       if (err)
> +               return err;
> +
> +       if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_disable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       err = i40e_add_xsk_umem(vsi, umem, qid);
> +       if (err)
> +               return err;
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_enable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
> +{
> +       bool if_running;
> +       int err;
> +
> +       if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
> +           !vsi->xsk_umems[qid])
> +               return -EINVAL;
> +
> +       if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_disable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
> +       i40e_remove_xsk_umem(vsi, qid);
> +
> +       if (if_running) {
> +               err = i40e_queue_pair_enable(vsi, qid);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
> +                              u16 qid)
> +{
> +       if (umem)
> +               return i40e_xsk_umem_enable(vsi, umem, qid);
> +
> +       return i40e_xsk_umem_disable(vsi, qid);
> +}
> +
>  /**
>   * i40e_xdp - implements ndo_bpf for i40e
>   * @dev: netdevice
> @@ -12071,6 +12266,9 @@ static int i40e_xdp(struct net_device *dev,
>                 xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
>                 xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
>                 return 0;
> +       case XDP_SETUP_XSK_UMEM:
> +               return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
> +                                          xdp->xsk.queue_id);
>         default:
>                 return -EINVAL;
>         }
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 5efa68de935b..f89ac524652c 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -5,6 +5,7 @@
>  #include <net/busy_poll.h>
>  #include <linux/bpf_trace.h>
>  #include <net/xdp.h>
> +#include <net/xdp_sock.h>
>  #include "i40e.h"
>  #include "i40e_trace.h"
>  #include "i40e_prototype.h"
> @@ -1373,31 +1374,35 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
>         }
>
>         /* Free all the Rx ring sk_buffs */
> -       for (i = 0; i < rx_ring->count; i++) {
> -               struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
> +       if (!rx_ring->xsk_umem) {
> +               for (i = 0; i < rx_ring->count; i++) {

I'm not a fan of all this extra indenting. This could be much more
easily handled with just a goto and a label.

> +                       struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
>
> -               if (!rx_bi->page)
> -                       continue;
> -
> -               /* Invalidate cache lines that may have been written to by
> -                * device so that we avoid corrupting memory.
> -                */
> -               dma_sync_single_range_for_cpu(rx_ring->dev,
> -                                             rx_bi->dma,
> -                                             rx_bi->page_offset,
> -                                             rx_ring->rx_buf_len,
> -                                             DMA_FROM_DEVICE);
> -
> -               /* free resources associated with mapping */
> -               dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> -                                    i40e_rx_pg_size(rx_ring),
> -                                    DMA_FROM_DEVICE,
> -                                    I40E_RX_DMA_ATTR);
> -
> -               __page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
> +                       if (!rx_bi->page)
> +                               continue;
>
> -               rx_bi->page = NULL;
> -               rx_bi->page_offset = 0;
> +                       /* Invalidate cache lines that may have been
> +                        * written to by device so that we avoid
> +                        * corrupting memory.
> +                        */
> +                       dma_sync_single_range_for_cpu(rx_ring->dev,
> +                                                     rx_bi->dma,
> +                                                     rx_bi->page_offset,
> +                                                     rx_ring->rx_buf_len,
> +                                                     DMA_FROM_DEVICE);
> +
> +                       /* free resources associated with mapping */
> +                       dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> +                                            i40e_rx_pg_size(rx_ring),
> +                                            DMA_FROM_DEVICE,
> +                                            I40E_RX_DMA_ATTR);
> +
> +                       __page_frag_cache_drain(rx_bi->page,
> +                                               rx_bi->pagecnt_bias);
> +
> +                       rx_bi->page = NULL;
> +                       rx_bi->page_offset = 0;
> +               }
>         }
>
>         bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
> @@ -2214,8 +2219,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
>         if (!xdp_prog)
>                 goto xdp_out;
>
> -       prefetchw(xdp->data_hard_start); /* xdp_frame write */
> -
>         act = bpf_prog_run_xdp(xdp_prog, xdp);
>         switch (act) {
>         case XDP_PASS:
> @@ -2284,7 +2287,7 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
>   *
>   * Returns amount of work completed
>   **/
> -static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
> +int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>  {
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>         struct sk_buff *skb = rx_ring->skb;
> @@ -2426,6 +2429,349 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>         return failure ? budget : (int)total_rx_packets;
>  }
>

How much of the code below is actually reused anywhere else? I would
almost be inclined to say that maybe the zero-copy path should be
moved to a new file since so much of this is being duplicated from the
original tx/rx code path. I can easily see this becoming confusing as
to which is which when a bug gets found and needs to be fixed.

> +static struct sk_buff *i40e_run_xdp_zc(struct i40e_ring *rx_ring,
> +                                      struct xdp_buff *xdp)
> +{
> +       int err, result = I40E_XDP_PASS;
> +       struct i40e_ring *xdp_ring;
> +       struct bpf_prog *xdp_prog;
> +       u32 act;
> +
> +       rcu_read_lock();
> +       xdp_prog = READ_ONCE(rx_ring->xdp_prog);
> +
> +       act = bpf_prog_run_xdp(xdp_prog, xdp);
> +       switch (act) {
> +       case XDP_PASS:
> +               break;
> +       case XDP_TX:
> +               xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +               result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
> +               break;
> +       case XDP_REDIRECT:
> +               err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
> +               result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
> +               break;
> +       default:
> +               bpf_warn_invalid_xdp_action(act);
> +       case XDP_ABORTED:
> +               trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
> +               /* fallthrough -- handle aborts by dropping packet */
> +       case XDP_DROP:
> +               result = I40E_XDP_CONSUMED;
> +               break;
> +       }
> +
> +       rcu_read_unlock();
> +       return ERR_PTR(-result);
> +}
> +
> +static bool i40e_alloc_frame_zc(struct i40e_ring *rx_ring,
> +                               struct i40e_rx_buffer *bi)
> +{
> +       struct xdp_umem *umem = rx_ring->xsk_umem;
> +       void *addr = bi->addr;
> +       u32 *id;
> +
> +       if (addr) {
> +               rx_ring->rx_stats.page_reuse_count++;
> +               return true;
> +       }
> +
> +       id = xsk_umem_peek_id(umem);
> +       if (unlikely(!id)) {
> +               rx_ring->rx_stats.alloc_page_failed++;
> +               return false;
> +       }
> +
> +       bi->dma = umem->frames[*id].dma + umem->frame_headroom +
> +                 XDP_PACKET_HEADROOM;
> +       bi->addr = umem->frames[*id].addr + umem->frame_headroom +
> +                 XDP_PACKET_HEADROOM;
> +       bi->id = *id;
> +
> +       xsk_umem_discard_id(umem);
> +       return true;
> +}
> +
> +bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count)
> +{
> +       u16 ntu = rx_ring->next_to_use;
> +       union i40e_rx_desc *rx_desc;
> +       struct i40e_rx_buffer *bi;
> +
> +       rx_desc = I40E_RX_DESC(rx_ring, ntu);
> +       bi = &rx_ring->rx_bi[ntu];
> +
> +       do {
> +               if (!i40e_alloc_frame_zc(rx_ring, bi))
> +                       goto no_buffers;
> +
> +               /* sync the buffer for use by the device */
> +               dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
> +                                                rx_ring->rx_buf_len,
> +                                                DMA_BIDIRECTIONAL);
> +
> +               /* Refresh the desc even if buffer_addrs didn't change
> +                * because each write-back erases this info.
> +                */
> +               rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
> +
> +               rx_desc++;
> +               bi++;
> +               ntu++;
> +               if (unlikely(ntu == rx_ring->count)) {
> +                       rx_desc = I40E_RX_DESC(rx_ring, 0);
> +                       bi = rx_ring->rx_bi;
> +                       ntu = 0;
> +               }
> +
> +               /* clear the status bits for the next_to_use descriptor */
> +               rx_desc->wb.qword1.status_error_len = 0;
> +
> +               cleaned_count--;
> +       } while (cleaned_count);
> +
> +       if (rx_ring->next_to_use != ntu)
> +               i40e_release_rx_desc(rx_ring, ntu);
> +
> +       return false;
> +
> +no_buffers:
> +       if (rx_ring->next_to_use != ntu)
> +               i40e_release_rx_desc(rx_ring, ntu);
> +
> +       /* make sure to come back via polling to try again after
> +        * allocation failure
> +        */
> +       return true;
> +}
> +
> +static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
> +                                                   const unsigned int size)
> +{
> +       struct i40e_rx_buffer *rx_buffer;
> +
> +       rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
> +
> +       /* we are reusing so sync this buffer for CPU use */
> +       dma_sync_single_range_for_cpu(rx_ring->dev,
> +                                     rx_buffer->dma, 0,
> +                                     size,
> +                                     DMA_BIDIRECTIONAL);
> +
> +       return rx_buffer;
> +}
> +
> +static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
> +                                   struct i40e_rx_buffer *old_buff)
> +{
> +       struct i40e_rx_buffer *new_buff;
> +       u16 nta = rx_ring->next_to_alloc;
> +
> +       new_buff = &rx_ring->rx_bi[nta];
> +
> +       /* update, and store next to alloc */
> +       nta++;
> +       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> +
> +       /* transfer page from old buffer to new buffer */
> +       new_buff->dma  = old_buff->dma;
> +       new_buff->addr = old_buff->addr;
> +       new_buff->id   = old_buff->id;
> +}
> +
> +/* Called from the XDP return API in NAPI context. */
> +void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
> +{
> +       struct i40e_rx_buffer *new_buff;
> +       struct i40e_ring *rx_ring;
> +       u16 nta;
> +
> +       rx_ring = container_of(alloc, struct i40e_ring, zca);
> +       nta = rx_ring->next_to_alloc;
> +
> +       new_buff = &rx_ring->rx_bi[nta];
> +
> +       /* update, and store next to alloc */
> +       nta++;
> +       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> +
> +       new_buff->dma  = rx_ring->xsk_umem->frames[handle].dma;
> +       new_buff->addr = rx_ring->xsk_umem->frames[handle].addr;
> +       new_buff->id   = (u32)handle;
> +}
> +
> +static struct sk_buff *i40e_zc_frame_to_skb(struct i40e_ring *rx_ring,
> +                                           struct i40e_rx_buffer *rx_buffer,
> +                                           struct xdp_buff *xdp)
> +{
> +       // XXX implement alloc skb and copy
> +       i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +       return NULL;
> +}
> +
> +static void i40e_clean_programming_status_zc(struct i40e_ring *rx_ring,
> +                                            union i40e_rx_desc *rx_desc,
> +                                            u64 qw)
> +{
> +       struct i40e_rx_buffer *rx_buffer;
> +       u32 ntc = rx_ring->next_to_clean;
> +       u8 id;
> +
> +       /* fetch, update, and store next to clean */
> +       rx_buffer = &rx_ring->rx_bi[ntc++];
> +       ntc = (ntc < rx_ring->count) ? ntc : 0;
> +       rx_ring->next_to_clean = ntc;
> +
> +       prefetch(I40E_RX_DESC(rx_ring, ntc));
> +
> +       /* place unused page back on the ring */
> +       i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +       rx_ring->rx_stats.page_reuse_count++;
> +
> +       /* clear contents of buffer_info */
> +       rx_buffer->addr = NULL;
> +
> +       id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
> +                 I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
> +
> +       if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
> +               i40e_fd_handle_status(rx_ring, rx_desc, id);
> +}
> +
> +int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> +{
> +       unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> +       u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
> +       bool failure = false, xdp_xmit = false;
> +       struct sk_buff *skb;
> +       struct xdp_buff xdp;
> +
> +       xdp.rxq = &rx_ring->xdp_rxq;
> +
> +       while (likely(total_rx_packets < (unsigned int)budget)) {
> +               struct i40e_rx_buffer *rx_buffer;
> +               union i40e_rx_desc *rx_desc;
> +               unsigned int size;
> +               u16 vlan_tag;
> +               u8 rx_ptype;
> +               u64 qword;
> +               u32 ntc;
> +
> +               /* return some buffers to hardware, one at a time is too slow */
> +               if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
> +                       failure = failure ||
> +                                 i40e_alloc_rx_buffers_zc(rx_ring,
> +                                                          cleaned_count);
> +                       cleaned_count = 0;
> +               }
> +
> +               rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
> +
> +               /* status_error_len will always be zero for unused descriptors
> +                * because it's cleared in cleanup, and overlaps with hdr_addr
> +                * which is always zero because packet split isn't used, if the
> +                * hardware wrote DD then the length will be non-zero
> +                */
> +               qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> +
> +               /* This memory barrier is needed to keep us from reading
> +                * any other fields out of the rx_desc until we have
> +                * verified the descriptor has been written back.
> +                */
> +               dma_rmb();
> +
> +               if (unlikely(i40e_rx_is_programming_status(qword))) {
> +                       i40e_clean_programming_status_zc(rx_ring, rx_desc,
> +                                                        qword);
> +                       cleaned_count++;
> +                       continue;
> +               }
> +               size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> +                      I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
> +               if (!size)
> +                       break;
> +
> +               rx_buffer = i40e_get_rx_buffer_zc(rx_ring, size);
> +
> +               /* retrieve a buffer from the ring */
> +               xdp.data = rx_buffer->addr;
> +               xdp_set_data_meta_invalid(&xdp);
> +               xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
> +               xdp.data_end = xdp.data + size;
> +               xdp.handle = rx_buffer->id;
> +
> +               skb = i40e_run_xdp_zc(rx_ring, &xdp);
> +
> +               if (IS_ERR(skb)) {
> +                       if (PTR_ERR(skb) == -I40E_XDP_TX)
> +                               xdp_xmit = true;
> +                       else
> +                               i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
> +                       total_rx_bytes += size;
> +                       total_rx_packets++;
> +               } else {
> +                       skb = i40e_zc_frame_to_skb(rx_ring, rx_buffer, &xdp);
> +                       if (!skb) {
> +                               rx_ring->rx_stats.alloc_buff_failed++;
> +                               break;
> +                       }
> +               }
> +
> +               rx_buffer->addr = NULL;
> +               cleaned_count++;
> +
> +               /* don't care about non-EOP frames in XDP mode */
> +               ntc = rx_ring->next_to_clean + 1;
> +               ntc = (ntc < rx_ring->count) ? ntc : 0;
> +               rx_ring->next_to_clean = ntc;
> +               prefetch(I40E_RX_DESC(rx_ring, ntc));
> +
> +               if (i40e_cleanup_headers(rx_ring, skb, rx_desc)) {
> +                       skb = NULL;
> +                       continue;
> +               }
> +
> +               /* probably a little skewed due to removing CRC */
> +               total_rx_bytes += skb->len;
> +
> +               qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
> +               rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
> +                          I40E_RXD_QW1_PTYPE_SHIFT;
> +
> +               /* populate checksum, VLAN, and protocol */
> +               i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
> +
> +               vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
> +                          le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
> +
> +               i40e_receive_skb(rx_ring, skb, vlan_tag);
> +               skb = NULL;
> +
> +               /* update budget accounting */
> +               total_rx_packets++;
> +       }
> +
> +       if (xdp_xmit) {
> +               struct i40e_ring *xdp_ring =
> +                       rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +
> +               i40e_xdp_ring_update_tail(xdp_ring);
> +               xdp_do_flush_map();
> +       }
> +
> +       u64_stats_update_begin(&rx_ring->syncp);
> +       rx_ring->stats.packets += total_rx_packets;
> +       rx_ring->stats.bytes += total_rx_bytes;
> +       u64_stats_update_end(&rx_ring->syncp);
> +       rx_ring->q_vector->rx.total_packets += total_rx_packets;
> +       rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
> +
> +       /* guarantee a trip back through this routine if there was a failure */
> +       return failure ? budget : (int)total_rx_packets;
> +}
> +
>  static inline u32 i40e_buildreg_itr(const int type, u16 itr)
>  {
>         u32 val;
> @@ -2576,7 +2922,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
>         budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
>
>         i40e_for_each_ring(ring, q_vector->rx) {
> -               int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
> +               int cleaned = ring->clean_rx_irq(ring, budget_per_ring);
>
>                 work_done += cleaned;
>                 /* if we clean as many as budgeted, we must not be done */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index fdd2c55f03a6..9d5d9862e9f1 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -296,13 +296,22 @@ struct i40e_tx_buffer {
>
>  struct i40e_rx_buffer {
>         dma_addr_t dma;
> -       struct page *page;
> +       union {
> +               struct {
> +                       struct page *page;
>  #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
> -       __u32 page_offset;
> +                       __u32 page_offset;
>  #else
> -       __u16 page_offset;
> +                       __u16 page_offset;
>  #endif
> -       __u16 pagecnt_bias;
> +                       __u16 pagecnt_bias;
> +               };
> +               struct {
> +                       /* for umem */
> +                       void *addr;
> +                       u32 id;
> +               };
> +       };
>  };
>
>  struct i40e_queue_stats {
> @@ -344,6 +353,8 @@ enum i40e_ring_state_t {
>  #define I40E_RX_SPLIT_TCP_UDP 0x4
>  #define I40E_RX_SPLIT_SCTP    0x8
>
> +void i40e_zc_recycle(struct zero_copy_allocator *alloc, unsigned long handle);
> +
>  /* struct that defines a descriptor ring, associated with a VSI */
>  struct i40e_ring {
>         struct i40e_ring *next;         /* pointer to next ring in q_vector */
> @@ -414,6 +425,12 @@ struct i40e_ring {
>
>         struct i40e_channel *ch;
>         struct xdp_rxq_info xdp_rxq;
> +
> +       int (*clean_rx_irq)(struct i40e_ring *, int);
> +       bool (*alloc_rx_buffers)(struct i40e_ring *, u16);
> +       struct xdp_umem *xsk_umem;
> +
> +       struct zero_copy_allocator zca; /* ZC allocator anchor */
>  } ____cacheline_internodealigned_in_smp;
>
>  static inline bool ring_uses_build_skb(struct i40e_ring *ring)
> @@ -474,6 +491,7 @@ static inline unsigned int i40e_rx_pg_order(struct i40e_ring *ring)
>  #define i40e_rx_pg_size(_ring) (PAGE_SIZE << i40e_rx_pg_order(_ring))
>
>  bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
> +bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
>  netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
>  void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
>  void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
> @@ -489,6 +507,9 @@ int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
>  bool __i40e_chk_linearize(struct sk_buff *skb);
>  int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
>  void i40e_xdp_flush(struct net_device *dev);
> +int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
> +int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
> +void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
>
>  /**
>   * i40e_get_head - Retrieve head from head writeback
> @@ -575,4 +596,5 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
>  {
>         return netdev_get_tx_queue(ring->netdev, ring->queue_index);
>  }
> +
>  #endif /* _I40E_TXRX_H_ */
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-16 10:47   ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-16 10:47 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, willemdebruijn.kernel,
	daniel, mst, netdev, Björn Töpel, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang, intel-wired-lan,
	brouer

On Tue, 15 May 2018 21:06:03 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> e have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are without
> retpoline so that we can compare against previous numbers. 
> 
> AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
> set are also reported for ease of reference.
> 
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.9*       9.6*       21.5
> txpush       2.6*       -          21.6
> l2fwd        1.9*       2.5*       15.0

These performance numbers are actually amazing.

When reaching these amazing/crazy speeds, where we are approaching the
speed of light (travel 30 cm in 1 nanosec), we have to view these
numbers differently, because we are actually working on a nanosec scale.

21.5 Mpps is 46.5 nanosec.

If we want to optimize for +1 Mpps, then (1/22.5*10^3=44.44ns) your
actually only have to optimize the code with 2 nanosec, and with this
2.0 GHz CPU it should in theory only be 4 cycles, but likely have more
instructions per cycle (I see around 2.5 ins per cycle), so we are
looking at (2*2*2.5) needing to find 10 cycles for +1Mpps.

Comparing to XDP_DROP of 32.3Mpps vs ZC-rxdrop 21.5Mpps, this is
actually only a "slowdown" of 15.55 ns, for having frame travel through
xdp_do_redirect, do map lookup etc, and queue into userspace, and
return frames back to kernel.  That is rather amazingly fast.

  (1/21.5*10^3)-(1/32.3*10^3) = 15.55 ns

Another performance number which is amazing is your l2fwd number of
15Mpps, because it if faster than xdp_redirect_map on i40e NICs on my
system, which runs at 12.2 Mpps (2.8Mpps slower).  Again looking at the
nanosec scale instead, this correspond to 15.3 ns.
  I expect, this improvement comes from avoiding page_frag_free, and
avoiding the TX dma_map call (as you premap pages DMA for TX). Reverse
calculating based on perf percentage, I find that these should only
cost 7.18 ns.  Maybe the rest is because you are running TX and TX-dma
completion on another CPU.

I notice you are also using the XDP return-API, which still does a
rhashtable_lookup per frame.  I plan to optimize this to do bulking, to
get away from per frame lookup.  Thus, this should get even faster.


> * From AF_XDP V3 patch set and cover letter.
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.1        3.3       3.3
> l2fwd        1.4        1.8       3.1
> 
> So why do we not get higher values for RX similar to the 34 Mpps we
> had in AF_PACKET V4? We made an experiment running the rxdrop
> benchmark without using the xdp_do_redirect/flush infrastructure nor
> using an XDP program (all traffic on a queue goes to one
> socket). Instead the driver acts directly on the AF_XDP socket. With
> this we got 36.9 Mpps, a significant improvement without any change to
> the uapi. So not forcing users to have an XDP program if they do not
> need it, might be a good idea. This measurement is actually higher
> than what we got with AF_PACKET V4.

So, that are you telling me with your number 36.9 Mpps for
direct-socket-rxdrop...

Compared to XDP_DROP at 32.3Mpps, are you saying that it only costs
3.86 nanosec to call the XDP bpf_prog which returns XDP_DROP.  That is
very impressive actually. (1/32.3*10^3)-(1/36.9*10^3)

Compared to ZC-AF_XDP rxdrop 21.5Mpps, are you saying the cost of XDP
redirect infrastructure, map lookups etc (incl. return-API per frame)
cost 19.41 nanosec (1/21.5*10^3)-(1/36.9*10^3).  Which is approx 40
clock-cycles or 100 (speculative) instructions.  That is not too bad,
and we are still optimizing this stuff.


> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32.3M  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0

Overall I'm *very* impressed by the performance of ZC AF_XDP.
Just remember that measuring improvement in +N Mpps, is actually
misleading, when operating at these (light) speeds.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
@ 2018-05-16 10:47   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-16 10:47 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, 15 May 2018 21:06:03 +0200
Bj?rn T?pel <bjorn.topel@gmail.com> wrote:

> e have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are without
> retpoline so that we can compare against previous numbers. 
> 
> AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
> set are also reported for ease of reference.
> 
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.9*       9.6*       21.5
> txpush       2.6*       -          21.6
> l2fwd        1.9*       2.5*       15.0

These performance numbers are actually amazing.

When reaching these amazing/crazy speeds, where we are approaching the
speed of light (travel 30 cm in 1 nanosec), we have to view these
numbers differently, because we are actually working on a nanosec scale.

21.5 Mpps is 46.5 nanosec.

If we want to optimize for +1 Mpps, then (1/22.5*10^3=44.44ns) your
actually only have to optimize the code with 2 nanosec, and with this
2.0 GHz CPU it should in theory only be 4 cycles, but likely have more
instructions per cycle (I see around 2.5 ins per cycle), so we are
looking at (2*2*2.5) needing to find 10 cycles for +1Mpps.

Comparing to XDP_DROP of 32.3Mpps vs ZC-rxdrop 21.5Mpps, this is
actually only a "slowdown" of 15.55 ns, for having frame travel through
xdp_do_redirect, do map lookup etc, and queue into userspace, and
return frames back to kernel.  That is rather amazingly fast.

  (1/21.5*10^3)-(1/32.3*10^3) = 15.55 ns

Another performance number which is amazing is your l2fwd number of
15Mpps, because it if faster than xdp_redirect_map on i40e NICs on my
system, which runs at 12.2 Mpps (2.8Mpps slower).  Again looking at the
nanosec scale instead, this correspond to 15.3 ns.
  I expect, this improvement comes from avoiding page_frag_free, and
avoiding the TX dma_map call (as you premap pages DMA for TX). Reverse
calculating based on perf percentage, I find that these should only
cost 7.18 ns.  Maybe the rest is because you are running TX and TX-dma
completion on another CPU.

I notice you are also using the XDP return-API, which still does a
rhashtable_lookup per frame.  I plan to optimize this to do bulking, to
get away from per frame lookup.  Thus, this should get even faster.


> * From AF_XDP V3 patch set and cover letter.
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.1        3.3       3.3
> l2fwd        1.4        1.8       3.1
> 
> So why do we not get higher values for RX similar to the 34 Mpps we
> had in AF_PACKET V4? We made an experiment running the rxdrop
> benchmark without using the xdp_do_redirect/flush infrastructure nor
> using an XDP program (all traffic on a queue goes to one
> socket). Instead the driver acts directly on the AF_XDP socket. With
> this we got 36.9 Mpps, a significant improvement without any change to
> the uapi. So not forcing users to have an XDP program if they do not
> need it, might be a good idea. This measurement is actually higher
> than what we got with AF_PACKET V4.

So, that are you telling me with your number 36.9 Mpps for
direct-socket-rxdrop...

Compared to XDP_DROP at 32.3Mpps, are you saying that it only costs
3.86 nanosec to call the XDP bpf_prog which returns XDP_DROP.  That is
very impressive actually. (1/32.3*10^3)-(1/36.9*10^3)

Compared to ZC-AF_XDP rxdrop 21.5Mpps, are you saying the cost of XDP
redirect infrastructure, map lookups etc (incl. return-API per frame)
cost 19.41 nanosec (1/21.5*10^3)-(1/36.9*10^3).  Which is approx 40
clock-cycles or 100 (speculative) instructions.  That is not too bad,
and we are still optimizing this stuff.


> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32.3M  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0

Overall I'm *very* impressed by the performance of ZC AF_XDP.
Just remember that measuring improvement in +N Mpps, is actually
misleading, when operating at these (light) speeds.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
  2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-16 14:28     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-16 14:28 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, willemdebruijn.kernel,
	daniel, mst, netdev, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan, brouer

On Tue, 15 May 2018 21:06:15 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Here, the zero-copy ndo is implemented. As a shortcut, the existing
> XDP Tx rings are used for zero-copy. This means that and XDP program
> cannot redirect to an AF_XDP enabled XDP Tx ring.

I've changed i40e1 to only have one queue via:
 $ ethtool -L i40e1 combined 1

And then, I'm sending on queue 1, which is/should not be avail... and then crash/BUG:

$ sudo taskset -c 2 ./xdpsock --tx --interface=i40e1 --queue=1

[ 3799.936877] Number of in use tx queues changed invalidating tc mappings. Priority traffic
 classification disabled!
[ 3799.972970] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 3799.980790] PGD 80000007b0992067 P4D 80000007b0992067 PUD 7b62d4067 PMD 0 
[ 3799.987654] Oops: 0002 [#1] PREEMPT SMP PTI
[ 3799.991831] Modules linked in: nf_nat_masquerade_ipv4 tun nfnetlink bridge stp llc nf_nat
 nf_conntrack rpcrdma ib_ipoib rdma_ucm ib_ucm ib_umad rdma_cm ib_cm iw_cm sunrpc mlx5_ib ib
_uverbs ib_core coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf p
cspkr i2c_i801 shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel i40e ml
x5_core hid_generic ixgbe igb devlink mdio ptp sd_mod i2c_algo_bit i2c_core pps_core [last u
nloaded: x_tables]
[ 3800.033472] CPU: 2 PID: 2006 Comm: xdpsock Not tainted 4.17.0-rc3-af_xdp03_ZC_rfc+ #155
[ 3800.041465] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
[ 3800.048943] RIP: 0010:i40e_xmit_frame_ring+0xd4/0x1490 [i40e]
[ 3800.054683] RSP: 0018:ffffc9000407bcd0 EFLAGS: 00010293
[ 3800.059900] RAX: 0000000000000000 RBX: ffff88084f0fd200 RCX: 0000000000000000
[ 3800.067022] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8807b6e710c0
[ 3800.074148] RBP: ffff8807c6397800 R08: 00000000000000c0 R09: 0000000000000001
[ 3800.081270] R10: 0000000000000800 R11: 0000000000000010 R12: 0000000000000001
[ 3800.088396] R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000003c
[ 3800.095520] FS:  00007f1d1e00bb80(0000) GS:ffff88087fc80000(0000) knlGS:0000000000000000
[ 3800.103597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3800.109335] CR2: 0000000000000008 CR3: 000000087d542001 CR4: 00000000003606e0
[ 3800.116458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3800.123583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3800.130706] Call Trace:
[ 3800.133157]  ? validate_xmit_skb.isra.116+0x1c/0x270
[ 3800.138118]  dev_direct_xmit+0xec/0x1d0
[ 3800.141949]  xsk_sendmsg+0x1f4/0x380
[ 3800.145521]  sock_sendmsg+0x30/0x40
[ 3800.149005]  __sys_sendto+0x10e/0x140
[ 3800.152662]  ? __do_page_fault+0x283/0x500
[ 3800.156751]  __x64_sys_sendto+0x24/0x30
[ 3800.160585]  do_syscall_64+0x42/0xf0
[ 3800.164156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3800.169204] RIP: 0033:0x7f1d1d9db430
[ 3800.172774] RSP: 002b:00007fffb7278610 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
[ 3800.180333] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1d1d9db430
[ 3800.187456] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
[ 3800.194582] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 3800.201705] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000
[ 3800.208830] R13: 0000000000000000 R14: 0000000000755510 R15: 00007f1d0d3fc000
[ 3800.215953] Code: d0 0f 86 db 05 00 00 01 c8 0f b7 ca 29 c8 83 e8 01 39 c6 0f 8f ea 06 00 00 48 8b 45 28 48 8d 14 92 41 b9 01 00 00 00 4c 8d 2c d0 <49> 89 5d 08 8b 83 80 00 00 00 66 45 89 4d 14 41 89 45 10 0f b7 
[ 3800.234798] RIP: i40e_xmit_frame_ring+0xd4/0x1490 [i40e] RSP: ffffc9000407bcd0
[ 3800.242005] CR2: 0000000000000008
[ 3800.245320] ---[ end trace f169e36f468e0c59 ]---
[ 3801.726719] Kernel panic - not syncing: Fatal exception in interrupt
[ 3801.733097] Kernel Offset: disabled
[ 3801.785836] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
[ 3801.793403] ------------[ cut here ]------------

(gdb) list *(i40e_xmit_frame_ring)+0xd4
0x2ccd4 is in i40e_xmit_frame_ring (drivers/net/ethernet/intel/i40e/i40e_txrx.c:4048).
warning: Source file is more recent than executable.
4043			return NETDEV_TX_BUSY;
4044		}
4045	
4046		/* record the location of the first descriptor for this packet */
4047		first = &tx_ring->tx_bi[tx_ring->next_to_use];
4048		first->skb = skb;
4049		first->bytecount = skb->len;
4050		first->gso_segs = 1;
4051	
4052		/* prepare the xmit flags */


(gdb) list *(xsk_sendmsg)+0x1f4
0xffffffff81800c34 is in xsk_sendmsg (net/xdp/xsk.c:251).
warning: Source file is more recent than executable.
246			skb_shinfo(skb)->destructor_arg = (void *)(long)id;
247			skb->destructor = xsk_destruct_skb;
248	
249			err = dev_direct_xmit(skb, xs->queue_id);
250			/* Ignore NET_XMIT_CN as packet might have been sent */
251			if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
252				err = -EAGAIN;
253				/* SKB consumed by dev_direct_xmit() */
254				goto out;
255			}

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
@ 2018-05-16 14:28     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-16 14:28 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, 15 May 2018 21:06:15 +0200
Bj?rn T?pel <bjorn.topel@gmail.com> wrote:

> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Here, the zero-copy ndo is implemented. As a shortcut, the existing
> XDP Tx rings are used for zero-copy. This means that and XDP program
> cannot redirect to an AF_XDP enabled XDP Tx ring.

I've changed i40e1 to only have one queue via:
 $ ethtool -L i40e1 combined 1

And then, I'm sending on queue 1, which is/should not be avail... and then crash/BUG:

$ sudo taskset -c 2 ./xdpsock --tx --interface=i40e1 --queue=1

[ 3799.936877] Number of in use tx queues changed invalidating tc mappings. Priority traffic
 classification disabled!
[ 3799.972970] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 3799.980790] PGD 80000007b0992067 P4D 80000007b0992067 PUD 7b62d4067 PMD 0 
[ 3799.987654] Oops: 0002 [#1] PREEMPT SMP PTI
[ 3799.991831] Modules linked in: nf_nat_masquerade_ipv4 tun nfnetlink bridge stp llc nf_nat
 nf_conntrack rpcrdma ib_ipoib rdma_ucm ib_ucm ib_umad rdma_cm ib_cm iw_cm sunrpc mlx5_ib ib
_uverbs ib_core coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf p
cspkr i2c_i801 shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel i40e ml
x5_core hid_generic ixgbe igb devlink mdio ptp sd_mod i2c_algo_bit i2c_core pps_core [last u
nloaded: x_tables]
[ 3800.033472] CPU: 2 PID: 2006 Comm: xdpsock Not tainted 4.17.0-rc3-af_xdp03_ZC_rfc+ #155
[ 3800.041465] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
[ 3800.048943] RIP: 0010:i40e_xmit_frame_ring+0xd4/0x1490 [i40e]
[ 3800.054683] RSP: 0018:ffffc9000407bcd0 EFLAGS: 00010293
[ 3800.059900] RAX: 0000000000000000 RBX: ffff88084f0fd200 RCX: 0000000000000000
[ 3800.067022] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8807b6e710c0
[ 3800.074148] RBP: ffff8807c6397800 R08: 00000000000000c0 R09: 0000000000000001
[ 3800.081270] R10: 0000000000000800 R11: 0000000000000010 R12: 0000000000000001
[ 3800.088396] R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000003c
[ 3800.095520] FS:  00007f1d1e00bb80(0000) GS:ffff88087fc80000(0000) knlGS:0000000000000000
[ 3800.103597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3800.109335] CR2: 0000000000000008 CR3: 000000087d542001 CR4: 00000000003606e0
[ 3800.116458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3800.123583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3800.130706] Call Trace:
[ 3800.133157]  ? validate_xmit_skb.isra.116+0x1c/0x270
[ 3800.138118]  dev_direct_xmit+0xec/0x1d0
[ 3800.141949]  xsk_sendmsg+0x1f4/0x380
[ 3800.145521]  sock_sendmsg+0x30/0x40
[ 3800.149005]  __sys_sendto+0x10e/0x140
[ 3800.152662]  ? __do_page_fault+0x283/0x500
[ 3800.156751]  __x64_sys_sendto+0x24/0x30
[ 3800.160585]  do_syscall_64+0x42/0xf0
[ 3800.164156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3800.169204] RIP: 0033:0x7f1d1d9db430
[ 3800.172774] RSP: 002b:00007fffb7278610 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
[ 3800.180333] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1d1d9db430
[ 3800.187456] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
[ 3800.194582] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 3800.201705] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000
[ 3800.208830] R13: 0000000000000000 R14: 0000000000755510 R15: 00007f1d0d3fc000
[ 3800.215953] Code: d0 0f 86 db 05 00 00 01 c8 0f b7 ca 29 c8 83 e8 01 39 c6 0f 8f ea 06 00 00 48 8b 45 28 48 8d 14 92 41 b9 01 00 00 00 4c 8d 2c d0 <49> 89 5d 08 8b 83 80 00 00 00 66 45 89 4d 14 41 89 45 10 0f b7 
[ 3800.234798] RIP: i40e_xmit_frame_ring+0xd4/0x1490 [i40e] RSP: ffffc9000407bcd0
[ 3800.242005] CR2: 0000000000000008
[ 3800.245320] ---[ end trace f169e36f468e0c59 ]---
[ 3801.726719] Kernel panic - not syncing: Fatal exception in interrupt
[ 3801.733097] Kernel Offset: disabled
[ 3801.785836] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
[ 3801.793403] ------------[ cut here ]------------

(gdb) list *(i40e_xmit_frame_ring)+0xd4
0x2ccd4 is in i40e_xmit_frame_ring (drivers/net/ethernet/intel/i40e/i40e_txrx.c:4048).
warning: Source file is more recent than executable.
4043			return NETDEV_TX_BUSY;
4044		}
4045	
4046		/* record the location of the first descriptor for this packet */
4047		first = &tx_ring->tx_bi[tx_ring->next_to_use];
4048		first->skb = skb;
4049		first->bytecount = skb->len;
4050		first->gso_segs = 1;
4051	
4052		/* prepare the xmit flags */


(gdb) list *(xsk_sendmsg)+0x1f4
0xffffffff81800c34 is in xsk_sendmsg (net/xdp/xsk.c:251).
warning: Source file is more recent than executable.
246			skb_shinfo(skb)->destructor_arg = (void *)(long)id;
247			skb->destructor = xsk_destruct_skb;
248	
249			err = dev_direct_xmit(skb, xs->queue_id);
250			/* Ignore NET_XMIT_CN as packet might have been sent */
251			if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
252				err = -EAGAIN;
253				/* SKB consumed by dev_direct_xmit() */
254				goto out;
255			}

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
  2018-05-16 14:28     ` [Intel-wired-lan] " Jesper Dangaard Brouer
@ 2018-05-16 14:38       ` Magnus Karlsson
  -1 siblings, 0 replies; 54+ messages in thread
From: Magnus Karlsson @ 2018-05-16 14:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z, intel-wired-lan

On Wed, May 16, 2018 at 4:28 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Tue, 15 May 2018 21:06:15 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, the zero-copy ndo is implemented. As a shortcut, the existing
>> XDP Tx rings are used for zero-copy. This means that and XDP program
>> cannot redirect to an AF_XDP enabled XDP Tx ring.
>
> I've changed i40e1 to only have one queue via:
>  $ ethtool -L i40e1 combined 1
>
> And then, I'm sending on queue 1, which is/should not be avail... and then crash/BUG:
>
> $ sudo taskset -c 2 ./xdpsock --tx --interface=i40e1 --queue=1
>
> [ 3799.936877] Number of in use tx queues changed invalidating tc mappings. Priority traffic
>  classification disabled!
> [ 3799.972970] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [ 3799.980790] PGD 80000007b0992067 P4D 80000007b0992067 PUD 7b62d4067 PMD 0
> [ 3799.987654] Oops: 0002 [#1] PREEMPT SMP PTI
> [ 3799.991831] Modules linked in: nf_nat_masquerade_ipv4 tun nfnetlink bridge stp llc nf_nat
>  nf_conntrack rpcrdma ib_ipoib rdma_ucm ib_ucm ib_umad rdma_cm ib_cm iw_cm sunrpc mlx5_ib ib
> _uverbs ib_core coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf p
> cspkr i2c_i801 shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel i40e ml
> x5_core hid_generic ixgbe igb devlink mdio ptp sd_mod i2c_algo_bit i2c_core pps_core [last u
> nloaded: x_tables]
> [ 3800.033472] CPU: 2 PID: 2006 Comm: xdpsock Not tainted 4.17.0-rc3-af_xdp03_ZC_rfc+ #155
> [ 3800.041465] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
> [ 3800.048943] RIP: 0010:i40e_xmit_frame_ring+0xd4/0x1490 [i40e]
> [ 3800.054683] RSP: 0018:ffffc9000407bcd0 EFLAGS: 00010293
> [ 3800.059900] RAX: 0000000000000000 RBX: ffff88084f0fd200 RCX: 0000000000000000
> [ 3800.067022] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8807b6e710c0
> [ 3800.074148] RBP: ffff8807c6397800 R08: 00000000000000c0 R09: 0000000000000001
> [ 3800.081270] R10: 0000000000000800 R11: 0000000000000010 R12: 0000000000000001
> [ 3800.088396] R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000003c
> [ 3800.095520] FS:  00007f1d1e00bb80(0000) GS:ffff88087fc80000(0000) knlGS:0000000000000000
> [ 3800.103597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3800.109335] CR2: 0000000000000008 CR3: 000000087d542001 CR4: 00000000003606e0
> [ 3800.116458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3800.123583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 3800.130706] Call Trace:
> [ 3800.133157]  ? validate_xmit_skb.isra.116+0x1c/0x270
> [ 3800.138118]  dev_direct_xmit+0xec/0x1d0
> [ 3800.141949]  xsk_sendmsg+0x1f4/0x380
> [ 3800.145521]  sock_sendmsg+0x30/0x40
> [ 3800.149005]  __sys_sendto+0x10e/0x140
> [ 3800.152662]  ? __do_page_fault+0x283/0x500
> [ 3800.156751]  __x64_sys_sendto+0x24/0x30
> [ 3800.160585]  do_syscall_64+0x42/0xf0
> [ 3800.164156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 3800.169204] RIP: 0033:0x7f1d1d9db430
> [ 3800.172774] RSP: 002b:00007fffb7278610 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
> [ 3800.180333] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1d1d9db430
> [ 3800.187456] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
> [ 3800.194582] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [ 3800.201705] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000
> [ 3800.208830] R13: 0000000000000000 R14: 0000000000755510 R15: 00007f1d0d3fc000
> [ 3800.215953] Code: d0 0f 86 db 05 00 00 01 c8 0f b7 ca 29 c8 83 e8 01 39 c6 0f 8f ea 06 00 00 48 8b 45 28 48 8d 14 92 41 b9 01 00 00 00 4c 8d 2c d0 <49> 89 5d 08 8b 83 80 00 00 00 66 45 89 4d 14 41 89 45 10 0f b7
> [ 3800.234798] RIP: i40e_xmit_frame_ring+0xd4/0x1490 [i40e] RSP: ffffc9000407bcd0
> [ 3800.242005] CR2: 0000000000000008
> [ 3800.245320] ---[ end trace f169e36f468e0c59 ]---
> [ 3801.726719] Kernel panic - not syncing: Fatal exception in interrupt
> [ 3801.733097] Kernel Offset: disabled
> [ 3801.785836] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> [ 3801.793403] ------------[ cut here ]------------
>
> (gdb) list *(i40e_xmit_frame_ring)+0xd4
> 0x2ccd4 is in i40e_xmit_frame_ring (drivers/net/ethernet/intel/i40e/i40e_txrx.c:4048).
> warning: Source file is more recent than executable.
> 4043                    return NETDEV_TX_BUSY;
> 4044            }
> 4045
> 4046            /* record the location of the first descriptor for this packet */
> 4047            first = &tx_ring->tx_bi[tx_ring->next_to_use];
> 4048            first->skb = skb;
> 4049            first->bytecount = skb->len;
> 4050            first->gso_segs = 1;
> 4051
> 4052            /* prepare the xmit flags */
>
>
> (gdb) list *(xsk_sendmsg)+0x1f4
> 0xffffffff81800c34 is in xsk_sendmsg (net/xdp/xsk.c:251).
> warning: Source file is more recent than executable.
> 246                     skb_shinfo(skb)->destructor_arg = (void *)(long)id;
> 247                     skb->destructor = xsk_destruct_skb;
> 248
> 249                     err = dev_direct_xmit(skb, xs->queue_id);
> 250                     /* Ignore NET_XMIT_CN as packet might have been sent */
> 251                     if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
> 252                             err = -EAGAIN;
> 253                             /* SKB consumed by dev_direct_xmit() */
> 254                             goto out;
> 255                     }

Thanks Jesper for reporting. I will take a look at it.

/Magnus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
@ 2018-05-16 14:38       ` Magnus Karlsson
  0 siblings, 0 replies; 54+ messages in thread
From: Magnus Karlsson @ 2018-05-16 14:38 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, May 16, 2018 at 4:28 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Tue, 15 May 2018 21:06:15 +0200
> Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, the zero-copy ndo is implemented. As a shortcut, the existing
>> XDP Tx rings are used for zero-copy. This means that and XDP program
>> cannot redirect to an AF_XDP enabled XDP Tx ring.
>
> I've changed i40e1 to only have one queue via:
>  $ ethtool -L i40e1 combined 1
>
> And then, I'm sending on queue 1, which is/should not be avail... and then crash/BUG:
>
> $ sudo taskset -c 2 ./xdpsock --tx --interface=i40e1 --queue=1
>
> [ 3799.936877] Number of in use tx queues changed invalidating tc mappings. Priority traffic
>  classification disabled!
> [ 3799.972970] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [ 3799.980790] PGD 80000007b0992067 P4D 80000007b0992067 PUD 7b62d4067 PMD 0
> [ 3799.987654] Oops: 0002 [#1] PREEMPT SMP PTI
> [ 3799.991831] Modules linked in: nf_nat_masquerade_ipv4 tun nfnetlink bridge stp llc nf_nat
>  nf_conntrack rpcrdma ib_ipoib rdma_ucm ib_ucm ib_umad rdma_cm ib_cm iw_cm sunrpc mlx5_ib ib
> _uverbs ib_core coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf p
> cspkr i2c_i801 shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel i40e ml
> x5_core hid_generic ixgbe igb devlink mdio ptp sd_mod i2c_algo_bit i2c_core pps_core [last u
> nloaded: x_tables]
> [ 3800.033472] CPU: 2 PID: 2006 Comm: xdpsock Not tainted 4.17.0-rc3-af_xdp03_ZC_rfc+ #155
> [ 3800.041465] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
> [ 3800.048943] RIP: 0010:i40e_xmit_frame_ring+0xd4/0x1490 [i40e]
> [ 3800.054683] RSP: 0018:ffffc9000407bcd0 EFLAGS: 00010293
> [ 3800.059900] RAX: 0000000000000000 RBX: ffff88084f0fd200 RCX: 0000000000000000
> [ 3800.067022] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8807b6e710c0
> [ 3800.074148] RBP: ffff8807c6397800 R08: 00000000000000c0 R09: 0000000000000001
> [ 3800.081270] R10: 0000000000000800 R11: 0000000000000010 R12: 0000000000000001
> [ 3800.088396] R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000003c
> [ 3800.095520] FS:  00007f1d1e00bb80(0000) GS:ffff88087fc80000(0000) knlGS:0000000000000000
> [ 3800.103597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3800.109335] CR2: 0000000000000008 CR3: 000000087d542001 CR4: 00000000003606e0
> [ 3800.116458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3800.123583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 3800.130706] Call Trace:
> [ 3800.133157]  ? validate_xmit_skb.isra.116+0x1c/0x270
> [ 3800.138118]  dev_direct_xmit+0xec/0x1d0
> [ 3800.141949]  xsk_sendmsg+0x1f4/0x380
> [ 3800.145521]  sock_sendmsg+0x30/0x40
> [ 3800.149005]  __sys_sendto+0x10e/0x140
> [ 3800.152662]  ? __do_page_fault+0x283/0x500
> [ 3800.156751]  __x64_sys_sendto+0x24/0x30
> [ 3800.160585]  do_syscall_64+0x42/0xf0
> [ 3800.164156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 3800.169204] RIP: 0033:0x7f1d1d9db430
> [ 3800.172774] RSP: 002b:00007fffb7278610 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
> [ 3800.180333] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1d1d9db430
> [ 3800.187456] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
> [ 3800.194582] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [ 3800.201705] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000
> [ 3800.208830] R13: 0000000000000000 R14: 0000000000755510 R15: 00007f1d0d3fc000
> [ 3800.215953] Code: d0 0f 86 db 05 00 00 01 c8 0f b7 ca 29 c8 83 e8 01 39 c6 0f 8f ea 06 00 00 48 8b 45 28 48 8d 14 92 41 b9 01 00 00 00 4c 8d 2c d0 <49> 89 5d 08 8b 83 80 00 00 00 66 45 89 4d 14 41 89 45 10 0f b7
> [ 3800.234798] RIP: i40e_xmit_frame_ring+0xd4/0x1490 [i40e] RSP: ffffc9000407bcd0
> [ 3800.242005] CR2: 0000000000000008
> [ 3800.245320] ---[ end trace f169e36f468e0c59 ]---
> [ 3801.726719] Kernel panic - not syncing: Fatal exception in interrupt
> [ 3801.733097] Kernel Offset: disabled
> [ 3801.785836] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> [ 3801.793403] ------------[ cut here ]------------
>
> (gdb) list *(i40e_xmit_frame_ring)+0xd4
> 0x2ccd4 is in i40e_xmit_frame_ring (drivers/net/ethernet/intel/i40e/i40e_txrx.c:4048).
> warning: Source file is more recent than executable.
> 4043                    return NETDEV_TX_BUSY;
> 4044            }
> 4045
> 4046            /* record the location of the first descriptor for this packet */
> 4047            first = &tx_ring->tx_bi[tx_ring->next_to_use];
> 4048            first->skb = skb;
> 4049            first->bytecount = skb->len;
> 4050            first->gso_segs = 1;
> 4051
> 4052            /* prepare the xmit flags */
>
>
> (gdb) list *(xsk_sendmsg)+0x1f4
> 0xffffffff81800c34 is in xsk_sendmsg (net/xdp/xsk.c:251).
> warning: Source file is more recent than executable.
> 246                     skb_shinfo(skb)->destructor_arg = (void *)(long)id;
> 247                     skb->destructor = xsk_destruct_skb;
> 248
> 249                     err = dev_direct_xmit(skb, xs->queue_id);
> 250                     /* Ignore NET_XMIT_CN as packet might have been sent */
> 251                     if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
> 252                             err = -EAGAIN;
> 253                             /* SKB consumed by dev_direct_xmit() */
> 254                             goto out;
> 255                     }

Thanks Jesper for reporting. I will take a look at it.

/Magnus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
  2018-05-16 14:38       ` [Intel-wired-lan] " Magnus Karlsson
@ 2018-05-16 15:38         ` Magnus Karlsson
  -1 siblings, 0 replies; 54+ messages in thread
From: Magnus Karlsson @ 2018-05-16 15:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z, intel-wired-lan

On Wed, May 16, 2018 at 4:38 PM, Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
> On Wed, May 16, 2018 at 4:28 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>> On Tue, 15 May 2018 21:06:15 +0200
>> Björn Töpel <bjorn.topel@gmail.com> wrote:
>>
>>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>>
>>> Here, the zero-copy ndo is implemented. As a shortcut, the existing
>>> XDP Tx rings are used for zero-copy. This means that and XDP program
>>> cannot redirect to an AF_XDP enabled XDP Tx ring.
>>
>> I've changed i40e1 to only have one queue via:
>>  $ ethtool -L i40e1 combined 1
>>
>> And then, I'm sending on queue 1, which is/should not be avail... and then crash/BUG:
>>
>> $ sudo taskset -c 2 ./xdpsock --tx --interface=i40e1 --queue=1
>>
>> [ 3799.936877] Number of in use tx queues changed invalidating tc mappings. Priority traffic
>>  classification disabled!
>> [ 3799.972970] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
>> [ 3799.980790] PGD 80000007b0992067 P4D 80000007b0992067 PUD 7b62d4067 PMD 0
>> [ 3799.987654] Oops: 0002 [#1] PREEMPT SMP PTI
>> [ 3799.991831] Modules linked in: nf_nat_masquerade_ipv4 tun nfnetlink bridge stp llc nf_nat
>>  nf_conntrack rpcrdma ib_ipoib rdma_ucm ib_ucm ib_umad rdma_cm ib_cm iw_cm sunrpc mlx5_ib ib
>> _uverbs ib_core coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf p
>> cspkr i2c_i801 shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel i40e ml
>> x5_core hid_generic ixgbe igb devlink mdio ptp sd_mod i2c_algo_bit i2c_core pps_core [last u
>> nloaded: x_tables]
>> [ 3800.033472] CPU: 2 PID: 2006 Comm: xdpsock Not tainted 4.17.0-rc3-af_xdp03_ZC_rfc+ #155
>> [ 3800.041465] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
>> [ 3800.048943] RIP: 0010:i40e_xmit_frame_ring+0xd4/0x1490 [i40e]
>> [ 3800.054683] RSP: 0018:ffffc9000407bcd0 EFLAGS: 00010293
>> [ 3800.059900] RAX: 0000000000000000 RBX: ffff88084f0fd200 RCX: 0000000000000000
>> [ 3800.067022] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8807b6e710c0
>> [ 3800.074148] RBP: ffff8807c6397800 R08: 00000000000000c0 R09: 0000000000000001
>> [ 3800.081270] R10: 0000000000000800 R11: 0000000000000010 R12: 0000000000000001
>> [ 3800.088396] R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000003c
>> [ 3800.095520] FS:  00007f1d1e00bb80(0000) GS:ffff88087fc80000(0000) knlGS:0000000000000000
>> [ 3800.103597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 3800.109335] CR2: 0000000000000008 CR3: 000000087d542001 CR4: 00000000003606e0
>> [ 3800.116458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 3800.123583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 3800.130706] Call Trace:
>> [ 3800.133157]  ? validate_xmit_skb.isra.116+0x1c/0x270
>> [ 3800.138118]  dev_direct_xmit+0xec/0x1d0
>> [ 3800.141949]  xsk_sendmsg+0x1f4/0x380
>> [ 3800.145521]  sock_sendmsg+0x30/0x40
>> [ 3800.149005]  __sys_sendto+0x10e/0x140
>> [ 3800.152662]  ? __do_page_fault+0x283/0x500
>> [ 3800.156751]  __x64_sys_sendto+0x24/0x30
>> [ 3800.160585]  do_syscall_64+0x42/0xf0
>> [ 3800.164156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [ 3800.169204] RIP: 0033:0x7f1d1d9db430
>> [ 3800.172774] RSP: 002b:00007fffb7278610 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
>> [ 3800.180333] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1d1d9db430
>> [ 3800.187456] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
>> [ 3800.194582] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> [ 3800.201705] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000
>> [ 3800.208830] R13: 0000000000000000 R14: 0000000000755510 R15: 00007f1d0d3fc000
>> [ 3800.215953] Code: d0 0f 86 db 05 00 00 01 c8 0f b7 ca 29 c8 83 e8 01 39 c6 0f 8f ea 06 00 00 48 8b 45 28 48 8d 14 92 41 b9 01 00 00 00 4c 8d 2c d0 <49> 89 5d 08 8b 83 80 00 00 00 66 45 89 4d 14 41 89 45 10 0f b7
>> [ 3800.234798] RIP: i40e_xmit_frame_ring+0xd4/0x1490 [i40e] RSP: ffffc9000407bcd0
>> [ 3800.242005] CR2: 0000000000000008
>> [ 3800.245320] ---[ end trace f169e36f468e0c59 ]---
>> [ 3801.726719] Kernel panic - not syncing: Fatal exception in interrupt
>> [ 3801.733097] Kernel Offset: disabled
>> [ 3801.785836] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>> [ 3801.793403] ------------[ cut here ]------------
>>
>> (gdb) list *(i40e_xmit_frame_ring)+0xd4
>> 0x2ccd4 is in i40e_xmit_frame_ring (drivers/net/ethernet/intel/i40e/i40e_txrx.c:4048).
>> warning: Source file is more recent than executable.
>> 4043                    return NETDEV_TX_BUSY;
>> 4044            }
>> 4045
>> 4046            /* record the location of the first descriptor for this packet */
>> 4047            first = &tx_ring->tx_bi[tx_ring->next_to_use];
>> 4048            first->skb = skb;
>> 4049            first->bytecount = skb->len;
>> 4050            first->gso_segs = 1;
>> 4051
>> 4052            /* prepare the xmit flags */
>>
>>
>> (gdb) list *(xsk_sendmsg)+0x1f4
>> 0xffffffff81800c34 is in xsk_sendmsg (net/xdp/xsk.c:251).
>> warning: Source file is more recent than executable.
>> 246                     skb_shinfo(skb)->destructor_arg = (void *)(long)id;
>> 247                     skb->destructor = xsk_destruct_skb;
>> 248
>> 249                     err = dev_direct_xmit(skb, xs->queue_id);
>> 250                     /* Ignore NET_XMIT_CN as packet might have been sent */
>> 251                     if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
>> 252                             err = -EAGAIN;
>> 253                             /* SKB consumed by dev_direct_xmit() */
>> 254                             goto out;
>> 255                     }
>

Found it. Checked num_rx_queues in the xsk_bind code instead of
real_num_rx_queues. The code below will solve the problem. Will post a
proper patch for it tomorrow. Thanks Jesper for reporting this.
Appreciated.

/Magnus

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ac97902..4b62a1e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -391,7 +391,8 @@ static int xsk_bind(struct socket *sock, struct
sockaddr *addr, int addr_len)
                goto out_unlock;
        }

-       if (sxdp->sxdp_queue_id >= dev->num_rx_queues) {
+       if ((xs->rx && sxdp->sxdp_queue_id >= dev->real_num_rx_queues) ||
+           (xs->tx && sxdp->sxdp_queue_id >= dev->real_num_tx_queues)) {
                err = -EINVAL;
                goto out_unlock;
        }

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
@ 2018-05-16 15:38         ` Magnus Karlsson
  0 siblings, 0 replies; 54+ messages in thread
From: Magnus Karlsson @ 2018-05-16 15:38 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, May 16, 2018 at 4:38 PM, Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
> On Wed, May 16, 2018 at 4:28 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>> On Tue, 15 May 2018 21:06:15 +0200
>> Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>>
>>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>>
>>> Here, the zero-copy ndo is implemented. As a shortcut, the existing
>>> XDP Tx rings are used for zero-copy. This means that and XDP program
>>> cannot redirect to an AF_XDP enabled XDP Tx ring.
>>
>> I've changed i40e1 to only have one queue via:
>>  $ ethtool -L i40e1 combined 1
>>
>> And then, I'm sending on queue 1, which is/should not be avail... and then crash/BUG:
>>
>> $ sudo taskset -c 2 ./xdpsock --tx --interface=i40e1 --queue=1
>>
>> [ 3799.936877] Number of in use tx queues changed invalidating tc mappings. Priority traffic
>>  classification disabled!
>> [ 3799.972970] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
>> [ 3799.980790] PGD 80000007b0992067 P4D 80000007b0992067 PUD 7b62d4067 PMD 0
>> [ 3799.987654] Oops: 0002 [#1] PREEMPT SMP PTI
>> [ 3799.991831] Modules linked in: nf_nat_masquerade_ipv4 tun nfnetlink bridge stp llc nf_nat
>>  nf_conntrack rpcrdma ib_ipoib rdma_ucm ib_ucm ib_umad rdma_cm ib_cm iw_cm sunrpc mlx5_ib ib
>> _uverbs ib_core coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf p
>> cspkr i2c_i801 shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel i40e ml
>> x5_core hid_generic ixgbe igb devlink mdio ptp sd_mod i2c_algo_bit i2c_core pps_core [last u
>> nloaded: x_tables]
>> [ 3800.033472] CPU: 2 PID: 2006 Comm: xdpsock Not tainted 4.17.0-rc3-af_xdp03_ZC_rfc+ #155
>> [ 3800.041465] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
>> [ 3800.048943] RIP: 0010:i40e_xmit_frame_ring+0xd4/0x1490 [i40e]
>> [ 3800.054683] RSP: 0018:ffffc9000407bcd0 EFLAGS: 00010293
>> [ 3800.059900] RAX: 0000000000000000 RBX: ffff88084f0fd200 RCX: 0000000000000000
>> [ 3800.067022] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8807b6e710c0
>> [ 3800.074148] RBP: ffff8807c6397800 R08: 00000000000000c0 R09: 0000000000000001
>> [ 3800.081270] R10: 0000000000000800 R11: 0000000000000010 R12: 0000000000000001
>> [ 3800.088396] R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000003c
>> [ 3800.095520] FS:  00007f1d1e00bb80(0000) GS:ffff88087fc80000(0000) knlGS:0000000000000000
>> [ 3800.103597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 3800.109335] CR2: 0000000000000008 CR3: 000000087d542001 CR4: 00000000003606e0
>> [ 3800.116458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 3800.123583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 3800.130706] Call Trace:
>> [ 3800.133157]  ? validate_xmit_skb.isra.116+0x1c/0x270
>> [ 3800.138118]  dev_direct_xmit+0xec/0x1d0
>> [ 3800.141949]  xsk_sendmsg+0x1f4/0x380
>> [ 3800.145521]  sock_sendmsg+0x30/0x40
>> [ 3800.149005]  __sys_sendto+0x10e/0x140
>> [ 3800.152662]  ? __do_page_fault+0x283/0x500
>> [ 3800.156751]  __x64_sys_sendto+0x24/0x30
>> [ 3800.160585]  do_syscall_64+0x42/0xf0
>> [ 3800.164156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [ 3800.169204] RIP: 0033:0x7f1d1d9db430
>> [ 3800.172774] RSP: 002b:00007fffb7278610 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
>> [ 3800.180333] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1d1d9db430
>> [ 3800.187456] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
>> [ 3800.194582] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> [ 3800.201705] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000
>> [ 3800.208830] R13: 0000000000000000 R14: 0000000000755510 R15: 00007f1d0d3fc000
>> [ 3800.215953] Code: d0 0f 86 db 05 00 00 01 c8 0f b7 ca 29 c8 83 e8 01 39 c6 0f 8f ea 06 00 00 48 8b 45 28 48 8d 14 92 41 b9 01 00 00 00 4c 8d 2c d0 <49> 89 5d 08 8b 83 80 00 00 00 66 45 89 4d 14 41 89 45 10 0f b7
>> [ 3800.234798] RIP: i40e_xmit_frame_ring+0xd4/0x1490 [i40e] RSP: ffffc9000407bcd0
>> [ 3800.242005] CR2: 0000000000000008
>> [ 3800.245320] ---[ end trace f169e36f468e0c59 ]---
>> [ 3801.726719] Kernel panic - not syncing: Fatal exception in interrupt
>> [ 3801.733097] Kernel Offset: disabled
>> [ 3801.785836] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>> [ 3801.793403] ------------[ cut here ]------------
>>
>> (gdb) list *(i40e_xmit_frame_ring)+0xd4
>> 0x2ccd4 is in i40e_xmit_frame_ring (drivers/net/ethernet/intel/i40e/i40e_txrx.c:4048).
>> warning: Source file is more recent than executable.
>> 4043                    return NETDEV_TX_BUSY;
>> 4044            }
>> 4045
>> 4046            /* record the location of the first descriptor for this packet */
>> 4047            first = &tx_ring->tx_bi[tx_ring->next_to_use];
>> 4048            first->skb = skb;
>> 4049            first->bytecount = skb->len;
>> 4050            first->gso_segs = 1;
>> 4051
>> 4052            /* prepare the xmit flags */
>>
>>
>> (gdb) list *(xsk_sendmsg)+0x1f4
>> 0xffffffff81800c34 is in xsk_sendmsg (net/xdp/xsk.c:251).
>> warning: Source file is more recent than executable.
>> 246                     skb_shinfo(skb)->destructor_arg = (void *)(long)id;
>> 247                     skb->destructor = xsk_destruct_skb;
>> 248
>> 249                     err = dev_direct_xmit(skb, xs->queue_id);
>> 250                     /* Ignore NET_XMIT_CN as packet might have been sent */
>> 251                     if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
>> 252                             err = -EAGAIN;
>> 253                             /* SKB consumed by dev_direct_xmit() */
>> 254                             goto out;
>> 255                     }
>

Found it. Checked num_rx_queues in the xsk_bind code instead of
real_num_rx_queues. The code below will solve the problem. Will post a
proper patch for it tomorrow. Thanks Jesper for reporting this.
Appreciated.

/Magnus

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ac97902..4b62a1e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -391,7 +391,8 @@ static int xsk_bind(struct socket *sock, struct
sockaddr *addr, int addr_len)
                goto out_unlock;
        }

-       if (sxdp->sxdp_queue_id >= dev->num_rx_queues) {
+       if ((xs->rx && sxdp->sxdp_queue_id >= dev->real_num_rx_queues) ||
+           (xs->tx && sxdp->sxdp_queue_id >= dev->real_num_tx_queues)) {
                err = -EINVAL;
                goto out_unlock;
        }

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
  2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-16 17:04   ` Alexei Starovoitov
  -1 siblings, 0 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2018-05-16 17:04 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev,
	Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

On Tue, May 15, 2018 at 09:06:03PM +0200, Björn Töpel wrote:
> 
> Alexei had two concerns in conjunction with adding ZC support to
> AF_XDP: show that the user interface holds and can deliver good
> performance for ZC and that the driver interfaces for ZC are good. We
> think that this patch set shows that we have addressed the first
> issue: performance is good and there is no change to the uapi. But
> please take a look at the code and see if you like the ZC interfaces
> that was the second concern.

Looks like we're not on the same page with definition of 'uapi'.
Here you're saying that patches demonstrate performance without
a change to uapi, whereas patch 1 does remove rebind support
which _is_ a change to uapi.
That was exactly my concern with the previous set.

The other restrictions that are introduced in this patch set
are actually ok:
- like in patch 12: 'no redirect to an AF_XDP enabled XDP Tx ring'
  this is fine, since this restriction can be lifted later without
  breaking uapi
- patch 11: 'No passing to the stack via XDP_PASS'
  also fine, since can be addressed later.

> To do for this RFC to become a patch set:
> 
> * Implement dynamic creation and deletion of queues in the i40e driver

can be deferred, no?

> * Properly splitting up the i40e changes

Imo patch 11 and 12 are reasonable in terms of size
and reviewable as-is. I don't think they have to be split.
Would be nice though.
 
> * Have the Intel NIC team review the i40e changes from at least an
>   architecture point of view

As Alexander pointed out in patch 11, if you split it into
separate file the changes to i40e core become pretty small and
I think Intel folks (Jeff, Alexander, ...) will be ok if we merge
this set via bpf-next tree asap and clean up, refactor, share
more code with i40e core later.

> * Implement a more fair scheduling policy for multiple XSKs that share
>   an umem for TX. This can be combined with a batching API for
>   xsk_umem_consume_tx.

can be deferred too?

I think the first 10 patches in this set is a hard dependency on i40e
patches, so the whole thing have to reviewed and landed together.
May be the first 5 patches can be applied already.

Anyway at this point I still think that removing AF_XDP and bpf xskmap
from uapi is necessary before the merge window, unless this patch set
(including i40e changes can land right now).
Also I'd like to see another NIC vendor demonstrating RFC for ZC as well.
The allocator api looks good and I don't anticipate issues, but still
I think it's necessary to make sure that we're not adding i40e-only feature.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
@ 2018-05-16 17:04   ` Alexei Starovoitov
  0 siblings, 0 replies; 54+ messages in thread
From: Alexei Starovoitov @ 2018-05-16 17:04 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, May 15, 2018 at 09:06:03PM +0200, Bj?rn T?pel wrote:
> 
> Alexei had two concerns in conjunction with adding ZC support to
> AF_XDP: show that the user interface holds and can deliver good
> performance for ZC and that the driver interfaces for ZC are good. We
> think that this patch set shows that we have addressed the first
> issue: performance is good and there is no change to the uapi. But
> please take a look at the code and see if you like the ZC interfaces
> that was the second concern.

Looks like we're not on the same page with definition of 'uapi'.
Here you're saying that patches demonstrate performance without
a change to uapi, whereas patch 1 does remove rebind support
which _is_ a change to uapi.
That was exactly my concern with the previous set.

The other restrictions that are introduced in this patch set
are actually ok:
- like in patch 12: 'no redirect to an AF_XDP enabled XDP Tx ring'
  this is fine, since this restriction can be lifted later without
  breaking uapi
- patch 11: 'No passing to the stack via XDP_PASS'
  also fine, since can be addressed later.

> To do for this RFC to become a patch set:
> 
> * Implement dynamic creation and deletion of queues in the i40e driver

can be deferred, no?

> * Properly splitting up the i40e changes

Imo patch 11 and 12 are reasonable in terms of size
and reviewable as-is. I don't think they have to be split.
Would be nice though.
 
> * Have the Intel NIC team review the i40e changes from at least an
>   architecture point of view

As Alexander pointed out in patch 11, if you split it into
separate file the changes to i40e core become pretty small and
I think Intel folks (Jeff, Alexander, ...) will be ok if we merge
this set via bpf-next tree asap and clean up, refactor, share
more code with i40e core later.

> * Implement a more fair scheduling policy for multiple XSKs that share
>   an umem for TX. This can be combined with a batching API for
>   xsk_umem_consume_tx.

can be deferred too?

I think the first 10 patches in this set is a hard dependency on i40e
patches, so the whole thing have to reviewed and landed together.
May be the first 5 patches can be applied already.

Anyway at this point I still think that removing AF_XDP and bpf xskmap
from uapi is necessary before the merge window, unless this patch set
(including i40e changes can land right now).
Also I'd like to see another NIC vendor demonstrating RFC for ZC as well.
The allocator api looks good and I don't anticipate issues, but still
I think it's necessary to make sure that we're not adding i40e-only feature.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
  2018-05-16 17:04   ` [Intel-wired-lan] " Alexei Starovoitov
@ 2018-05-16 17:49     ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-16 17:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Magnus Karlsson, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
	Michael S. Tsirkin, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z, intel-wired-lan

2018-05-16 19:04 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
> On Tue, May 15, 2018 at 09:06:03PM +0200, Björn Töpel wrote:
>>
>> Alexei had two concerns in conjunction with adding ZC support to
>> AF_XDP: show that the user interface holds and can deliver good
>> performance for ZC and that the driver interfaces for ZC are good. We
>> think that this patch set shows that we have addressed the first
>> issue: performance is good and there is no change to the uapi. But
>> please take a look at the code and see if you like the ZC interfaces
>> that was the second concern.
>
> Looks like we're not on the same page with definition of 'uapi'.
> Here you're saying that patches demonstrate performance without
> a change to uapi, whereas patch 1 does remove rebind support
> which _is_ a change to uapi.
> That was exactly my concern with the previous set.
>

Good point. We did realize it was an UAPI break, and intended to add
the "disable rebind" as a follow up in this merge window (honestly!
;-)), but still -- this proves your point that the ZC patches should
be done back-to-back to the non-ZC ones.

> The other restrictions that are introduced in this patch set
> are actually ok:
> - like in patch 12: 'no redirect to an AF_XDP enabled XDP Tx ring'
>   this is fine, since this restriction can be lifted later without
>   breaking uapi
> - patch 11: 'No passing to the stack via XDP_PASS'
>   also fine, since can be addressed later.
>
>> To do for this RFC to become a patch set:
>>
>> * Implement dynamic creation and deletion of queues in the i40e driver
>
> can be deferred, no?
>

Well it *could*, but that combined with the whole "bolted on Rx path"
isn't something I'd like to be upstream. It needs more work, and is
too messy and fragile IMO.

>> * Properly splitting up the i40e changes
>
> Imo patch 11 and 12 are reasonable in terms of size
> and reviewable as-is. I don't think they have to be split.
> Would be nice though.
>
>> * Have the Intel NIC team review the i40e changes from at least an
>>   architecture point of view
>
> As Alexander pointed out in patch 11, if you split it into
> separate file the changes to i40e core become pretty small and
> I think Intel folks (Jeff, Alexander, ...) will be ok if we merge
> this set via bpf-next tree asap and clean up, refactor, share
> more code with i40e core later.
>

Hmm...

>> * Implement a more fair scheduling policy for multiple XSKs that share
>>   an umem for TX. This can be combined with a batching API for
>>   xsk_umem_consume_tx.
>
> can be deferred too?
>

Yes.

> I think the first 10 patches in this set is a hard dependency on i40e
> patches, so the whole thing have to reviewed and landed together.
> May be the first 5 patches can be applied already.
>
> Anyway at this point I still think that removing AF_XDP and bpf xskmap
> from uapi is necessary before the merge window, unless this patch set
> (including i40e changes can land right now).
> Also I'd like to see another NIC vendor demonstrating RFC for ZC as well.
> The allocator api looks good and I don't anticipate issues, but still
> I think it's necessary to make sure that we're not adding i40e-only feature.
>

Again, fair point. We think the copy-path is generic enough (with the
follow-ups you and Daniel suggested and the rebind state removed) --
but hey, we're that one vendor. ;-) More seriously -- having at least
two ZC implementations at the introduction of AF_XDP would make us
happier as well.


Thanks,
Björn

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
@ 2018-05-16 17:49     ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-16 17:49 UTC (permalink / raw)
  To: intel-wired-lan

2018-05-16 19:04 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
> On Tue, May 15, 2018 at 09:06:03PM +0200, Bj?rn T?pel wrote:
>>
>> Alexei had two concerns in conjunction with adding ZC support to
>> AF_XDP: show that the user interface holds and can deliver good
>> performance for ZC and that the driver interfaces for ZC are good. We
>> think that this patch set shows that we have addressed the first
>> issue: performance is good and there is no change to the uapi. But
>> please take a look at the code and see if you like the ZC interfaces
>> that was the second concern.
>
> Looks like we're not on the same page with definition of 'uapi'.
> Here you're saying that patches demonstrate performance without
> a change to uapi, whereas patch 1 does remove rebind support
> which _is_ a change to uapi.
> That was exactly my concern with the previous set.
>

Good point. We did realize it was an UAPI break, and intended to add
the "disable rebind" as a follow up in this merge window (honestly!
;-)), but still -- this proves your point that the ZC patches should
be done back-to-back to the non-ZC ones.

> The other restrictions that are introduced in this patch set
> are actually ok:
> - like in patch 12: 'no redirect to an AF_XDP enabled XDP Tx ring'
>   this is fine, since this restriction can be lifted later without
>   breaking uapi
> - patch 11: 'No passing to the stack via XDP_PASS'
>   also fine, since can be addressed later.
>
>> To do for this RFC to become a patch set:
>>
>> * Implement dynamic creation and deletion of queues in the i40e driver
>
> can be deferred, no?
>

Well it *could*, but that combined with the whole "bolted on Rx path"
isn't something I'd like to be upstream. It needs more work, and is
too messy and fragile IMO.

>> * Properly splitting up the i40e changes
>
> Imo patch 11 and 12 are reasonable in terms of size
> and reviewable as-is. I don't think they have to be split.
> Would be nice though.
>
>> * Have the Intel NIC team review the i40e changes from at least an
>>   architecture point of view
>
> As Alexander pointed out in patch 11, if you split it into
> separate file the changes to i40e core become pretty small and
> I think Intel folks (Jeff, Alexander, ...) will be ok if we merge
> this set via bpf-next tree asap and clean up, refactor, share
> more code with i40e core later.
>

Hmm...

>> * Implement a more fair scheduling policy for multiple XSKs that share
>>   an umem for TX. This can be combined with a batching API for
>>   xsk_umem_consume_tx.
>
> can be deferred too?
>

Yes.

> I think the first 10 patches in this set is a hard dependency on i40e
> patches, so the whole thing have to reviewed and landed together.
> May be the first 5 patches can be applied already.
>
> Anyway at this point I still think that removing AF_XDP and bpf xskmap
> from uapi is necessary before the merge window, unless this patch set
> (including i40e changes can land right now).
> Also I'd like to see another NIC vendor demonstrating RFC for ZC as well.
> The allocator api looks good and I don't anticipate issues, but still
> I think it's necessary to make sure that we're not adding i40e-only feature.
>

Again, fair point. We think the copy-path is generic enough (with the
follow-ups you and Daniel suggested and the rebind state removed) --
but hey, we're that one vendor. ;-) More seriously -- having at least
two ZC implementations at the introduction of AF_XDP would make us
happier as well.


Thanks,
Bj?rn

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Intel-wired-lan] [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
  2018-05-16 17:04   ` [Intel-wired-lan] " Alexei Starovoitov
@ 2018-05-16 18:14     ` Jeff Kirsher
  -1 siblings, 0 replies; 54+ messages in thread
From: Jeff Kirsher @ 2018-05-16 18:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Björn Töpel
  Cc: willemdebruijn.kernel, daniel, ast, netdev, qi.z.zhang, mst,
	michael.lundkvist, intel-wired-lan, brouer,
	Björn Töpel, magnus.karlsson, magnus.karlsson

[-- Attachment #1: Type: text/plain, Size: 3398 bytes --]

On Wed, 2018-05-16 at 10:04 -0700, Alexei Starovoitov wrote:
> On Tue, May 15, 2018 at 09:06:03PM +0200, Björn Töpel wrote:
> > 
> > Alexei had two concerns in conjunction with adding ZC support to
> > AF_XDP: show that the user interface holds and can deliver good
> > performance for ZC and that the driver interfaces for ZC are good.
> > We
> > think that this patch set shows that we have addressed the first
> > issue: performance is good and there is no change to the uapi. But
> > please take a look at the code and see if you like the ZC
> > interfaces
> > that was the second concern.
> 
> Looks like we're not on the same page with definition of 'uapi'.
> Here you're saying that patches demonstrate performance without
> a change to uapi, whereas patch 1 does remove rebind support
> which _is_ a change to uapi.
> That was exactly my concern with the previous set.
> 
> The other restrictions that are introduced in this patch set
> are actually ok:
> - like in patch 12: 'no redirect to an AF_XDP enabled XDP Tx ring'
>   this is fine, since this restriction can be lifted later without
>   breaking uapi
> - patch 11: 'No passing to the stack via XDP_PASS'
>   also fine, since can be addressed later.
> 
> > To do for this RFC to become a patch set:
> > 
> > * Implement dynamic creation and deletion of queues in the i40e
> > driver
> 
> can be deferred, no?
> 
> > * Properly splitting up the i40e changes
> 
> Imo patch 11 and 12 are reasonable in terms of size
> and reviewable as-is. I don't think they have to be split.
> Would be nice though.
>  
> > * Have the Intel NIC team review the i40e changes from at least an
> >   architecture point of view
> 
> As Alexander pointed out in patch 11, if you split it into
> separate file the changes to i40e core become pretty small and
> I think Intel folks (Jeff, Alexander, ...) will be ok if we merge
> this set via bpf-next tree asap and clean up, refactor, share
> more code with i40e core later.

I am fine with the i40e changes in this series go through the BPF tree,
since majority of the series is BPF changes.  We just need to address
Alex's comments on patch 11 & 12.

I only have 1-2 patches currently in my queue against i40e and they are
not affected by the changes in this series, so Dave should not have any
merge issues when pulling.

> > * Implement a more fair scheduling policy for multiple XSKs that
> > share
> >   an umem for TX. This can be combined with a batching API for
> >   xsk_umem_consume_tx.
> 
> can be deferred too?
> 
> I think the first 10 patches in this set is a hard dependency on i40e
> patches, so the whole thing have to reviewed and landed together.
> May be the first 5 patches can be applied already.
> 
> Anyway at this point I still think that removing AF_XDP and bpf
> xskmap
> from uapi is necessary before the merge window, unless this patch set
> (including i40e changes can land right now).
> Also I'd like to see another NIC vendor demonstrating RFC for ZC as
> well.
> The allocator api looks good and I don't anticipate issues, but still
> I think it's necessary to make sure that we're not adding i40e-only
> feature.
> 
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
@ 2018-05-16 18:14     ` Jeff Kirsher
  0 siblings, 0 replies; 54+ messages in thread
From: Jeff Kirsher @ 2018-05-16 18:14 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, 2018-05-16 at 10:04 -0700, Alexei Starovoitov wrote:
> On Tue, May 15, 2018 at 09:06:03PM +0200, Bj?rn T?pel wrote:
> > 
> > Alexei had two concerns in conjunction with adding ZC support to
> > AF_XDP: show that the user interface holds and can deliver good
> > performance for ZC and that the driver interfaces for ZC are good.
> > We
> > think that this patch set shows that we have addressed the first
> > issue: performance is good and there is no change to the uapi. But
> > please take a look at the code and see if you like the ZC
> > interfaces
> > that was the second concern.
> 
> Looks like we're not on the same page with definition of 'uapi'.
> Here you're saying that patches demonstrate performance without
> a change to uapi, whereas patch 1 does remove rebind support
> which _is_ a change to uapi.
> That was exactly my concern with the previous set.
> 
> The other restrictions that are introduced in this patch set
> are actually ok:
> - like in patch 12: 'no redirect to an AF_XDP enabled XDP Tx ring'
>   this is fine, since this restriction can be lifted later without
>   breaking uapi
> - patch 11: 'No passing to the stack via XDP_PASS'
>   also fine, since can be addressed later.
> 
> > To do for this RFC to become a patch set:
> > 
> > * Implement dynamic creation and deletion of queues in the i40e
> > driver
> 
> can be deferred, no?
> 
> > * Properly splitting up the i40e changes
> 
> Imo patch 11 and 12 are reasonable in terms of size
> and reviewable as-is. I don't think they have to be split.
> Would be nice though.
>  
> > * Have the Intel NIC team review the i40e changes from at least an
> >   architecture point of view
> 
> As Alexander pointed out in patch 11, if you split it into
> separate file the changes to i40e core become pretty small and
> I think Intel folks (Jeff, Alexander, ...) will be ok if we merge
> this set via bpf-next tree asap and clean up, refactor, share
> more code with i40e core later.

I am fine with the i40e changes in this series go through the BPF tree,
since majority of the series is BPF changes.  We just need to address
Alex's comments on patch 11 & 12.

I only have 1-2 patches currently in my queue against i40e and they are
not affected by the changes in this series, so Dave should not have any
merge issues when pulling.

> > * Implement a more fair scheduling policy for multiple XSKs that
> > share
> >   an umem for TX. This can be combined with a batching API for
> >   xsk_umem_consume_tx.
> 
> can be deferred too?
> 
> I think the first 10 patches in this set is a hard dependency on i40e
> patches, so the whole thing have to reviewed and landed together.
> May be the first 5 patches can be applied already.
> 
> Anyway at this point I still think that removing AF_XDP and bpf
> xskmap
> from uapi is necessary before the merge window, unless this patch set
> (including i40e changes can land right now).
> Also I'd like to see another NIC vendor demonstrating RFC for ZC as
> well.
> The allocator api looks good and I don't anticipate issues, but still
> I think it's necessary to make sure that we're not adding i40e-only
> feature.
> 
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan at osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20180516/a0a513ef/attachment.asc>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
  2018-05-16 15:38         ` [Intel-wired-lan] " Magnus Karlsson
@ 2018-05-16 18:53           ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-16 18:53 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z, intel-wired-lan, brouer

On Wed, 16 May 2018 17:38:12 +0200
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Wed, May 16, 2018 at 4:38 PM, Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> > On Wed, May 16, 2018 at 4:28 PM, Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:  
> >> On Tue, 15 May 2018 21:06:15 +0200
> >> Björn Töpel <bjorn.topel@gmail.com> wrote:
> >>  
> >>> From: Magnus Karlsson <magnus.karlsson@intel.com>
> >>>
> >>> Here, the zero-copy ndo is implemented. As a shortcut, the existing
> >>> XDP Tx rings are used for zero-copy. This means that and XDP program
> >>> cannot redirect to an AF_XDP enabled XDP Tx ring.  
> >>
> >> I've changed i40e1 to only have one queue via:
> >>  $ ethtool -L i40e1 combined 1
> >>
> >> And then, I'm sending on queue 1, which is/should not be avail... and then crash/BUG:
> >>
> >> $ sudo taskset -c 2 ./xdpsock --tx --interface=i40e1 --queue=1
> >>
> >> [ 3799.936877] Number of in use tx queues changed invalidating tc mappings. Priority traffic
> >>  classification disabled!
> >> [ 3799.972970] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> >> [ 3799.980790] PGD 80000007b0992067 P4D 80000007b0992067 PUD 7b62d4067 PMD 0
> >> [ 3799.987654] Oops: 0002 [#1] PREEMPT SMP PTI
> >> [ 3799.991831] Modules linked in: nf_nat_masquerade_ipv4 tun nfnetlink bridge stp llc nf_nat
> >>  nf_conntrack rpcrdma ib_ipoib rdma_ucm ib_ucm ib_umad rdma_cm ib_cm iw_cm sunrpc mlx5_ib ib
> >> _uverbs ib_core coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf p
> >> cspkr i2c_i801 shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel i40e ml
> >> x5_core hid_generic ixgbe igb devlink mdio ptp sd_mod i2c_algo_bit i2c_core pps_core [last u
> >> nloaded: x_tables]
> >> [ 3800.033472] CPU: 2 PID: 2006 Comm: xdpsock Not tainted 4.17.0-rc3-af_xdp03_ZC_rfc+ #155
> >> [ 3800.041465] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
> >> [ 3800.048943] RIP: 0010:i40e_xmit_frame_ring+0xd4/0x1490 [i40e]
> >> [ 3800.054683] RSP: 0018:ffffc9000407bcd0 EFLAGS: 00010293
> >> [ 3800.059900] RAX: 0000000000000000 RBX: ffff88084f0fd200 RCX: 0000000000000000
> >> [ 3800.067022] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8807b6e710c0
> >> [ 3800.074148] RBP: ffff8807c6397800 R08: 00000000000000c0 R09: 0000000000000001
> >> [ 3800.081270] R10: 0000000000000800 R11: 0000000000000010 R12: 0000000000000001
> >> [ 3800.088396] R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000003c
> >> [ 3800.095520] FS:  00007f1d1e00bb80(0000) GS:ffff88087fc80000(0000) knlGS:0000000000000000
> >> [ 3800.103597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 3800.109335] CR2: 0000000000000008 CR3: 000000087d542001 CR4: 00000000003606e0
> >> [ 3800.116458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> [ 3800.123583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> [ 3800.130706] Call Trace:
> >> [ 3800.133157]  ? validate_xmit_skb.isra.116+0x1c/0x270
> >> [ 3800.138118]  dev_direct_xmit+0xec/0x1d0
> >> [ 3800.141949]  xsk_sendmsg+0x1f4/0x380
> >> [ 3800.145521]  sock_sendmsg+0x30/0x40
> >> [ 3800.149005]  __sys_sendto+0x10e/0x140
> >> [ 3800.152662]  ? __do_page_fault+0x283/0x500
> >> [ 3800.156751]  __x64_sys_sendto+0x24/0x30
> >> [ 3800.160585]  do_syscall_64+0x42/0xf0
> >> [ 3800.164156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [ 3800.169204] RIP: 0033:0x7f1d1d9db430
> >> [ 3800.172774] RSP: 002b:00007fffb7278610 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
> >> [ 3800.180333] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1d1d9db430
> >> [ 3800.187456] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
> >> [ 3800.194582] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> >> [ 3800.201705] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000
> >> [ 3800.208830] R13: 0000000000000000 R14: 0000000000755510 R15: 00007f1d0d3fc000
> >> [ 3800.215953] Code: d0 0f 86 db 05 00 00 01 c8 0f b7 ca 29 c8 83 e8 01 39 c6 0f 8f ea 06 00 00 48 8b 45 28 48 8d 14 92 41 b9 01 00 00 00 4c 8d 2c d0 <49> 89 5d 08 8b 83 80 00 00 00 66 45 89 4d 14 41 89 45 10 0f b7
> >> [ 3800.234798] RIP: i40e_xmit_frame_ring+0xd4/0x1490 [i40e] RSP: ffffc9000407bcd0
> >> [ 3800.242005] CR2: 0000000000000008
> >> [ 3800.245320] ---[ end trace f169e36f468e0c59 ]---
> >> [ 3801.726719] Kernel panic - not syncing: Fatal exception in interrupt
> >> [ 3801.733097] Kernel Offset: disabled
> >> [ 3801.785836] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> >> [ 3801.793403] ------------[ cut here ]------------
> >>
> >> (gdb) list *(i40e_xmit_frame_ring)+0xd4
> >> 0x2ccd4 is in i40e_xmit_frame_ring (drivers/net/ethernet/intel/i40e/i40e_txrx.c:4048).
> >> warning: Source file is more recent than executable.
> >> 4043                    return NETDEV_TX_BUSY;
> >> 4044            }
> >> 4045
> >> 4046            /* record the location of the first descriptor for this packet */
> >> 4047            first = &tx_ring->tx_bi[tx_ring->next_to_use];
> >> 4048            first->skb = skb;
> >> 4049            first->bytecount = skb->len;
> >> 4050            first->gso_segs = 1;
> >> 4051
> >> 4052            /* prepare the xmit flags */
> >>
> >>
> >> (gdb) list *(xsk_sendmsg)+0x1f4
> >> 0xffffffff81800c34 is in xsk_sendmsg (net/xdp/xsk.c:251).
> >> warning: Source file is more recent than executable.
> >> 246                     skb_shinfo(skb)->destructor_arg = (void *)(long)id;
> >> 247                     skb->destructor = xsk_destruct_skb;
> >> 248
> >> 249                     err = dev_direct_xmit(skb, xs->queue_id);
> >> 250                     /* Ignore NET_XMIT_CN as packet might have been sent */
> >> 251                     if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
> >> 252                             err = -EAGAIN;
> >> 253                             /* SKB consumed by dev_direct_xmit() */
> >> 254                             goto out;
> >> 255                     }  
> >  
> 
> Found it. Checked num_rx_queues in the xsk_bind code instead of
> real_num_rx_queues. The code below will solve the problem. Will post a
> proper patch for it tomorrow. Thanks Jesper for reporting this.
> Appreciated.
> 
> /Magnus
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index ac97902..4b62a1e 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -391,7 +391,8 @@ static int xsk_bind(struct socket *sock, struct
> sockaddr *addr, int addr_len)
>                 goto out_unlock;
>         }
> 
> -       if (sxdp->sxdp_queue_id >= dev->num_rx_queues) {
> +       if ((xs->rx && sxdp->sxdp_queue_id >= dev->real_num_rx_queues) ||
> +           (xs->tx && sxdp->sxdp_queue_id >= dev->real_num_tx_queues)) {
>                 err = -EINVAL;
>                 goto out_unlock;
>         }

Tried this patch... it fixed/catched the problem :-)

$ sudo ./xdpsock --interface=i40e1 --queue=42 --txonly
samples/bpf/xdpsock_user.c:xsk_configure:528: Assertion failed: bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0: errno: 22/"Invalid argument"
Segmentation fault

You can add:
 Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>

Notice: this bug is not related to this zero-copy patch, but to your
previous patchset, which are in bpf-next.  Thus, you need to send a fix
patch to bpf-next...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
@ 2018-05-16 18:53           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-16 18:53 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, 16 May 2018 17:38:12 +0200
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Wed, May 16, 2018 at 4:38 PM, Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> > On Wed, May 16, 2018 at 4:28 PM, Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:  
> >> On Tue, 15 May 2018 21:06:15 +0200
> >> Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
> >>  
> >>> From: Magnus Karlsson <magnus.karlsson@intel.com>
> >>>
> >>> Here, the zero-copy ndo is implemented. As a shortcut, the existing
> >>> XDP Tx rings are used for zero-copy. This means that and XDP program
> >>> cannot redirect to an AF_XDP enabled XDP Tx ring.  
> >>
> >> I've changed i40e1 to only have one queue via:
> >>  $ ethtool -L i40e1 combined 1
> >>
> >> And then, I'm sending on queue 1, which is/should not be avail... and then crash/BUG:
> >>
> >> $ sudo taskset -c 2 ./xdpsock --tx --interface=i40e1 --queue=1
> >>
> >> [ 3799.936877] Number of in use tx queues changed invalidating tc mappings. Priority traffic
> >>  classification disabled!
> >> [ 3799.972970] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> >> [ 3799.980790] PGD 80000007b0992067 P4D 80000007b0992067 PUD 7b62d4067 PMD 0
> >> [ 3799.987654] Oops: 0002 [#1] PREEMPT SMP PTI
> >> [ 3799.991831] Modules linked in: nf_nat_masquerade_ipv4 tun nfnetlink bridge stp llc nf_nat
> >>  nf_conntrack rpcrdma ib_ipoib rdma_ucm ib_ucm ib_umad rdma_cm ib_cm iw_cm sunrpc mlx5_ib ib
> >> _uverbs ib_core coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf p
> >> cspkr i2c_i801 shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel i40e ml
> >> x5_core hid_generic ixgbe igb devlink mdio ptp sd_mod i2c_algo_bit i2c_core pps_core [last u
> >> nloaded: x_tables]
> >> [ 3800.033472] CPU: 2 PID: 2006 Comm: xdpsock Not tainted 4.17.0-rc3-af_xdp03_ZC_rfc+ #155
> >> [ 3800.041465] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
> >> [ 3800.048943] RIP: 0010:i40e_xmit_frame_ring+0xd4/0x1490 [i40e]
> >> [ 3800.054683] RSP: 0018:ffffc9000407bcd0 EFLAGS: 00010293
> >> [ 3800.059900] RAX: 0000000000000000 RBX: ffff88084f0fd200 RCX: 0000000000000000
> >> [ 3800.067022] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8807b6e710c0
> >> [ 3800.074148] RBP: ffff8807c6397800 R08: 00000000000000c0 R09: 0000000000000001
> >> [ 3800.081270] R10: 0000000000000800 R11: 0000000000000010 R12: 0000000000000001
> >> [ 3800.088396] R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000003c
> >> [ 3800.095520] FS:  00007f1d1e00bb80(0000) GS:ffff88087fc80000(0000) knlGS:0000000000000000
> >> [ 3800.103597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 3800.109335] CR2: 0000000000000008 CR3: 000000087d542001 CR4: 00000000003606e0
> >> [ 3800.116458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> [ 3800.123583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> [ 3800.130706] Call Trace:
> >> [ 3800.133157]  ? validate_xmit_skb.isra.116+0x1c/0x270
> >> [ 3800.138118]  dev_direct_xmit+0xec/0x1d0
> >> [ 3800.141949]  xsk_sendmsg+0x1f4/0x380
> >> [ 3800.145521]  sock_sendmsg+0x30/0x40
> >> [ 3800.149005]  __sys_sendto+0x10e/0x140
> >> [ 3800.152662]  ? __do_page_fault+0x283/0x500
> >> [ 3800.156751]  __x64_sys_sendto+0x24/0x30
> >> [ 3800.160585]  do_syscall_64+0x42/0xf0
> >> [ 3800.164156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [ 3800.169204] RIP: 0033:0x7f1d1d9db430
> >> [ 3800.172774] RSP: 002b:00007fffb7278610 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
> >> [ 3800.180333] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1d1d9db430
> >> [ 3800.187456] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
> >> [ 3800.194582] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> >> [ 3800.201705] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000
> >> [ 3800.208830] R13: 0000000000000000 R14: 0000000000755510 R15: 00007f1d0d3fc000
> >> [ 3800.215953] Code: d0 0f 86 db 05 00 00 01 c8 0f b7 ca 29 c8 83 e8 01 39 c6 0f 8f ea 06 00 00 48 8b 45 28 48 8d 14 92 41 b9 01 00 00 00 4c 8d 2c d0 <49> 89 5d 08 8b 83 80 00 00 00 66 45 89 4d 14 41 89 45 10 0f b7
> >> [ 3800.234798] RIP: i40e_xmit_frame_ring+0xd4/0x1490 [i40e] RSP: ffffc9000407bcd0
> >> [ 3800.242005] CR2: 0000000000000008
> >> [ 3800.245320] ---[ end trace f169e36f468e0c59 ]---
> >> [ 3801.726719] Kernel panic - not syncing: Fatal exception in interrupt
> >> [ 3801.733097] Kernel Offset: disabled
> >> [ 3801.785836] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> >> [ 3801.793403] ------------[ cut here ]------------
> >>
> >> (gdb) list *(i40e_xmit_frame_ring)+0xd4
> >> 0x2ccd4 is in i40e_xmit_frame_ring (drivers/net/ethernet/intel/i40e/i40e_txrx.c:4048).
> >> warning: Source file is more recent than executable.
> >> 4043                    return NETDEV_TX_BUSY;
> >> 4044            }
> >> 4045
> >> 4046            /* record the location of the first descriptor for this packet */
> >> 4047            first = &tx_ring->tx_bi[tx_ring->next_to_use];
> >> 4048            first->skb = skb;
> >> 4049            first->bytecount = skb->len;
> >> 4050            first->gso_segs = 1;
> >> 4051
> >> 4052            /* prepare the xmit flags */
> >>
> >>
> >> (gdb) list *(xsk_sendmsg)+0x1f4
> >> 0xffffffff81800c34 is in xsk_sendmsg (net/xdp/xsk.c:251).
> >> warning: Source file is more recent than executable.
> >> 246                     skb_shinfo(skb)->destructor_arg = (void *)(long)id;
> >> 247                     skb->destructor = xsk_destruct_skb;
> >> 248
> >> 249                     err = dev_direct_xmit(skb, xs->queue_id);
> >> 250                     /* Ignore NET_XMIT_CN as packet might have been sent */
> >> 251                     if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
> >> 252                             err = -EAGAIN;
> >> 253                             /* SKB consumed by dev_direct_xmit() */
> >> 254                             goto out;
> >> 255                     }  
> >  
> 
> Found it. Checked num_rx_queues in the xsk_bind code instead of
> real_num_rx_queues. The code below will solve the problem. Will post a
> proper patch for it tomorrow. Thanks Jesper for reporting this.
> Appreciated.
> 
> /Magnus
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index ac97902..4b62a1e 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -391,7 +391,8 @@ static int xsk_bind(struct socket *sock, struct
> sockaddr *addr, int addr_len)
>                 goto out_unlock;
>         }
> 
> -       if (sxdp->sxdp_queue_id >= dev->num_rx_queues) {
> +       if ((xs->rx && sxdp->sxdp_queue_id >= dev->real_num_rx_queues) ||
> +           (xs->tx && sxdp->sxdp_queue_id >= dev->real_num_tx_queues)) {
>                 err = -EINVAL;
>                 goto out_unlock;
>         }

Tried this patch... it fixed/catched the problem :-)

$ sudo ./xdpsock --interface=i40e1 --queue=42 --txonly
samples/bpf/xdpsock_user.c:xsk_configure:528: Assertion failed: bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0: errno: 22/"Invalid argument"
Segmentation fault

You can add:
 Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>

Notice: this bug is not related to this zero-copy patch, but to your
previous patchset, which are in bpf-next.  Thus, you need to send a fix
patch to bpf-next...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY
  2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-17  5:57     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-17  5:57 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, willemdebruijn.kernel,
	daniel, mst, netdev, Björn Töpel, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang, intel-wired-lan,
	brouer

On Tue, 15 May 2018 21:06:08 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> @@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>  	int metasize;
>  	int headroom;
>  
> +	// XXX implement clone, copy, use "native" MEM_TYPE
> +	if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> +		return NULL;
> +

There is going to be significant tradeoffs between AF_XDP zero-copy and
copy-variant.  The copy-variant, still have very attractive
RX-performance, and other benefits like no exposing unrelated packets
to userspace (but limit these to the XDP filter).

Thus, as a user I would like to choose between AF_XDP zero-copy and
copy-variant. Even if my NIC support zero-copy, I can be interested in
only enabling the copy-variant. This patchset doesn't let me choose.

How do we expose this to userspace?
(Maybe as simple as an sockaddr_xdp->sxdp_flags flag?)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY
@ 2018-05-17  5:57     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-17  5:57 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, 15 May 2018 21:06:08 +0200
Bj?rn T?pel <bjorn.topel@gmail.com> wrote:

> @@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>  	int metasize;
>  	int headroom;
>  
> +	// XXX implement clone, copy, use "native" MEM_TYPE
> +	if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> +		return NULL;
> +

There is going to be significant tradeoffs between AF_XDP zero-copy and
copy-variant.  The copy-variant, still have very attractive
RX-performance, and other benefits like no exposing unrelated packets
to userspace (but limit these to the XDP filter).

Thus, as a user I would like to choose between AF_XDP zero-copy and
copy-variant. Even if my NIC support zero-copy, I can be interested in
only enabling the copy-variant. This patchset doesn't let me choose.

How do we expose this to userspace?
(Maybe as simple as an sockaddr_xdp->sxdp_flags flag?)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY
  2018-05-17  5:57     ` [Intel-wired-lan] " Jesper Dangaard Brouer
@ 2018-05-17  7:08       ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-17  7:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Magnus Karlsson, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin, Netdev,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z, intel-wired-lan

2018-05-17 7:57 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Tue, 15 May 2018 21:06:08 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
>> @@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>>       int metasize;
>>       int headroom;
>>
>> +     // XXX implement clone, copy, use "native" MEM_TYPE
>> +     if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
>> +             return NULL;
>> +
>
> There is going to be significant tradeoffs between AF_XDP zero-copy and
> copy-variant.  The copy-variant, still have very attractive
> RX-performance, and other benefits like no exposing unrelated packets
> to userspace (but limit these to the XDP filter).
>
> Thus, as a user I would like to choose between AF_XDP zero-copy and
> copy-variant. Even if my NIC support zero-copy, I can be interested in
> only enabling the copy-variant. This patchset doesn't let me choose.
>
> How do we expose this to userspace?
> (Maybe as simple as an sockaddr_xdp->sxdp_flags flag?)
>

We planned to add these flags later, but I think you're right that
it's better to do that right away.

If we try to follow the behavior of the XDP netlink interface: Pick
the "the best mode" when there are no flags. A user would like to
"force" a mode -- meaning that you select, say copy, and getting an
error if that's not supported. Four new flags?

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 77b88c4efe98..ce1f710847b7 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -22,7 +22,11 @@
 #include <linux/types.h>

 /* Options for the sxdp_flags field */
-#define XDP_SHARED_UMEM 1
+#define XDP_SHARED_UMEM        (1U << 0)
+#define XDP_COPY_TX_UMEM    (1U << 1)
+#define XDP_ZEROCOPY_TX_UMEM    (1U << 2)
+#define XDP_COPY_RX_UMEM    (1U << 3)
+#define XDP_ZEROCOPY_RX_UMEM    (1U << 4)

 struct sockaddr_xdp {
     __u16 sxdp_family;

A better way?




> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY
@ 2018-05-17  7:08       ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-17  7:08 UTC (permalink / raw)
  To: intel-wired-lan

2018-05-17 7:57 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Tue, 15 May 2018 21:06:08 +0200
> Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>
>> @@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>>       int metasize;
>>       int headroom;
>>
>> +     // XXX implement clone, copy, use "native" MEM_TYPE
>> +     if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
>> +             return NULL;
>> +
>
> There is going to be significant tradeoffs between AF_XDP zero-copy and
> copy-variant.  The copy-variant, still have very attractive
> RX-performance, and other benefits like no exposing unrelated packets
> to userspace (but limit these to the XDP filter).
>
> Thus, as a user I would like to choose between AF_XDP zero-copy and
> copy-variant. Even if my NIC support zero-copy, I can be interested in
> only enabling the copy-variant. This patchset doesn't let me choose.
>
> How do we expose this to userspace?
> (Maybe as simple as an sockaddr_xdp->sxdp_flags flag?)
>

We planned to add these flags later, but I think you're right that
it's better to do that right away.

If we try to follow the behavior of the XDP netlink interface: Pick
the "the best mode" when there are no flags. A user would like to
"force" a mode -- meaning that you select, say copy, and getting an
error if that's not supported. Four new flags?

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 77b88c4efe98..ce1f710847b7 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -22,7 +22,11 @@
 #include <linux/types.h>

 /* Options for the sxdp_flags field */
-#define XDP_SHARED_UMEM 1
+#define XDP_SHARED_UMEM        (1U << 0)
+#define XDP_COPY_TX_UMEM    (1U << 1)
+#define XDP_ZEROCOPY_TX_UMEM    (1U << 2)
+#define XDP_COPY_RX_UMEM    (1U << 3)
+#define XDP_ZEROCOPY_RX_UMEM    (1U << 4)

 struct sockaddr_xdp {
     __u16 sxdp_family;

A better way?




> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY
  2018-05-17  7:08       ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-17  7:09         ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-17  7:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Magnus Karlsson, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin, Netdev,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z, intel-wired-lan

2018-05-17 9:08 GMT+02:00 Björn Töpel <bjorn.topel@gmail.com>:
> 2018-05-17 7:57 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>> On Tue, 15 May 2018 21:06:08 +0200
>> Björn Töpel <bjorn.topel@gmail.com> wrote:
>>
>>> @@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>>>       int metasize;
>>>       int headroom;
>>>
>>> +     // XXX implement clone, copy, use "native" MEM_TYPE
>>> +     if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
>>> +             return NULL;
>>> +
>>
>> There is going to be significant tradeoffs between AF_XDP zero-copy and
>> copy-variant.  The copy-variant, still have very attractive
>> RX-performance, and other benefits like no exposing unrelated packets
>> to userspace (but limit these to the XDP filter).
>>
>> Thus, as a user I would like to choose between AF_XDP zero-copy and
>> copy-variant. Even if my NIC support zero-copy, I can be interested in
>> only enabling the copy-variant. This patchset doesn't let me choose.
>>
>> How do we expose this to userspace?
>> (Maybe as simple as an sockaddr_xdp->sxdp_flags flag?)
>>
>
> We planned to add these flags later, but I think you're right that
> it's better to do that right away.
>
> If we try to follow the behavior of the XDP netlink interface: Pick
> the "the best mode" when there are no flags. A user would like to
> "force" a mode -- meaning that you select, say copy, and getting an
> error if that's not supported. Four new flags?
>
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 77b88c4efe98..ce1f710847b7 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -22,7 +22,11 @@
>  #include <linux/types.h>
>
>  /* Options for the sxdp_flags field */
> -#define XDP_SHARED_UMEM 1
> +#define XDP_SHARED_UMEM        (1U << 0)
> +#define XDP_COPY_TX_UMEM    (1U << 1)
> +#define XDP_ZEROCOPY_TX_UMEM    (1U << 2)
> +#define XDP_COPY_RX_UMEM    (1U << 3)
> +#define XDP_ZEROCOPY_RX_UMEM    (1U << 4)
>
>  struct sockaddr_xdp {
>      __u16 sxdp_family;
>
> A better way?
>

...but without the _UMEM suffix obviously.

>
>
>
>> --
>> Best regards,
>>   Jesper Dangaard Brouer
>>   MSc.CS, Principal Kernel Engineer at Red Hat
>>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY
@ 2018-05-17  7:09         ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-17  7:09 UTC (permalink / raw)
  To: intel-wired-lan

2018-05-17 9:08 GMT+02:00 Bj?rn T?pel <bjorn.topel@gmail.com>:
> 2018-05-17 7:57 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>> On Tue, 15 May 2018 21:06:08 +0200
>> Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>>
>>> @@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>>>       int metasize;
>>>       int headroom;
>>>
>>> +     // XXX implement clone, copy, use "native" MEM_TYPE
>>> +     if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
>>> +             return NULL;
>>> +
>>
>> There is going to be significant tradeoffs between AF_XDP zero-copy and
>> copy-variant.  The copy-variant, still have very attractive
>> RX-performance, and other benefits like no exposing unrelated packets
>> to userspace (but limit these to the XDP filter).
>>
>> Thus, as a user I would like to choose between AF_XDP zero-copy and
>> copy-variant. Even if my NIC support zero-copy, I can be interested in
>> only enabling the copy-variant. This patchset doesn't let me choose.
>>
>> How do we expose this to userspace?
>> (Maybe as simple as an sockaddr_xdp->sxdp_flags flag?)
>>
>
> We planned to add these flags later, but I think you're right that
> it's better to do that right away.
>
> If we try to follow the behavior of the XDP netlink interface: Pick
> the "the best mode" when there are no flags. A user would like to
> "force" a mode -- meaning that you select, say copy, and getting an
> error if that's not supported. Four new flags?
>
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 77b88c4efe98..ce1f710847b7 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -22,7 +22,11 @@
>  #include <linux/types.h>
>
>  /* Options for the sxdp_flags field */
> -#define XDP_SHARED_UMEM 1
> +#define XDP_SHARED_UMEM        (1U << 0)
> +#define XDP_COPY_TX_UMEM    (1U << 1)
> +#define XDP_ZEROCOPY_TX_UMEM    (1U << 2)
> +#define XDP_COPY_RX_UMEM    (1U << 3)
> +#define XDP_ZEROCOPY_RX_UMEM    (1U << 4)
>
>  struct sockaddr_xdp {
>      __u16 sxdp_family;
>
> A better way?
>

...but without the _UMEM suffix obviously.

>
>
>
>> --
>> Best regards,
>>   Jesper Dangaard Brouer
>>   MSc.CS, Principal Kernel Engineer at Red Hat
>>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
  2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
@ 2018-05-17 21:31     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-17 21:31 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, willemdebruijn.kernel,
	daniel, mst, netdev, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan, brouer


On Tue, 15 May 2018 21:06:15 +0200 Björn Töpel <bjorn.topel@gmail.com> wrote:

> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Here, the zero-copy ndo is implemented. As a shortcut, the existing
> XDP Tx rings are used for zero-copy. This means that and XDP program
> cannot redirect to an AF_XDP enabled XDP Tx ring.

This "shortcut" is not acceptable, and completely broken.  The
XDP_REDIRECT queue_index is based on smp_processor_id(), and can easily
clash with the configured XSK queue_index.  Provided a bit more code
context below...

On Tue, 15 May 2018 21:06:15 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
{
	struct i40e_netdev_priv *np = netdev_priv(dev);
	unsigned int queue_index = smp_processor_id();
	struct i40e_vsi *vsi = np->vsi;
	int err;

	if (test_bit(__I40E_VSI_DOWN, vsi->state))
		return -ENETDOWN;

> @@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
>  	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>  		return -ENXIO;
>  
> +	if (vsi->xdp_rings[queue_index]->xsk_umem)
> +		return -ENXIO;
> +

Using the sane errno makes this impossible to debug (via the tracepoints).

>  	err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
>  	if (err != I40E_XDP_TX)
>  		return -ENOSPC;
> @@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev)
>  	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>  		return;
>  
> +	if (vsi->xdp_rings[queue_index]->xsk_umem)
> +		return;
> +
>  	i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
>  }

	

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
@ 2018-05-17 21:31     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-17 21:31 UTC (permalink / raw)
  To: intel-wired-lan


On Tue, 15 May 2018 21:06:15 +0200 Bj?rn T?pel <bjorn.topel@gmail.com> wrote:

> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Here, the zero-copy ndo is implemented. As a shortcut, the existing
> XDP Tx rings are used for zero-copy. This means that and XDP program
> cannot redirect to an AF_XDP enabled XDP Tx ring.

This "shortcut" is not acceptable, and completely broken.  The
XDP_REDIRECT queue_index is based on smp_processor_id(), and can easily
clash with the configured XSK queue_index.  Provided a bit more code
context below...

On Tue, 15 May 2018 21:06:15 +0200
Bj?rn T?pel <bjorn.topel@gmail.com> wrote:

int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
{
	struct i40e_netdev_priv *np = netdev_priv(dev);
	unsigned int queue_index = smp_processor_id();
	struct i40e_vsi *vsi = np->vsi;
	int err;

	if (test_bit(__I40E_VSI_DOWN, vsi->state))
		return -ENETDOWN;

> @@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
>  	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>  		return -ENXIO;
>  
> +	if (vsi->xdp_rings[queue_index]->xsk_umem)
> +		return -ENXIO;
> +

Using the sane errno makes this impossible to debug (via the tracepoints).

>  	err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
>  	if (err != I40E_XDP_TX)
>  		return -ENOSPC;
> @@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev)
>  	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>  		return;
>  
> +	if (vsi->xdp_rings[queue_index]->xsk_umem)
> +		return;
> +
>  	i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
>  }

	

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
  2018-05-17 21:31     ` [Intel-wired-lan] " Jesper Dangaard Brouer
@ 2018-05-18  4:23       ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  -1 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-05-18  4:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Magnus Karlsson, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin, Netdev,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z, intel-wired-lan

2018-05-17 23:31 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Tue, 15 May 2018 21:06:15 +0200 Björn Töpel <bjorn.topel@gmail.com> wrote:
>
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, the zero-copy ndo is implemented. As a shortcut, the existing
>> XDP Tx rings are used for zero-copy. This means that and XDP program
>> cannot redirect to an AF_XDP enabled XDP Tx ring.
>
> This "shortcut" is not acceptable, and completely broken.  The
> XDP_REDIRECT queue_index is based on smp_processor_id(), and can easily
> clash with the configured XSK queue_index.  Provided a bit more code
> context below...
>

Yes, and this is the reason we need to go for a solution with
dedicated Tx rings. Again, we chose not to, and simply drops
XDP_REDIRECT where the AF_XDP queue id clashes with the processor id.
The queue id hijacked by AF_XDP's egress side.

> On Tue, 15 May 2018 21:06:15 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
> {
>         struct i40e_netdev_priv *np = netdev_priv(dev);
>         unsigned int queue_index = smp_processor_id();
>         struct i40e_vsi *vsi = np->vsi;
>         int err;
>
>         if (test_bit(__I40E_VSI_DOWN, vsi->state))
>                 return -ENETDOWN;
>
>> @@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
>>       if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>>               return -ENXIO;
>>
>> +     if (vsi->xdp_rings[queue_index]->xsk_umem)
>> +             return -ENXIO;
>> +
>
> Using the sane errno makes this impossible to debug (via the tracepoints).
>

The rationale was that the situation was similar to an incorrectly
configured receiving (from an XDP_REDIRECT perspective) interface.

We'll rework this! Thanks for looking into this, Jesper!


Björn

>>       err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
>>       if (err != I40E_XDP_TX)
>>               return -ENOSPC;
>> @@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev)
>>       if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>>               return;
>>
>> +     if (vsi->xdp_rings[queue_index]->xsk_umem)
>> +             return;
>> +
>>       i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
>>  }
>
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Intel-wired-lan] [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
@ 2018-05-18  4:23       ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
  0 siblings, 0 replies; 54+ messages in thread
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= @ 2018-05-18  4:23 UTC (permalink / raw)
  To: intel-wired-lan

2018-05-17 23:31 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Tue, 15 May 2018 21:06:15 +0200 Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, the zero-copy ndo is implemented. As a shortcut, the existing
>> XDP Tx rings are used for zero-copy. This means that and XDP program
>> cannot redirect to an AF_XDP enabled XDP Tx ring.
>
> This "shortcut" is not acceptable, and completely broken.  The
> XDP_REDIRECT queue_index is based on smp_processor_id(), and can easily
> clash with the configured XSK queue_index.  Provided a bit more code
> context below...
>

Yes, and this is the reason we need to go for a solution with
dedicated Tx rings. Again, we chose not to, and simply drops
XDP_REDIRECT where the AF_XDP queue id clashes with the processor id.
The queue id hijacked by AF_XDP's egress side.

> On Tue, 15 May 2018 21:06:15 +0200
> Bj?rn T?pel <bjorn.topel@gmail.com> wrote:
>
> int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
> {
>         struct i40e_netdev_priv *np = netdev_priv(dev);
>         unsigned int queue_index = smp_processor_id();
>         struct i40e_vsi *vsi = np->vsi;
>         int err;
>
>         if (test_bit(__I40E_VSI_DOWN, vsi->state))
>                 return -ENETDOWN;
>
>> @@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
>>       if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>>               return -ENXIO;
>>
>> +     if (vsi->xdp_rings[queue_index]->xsk_umem)
>> +             return -ENXIO;
>> +
>
> Using the sane errno makes this impossible to debug (via the tracepoints).
>

The rationale was that the situation was similar to an incorrectly
configured receiving (from an XDP_REDIRECT perspective) interface.

We'll rework this! Thanks for looking into this, Jesper!


Bj?rn

>>       err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
>>       if (err != I40E_XDP_TX)
>>               return -ENOSPC;
>> @@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev)
>>       if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>>               return;
>>
>> +     if (vsi->xdp_rings[queue_index]->xsk_umem)
>> +             return;
>> +
>>       i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
>>  }
>
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2018-05-18  4:23 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-15 19:06 [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 01/12] xsk: remove rebind support Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 02/12] xsk: moved struct xdp_umem definition Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 03/12] xsk: introduce xdp_umem_frame Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 04/12] net: xdp: added bpf_netdev_command XDP_SETUP_XSK_UMEM Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-17  5:57   ` Jesper Dangaard Brouer
2018-05-17  5:57     ` [Intel-wired-lan] " Jesper Dangaard Brouer
2018-05-17  7:08     ` Björn Töpel
2018-05-17  7:08       ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-17  7:09       ` Björn Töpel
2018-05-17  7:09         ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 06/12] xsk: add zero-copy support for Rx Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 07/12] net: added netdevice operation for Tx Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 08/12] xsk: wire upp Tx zero-copy functions Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 09/12] samples/bpf: minor *_nb_free performance fix Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 10/12] i40e: added queue pair disable/enable functions Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` [RFC PATCH bpf-next 11/12] i40e: implement AF_XDP zero-copy support for Rx Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 20:25   ` Alexander Duyck
2018-05-15 20:25     ` [Intel-wired-lan] " Alexander Duyck
2018-05-15 19:06 ` [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy Björn Töpel
2018-05-15 19:06   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-16 14:28   ` Jesper Dangaard Brouer
2018-05-16 14:28     ` [Intel-wired-lan] " Jesper Dangaard Brouer
2018-05-16 14:38     ` Magnus Karlsson
2018-05-16 14:38       ` [Intel-wired-lan] " Magnus Karlsson
2018-05-16 15:38       ` Magnus Karlsson
2018-05-16 15:38         ` [Intel-wired-lan] " Magnus Karlsson
2018-05-16 18:53         ` Jesper Dangaard Brouer
2018-05-16 18:53           ` [Intel-wired-lan] " Jesper Dangaard Brouer
2018-05-17 21:31   ` Jesper Dangaard Brouer
2018-05-17 21:31     ` [Intel-wired-lan] " Jesper Dangaard Brouer
2018-05-18  4:23     ` Björn Töpel
2018-05-18  4:23       ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-16 10:47 ` [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support Jesper Dangaard Brouer
2018-05-16 10:47   ` [Intel-wired-lan] " Jesper Dangaard Brouer
2018-05-16 17:04 ` Alexei Starovoitov
2018-05-16 17:04   ` [Intel-wired-lan] " Alexei Starovoitov
2018-05-16 17:49   ` Björn Töpel
2018-05-16 17:49     ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-16 18:14   ` Jeff Kirsher
2018-05-16 18:14     ` Jeff Kirsher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.