All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/24] Introducing AF_XDP support
@ 2018-01-31 13:53 Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 01/24] xsk: AF_XDP sockets buildable skeleton Björn Töpel
                   ` (28 more replies)
  0 siblings, 29 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This RFC introduces a new address family called AF_XDP that is
optimized for high performance packet processing and zero-copy
semantics. Throughput improvements can be up to 20x compared to V2 and
V3 for the micro benchmarks included. Would be great to get your
feedback on it. Note that this is the follow up RFC to AF_PACKET V4
from November last year. The feedback from that RFC submission and the
presentation at NetdevConf in Seoul was to create a new address family
instead of building on top of AF_PACKET. AF_XDP is this new address
family.
 
The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
level is that TX and RX descriptors are separated from packet
buffers. An RX or TX descriptor points to a data buffer in a packet
buffer area. RX and TX can share the same packet buffer so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, then
the descriptor that points to that packet buffer can be changed to
point to another buffer and reused right away. This again avoids
copying data.
 
The RX and TX descriptor rings are registered with the setsockopts
XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
area is allocated by user space and registered with the kernel using
the new XDP_MEM_REG setsockopt. All these three areas are shared
between user space and kernel space. The socket is then bound with a
bind() call to a device and a specific queue id on that device, and it
is not until bind is completed that traffic starts to flow.
 
An XDP program can be loaded to direct part of the traffic on that
device and queue id to user space through a new redirect action in an
XDP program called bpf_xdpsk_redirect that redirects a packet up to
the socket in user space. All the other XDP actions work just as
before. Note that the current RFC requires the user to load an XDP
program to get any traffic to user space (for example all traffic to
user space with the one-liner program "return
bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
this requirement and sends all traffic from a queue to user space if
an AF_XDP socket is bound to it.
 
AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
is no specific mode called XDP_DRV_ZC). If the driver does not have
support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
program, XDP_SKB mode is employed that uses SKBs together with the
generic XDP support and copies out the data to user space. A fallback
mode that works for any network device. On the other hand, if the
driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
ndo_xdp_flush), these NDOs, without any modifications, will be used by
the AF_XDP code to provide better performance, but there is still a
copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
driver support with the zero-copy user space allocator that provides
even better performance. In this mode, the networking HW (or SW driver
if it is a virtual driver like veth) DMAs/puts packets straight into
the packet buffer that is shared between user space and kernel
space. The RX and TX descriptor queues of the networking HW are NOT
shared to user space. Only the kernel can read and write these and it
is the kernel driver's responsibility to translate these HW specific
descriptors to the HW agnostic ones in the virtual descriptor rings
that user space sees. This way, a malicious user space program cannot
mess with the networking HW. This mode though requires some extensions
to XDP.
 
To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
buffer pool concept so that the same XDP driver code can be used for
buffers allocated using the page allocator (XDP_DRV), the user-space
zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
allocator/cache/recycling mechanism. The ndo_bpf call has also been
extended with two commands for registering and unregistering an XSK
socket and is in the RX case mainly used to communicate some
information about the user-space buffer pool to the driver.
 
For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
but we run into problems with this (further discussion in the
challenges section) and had to introduce a new NDO called
ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
and an explicit queue id that packets should be sent out on. In
contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
sent from the xdp socket (associated with the dev and queue
combination that was provided with the NDO call) using a callback
(get_tx_packet), and when they have been transmitted it uses another
callback (tx_completion) to signal completion of packets. These
callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
and thus does not clash with the XDP_REDIRECT use of
ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
(without ZC) is currently not supported by TX. Please have a look at
the challenges section for further discussions.
 
The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
so the user needs to steer the traffic to the zero-copy enabled queue
pair. Which queue to use, is up to the user.
 
For an untrusted application, HW packet steering to a specific queue
pair (the one associated with the application) is a requirement, as
the application would otherwise be able to see other user space
processes' packets. If the HW cannot support the required packet
steering, XDP_DRV or XDP_SKB mode have to be used as they do not
expose the NIC's packet buffer into user space as the packets are
copied into user space from the NIC's packet buffer in the kernel.
 
There is a xdpsock benchmarking/test application included. Say that
you would like your UDP traffic from port 4242 to end up in queue 16,
that we will enable AF_XDP on. Here, we use ethtool for this:
 
      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16
 
Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
 
      samples/bpf/xdpsock -i p3p2 -q 16 -l -N
 
For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.
 
We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
Intel I40E 40Gbit/s using the i40e driver.
 
Below are the results in Mpps of the I40E NIC benchmark runs for 64
byte packets, generated by commercial packet generator HW that is
generating packets at full 40 Gbit/s line rate.
 
XDP baseline numbers without this RFC:
xdp_rxq_info --action XDP_DROP 31.3 Mpps
xdp_rxq_info --action XDP_TX   16.7 Mpps
 
XDP performance with this RFC i.e. with the buffer allocator:
XDP_DROP 21.0 Mpps
XDP_TX   11.9 Mpps
 
AF_PACKET V4 performance from previous RFC on 4.14-rc7:
Benchmark   V2     V3     V4     V4+ZC
rxdrop      0.67   0.73   0.74   33.7
txpush      0.98   0.98   0.91   19.6
l2fwd       0.66   0.71   0.67   15.5
 
AF_XDP performance:
Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
rxdrop      3.3        11.6         16.9
txpush      2.2         NA*         21.8
l2fwd       1.7         NA*         10.4
 
* NA since there is no XDP_DRV mode (without ZC) for TX in this RFC,
  see challenges below.
 
If we start by comparing XDP_SKB performance with copy mode in
AF_PACKET V4, we can see that AF_XDP delivers 3-5 times the
throughput, which is positive. We are also happy with the XDP_DRV
performance that provides 11.6 Mpps for rxdrop, and should work on any
driver implementing full XDP support. Now to the problematic part:
XDP_DRV_ZC. The txpush (TX only) benchmark shows decent results at
21.8 Mpps and is better than it was for V4, even though we have spent
no time optimizing the code in AF_XDP. (We did that in AF_PACKET V4.)
But the RX performance is sliced by half, which is not good. The
reason for this is, for the major part, the new buffer allocator which
is used for RX ZC only (at this point, see todo section). If you take
a look at the XDP baseline numbers, introducing the buffer pool
allocator drops the performance by around 30% or 10 Mpps which is
obviously not acceptable. We clearly need to give this code some
overdue performance love. But the overhanging question is how much
overhead it will produce in the end and if this will be
acceptable. Another thing to note is that V4 provided 33.7 Mpps for
rxdrop, but with AF_XDP we are quite unlikely to get above the
XDP_DROP number of 31.3, since we are reusing the XDP infrastructure
and driver code on the RX side. So in the end, the AF_XDP XDP_DRV_ZC
numbers will likely be lower than the V4 ZC numbers.
 
We based this patch set on net-next commit 91e6dd828425 ("ipmr: Fix
ptrdiff_t print formatting").
 
Challenges: areas we would really appreciate your help on and that we
are having substantial problems with.
 
* We would like to, if possible, use ndo_xdp_xmit and ndo_xdp_flush
  instead of introducing another NDO in the form of
  ndo_xdp_xmit_xsk. The first reason behind our ineptitude to be able
  to accomplish this is that if both paths use ndo_xdp_xmit, they will
  create a race as ndo_xdp_xmit currently does not contain any
  locking. How to implement some type of mutual exclusion here without
  resorting to slowing down the NDO with a lock? The second problem is
  that the XDP_REDIRECT code implicitly assumes that core id = queue
  id. AF_XDP, on the other hand, explicitly specifies a queue id that
  has nothing to do with the core id (the application can run on any
  core). How to reconcile these two views in one ndo? If these two
  problems can be solved, then we would still need to introduce a
  completion callback and a get_packet callback, but this seems to be
  less challenging. This would also make it possible to run TX in the
  XDP_DRV mode (with the default page allocator).
 
* What should the buffer allocator look like and how to make it
  generic enough so it can be used by all NIC vendors? Would be great
  if you could take a look at it and come with suggestions. As you can
  see from the change log, it took some effort to rewire the i40e code
  to use the buff pool, and we expect the same to be true for many
  other NICs. Ideas on how to introduce multiple allocator into XDP in
  a less intrusive way would be highly appreciated. Another question
  is how to create a buffer pool that gives rise to very little
  overhead? We do not know if the current one can be optimized to have
  an acceptable overhead as we have not started any optimization
  effort. But we will give it a try during the next week or so to see
  where it leads.
 
* In this RFC, do not use an XDP_REDIRECT action other than
  bpf_xdpsk_redirect for XDP_DRV_ZC. This is because a zero-copy
  allocated buffer will then be sent to a cpu id / queue_pair through
  ndo_xdp_xmit that does not know this has been ZC allocated. It will
  then do a page_free on it and you will get a crash. How to extend
  ndo_xdp_xmit with some free/completion function that could be called
  instead of page_free?  Hopefully, the same solution can be used here
  as in the first problem item in this section.
 
Caveats with this RFC. In contrast to the last section, we believe we
have solutions for these but we did not have time to fix them. We
chose to show you all the code sooner than later, even though
everything does not work. Sorry.
 
* This RFC is more immature (read, has more bugs) than the AF_PACKET
  V4 RFC. Some known mentioned here, others unknown.
 
* We have done absolutely no optimization to this RFC. There is
  (hopefully) some substantial low hanging fruit that we could fix
  once we get to this, to improve XDP_DRV_ZC performance to levels
  that we are not ashamed of and also bring the i40e driver to the
  same performance levels it had before our changes, which is a must.
 
* There is a race in the TX XSK clean up code in the i40e driver that
  triggers a WARN_ON_ONCE. Clearly a bug that needs to be fixed. It
  can be triggered by performing ifdown/ifup when the application is
  running, or when changing the number of queues of the device
  underneath the hood of the application. As a workaround, please
  refrain from doing these two things without restarting the
  application, as not all buffers will be returned in the TX
  path. This bug can also be triggered when killing the application,
  but has no negative effect in this case as the process will never
  execute again.
 
* Before this RFC, ndo_xdp_xmit triggered by an XDP_REDIRECT to a NIC
  never modified the page count, so the redirect code could assume
  that the page would still be valid after the NDO call. With the
  introduction of the xsk_rcv path that is called as a result of an
  XDP_REDIRECT to an AF_XDP socket, the page count will be decreased
  if the page is copied out to user space, since we have no use for it
  anymore. Our somewhat blunt solution to this is to make sure in the
  i40e driver that the refcount is never under two. Note though, that
  with the introduction of the buffer pool, this problem
  disappears. This also means that XDP_DRV will not work out of the
  box with a Niantic NIC, since it also needs this modification to
  work. One question that we have is what should the semantics of
  ndo_xdp_xmit be? Can we always assume that the page count will never
  be changed by all possible netdevices that implement this NDO, or
  should we remove this assumption to gain more device implementation
  flexibility?
 
To do:
 
* Optimize performance. No optimization whatsoever was performed on
  this RFC, in contrast to the previous one for AF_PACKET V4.
 
* Kernel load module support.
 
* Polling has not been implemented yet.
 
* Optimize the user space sample application. It is simple but naive
  at this point. The one for AF_PACKET V4 had a number of
  optimizations that have not been introduced in the AF_XDP version.
 
* Implement a way to pick the XDP_DRV mode even if XDP_DRV_ZC is
  available. Would be nice to have for the sample application too.
 
* Introduce a notifier chain for queue changes (caused by ethtool for
  example). This would get rid of the error callback that we have at
  this point.
 
* Use one NAPI context for RX and another one for TX in i40e. This
  would make it possible to run RX on one core and TX on another for
  better performance. Today, they need to share a single core since
  they share NAPI context.
 
* Get rid of packet arrays (PA) and convert them to the buffer pool
  allocator by transferring the necessary PA functionality into the
  buffer pool. This has only been done for RX in ZC mode, while all
  the other modes are still using packet arrays. Clearly, having two
  structures with largely the same information is not a good thing.
 
* Support for AF_XDP sockets without an XPD program loaded. In this
  case all the traffic on a queue should go up to user space.
 
* Support shared packet buffers
 
* Support for packets spanning multiple frames
 
Thanks: Björn and Magnus

Björn Töpel (16):
  xsk: AF_XDP sockets buildable skeleton
  xsk: add user memory registration sockopt
  xsk: added XDP_{R,T}X_RING sockopt and supporting structures
  bpf: added bpf_xdpsk_redirect
  net: wire up xsk support in the XDP_REDIRECT path
  i40e: add support for XDP_REDIRECT
  samples/bpf: added xdpsock program
  xsk: add iterator functions to xsk_ring
  i40e: introduce external allocator support
  i40e: implemented page recycling buff_pool
  i40e: start using recycling buff_pool
  i40e: separated buff_pool interface from i40e implementaion
  xsk: introduce xsk_buff_pool
  xdp: added buff_pool support to struct xdp_buff
  xsk: add support for zero copy Rx
  i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx

Magnus Karlsson (8):
  xsk: add bind support and introduce Rx functionality
  xsk: introduce Tx functionality
  netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf
  netdevice: added ndo for transmitting a packet from an XDP socket
  xsk: add support for zero copy Tx
  i40e: introduced a clean_tx callback function
  i40e: introduced Tx completion callbacks
  i40e: Tx support for zero copy allocator

 drivers/net/ethernet/intel/i40e/Makefile         |    3 +-
 drivers/net/ethernet/intel/i40e/i40e.h           |   24 +
 drivers/net/ethernet/intel/i40e/i40e_buff_pool.c |  580 +++++++++++
 drivers/net/ethernet/intel/i40e/i40e_buff_pool.h |   15 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c   |    1 -
 drivers/net/ethernet/intel/i40e/i40e_main.c      |  541 +++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c      |  906 +++++++++--------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h      |  119 ++-
 include/linux/buff_pool.h                        |  136 +++
 include/linux/filter.h                           |    3 +-
 include/linux/netdevice.h                        |   25 +
 include/linux/socket.h                           |    5 +-
 include/net/xdp.h                                |    1 +
 include/net/xdp_sock.h                           |   60 ++
 include/uapi/linux/bpf.h                         |    6 +-
 include/uapi/linux/if_xdp.h                      |   72 ++
 net/Kconfig                                      |    1 +
 net/Makefile                                     |    1 +
 net/core/dev.c                                   |   28 +-
 net/core/filter.c                                |   88 +-
 net/core/sock.c                                  |   12 +-
 net/xdp/Kconfig                                  |    7 +
 net/xdp/Makefile                                 |    1 +
 net/xdp/xsk.c                                    | 1142 ++++++++++++++++++++++
 net/xdp/xsk.h                                    |   31 +
 net/xdp/xsk_buff.h                               |  161 +++
 net/xdp/xsk_buff_pool.c                          |  225 +++++
 net/xdp/xsk_buff_pool.h                          |   17 +
 net/xdp/xsk_packet_array.c                       |   62 ++
 net/xdp/xsk_packet_array.h                       |  399 ++++++++
 net/xdp/xsk_ring.c                               |   61 ++
 net/xdp/xsk_ring.h                               |  419 ++++++++
 net/xdp/xsk_user_queue.h                         |   24 +
 samples/bpf/Makefile                             |    4 +
 samples/bpf/xdpsock_kern.c                       |   11 +
 samples/bpf/xdpsock_queue.h                      |   62 ++
 samples/bpf/xdpsock_user.c                       |  642 ++++++++++++
 security/selinux/hooks.c                         |    4 +-
 security/selinux/include/classmap.h              |    4 +-
 tools/testing/selftests/bpf/bpf_helpers.h        |    2 +
 40 files changed, 5408 insertions(+), 497 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.h
 create mode 100644 include/linux/buff_pool.h
 create mode 100644 include/net/xdp_sock.h
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 net/xdp/Kconfig
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xsk.c
 create mode 100644 net/xdp/xsk.h
 create mode 100644 net/xdp/xsk_buff.h
 create mode 100644 net/xdp/xsk_buff_pool.c
 create mode 100644 net/xdp/xsk_buff_pool.h
 create mode 100644 net/xdp/xsk_packet_array.c
 create mode 100644 net/xdp/xsk_packet_array.h
 create mode 100644 net/xdp/xsk_ring.c
 create mode 100644 net/xdp/xsk_ring.h
 create mode 100644 net/xdp/xsk_user_queue.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_queue.h
 create mode 100644 samples/bpf/xdpsock_user.c

-- 
2.14.1

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 01/24] xsk: AF_XDP sockets buildable skeleton
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 02/24] xsk: add user memory registration sockopt Björn Töpel
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Buildable skeleton. Move on, nothing to see.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/socket.h              |   5 +-
 include/uapi/linux/if_xdp.h         |  32 +++++++++
 net/Kconfig                         |   1 +
 net/Makefile                        |   1 +
 net/core/sock.c                     |  12 ++--
 net/xdp/Kconfig                     |   7 ++
 net/xdp/Makefile                    |   1 +
 net/xdp/xsk.c                       | 133 ++++++++++++++++++++++++++++++++++++
 net/xdp/xsk.h                       |  18 +++++
 security/selinux/hooks.c            |   4 +-
 security/selinux/include/classmap.h |   4 +-
 11 files changed, 211 insertions(+), 7 deletions(-)
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 net/xdp/Kconfig
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xsk.c
 create mode 100644 net/xdp/xsk.h

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 9286a5a8c60c..ada0102ff8db 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -207,8 +207,9 @@ struct ucred {
 				 * PF_SMC protocol family that
 				 * reuses AF_INET address family
 				 */
+#define AF_XDP		44	/* XDP sockets			*/
 
-#define AF_MAX		44	/* For now.. */
+#define AF_MAX		45	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -257,6 +258,7 @@ struct ucred {
 #define PF_KCM		AF_KCM
 #define PF_QIPCRTR	AF_QIPCRTR
 #define PF_SMC		AF_SMC
+#define PF_XDP		AF_XDP
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -337,6 +339,7 @@ struct ucred {
 #define SOL_NFC		280
 #define SOL_KCM		281
 #define SOL_TLS		282
+#define SOL_XDP		283
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
new file mode 100644
index 000000000000..cd09232e16c1
--- /dev/null
+++ b/include/uapi/linux/if_xdp.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * if_xdp: XDP socket user-space interface
+ *
+ * Copyright(c) 2017 Intel Corporation.
+ *
+ * Author(s): Björn Töpel <bjorn.topel@intel.com>
+ *	      Magnus Karlsson <magnus.karlsson@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _LINUX_IF_XDP_H
+#define _LINUX_IF_XDP_H
+
+#include <linux/types.h>
+
+struct sockaddr_xdp {
+	__u16	sxdp_family;
+	__u32	sxdp_ifindex;
+	__u32	sxdp_queue_id;
+};
+
+/* XDP socket options */
+#define XDP_MEM_REG	1
+#define XDP_RX_RING	2
+#define XDP_TX_RING	3
+
+#endif /* _LINUX_IF_XDP_H */
diff --git a/net/Kconfig b/net/Kconfig
index 37ec8e67af57..03e5c64b411d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -59,6 +59,7 @@ source "net/tls/Kconfig"
 source "net/xfrm/Kconfig"
 source "net/iucv/Kconfig"
 source "net/smc/Kconfig"
+source "net/xdp/Kconfig"
 
 config INET
 	bool "TCP/IP networking"
diff --git a/net/Makefile b/net/Makefile
index 14fede520840..9df8e6f827f8 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -86,3 +86,4 @@ obj-y				+= l3mdev/
 endif
 obj-$(CONFIG_QRTR)		+= qrtr/
 obj-$(CONFIG_NET_NCSI)		+= ncsi/
+obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
diff --git a/net/core/sock.c b/net/core/sock.c
index abf4cbff99b2..4d29430f4671 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -226,7 +226,8 @@ static struct lock_class_key af_family_kern_slock_keys[AF_MAX];
   x "AF_RXRPC" ,	x "AF_ISDN"     ,	x "AF_PHONET"   , \
   x "AF_IEEE802154",	x "AF_CAIF"	,	x "AF_ALG"      , \
   x "AF_NFC"   ,	x "AF_VSOCK"    ,	x "AF_KCM"      , \
-  x "AF_QIPCRTR",	x "AF_SMC"	,	x "AF_MAX"
+  x "AF_QIPCRTR",	x "AF_SMC"	,	x "AF_XDP"	, \
+  x "AF_MAX"
 
 static const char *const af_family_key_strings[AF_MAX+1] = {
 	_sock_locks("sk_lock-")
@@ -262,7 +263,8 @@ static const char *const af_family_rlock_key_strings[AF_MAX+1] = {
   "rlock-AF_RXRPC" , "rlock-AF_ISDN"     , "rlock-AF_PHONET"   ,
   "rlock-AF_IEEE802154", "rlock-AF_CAIF" , "rlock-AF_ALG"      ,
   "rlock-AF_NFC"   , "rlock-AF_VSOCK"    , "rlock-AF_KCM"      ,
-  "rlock-AF_QIPCRTR", "rlock-AF_SMC"     , "rlock-AF_MAX"
+  "rlock-AF_QIPCRTR", "rlock-AF_SMC"     , "rlock-AF_XDP"      ,
+  "rlock-AF_MAX"
 };
 static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
   "wlock-AF_UNSPEC", "wlock-AF_UNIX"     , "wlock-AF_INET"     ,
@@ -279,7 +281,8 @@ static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
   "wlock-AF_RXRPC" , "wlock-AF_ISDN"     , "wlock-AF_PHONET"   ,
   "wlock-AF_IEEE802154", "wlock-AF_CAIF" , "wlock-AF_ALG"      ,
   "wlock-AF_NFC"   , "wlock-AF_VSOCK"    , "wlock-AF_KCM"      ,
-  "wlock-AF_QIPCRTR", "wlock-AF_SMC"     , "wlock-AF_MAX"
+  "wlock-AF_QIPCRTR", "wlock-AF_SMC"     , "wlock-AF_XDP"      ,
+  "wlock-AF_MAX"
 };
 static const char *const af_family_elock_key_strings[AF_MAX+1] = {
   "elock-AF_UNSPEC", "elock-AF_UNIX"     , "elock-AF_INET"     ,
@@ -296,7 +299,8 @@ static const char *const af_family_elock_key_strings[AF_MAX+1] = {
   "elock-AF_RXRPC" , "elock-AF_ISDN"     , "elock-AF_PHONET"   ,
   "elock-AF_IEEE802154", "elock-AF_CAIF" , "elock-AF_ALG"      ,
   "elock-AF_NFC"   , "elock-AF_VSOCK"    , "elock-AF_KCM"      ,
-  "elock-AF_QIPCRTR", "elock-AF_SMC"     , "elock-AF_MAX"
+  "elock-AF_QIPCRTR", "elock-AF_SMC"     , "elock-AF_XDP"      ,
+  "elock-AF_MAX"
 };
 
 /*
diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
new file mode 100644
index 000000000000..90e4a7152854
--- /dev/null
+++ b/net/xdp/Kconfig
@@ -0,0 +1,7 @@
+config XDP_SOCKETS
+	bool "XDP sockets"
+	depends on BPF_SYSCALL
+	default n
+	help
+	  XDP sockets allows a channel between XDP programs and
+	  userspace applications.
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
new file mode 100644
index 000000000000..0c7631f21586
--- /dev/null
+++ b/net/xdp/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
new file mode 100644
index 000000000000..2d7c08a50c60
--- /dev/null
+++ b/net/xdp/xsk.c
@@ -0,0 +1,133 @@
+/*
+ * XDP sockets
+ *
+ * AF_XDP sockets allows a channel between XDP programs and userspace
+ * applications.
+ *
+ * Copyright(c) 2017 Intel Corporation.
+ *
+ * Author(s): Björn Töpel <bjorn.topel@intel.com>
+ *	      Magnus Karlsson <magnus.karlsson@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#define pr_fmt(fmt) "AF_XDP: %s: " fmt, __func__
+
+#include <linux/if_xdp.h>
+#include <linux/init.h>
+#include <linux/socket.h>
+#include <net/sock.h>
+
+#include "xsk.h"
+
+struct xdp_sock {
+	/* struct sock must be the first member of struct xdp_sock */
+	struct sock sk;
+};
+
+static int xsk_release(struct socket *sock)
+{
+	return 0;
+}
+
+static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+	return -EOPNOTSUPP;
+}
+
+static unsigned int xsk_poll(struct file *file, struct socket *sock,
+			     struct poll_table_struct *wait)
+{
+	return -EOPNOTSUPP;
+}
+
+static int xsk_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, unsigned int optlen)
+{
+	return -ENOPROTOOPT;
+}
+
+static int xsk_getsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int __user *optlen)
+{
+	return -EOPNOTSUPP;
+}
+
+static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
+{
+	return -EOPNOTSUPP;
+}
+
+static int xsk_mmap(struct file *file, struct socket *sock,
+		    struct vm_area_struct *vma)
+{
+	return -EOPNOTSUPP;
+}
+
+static struct proto xsk_proto = {
+	.name =		"XDP",
+	.owner =	THIS_MODULE,
+	.obj_size =	sizeof(struct xdp_sock),
+};
+
+static const struct proto_ops xsk_proto_ops = {
+	.family =	PF_XDP,
+	.owner =	THIS_MODULE,
+	.release =	xsk_release,
+	.bind =		xsk_bind,
+	.connect =	sock_no_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	sock_no_accept,
+	.getname =	sock_no_getname, /* XXX do we need this? */
+	.poll =		xsk_poll,
+	.ioctl =	sock_no_ioctl, /* XXX do we need this? */
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	xsk_setsockopt,
+	.getsockopt =	xsk_getsockopt,
+	/* XXX make sure we don't rely on any ioctl/{get,set}sockopt that would require CONFIG_COMPAT! */
+	.sendmsg =	xsk_sendmsg,
+	.recvmsg =	sock_no_recvmsg,
+	.mmap =		xsk_mmap,
+	.sendpage =	sock_no_sendpage,
+	/* the rest vvv, OK to be missing implementation -- checked against NULL. */
+};
+
+static int xsk_create(struct net *net, struct socket *sock, int protocol,
+		      int kern)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct net_proto_family xsk_family_ops = {
+	.family = PF_XDP,
+	.create = xsk_create,
+	.owner	= THIS_MODULE,
+};
+
+/* XXX Do we need any namespace support? _pernet_subsys and friends */
+static int __init xsk_init(void)
+{
+	int err;
+
+	err = proto_register(&xsk_proto, 0 /* no slab */);
+	if (err)
+		goto out;
+
+	err = sock_register(&xsk_family_ops);
+	if (err)
+		goto out_proto;
+
+	return 0;
+
+out_proto:
+	proto_unregister(&xsk_proto);
+out:
+	return err;
+}
+
+fs_initcall(xsk_init);
diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
new file mode 100644
index 000000000000..441f8d00a9d5
--- /dev/null
+++ b/net/xdp/xsk.h
@@ -0,0 +1,18 @@
+/*
+ *  XDP sockets
+ *  Copyright(c) 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XDPSOCK_H
+#define _LINUX_XDPSOCK_H
+
+#endif /* _LINUX_XDPSOCK_H */
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 8644d864e3c1..b6b959c5efb3 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -1436,7 +1436,9 @@ static inline u16 socket_type_to_security_class(int family, int type, int protoc
 			return SECCLASS_QIPCRTR_SOCKET;
 		case PF_SMC:
 			return SECCLASS_SMC_SOCKET;
-#if PF_MAX > 44
+		case PF_XDP:
+			return SECCLASS_XDP_SOCKET;
+#if PF_MAX > 45
 #error New address family defined, please update this function.
 #endif
 		}
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index acdee7795297..e2044cd358bb 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -240,9 +240,11 @@ struct security_class_mapping secclass_map[] = {
 	  { "manage_subnet", NULL } },
 	{ "bpf",
 	  {"map_create", "map_read", "map_write", "prog_load", "prog_run"} },
+	{ "xdp_socket",
+	  { COMMON_SOCK_PERMS, NULL } },
 	{ NULL }
   };
 
-#if PF_MAX > 44
+#if PF_MAX > 45
 #error New address family defined, please update secclass_map.
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 02/24] xsk: add user memory registration sockopt
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 01/24] xsk: AF_XDP sockets buildable skeleton Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-02-07 16:00   ` Willem de Bruijn
  2018-01-31 13:53 ` [RFC PATCH 03/24] xsk: added XDP_{R,T}X_RING sockopt and supporting structures Björn Töpel
                   ` (26 subsequent siblings)
  28 siblings, 1 reply; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

The XDP_MEM_REG socket option allows a process to register a window of
user space memory to the kernel. This memory will later be used as
frame data buffer.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/uapi/linux/if_xdp.h |   7 ++
 net/xdp/xsk.c               | 294 +++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk.h               |  19 ++-
 3 files changed, 316 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index cd09232e16c1..3f8c90c708b4 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -29,4 +29,11 @@ struct sockaddr_xdp {
 #define XDP_RX_RING	2
 #define XDP_TX_RING	3
 
+struct xdp_mr_req {
+	__u64	addr;           /* Start of packet data area */
+	__u64	len;            /* Length of packet data area */
+	__u32	frame_size;     /* Frame size */
+	__u32	data_headroom;  /* Frame head room */
+};
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 2d7c08a50c60..333ce1450cc7 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -19,18 +19,235 @@
 
 #include <linux/if_xdp.h>
 #include <linux/init.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
 #include <linux/socket.h>
 #include <net/sock.h>
 
 #include "xsk.h"
 
+#define XSK_UMEM_MIN_FRAME_SIZE 2048
+
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
+	struct xsk_umem *umem;
 };
 
+static struct xdp_sock *xdp_sk(struct sock *sk)
+{
+	return (struct xdp_sock *)sk;
+}
+
+static void xsk_umem_unpin_pages(struct xsk_umem *umem)
+{
+	unsigned int i;
+
+	if (umem->pgs) {
+		for (i = 0; i < umem->npgs; i++) {
+			struct page *page = umem->pgs[i];
+
+			set_page_dirty_lock(page);
+			put_page(page);
+		}
+
+		kfree(umem->pgs);
+		umem->pgs = NULL;
+	}
+}
+
+static void xsk_umem_destroy(struct xsk_umem *umem)
+{
+	struct mm_struct *mm;
+	struct task_struct *task;
+	unsigned long diff;
+
+	if (!umem)
+		return;
+
+	xsk_umem_unpin_pages(umem);
+
+	task = get_pid_task(umem->pid, PIDTYPE_PID);
+	put_pid(umem->pid);
+	if (!task)
+		goto out;
+	mm = get_task_mm(task);
+	put_task_struct(task);
+	if (!mm)
+		goto out;
+
+	diff = umem->size >> PAGE_SHIFT;
+
+	down_write(&mm->mmap_sem);
+	mm->pinned_vm -= diff;
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+out:
+	kfree(umem);
+}
+
+static struct xsk_umem *xsk_umem_create(u64 addr, u64 size, u32 frame_size,
+					u32 data_headroom)
+{
+	struct xsk_umem *umem;
+	unsigned int nframes;
+	int size_chk;
+
+	if (frame_size < XSK_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
+		/* Strictly speaking we could support this, if:
+		 * - huge pages, or*
+		 * - using an IOMMU, or
+		 * - making sure the memory area is consecutive
+		 * but for now, we simply say "computer says no".
+		 */
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (!is_power_of_2(frame_size))
+		return ERR_PTR(-EINVAL);
+
+	if (!PAGE_ALIGNED(addr)) {
+		/* Memory area has to be page size aligned. For
+		 * simplicity, this might change.
+		 */
+		return ERR_PTR(-EINVAL);
+	}
+
+	if ((addr + size) < addr)
+		return ERR_PTR(-EINVAL);
+
+	nframes = size / frame_size;
+	if (nframes == 0)
+		return ERR_PTR(-EINVAL);
+
+	data_headroom =	ALIGN(data_headroom, 64);
+
+	size_chk = frame_size - data_headroom - XSK_KERNEL_HEADROOM;
+	if (size_chk < 0)
+		return ERR_PTR(-EINVAL);
+
+	umem = kzalloc(sizeof(*umem), GFP_KERNEL);
+	if (!umem)
+		return ERR_PTR(-ENOMEM);
+
+	umem->pid = get_task_pid(current, PIDTYPE_PID);
+	umem->size = (size_t)size;
+	umem->address = (unsigned long)addr;
+	umem->frame_size = frame_size;
+	umem->nframes = nframes;
+	umem->data_headroom = data_headroom;
+	umem->pgs = NULL;
+
+	return umem;
+}
+
+static int xsk_umem_pin_pages(struct xsk_umem *umem)
+{
+	unsigned int gup_flags = FOLL_WRITE;
+	long npgs;
+	int err;
+
+	/* XXX Fix so that we don't always pin.
+	 * "copy to user" from interrupt context, but how?
+	 */
+	umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), GFP_ATOMIC);
+	if (!umem->pgs)
+		return -ENOMEM;
+
+	npgs = get_user_pages(umem->address, umem->npgs,
+			      gup_flags, &umem->pgs[0], NULL);
+	if (npgs != umem->npgs) {
+		if (npgs >= 0) {
+			umem->npgs = npgs;
+			err = -ENOMEM;
+			goto out_pin;
+		}
+		err = npgs;
+		goto out_pgs;
+	}
+
+	return 0;
+
+out_pin:
+	xsk_umem_unpin_pages(umem);
+out_pgs:
+	kfree(umem->pgs);
+	umem->pgs = NULL;
+
+	return err;
+}
+
+static struct xsk_umem *xsk_mem_reg(u64 addr, u64 size, u32 frame_size,
+				    u32 data_headroom)
+{
+	unsigned long lock_limit, locked, npages;
+	int ret = 0;
+	struct xsk_umem *umem;
+
+	if (!can_do_mlock())
+		return ERR_PTR(-EPERM);
+
+	umem = xsk_umem_create(addr, size, frame_size, data_headroom);
+	if (IS_ERR(umem))
+		return umem;
+
+	npages = PAGE_ALIGN(umem->nframes * umem->frame_size) >> PAGE_SHIFT;
+
+	down_write(&current->mm->mmap_sem);
+
+	locked = npages + current->mm->pinned_vm;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (npages == 0 || npages > UINT_MAX) {
+		ret = -EINVAL;
+		goto out;
+	}
+	umem->npgs = npages;
+
+	ret = xsk_umem_pin_pages(umem);
+
+out:
+	if (ret < 0) {
+		put_pid(umem->pid);
+		kfree(umem);
+	} else {
+		current->mm->pinned_vm = locked;
+	}
+
+	up_write(&current->mm->mmap_sem);
+
+	return ret < 0 ? ERR_PTR(ret) : umem;
+}
+
 static int xsk_release(struct socket *sock)
 {
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct net *net;
+
+	if (!sk)
+		return 0;
+
+	net = sock_net(sk);
+
+	local_bh_disable();
+	sock_prot_inuse_add(net, sk->sk_prot, -1);
+	local_bh_enable();
+
+	xsk_umem_destroy(xs->umem);
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+
+	sk_refcnt_debug_release(sk);
+	sock_put(sk);
+
 	return 0;
 }
 
@@ -48,6 +265,43 @@ static unsigned int xsk_poll(struct file *file, struct socket *sock,
 static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			  char __user *optval, unsigned int optlen)
 {
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (level != SOL_XDP)
+		return -ENOPROTOOPT;
+
+	switch (optname) {
+	case XDP_MEM_REG:
+	{
+		struct xdp_mr_req req;
+		struct xsk_umem *umem;
+
+		if (optlen < sizeof(req))
+			return -EINVAL;
+		if (copy_from_user(&req, optval, sizeof(req)))
+			return -EFAULT;
+
+		umem = xsk_mem_reg(req.addr, req.len, req.frame_size,
+				   req.data_headroom);
+		if (IS_ERR(umem))
+			return PTR_ERR(umem);
+
+		lock_sock(sk);
+		if (xs->umem) { /* XXX create and check afterwards... really? */
+			release_sock(sk);
+			xsk_umem_destroy(umem);
+			return -EBUSY;
+		}
+		xs->umem = umem;
+		release_sock(sk);
+
+		return 0;
+	}
+	default:
+		break;
+	}
+
 	return -ENOPROTOOPT;
 }
 
@@ -97,10 +351,48 @@ static const struct proto_ops xsk_proto_ops = {
 	/* the rest vvv, OK to be missing implementation -- checked against NULL. */
 };
 
+static void xsk_destruct(struct sock *sk)
+{
+	if (!sock_flag(sk, SOCK_DEAD))
+		return;
+
+	sk_refcnt_debug_dec(sk);
+}
+
 static int xsk_create(struct net *net, struct socket *sock, int protocol,
 		      int kern)
 {
-	return -EOPNOTSUPP;
+	struct sock *sk;
+
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
+		return -EPERM;
+	if (sock->type != SOCK_RAW)
+		return -ESOCKTNOSUPPORT;
+
+	/* XXX Require ETH_P_IP? Something else? */
+	if (protocol)
+		return -EPROTONOSUPPORT;
+
+	sock->state = SS_UNCONNECTED;
+
+	sk = sk_alloc(net, PF_XDP, GFP_KERNEL, &xsk_proto, kern);
+	if (!sk)
+		return -ENOBUFS;
+
+	sock->ops = &xsk_proto_ops;
+
+	sock_init_data(sock, sk);
+
+	sk->sk_family = PF_XDP;
+
+	sk->sk_destruct = xsk_destruct;
+	sk_refcnt_debug_inc(sk);
+
+	local_bh_disable();
+	sock_prot_inuse_add(net, &xsk_proto, 1);
+	local_bh_enable();
+
+	return 0;
 }
 
 static const struct net_proto_family xsk_family_ops = {
diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
index 441f8d00a9d5..71559374645b 100644
--- a/net/xdp/xsk.h
+++ b/net/xdp/xsk.h
@@ -12,7 +12,20 @@
  * more details.
  */
 
-#ifndef _LINUX_XDPSOCK_H
-#define _LINUX_XDPSOCK_H
+#ifndef _LINUX_XSK_H
+#define _LINUX_XSK_H
 
-#endif /* _LINUX_XDPSOCK_H */
+#define XSK_KERNEL_HEADROOM 256 /* Headrom for XDP */
+
+struct xsk_umem {
+	struct pid *pid;
+	struct page **pgs;
+	unsigned long address;
+	size_t size;
+	u32 npgs;
+	u32 frame_size;
+	u32 nframes;
+	u32 data_headroom;
+};
+
+#endif /* _LINUX_XSK_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 03/24] xsk: added XDP_{R,T}X_RING sockopt and supporting structures
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 01/24] xsk: AF_XDP sockets buildable skeleton Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 02/24] xsk: add user memory registration sockopt Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 04/24] xsk: add bind support and introduce Rx functionality Björn Töpel
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This commit contains setup code for the shared user/kernel rings. The
rings are used for passing ownership of frame data buffers via
descriptors between the kernel and the user space process.

We're also introducing some additional structures:

 * xsk_packet array: A batching/caching wrapper on-top of the
                     descriptor ring.
 * xsk_buff: The xsk_buff is an entry in the user registered frame
             data area. Can be seen as a decorated descriptor entry.
 * xsk_buff_info: Container of xsk_buffs.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h |  33 ++++
 net/xdp/Makefile            |   2 +-
 net/xdp/xsk.c               | 127 +++++++++++++-
 net/xdp/xsk_buff.h          | 161 ++++++++++++++++++
 net/xdp/xsk_packet_array.c  |  62 +++++++
 net/xdp/xsk_packet_array.h  | 394 ++++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_ring.c          |  60 +++++++
 net/xdp/xsk_ring.h          | 307 ++++++++++++++++++++++++++++++++++
 net/xdp/xsk_user_queue.h    |  24 +++
 9 files changed, 1168 insertions(+), 2 deletions(-)
 create mode 100644 net/xdp/xsk_buff.h
 create mode 100644 net/xdp/xsk_packet_array.c
 create mode 100644 net/xdp/xsk_packet_array.h
 create mode 100644 net/xdp/xsk_ring.c
 create mode 100644 net/xdp/xsk_ring.h
 create mode 100644 net/xdp/xsk_user_queue.h

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 3f8c90c708b4..3a10df302a1e 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -36,4 +36,37 @@ struct xdp_mr_req {
 	__u32	data_headroom;  /* Frame head room */
 };
 
+struct xdp_ring_req {
+	__u32   mr_fd;      /* FD of packet buffer area registered
+			     * with XDP_MEM_REG
+			     */
+	__u32   desc_nr;    /* Number of descriptors in ring */
+};
+
+/* Pgoff for mmaping the rings */
+#define XDP_PGOFF_RX_RING 0
+#define XDP_PGOFF_TX_RING 0x80000000
+
+/* XDP user space ring structure */
+#define XDP_DESC_KERNEL 0x0080 /* The descriptor is owned by the kernel */
+#define XDP_PKT_CONT    1      /* The packet continues in the next descriptor */
+
+struct xdp_desc {
+	__u32 idx;
+	__u32 len;
+	__u16 offset;
+	__u8  error; /* an errno */
+	__u8  flags;
+	__u8  padding[4];
+};
+
+struct xdp_queue {
+	struct xdp_desc *ring;
+
+	__u32 avail_idx;
+	__u32 last_used_idx;
+	__u32 num_free;
+	__u32 ring_mask;
+};
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
index 0c7631f21586..b9d5d6b8823c 100644
--- a/net/xdp/Makefile
+++ b/net/xdp/Makefile
@@ -1 +1 @@
-obj-$(CONFIG_XDP_SOCKETS) += xsk.o
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xsk_ring.o xsk_packet_array.o
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 333ce1450cc7..34294ac2f75f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -23,15 +23,30 @@
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
 #include <linux/socket.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/net.h>
+#include <linux/netdevice.h>
 #include <net/sock.h>
 
 #include "xsk.h"
+#include "xsk_buff.h"
+#include "xsk_ring.h"
 
 #define XSK_UMEM_MIN_FRAME_SIZE 2048
 
+struct xsk_info {
+	struct xsk_queue *q;
+	struct xsk_umem *umem;
+	struct socket *mrsock;
+	struct xsk_buff_info *buff_info;
+};
+
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
+	struct xsk_info rx;
+	struct xsk_info tx;
 	struct xsk_umem *umem;
 };
 
@@ -225,6 +240,81 @@ static struct xsk_umem *xsk_mem_reg(u64 addr, u64 size, u32 frame_size,
 	return ret < 0 ? ERR_PTR(ret) : umem;
 }
 
+static struct socket *xsk_umem_sock_get(int fd)
+{
+	struct socket *sock;
+	int err;
+
+	sock = sockfd_lookup(fd, &err);
+	if (!sock)
+		return ERR_PTR(err);
+
+	/* Parameter checking */
+	if (sock->sk->sk_family != PF_XDP) {
+		err = -ESOCKTNOSUPPORT;
+		goto out;
+	}
+
+	if (!xdp_sk(sock->sk)->umem) {
+		err = -ESOCKTNOSUPPORT;
+		goto out;
+	}
+
+	return sock;
+out:
+	sockfd_put(sock);
+	return ERR_PTR(err);
+}
+
+static int xsk_init_ring(struct sock *sk, int mr_fd, u32 desc_nr,
+			 struct xsk_info *info)
+{
+	struct xsk_umem *umem;
+	struct socket *mrsock;
+
+	if (desc_nr == 0)
+		return -EINVAL;
+
+	mrsock = xsk_umem_sock_get(mr_fd);
+	if (IS_ERR(mrsock))
+		return PTR_ERR(mrsock);
+	umem = xdp_sk(mrsock->sk)->umem;
+
+	/* Check if umem is from this socket, if so do not make
+	 * circular references.
+	 */
+	lock_sock(sk);
+	if (sk->sk_socket == mrsock)
+		sockfd_put(mrsock);
+
+	info->q = xskq_create(desc_nr);
+	if (!info->q)
+		goto out_queue;
+
+	info->umem = umem;
+	info->mrsock = mrsock;
+	release_sock(sk);
+	return 0;
+
+out_queue:
+	release_sock(sk);
+	return -ENOMEM;
+}
+
+static int xsk_init_rx_ring(struct sock *sk, int mr_fd, u32 desc_nr)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	return xsk_init_ring(sk, mr_fd, desc_nr, &xs->rx);
+}
+
+static int xsk_init_tx_ring(struct sock *sk, int mr_fd, u32 desc_nr)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	return xsk_init_ring(sk, mr_fd, desc_nr, &xs->tx);
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -241,6 +331,8 @@ static int xsk_release(struct socket *sock)
 	local_bh_enable();
 
 	xsk_umem_destroy(xs->umem);
+	xskq_destroy(xs->rx.q);
+	xskq_destroy(xs->tx.q);
 
 	sock_orphan(sk);
 	sock->sk = NULL;
@@ -298,6 +390,21 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 
 		return 0;
 	}
+	case XDP_RX_RING:
+	case XDP_TX_RING:
+	{
+		struct xdp_ring_req req;
+
+		if (optlen < sizeof(req))
+			return -EINVAL;
+		if (copy_from_user(&req, optval, sizeof(req)))
+			return -EFAULT;
+
+		if (optname == XDP_TX_RING)
+			return xsk_init_tx_ring(sk, req.mr_fd, req.desc_nr);
+
+		return xsk_init_rx_ring(sk, req.mr_fd, req.desc_nr);
+	}
 	default:
 		break;
 	}
@@ -319,7 +426,25 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 static int xsk_mmap(struct file *file, struct socket *sock,
 		    struct vm_area_struct *vma)
 {
-	return -EOPNOTSUPP;
+	unsigned long size = vma->vm_end - vma->vm_start;
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct xsk_queue *q;
+	unsigned long pfn;
+
+	if (vma->vm_pgoff == XDP_PGOFF_RX_RING)
+		q = xs->rx.q;
+	else if (vma->vm_pgoff == XDP_PGOFF_TX_RING >> PAGE_SHIFT)
+		q = xs->tx.q;
+	else
+		return -EINVAL;
+
+	if (size != xskq_get_ring_size(q))
+		return -EFBIG;
+
+	pfn = virt_to_phys(xskq_get_ring_address(q)) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn,
+			       size, vma->vm_page_prot);
 }
 
 static struct proto xsk_proto = {
diff --git a/net/xdp/xsk_buff.h b/net/xdp/xsk_buff.h
new file mode 100644
index 000000000000..18ead1bc4482
--- /dev/null
+++ b/net/xdp/xsk_buff.h
@@ -0,0 +1,161 @@
+#ifndef XSK_BUFF_H_
+#define XSK_BUFF_H_
+
+#include <linux/types.h> /* dma_addr_t */
+#include <linux/vmalloc.h>
+#include <linux/dma-mapping.h>
+
+#include "xsk.h"
+
+struct xsk_buff {
+	void *data;
+	dma_addr_t dma;
+	unsigned int len; /* XXX really needed? */
+	unsigned int id;
+	unsigned int offset;
+	struct xsk_buff *next;
+};
+
+/* Rx: data + umem->data_headroom + XDP_PACKET_HEADROOM */
+/* Tx: data + desc->offset */
+
+struct xsk_buff_info {
+	struct xsk_umem *umem;
+	struct device *dev;
+	enum dma_data_direction dir;
+	unsigned long attrs;
+	unsigned int rx_headroom;
+	unsigned int buff_len;
+	unsigned int nbuffs;
+	struct xsk_buff buffs[0];
+
+};
+
+static inline int xsk_buff_dma_map(struct xsk_buff_info *info,
+				   struct device *dev,
+				   enum dma_data_direction dir,
+				   unsigned long attrs)
+{
+	struct xsk_buff *b;
+	unsigned int i, j;
+	dma_addr_t dma;
+
+	if (info->dev)
+		return -1; /* Already mapped */
+
+	for (i = 0; i < info->nbuffs; i++) {
+		b = &info->buffs[i];
+		dma = dma_map_single_attrs(dev, b->data, b->len, dir, attrs);
+		if (dma_mapping_error(dev, dma))
+			goto out_unmap;
+
+		b->dma = dma;
+	}
+
+	info->dev = dev;
+	info->dir = dir;
+	info->attrs = attrs;
+
+	return 0;
+
+out_unmap:
+	for (j = 0; j < i; j++) {
+		b = &info->buffs[i];
+		dma_unmap_single_attrs(info->dev, b->dma, b->len,
+				       info->dir, info->attrs);
+		b->dma = 0;
+	}
+
+	return -1;
+}
+
+static inline void xsk_buff_dma_unmap(struct xsk_buff_info *info)
+{
+	struct xsk_buff *b;
+	unsigned int i;
+
+	if (!info->dev)
+		return; /* Nothing mapped! */
+
+	for (i = 0; i < info->nbuffs; i++) {
+		b = &info->buffs[i];
+		dma_unmap_single_attrs(info->dev, b->dma, b->len,
+				       info->dir, info->attrs);
+		b->dma = 0;
+	}
+
+	info->dev = NULL;
+	info->dir = DMA_NONE;
+	info->attrs = 0;
+}
+
+/* --- */
+
+static inline struct xsk_buff *xsk_buff_info_get_buff(
+	struct xsk_buff_info *info,
+	u32 id)
+{
+	/* XXX remove */
+	if (id >= info->nbuffs) {
+		WARN(1, "%s bad id\n", __func__);
+		return NULL;
+	}
+
+	return &info->buffs[id];
+}
+
+static inline unsigned int xsk_buff_info_get_rx_headroom(
+	struct xsk_buff_info *info)
+{
+	return info->rx_headroom;
+}
+
+static inline unsigned int xsk_buff_info_get_buff_len(
+	struct xsk_buff_info *info)
+{
+	return info->buff_len;
+}
+
+static inline struct xsk_buff_info *xsk_buff_info_create(struct xsk_umem *umem)
+{
+	struct xsk_buff_info *buff_info;
+	unsigned int id = 0;
+	void *data, *end;
+	u32 i;
+
+	buff_info = vzalloc(sizeof(*buff_info) +
+			    sizeof(struct xsk_buff) * umem->nframes);
+	if (!buff_info)
+		return NULL;
+
+	buff_info->umem = umem;
+	buff_info->rx_headroom = umem->data_headroom;
+	buff_info->buff_len = umem->frame_size;
+	buff_info->nbuffs = umem->nframes;
+
+	for (i = 0; i < umem->npgs; i++) {
+		data = page_address(umem->pgs[i]);
+		end = data + PAGE_SIZE;
+		while (data < end) {
+			struct xsk_buff *buff = &buff_info->buffs[id];
+
+			buff->data = data;
+			buff->len = buff_info->buff_len;
+			buff->id = id;
+			buff->offset = buff_info->rx_headroom;
+
+			data += buff_info->buff_len;
+			id++;
+		}
+	}
+
+	return buff_info;
+}
+
+static inline void xsk_buff_info_destroy(struct xsk_buff_info *info)
+{
+	xsk_buff_dma_unmap(info);
+	vfree(info);
+}
+
+#endif /* XSK_BUFF_H_ */
diff --git a/net/xdp/xsk_packet_array.c b/net/xdp/xsk_packet_array.c
new file mode 100644
index 000000000000..f1c3fad1e61b
--- /dev/null
+++ b/net/xdp/xsk_packet_array.c
@@ -0,0 +1,62 @@
+/*
+ *  XDP packet arrays
+ *  Copyright(c) 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/slab.h>
+
+#include "xsk_packet_array.h"
+
+/**
+ * xskpa_create - Create new packet array
+ * @q_ops: opaque reference to queue associated with this packet array
+ * @buff_info: buffer info
+ * @elems: number of elements
+ *
+ * Returns a reference to the new packet array or NULL for failure
+ **/
+struct xsk_packet_array *xskpa_create(struct xsk_user_queue *q_ops,
+				      struct xsk_buff_info *buff_info,
+				      size_t elems)
+{
+	struct xsk_packet_array *arr;
+
+	if (!is_power_of_2(elems))
+		return NULL;
+
+	arr = kzalloc(sizeof(*arr) + elems * sizeof(struct xdp_desc),
+		      GFP_KERNEL);
+	if (!arr)
+		return NULL;
+
+	arr->q_ops = q_ops;
+	arr->buff_info = buff_info;
+	arr->mask = elems - 1;
+	return arr;
+}
+
+void xskpa_destroy(struct xsk_packet_array *a)
+{
+	struct xsk_frame_set f;
+
+	if (a) {
+		/* Flush all outstanding requests. */
+		if (xskpa_get_flushable_frame_set(a, &f)) {
+			do {
+				xskf_set_frame(&f, 0, 0, true);
+			} while (xskf_next_frame(&f));
+		}
+
+		WARN_ON_ONCE(xskpa_flush(a));
+		kfree(a);
+	}
+}
diff --git a/net/xdp/xsk_packet_array.h b/net/xdp/xsk_packet_array.h
new file mode 100644
index 000000000000..1f7544dee443
--- /dev/null
+++ b/net/xdp/xsk_packet_array.h
@@ -0,0 +1,394 @@
+/*
+ *  XDP packet arrays
+ *  Copyright(c) 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XDP_PACKET_ARRAY_H
+#define _LINUX_XDP_PACKET_ARRAY_H
+
+#include <linux/dma-direction.h>
+#include <linux/if_xdp.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+
+#include "xsk.h"
+#include "xsk_buff.h"
+#include "xsk_user_queue.h"
+
+/**
+ * struct xsk_packet_array - An array of packets/frames
+ *
+ * @q_ops:
+ * @buff_info:
+ * @start: the first packet that has not been processed
+ * @curr: the packet that is currently being processed
+ * @end: the last packet in the array
+ * @mask: convenience variable for internal operations on the array
+ * @items: the actual descriptors to frames/packets that are in the array
+ **/
+struct xsk_packet_array {
+	struct xsk_user_queue *q_ops;
+	struct xsk_buff_info *buff_info;
+	u32 start;
+	u32 curr;
+	u32 end;
+	u32 mask;
+	struct xdp_desc items[0];
+};
+
+/**
+ * struct xsk_frame_set - A view of a packet array consisting of
+ *			  one or more frames
+ *
+ * @pkt_arr: the packet array this frame set is located in
+ * @start: the first frame that has not been processed
+ * @curr: the frame that is currently being processed
+ * @end: the last frame in the frame set
+ *
+ * This frame set can either be one or more frames or a single packet
+ * consisting of one or more frames. xskf_ functions with packet in the
+ * name return a frame set representing a packet, while the other
+ * xskf_ functions return one or more frames not taking into account if
+ * they consitute a packet or not.
+ **/
+struct xsk_frame_set {
+	struct xsk_packet_array *pkt_arr;
+	u32 start;
+	u32 curr;
+	u32 end;
+};
+
+static inline struct xsk_user_queue *xsk_user_queue(struct xsk_packet_array *a)
+{
+	return a->q_ops;
+}
+
+static inline struct xdp_desc *xskf_get_desc(struct xsk_frame_set *p)
+{
+	return &p->pkt_arr->items[p->curr & p->pkt_arr->mask];
+}
+
+/**
+ * xskf_reset - Start to traverse the frames in the set from the beginning
+ * @p: pointer to frame set
+ **/
+static inline void xskf_reset(struct xsk_frame_set *p)
+{
+	p->curr = p->start;
+}
+
+static inline u32 xskf_get_frame_id(struct xsk_frame_set *p)
+{
+	return p->pkt_arr->items[p->curr & p->pkt_arr->mask].idx;
+}
+
+static inline void xskf_set_error(struct xsk_frame_set *p, int errno)
+{
+	p->pkt_arr->items[p->curr & p->pkt_arr->mask].error = errno;
+}
+
+static inline u32 xskf_get_frame_len(struct xsk_frame_set *p)
+{
+	return p->pkt_arr->items[p->curr & p->pkt_arr->mask].len;
+}
+
+/**
+ * xskf_set_frame - Sets the properties of a frame
+ * @p: pointer to frame
+ * @len: the length in bytes of the data in the frame
+ * @offset: offset to start of data in frame
+ * @is_eop: Set if this is the last frame of the packet
+ **/
+static inline void xskf_set_frame(struct xsk_frame_set *p, u32 len, u16 offset,
+				  bool is_eop)
+{
+	struct xdp_desc *d =
+		&p->pkt_arr->items[p->curr & p->pkt_arr->mask];
+
+	d->len = len;
+	d->offset = offset;
+	if (!is_eop)
+		d->flags |= XDP_PKT_CONT;
+}
+
+static inline void xskf_set_frame_no_offset(struct xsk_frame_set *p,
+					    u32 len, bool is_eop)
+{
+	struct xdp_desc *d =
+		&p->pkt_arr->items[p->curr & p->pkt_arr->mask];
+
+	d->len = len;
+	if (!is_eop)
+		d->flags |= XDP_PKT_CONT;
+}
+
+/**
+ * xskf_get_data - Gets a pointer to the start of the packet
+ *
+ * @q: Pointer to the frame
+ *
+ * Returns a pointer to the start of the packet the descriptor is pointing
+ * to
+ **/
+static inline void *xskf_get_data(struct xsk_frame_set *p)
+{
+	struct xdp_desc *desc = xskf_get_desc(p);
+	struct xsk_buff *buff;
+
+	buff = xsk_buff_info_get_buff(p->pkt_arr->buff_info, desc->idx);
+
+	return buff->data + desc->offset;
+}
+
+static inline u32 xskf_get_data_offset(struct xsk_frame_set *p)
+{
+	return p->pkt_arr->items[p->curr & p->pkt_arr->mask].offset;
+}
+
+/**
+ * xskf_next_frame - Go to next frame in frame set
+ * @p: pointer to frame set
+ *
+ * Returns true if there is another frame in the frame set.
+ * Advances curr pointer.
+ **/
+static inline bool xskf_next_frame(struct xsk_frame_set *p)
+{
+	if (p->curr + 1 == p->end)
+		return false;
+
+	p->curr++;
+	return true;
+}
+
+/**
+ * xskf_get_packet_len - Length of packet
+ * @p: pointer to packet
+ *
+ * Returns the length of the packet in bytes.
+ * Resets curr pointer of packet.
+ **/
+static inline u32 xskf_get_packet_len(struct xsk_frame_set *p)
+{
+	u32 len = 0;
+
+	xskf_reset(p);
+
+	do {
+		len += xskf_get_frame_len(p);
+	} while (xskf_next_frame(p));
+
+	return len;
+}
+
+/**
+ * xskf_packet_completed - Mark packet as completed
+ * @p: pointer to packet
+ *
+ * Resets curr pointer of packet.
+ **/
+static inline void xskf_packet_completed(struct xsk_frame_set *p)
+{
+	xskf_reset(p);
+
+	do {
+		p->pkt_arr->items[p->curr & p->pkt_arr->mask].flags |=
+			XSK_FRAME_COMPLETED;
+	} while (xskf_next_frame(p));
+}
+
+/**
+ * xskpa_flush_completed - Flushes only frames marked as completed
+ * @a: pointer to packet array
+ *
+ * Returns 0 for success and -1 for failure
+ **/
+static inline int xskpa_flush_completed(struct xsk_packet_array *a)
+{
+	u32 avail = a->curr - a->start;
+	int ret;
+
+	if (avail == 0)
+		return 0; /* nothing to flush */
+
+	ret = xsk_user_queue(a)->enqueue_completed(a, avail);
+	if (ret < 0)
+		return -1;
+
+	a->start += ret;
+	return 0;
+}
+
+/**
+ * xskpa_next_packet - Get next packet in array and advance curr pointer
+ * @a: pointer to packet array
+ * @p: supplied pointer to packet structure that is filled in by function
+ *
+ * Returns true if there is a packet, false otherwise. Packet returned in *p.
+ **/
+static inline bool xskpa_next_packet(struct xsk_packet_array *a,
+				     struct xsk_frame_set *p)
+{
+	u32 avail = a->end - a->curr;
+
+	if (avail == 0)
+		return false; /* empty */
+
+	p->pkt_arr = a;
+	p->start = a->curr;
+	p->curr = a->curr;
+	p->end = a->curr;
+
+	/* XXX Sanity check for too-many-frames packets? */
+	while (a->items[p->end++ & a->mask].flags & XDP_PKT_CONT) {
+		avail--;
+		if (avail == 0)
+			return false;
+	}
+
+	a->curr += (p->end - p->start);
+	return true;
+}
+
+/**
+ * xskpa_populate - Populate an array with packets from associated queue
+ * @a: pointer to packet array
+ **/
+static inline void xskpa_populate(struct xsk_packet_array *a)
+{
+	u32 cnt, free = a->mask + 1 - (a->end - a->start);
+
+	if (free == 0)
+		return; /* no space! */
+
+	cnt = xsk_user_queue(a)->dequeue(a, free);
+	a->end += cnt;
+}
+
+/**
+ * xskpa_next_frame - Get next frame in array and advance curr pointer
+ * @a: pointer to packet array
+ * @p: supplied pointer to packet structure that is filled in by function
+ *
+ * Returns true if there is a frame, false otherwise. Frame returned in *p.
+ **/
+static inline bool xskpa_next_frame(struct xsk_packet_array *a,
+				    struct xsk_frame_set *p)
+{
+	u32 avail = a->end - a->curr;
+
+	if (avail == 0)
+		return false; /* empty */
+
+	p->pkt_arr = a;
+	p->start = a->curr;
+	p->curr = a->curr;
+	p->end = ++a->curr;
+
+	return true;
+}
+
+/**
+ * xskpa_next_frame_populate - Get next frame and populate array if empty
+ * @a: pointer to packet array
+ * @p: supplied pointer to packet structure that is filled in by function
+ *
+ * Returns true if there is a frame, false otherwise. Frame returned in *p.
+ **/
+static inline bool xskpa_next_frame_populate(struct xsk_packet_array *a,
+					     struct xsk_frame_set *p)
+{
+	bool more_frames;
+
+	more_frames = xskpa_next_frame(a, p);
+	if (!more_frames) {
+		xskpa_populate(a);
+		more_frames = xskpa_next_frame(a, p);
+	}
+
+	return more_frames;
+}
+
+/**
+ * xskpa_get_flushable_frame_set - Create a frame set of the flushable region
+ * @a: pointer to packet array
+ * @p: frame set
+ *
+ * Returns true for success and false for failure
+ **/
+static inline bool xskpa_get_flushable_frame_set(struct xsk_packet_array *a,
+						 struct xsk_frame_set *p)
+{
+	u32 curr = READ_ONCE(a->curr);
+	u32 avail = curr - a->start;
+
+	if (avail == 0)
+		return false; /* empty */
+
+	p->pkt_arr = a;
+	p->start = a->start;
+	p->curr = a->start;
+	p->end = curr;
+
+	return true;
+}
+
+static inline int __xskpa_flush(struct xsk_packet_array *a, u32 npackets)
+{
+	int ret;
+
+	if (npackets == 0)
+		return 0; /* nothing to flush */
+
+	ret = xsk_user_queue(a)->enqueue(a, npackets);
+	if (ret < 0)
+		return ret;
+
+	a->start += npackets;
+	return 0;
+}
+
+/**
+ * xskpa_flush - Flush processed packets to associated queue
+ * @a: pointer to packet array
+ *
+ * Returns 0 for success and -errno for failure
+ **/
+static inline int xskpa_flush(struct xsk_packet_array *a)
+{
+	u32 curr = READ_ONCE(a->curr);
+	u32 avail = curr - a->start;
+
+	return __xskpa_flush(a, avail);
+}
+
+/**
+ * xskpa_flush_n - Flush N processed packets to associated queue
+ * @a: pointer to packet array
+ * @npackets: number of packets to flush
+ *
+ * Returns 0 for success and -errno for failure
+ **/
+static inline int xskpa_flush_n(struct xsk_packet_array *a, u32 npackets)
+{
+	if (npackets > a->curr - a->start)
+		return -ENOSPC;
+
+	return __xskpa_flush(a, npackets);
+}
+
+struct xsk_packet_array *xskpa_create(struct xsk_user_queue *q_ops,
+				      struct xsk_buff_info *buff_info,
+				      size_t elems);
+void xskpa_destroy(struct xsk_packet_array *a);
+
+#endif /* _LINUX_XDP_PACKET_ARRAY_H */
diff --git a/net/xdp/xsk_ring.c b/net/xdp/xsk_ring.c
new file mode 100644
index 000000000000..11b590506ddf
--- /dev/null
+++ b/net/xdp/xsk_ring.c
@@ -0,0 +1,60 @@
+/*
+ *  XDP user-space ring structure
+ *  Copyright(c) 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/slab.h>
+
+#include "xsk_ring.h"
+
+/**
+ * xskq_init - Initializas an XDP queue
+ *
+ * @nentries: Number of descriptor entries in the queue
+ *
+ * Returns the created queue in *q_ops and the function returns zero
+ * for success.
+ **/
+struct xsk_queue *xskq_create(u32 nentries)
+{
+	struct xsk_queue *q;
+
+	q = kzalloc(sizeof(*q), GFP_KERNEL);
+	if (!q)
+		return NULL;
+
+	q->ring = kcalloc(nentries, sizeof(*q->ring), GFP_KERNEL);
+	if (!q->ring) {
+		kfree(q);
+		return NULL;
+	}
+
+	q->queue_ops.enqueue = xskq_enqueue_from_array;
+	q->queue_ops.enqueue_completed = xskq_enqueue_completed_from_array;
+	q->queue_ops.dequeue = xskq_dequeue_to_array;
+	q->used_idx = 0;
+	q->last_avail_idx = 0;
+	q->ring_mask = nentries - 1;
+	q->num_free = 0;
+	q->nentries = nentries;
+
+	return q;
+}
+
+void xskq_destroy(struct xsk_queue *q)
+{
+	if (!q)
+		return;
+
+	kfree(q->ring);
+	kfree(q);
+}
diff --git a/net/xdp/xsk_ring.h b/net/xdp/xsk_ring.h
new file mode 100644
index 000000000000..c9d61195ab2d
--- /dev/null
+++ b/net/xdp/xsk_ring.h
@@ -0,0 +1,307 @@
+/*
+ *  XDP user-space ring structure
+ *  Copyright(c) 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XDP_RING_H
+#define _LINUX_XDP_RING_H
+
+#include <linux/types.h>
+#include <linux/if_xdp.h>
+
+#include "xsk.h"
+#include "xsk_buff.h"
+#include "xsk_packet_array.h"
+
+struct xsk_queue {
+	/* struct xsk_user_queue has to be first */
+	struct xsk_user_queue queue_ops;
+	struct xdp_desc *ring;
+
+	u32 used_idx;
+	u32 last_avail_idx;
+	u32 ring_mask;
+	u32 num_free;
+
+	u32 nentries;
+	struct xsk_buff_info *buff_info;
+	enum xsk_validation validation;
+};
+
+static inline unsigned int xsk_get_data_headroom(struct xsk_umem *umem)
+{
+	return umem->data_headroom + XDP_KERNEL_HEADROOM;
+}
+
+/**
+ * xskq_is_valid_entry - Is the entry valid?
+ *
+ * @q: Pointer to the tp4 queue the descriptor resides in
+ * @desc: Pointer to the descriptor to examine
+ * @validation: The type of validation to perform
+ *
+ * Returns true if the entry is a valid, otherwise false
+ **/
+static inline bool xskq_is_valid_entry(struct xsk_queue *q,
+				       struct xdp_desc *d)
+{
+	unsigned int buff_len;
+
+	if (q->validation == XSK_VALIDATION_NONE)
+		return true;
+
+	if (unlikely(d->idx >= q->buff_info->nbuffs)) {
+		d->error = EBADF;
+		return false;
+	}
+
+	if (q->validation == XSK_VALIDATION_RX) {
+		d->offset = xsk_buff_info_get_rx_headroom(q->buff_info);
+		return true;
+	}
+
+	buff_len = xsk_buff_info_get_buff_len(q->buff_info);
+	/* XSK_VALIDATION_TX */
+	if (unlikely(d->len > buff_len || d->len == 0 || d->offset > buff_len ||
+		     d->offset + d->len > buff_len)) {
+		d->error = EBADF;
+		return false;
+	}
+
+	return true;
+}
+
+/**
+ * xskq_nb_avail - Returns the number of available entries
+ *
+ * @q: Pointer to the queue to examine
+ * @dcnt: Max number of entries to check
+ *
+ * Returns the the number of entries available in the queue up to dcnt
+ **/
+static inline int xskq_nb_avail(struct xsk_queue *q, int dcnt)
+{
+	unsigned int idx, last_avail_idx = q->last_avail_idx;
+	int i, entries = 0;
+
+	for (i = 0; i < dcnt; i++) {
+		idx = (last_avail_idx++) & q->ring_mask;
+		if (!(q->ring[idx].flags & XDP_DESC_KERNEL))
+			break;
+		entries++;
+	}
+
+	return entries;
+}
+
+/**
+ * xskq_enqueue - Enqueue entries to a the queue
+ *
+ * @q: Pointer to the queue the descriptor resides in
+ * @d: Pointer to the descriptor to examine
+ * @dcnt: Max number of entries to dequeue
+ *
+ * Returns 0 for success or an errno at failure
+ **/
+static inline int xskq_enqueue(struct xsk_queue *q,
+			       const struct xdp_desc *d, int dcnt)
+{
+	unsigned int used_idx = q->used_idx;
+	int i;
+
+	if (q->num_free < dcnt)
+		return -ENOSPC;
+
+	q->num_free -= dcnt;
+
+	for (i = 0; i < dcnt; i++) {
+		unsigned int idx = (used_idx++) & q->ring_mask;
+
+		q->ring[idx].idx = d[i].idx;
+		q->ring[idx].len = d[i].len;
+		q->ring[idx].offset = d[i].offset;
+		q->ring[idx].error = d[i].error;
+	}
+
+	/* Order flags and data */
+	smp_wmb();
+
+	for (i = dcnt - 1; i >= 0; i--) {
+		unsigned int idx = (q->used_idx + i) & q->ring_mask;
+
+		q->ring[idx].flags = d[i].flags & ~XDP_DESC_KERNEL;
+	}
+	q->used_idx += dcnt;
+
+	return 0;
+}
+
+/**
+ * xskq_enqueue_from_array - Enqueue entries from packet array to the queue
+ *
+ * @a: Pointer to the packet array to enqueue from
+ * @dcnt: Max number of entries to enqueue
+ *
+ * Returns 0 for success or an errno at failure
+ **/
+static inline int xskq_enqueue_from_array(struct xsk_packet_array *a,
+					  u32 dcnt)
+{
+	struct xsk_queue *q = (struct xsk_queue *)a->q_ops;
+	unsigned int used_idx = q->used_idx;
+	struct xdp_desc *d = a->items;
+	int i;
+
+	if (q->num_free < dcnt)
+		return -ENOSPC;
+
+	q->num_free -= dcnt;
+
+	for (i = 0; i < dcnt; i++) {
+		unsigned int idx = (used_idx++) & q->ring_mask;
+		unsigned int didx = (a->start + i) & a->mask;
+
+		q->ring[idx].idx = d[didx].idx;
+		q->ring[idx].len = d[didx].len;
+		q->ring[idx].offset = d[didx].offset;
+		q->ring[idx].error = d[didx].error;
+	}
+
+	/* Order flags and data */
+	smp_wmb();
+
+	for (i = dcnt - 1; i >= 0; i--) {
+		unsigned int idx = (q->used_idx + i) & q->ring_mask;
+		unsigned int didx = (a->start + i) & a->mask;
+
+		q->ring[idx].flags = d[didx].flags & ~XDP_DESC_KERNEL;
+	}
+	q->used_idx += dcnt;
+
+	return 0;
+}
+
+/**
+ * xskq_enqueue_completed_from_array - Enqueue only completed entries
+ *				       from packet array
+ *
+ * @a: Pointer to the packet array to enqueue from
+ * @dcnt: Max number of entries to enqueue
+ *
+ * Returns the number of entries successfully enqueued or a negative errno
+ * at failure.
+ **/
+static inline int xskq_enqueue_completed_from_array(struct xsk_packet_array *a,
+						    u32 dcnt)
+{
+	struct xsk_queue *q = (struct xsk_queue *)a->q_ops;
+	unsigned int used_idx = q->used_idx;
+	struct xdp_desc *d = a->items;
+	int i, j;
+
+	if (q->num_free < dcnt)
+		return -ENOSPC;
+
+	for (i = 0; i < dcnt; i++) {
+		unsigned int didx = (a->start + i) & a->mask;
+
+		if (d[didx].flags & XSK_FRAME_COMPLETED) {
+			unsigned int idx = (used_idx++) & q->ring_mask;
+
+			q->ring[idx].idx = d[didx].idx;
+			q->ring[idx].len = d[didx].len;
+			q->ring[idx].offset = d[didx].offset;
+			q->ring[idx].error = d[didx].error;
+		} else {
+			break;
+		}
+	}
+
+	if (i == 0)
+		return 0;
+
+	/* Order flags and data */
+	smp_wmb();
+
+	for (j = i - 1; j >= 0; j--) {
+		unsigned int idx = (q->used_idx + j) & q->ring_mask;
+		unsigned int didx = (a->start + j) & a->mask;
+
+		q->ring[idx].flags = d[didx].flags & ~XDP_DESC_KERNEL;
+	}
+	q->num_free -= i;
+	q->used_idx += i;
+
+	return i;
+}
+
+/**
+ * xskq_dequeue_to_array - Dequeue entries from the queue to a packet array
+ *
+ * @a: Pointer to the packet array to dequeue from
+ * @dcnt: Max number of entries to dequeue
+ *
+ * Returns the number of entries dequeued. Non valid entries will be
+ * discarded.
+ **/
+static inline int xskq_dequeue_to_array(struct xsk_packet_array *a, u32 dcnt)
+{
+	struct xdp_desc *d = a->items;
+	int i, entries, valid_entries = 0;
+	struct xsk_queue *q = (struct xsk_queue *)a->q_ops;
+	u32 start = a->end;
+
+	entries = xskq_nb_avail(q, dcnt);
+	q->num_free += entries;
+
+	/* Order flags and data */
+	smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		unsigned int d_idx = start & a->mask;
+		unsigned int idx;
+
+		idx = (q->last_avail_idx++) & q->ring_mask;
+		d[d_idx] = q->ring[idx];
+		if (!xskq_is_valid_entry(q, &d[d_idx])) {
+			WARN_ON_ONCE(xskq_enqueue(q, &d[d_idx], 1));
+			continue;
+		}
+
+		start++;
+		valid_entries++;
+	}
+	return valid_entries;
+}
+
+static inline u32 xskq_get_ring_size(struct xsk_queue *q)
+{
+	return q->nentries * sizeof(*q->ring);
+}
+
+static inline char *xskq_get_ring_address(struct xsk_queue *q)
+{
+	return (char *)q->ring;
+}
+
+static inline void xskq_set_buff_info(struct xsk_queue *q,
+				      struct xsk_buff_info *buff_info,
+				      enum xsk_validation validation)
+{
+	q->buff_info = buff_info;
+	q->validation = validation;
+}
+
+struct xsk_queue *xskq_create(u32 nentries);
+void xskq_destroy(struct xsk_queue *q_ops);
+
+#endif /* _LINUX_XDP_RING_H */
diff --git a/net/xdp/xsk_user_queue.h b/net/xdp/xsk_user_queue.h
new file mode 100644
index 000000000000..c072f854d693
--- /dev/null
+++ b/net/xdp/xsk_user_queue.h
@@ -0,0 +1,24 @@
+#ifndef XSK_USER_QUEUE_H_
+#define XSK_USER_QUEUE_H_
+
+#define XDP_KERNEL_HEADROOM 256 /* Headrom for XDP */
+
+#define XSK_FRAME_COMPLETED XDP_DESC_KERNEL
+
+enum xsk_validation {
+	XSK_VALIDATION_NONE,	  /* No validation is performed */
+	XSK_VALIDATION_RX,	  /* Only address to packet buffer validated */
+	XSK_VALIDATION_TX	  /* Full descriptor is validated */
+};
+
+struct xsk_packet_array;
+
+struct xsk_user_queue {
+	int (*enqueue)(struct xsk_packet_array *pa, u32 cnt);
+	int (*enqueue_completed)(struct xsk_packet_array *pa, u32 cnt);
+	int (*dequeue)(struct xsk_packet_array *pa, u32 cnt);
+	u32 (*get_ring_size)(struct xsk_user_queue *q);
+	char *(*get_ring_address)(struct xsk_user_queue *q);
+};
+
+#endif /* XSK_USER_QUEUE_H_ */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 04/24] xsk: add bind support and introduce Rx functionality
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (2 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 03/24] xsk: added XDP_{R,T}X_RING sockopt and supporting structures Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect Björn Töpel
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here the bind syscall is implemented. Also, two frame receive
functions are introduced: xsk_rcv and xsk_generic_rcv. The latter is
used for the XDP_SKB path, and the first is used for XDP_DRV.

Later commits will wire up the receive functions.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h |   3 +
 net/xdp/xsk.c             | 211 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 210 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4c77f39ebd65..36cc7e92bd8e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -688,6 +688,9 @@ struct netdev_rx_queue {
 	struct kobject			kobj;
 	struct net_device		*dev;
 	struct xdp_rxq_info		xdp_rxq;
+#ifdef CONFIG_XDP_SOCKETS
+	struct xdp_sock __rcu           *xs;
+#endif
 } ____cacheline_aligned_in_smp;
 
 /*
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 34294ac2f75f..db918e31079b 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -34,8 +34,11 @@
 #include "xsk_ring.h"
 
 #define XSK_UMEM_MIN_FRAME_SIZE 2048
+#define XSK_ARRAY_SIZE 512
 
 struct xsk_info {
+	struct xsk_packet_array *pa;
+	spinlock_t pa_lock;
 	struct xsk_queue *q;
 	struct xsk_umem *umem;
 	struct socket *mrsock;
@@ -47,7 +50,10 @@ struct xdp_sock {
 	struct sock sk;
 	struct xsk_info rx;
 	struct xsk_info tx;
+	struct net_device *dev;
 	struct xsk_umem *umem;
+	u32 ifindex;
+	u16 queue_id;
 };
 
 static struct xdp_sock *xdp_sk(struct sock *sk)
@@ -330,9 +336,21 @@ static int xsk_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	local_bh_enable();
 
-	xsk_umem_destroy(xs->umem);
-	xskq_destroy(xs->rx.q);
-	xskq_destroy(xs->tx.q);
+	if (xs->dev) {
+		struct xdp_sock *xs_prev;
+
+		xs_prev = xs->dev->_rx[xs->queue_id].xs;
+		rcu_assign_pointer(xs->dev->_rx[xs->queue_id].xs, NULL);
+
+		/* Wait for driver to stop using the xdp socket. */
+		synchronize_net();
+
+		xskpa_destroy(xs->rx.pa);
+		xsk_umem_destroy(xs_prev->umem);
+		xskq_destroy(xs_prev->rx.q);
+		kobject_put(&xs_prev->dev->_rx[xs->queue_id].kobj);
+		dev_put(xs_prev->dev);
+	}
 
 	sock_orphan(sk);
 	sock->sk = NULL;
@@ -345,8 +363,193 @@ static int xsk_release(struct socket *sock)
 
 static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 {
-	return -EOPNOTSUPP;
+	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct net_device *dev_curr;
+	struct net_device *dev;
+	int err = 0;
+
+	if (addr_len < sizeof(struct sockaddr_xdp))
+		return -EINVAL;
+	if (sxdp->sxdp_family != AF_XDP)
+		return -EINVAL;
+
+	lock_sock(sk);
+	dev_curr = xs->dev;
+	dev = dev_get_by_index_rcu(sock_net(sk), sxdp->sxdp_ifindex);
+	if (!dev) {
+		err = -ENODEV;
+		goto out_unlock;
+	}
+	dev_hold(dev);
+
+	if (dev_curr && dev_curr != dev) {
+		/* XXX Needs rebind code here */
+		err = -EBUSY;
+		goto out_unlock;
+	}
+
+	if (!xs->rx.q || !xs->tx.q) {
+		/* XXX For now require Tx and Rx */
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (sxdp->sxdp_queue_id > dev->num_rx_queues) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+	kobject_get(&dev->_rx[sxdp->sxdp_queue_id].kobj);
+
+	xs->dev = dev;
+	xs->ifindex = sxdp->sxdp_ifindex;
+	xs->queue_id = sxdp->sxdp_queue_id;
+	spin_lock_init(&xs->rx.pa_lock);
+
+	/* Rx */
+	xs->rx.buff_info = xsk_buff_info_create(xs->rx.umem);
+	if (!xs->rx.buff_info) {
+		err = -ENOMEM;
+		goto out_unlock;
+	}
+	xskq_set_buff_info(xs->rx.q, xs->rx.buff_info, XSK_VALIDATION_RX);
+
+	/* Rx packet array is used for copy semantics... */
+	xs->rx.pa = xskpa_create((struct xsk_user_queue *)xs->rx.q,
+				 xs->rx.buff_info, XSK_ARRAY_SIZE);
+	if (!xs->rx.pa) {
+		err = -ENOMEM;
+		goto out_rx_pa;
+	}
+
+	rcu_assign_pointer(dev->_rx[sxdp->sxdp_queue_id].xs, xs);
+
+	goto out_unlock;
+
+out_rx_pa:
+	xsk_buff_info_destroy(xs->rx.buff_info);
+	xs->rx.buff_info = NULL;
+out_unlock:
+	if (err)
+		dev_put(dev);
+	release_sock(sk);
+	if (dev_curr)
+		dev_put(dev_curr);
+	return err;
+}
+
+static inline struct xdp_sock *lookup_xsk(struct net_device *dev,
+					  unsigned int queue_id)
+{
+	if (unlikely(queue_id > dev->num_rx_queues))
+		return NULL;
+
+	return rcu_dereference(dev->_rx[queue_id].xs);
+}
+
+int xsk_generic_rcv(struct xdp_buff *xdp)
+{
+	u32 len = xdp->data_end - xdp->data;
+	struct xsk_frame_set p;
+	struct xdp_sock *xsk;
+	bool ok;
+
+	rcu_read_lock();
+	xsk = lookup_xsk(xdp->rxq->dev, xdp->rxq->queue_index);
+	if (unlikely(!xsk)) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	spin_lock(&xsk->rx.pa_lock);
+	ok = xskpa_next_frame_populate(xsk->rx.pa, &p);
+	spin_unlock(&xsk->rx.pa_lock);
+
+	if (!ok) {
+		rcu_read_unlock();
+		return -ENOSPC;
+	}
+
+	memcpy(xskf_get_data(&p), xdp->data, len);
+	xskf_set_frame_no_offset(&p, len, true);
+	spin_lock(&xsk->rx.pa_lock);
+	xskpa_flush(xsk->rx.pa);
+	spin_unlock(&xsk->rx.pa_lock);
+	rcu_read_unlock();
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xsk_generic_rcv);
+
+struct xdp_sock *xsk_rcv(struct xdp_sock *xsk, struct xdp_buff *xdp)
+{
+	u32 len = xdp->data_end - xdp->data;
+	struct xsk_frame_set p;
+
+	rcu_read_lock();
+	if (!xsk)
+		xsk = lookup_xsk(xdp->rxq->dev, xdp->rxq->queue_index);
+	if (unlikely(!xsk)) {
+		rcu_read_unlock();
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (!xskpa_next_frame_populate(xsk->rx.pa, &p)) {
+		rcu_read_unlock();
+		return ERR_PTR(-ENOSPC);
+	}
+
+	memcpy(xskf_get_data(&p), xdp->data, len);
+	xskf_set_frame_no_offset(&p, len, true);
+	rcu_read_unlock();
+
+	/* We assume that the semantic of xdp_do_redirect is such that
+	 * ndo_xdp_xmit will decrease the refcount of the page when it
+	 * is done with the page. Thus, if we want to guarantee the
+	 * existence of the page in the calling driver, we need to
+	 * bump the refcount. Unclear what the correct semantic is
+	 * supposed to be.
+	 */
+	page_frag_free(xdp->data);
+
+	return xsk;
+}
+EXPORT_SYMBOL_GPL(xsk_rcv);
+
+int xsk_zc_rcv(struct xdp_sock *xsk, struct xdp_buff *xdp)
+{
+	u32 offset = xdp->data - xdp->data_hard_start;
+	u32 len = xdp->data_end - xdp->data;
+	struct xsk_frame_set p;
+
+	/* We do not need any locking here since we are guaranteed
+	 * a single producer and a single consumer.
+	 */
+	if (xskpa_next_frame_populate(xsk->rx.pa, &p)) {
+		xskf_set_frame(&p, len, offset, true);
+		return 0;
+	}
+
+	/* No user-space buffer to put the packet in. */
+	return -ENOSPC;
+}
+EXPORT_SYMBOL_GPL(xsk_zc_rcv);
+
+void xsk_flush(struct xdp_sock *xsk)
+{
+	rcu_read_lock();
+	if (!xsk)
+		xsk = lookup_xsk(xsk->dev, xsk->queue_id);
+	if (unlikely(!xsk)) {
+		rcu_read_unlock();
+		return;
+	}
+
+	WARN_ON_ONCE(xskpa_flush(xsk->rx.pa));
+	rcu_read_unlock();
 }
+EXPORT_SYMBOL_GPL(xsk_flush);
 
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
 			     struct poll_table_struct *wait)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (3 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 04/24] xsk: add bind support and introduce Rx functionality Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-02-05 13:42   ` Jesper Dangaard Brouer
  2018-01-31 13:53 ` [RFC PATCH 06/24] net: wire up xsk support in the XDP_REDIRECT path Björn Töpel
                   ` (23 subsequent siblings)
  28 siblings, 1 reply; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

The bpf_xdpsk_redirect call redirects the XDP context to the XDP
socket bound to the receiving queue (if any).

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/uapi/linux/bpf.h                  |  6 +++++-
 net/core/filter.c                         | 24 ++++++++++++++++++++++++
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index db6bdc375126..18f9ee7cb529 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -696,6 +696,9 @@ union bpf_attr {
  * int bpf_override_return(pt_regs, rc)
  *	@pt_regs: pointer to struct pt_regs
  *	@rc: the return value to set
+ *
+ * int bpf_xdpsk_redirect()
+ *	Return: XDP_REDIRECT
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -757,7 +760,8 @@ union bpf_attr {
 	FN(perf_prog_read_value),	\
 	FN(getsockopt),			\
 	FN(override_return),		\
-	FN(sock_ops_cb_flags_set),
+	FN(sock_ops_cb_flags_set),	\
+	FN(xdpsk_redirect),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 08ab4c65a998..aedf57489cb5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1809,6 +1809,8 @@ struct redirect_info {
 	struct bpf_map *map;
 	struct bpf_map *map_to_flush;
 	unsigned long   map_owner;
+	bool to_xsk;
+	/* XXX cache xsk socket here, to avoid lookup? */
 };
 
 static DEFINE_PER_CPU(struct redirect_info, redirect_info);
@@ -2707,6 +2709,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 	ri->ifindex = 0;
 	ri->map = NULL;
 	ri->map_owner = 0;
+	ri->to_xsk = false;
 
 	if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
 		err = -EFAULT;
@@ -2817,6 +2820,25 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_0(bpf_xdpsk_redirect)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+
+	/* XXX Would it be better to check for socket existence here,
+	 * and XDP_ABORTED on failure? Also, then we can populate xsk
+	 * in ri, and don't have to do the lookup multiple times.
+	 */
+	ri->to_xsk = true;
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdpsk_redirect_proto = {
+	.func		= bpf_xdpsk_redirect,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+};
+
 bool bpf_helper_changes_pkt_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
@@ -3544,6 +3566,8 @@ xdp_func_proto(enum bpf_func_id func_id)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_xdpsk_redirect:
+		return &bpf_xdpsk_redirect_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index dde2c11d7771..5898ad7a8e40 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -86,6 +86,8 @@ static int (*bpf_perf_prog_read_value)(void *ctx, void *buf,
 	(void *) BPF_FUNC_perf_prog_read_value;
 static int (*bpf_override_return)(void *ctx, unsigned long rc) =
 	(void *) BPF_FUNC_override_return;
+static int (*bpf_xdpsk_redirect)(void) =
+	(void *) BPF_FUNC_xdpsk_redirect;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 06/24] net: wire up xsk support in the XDP_REDIRECT path
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (4 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 07/24] xsk: introduce Tx functionality Björn Töpel
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

In this commit we add support for XDP programs to redirect frames to a
bound AF_XDP socket.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/filter.h |  2 +-
 include/net/xdp_sock.h | 28 ++++++++++++++++++++
 net/core/dev.c         | 28 +++++++++++---------
 net/core/filter.c      | 72 ++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 111 insertions(+), 19 deletions(-)
 create mode 100644 include/net/xdp_sock.h

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 276932d75975..43cacfe2cc2a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -747,7 +747,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
  * This does not appear to be a real limitation for existing software.
  */
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
-			    struct bpf_prog *prog);
+			    struct xdp_buff *xdp, struct bpf_prog *prog);
 int xdp_do_redirect(struct net_device *dev,
 		    struct xdp_buff *xdp,
 		    struct bpf_prog *prog);
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
new file mode 100644
index 000000000000..132489fe0e70
--- /dev/null
+++ b/include/net/xdp_sock.h
@@ -0,0 +1,28 @@
+#ifndef _LINUX_AF_XDP_SOCK_H
+#define _LINUX_AF_XDP_SOCK_H
+
+struct xdp_sock;
+struct xdp_buff;
+
+#ifdef CONFIG_XDP_SOCKETS
+int xsk_generic_rcv(struct xdp_buff *xdp);
+struct xdp_sock *xsk_rcv(struct xdp_sock *xsk, struct xdp_buff *xdp);
+void xsk_flush(struct xdp_sock *xsk);
+#else
+static inline int xsk_generic_rcv(struct xdp_buff *xdp)
+{
+	return -ENOTSUPP;
+}
+
+static inline struct xdp_sock *xsk_rcv(struct xdp_sock *xsk,
+				       struct xdp_buff *xdp)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
+
+static inline void xsk_flush(struct xdp_sock *xsk)
+{
+}
+#endif /* CONFIG_XDP_SOCKETS */
+
+#endif /* _LINUX_AF_XDP_SOCK_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index dda9d7b9a840..94d2950fc33d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3951,11 +3951,11 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 }
 
 static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+				     struct xdp_buff *xdp,
 				     struct bpf_prog *xdp_prog)
 {
 	struct netdev_rx_queue *rxqueue;
 	u32 metalen, act = XDP_DROP;
-	struct xdp_buff xdp;
 	void *orig_data;
 	int hlen, off;
 	u32 mac_len;
@@ -3991,18 +3991,18 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	 */
 	mac_len = skb->data - skb_mac_header(skb);
 	hlen = skb_headlen(skb) + mac_len;
-	xdp.data = skb->data - mac_len;
-	xdp.data_meta = xdp.data;
-	xdp.data_end = xdp.data + hlen;
-	xdp.data_hard_start = skb->data - skb_headroom(skb);
-	orig_data = xdp.data;
+	xdp->data = skb->data - mac_len;
+	xdp->data_meta = xdp->data;
+	xdp->data_end = xdp->data + hlen;
+	xdp->data_hard_start = skb->data - skb_headroom(skb);
+	orig_data = xdp->data;
 
 	rxqueue = netif_get_rxqueue(skb);
-	xdp.rxq = &rxqueue->xdp_rxq;
+	xdp->rxq = &rxqueue->xdp_rxq;
 
-	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
-	off = xdp.data - orig_data;
+	off = xdp->data - orig_data;
 	if (off > 0)
 		__skb_pull(skb, off);
 	else if (off < 0)
@@ -4015,7 +4015,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 		__skb_push(skb, mac_len);
 		break;
 	case XDP_PASS:
-		metalen = xdp.data - xdp.data_meta;
+		metalen = xdp->data - xdp->data_meta;
 		if (metalen)
 			skb_metadata_set(skb, metalen);
 		break;
@@ -4065,17 +4065,19 @@ static struct static_key generic_xdp_needed __read_mostly;
 int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 {
 	if (xdp_prog) {
-		u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+		struct xdp_buff xdp;
+		u32 act;
 		int err;
 
+		act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
 		if (act != XDP_PASS) {
 			switch (act) {
 			case XDP_REDIRECT:
 				err = xdp_do_generic_redirect(skb->dev, skb,
-							      xdp_prog);
+							      &xdp, xdp_prog);
 				if (err)
 					goto out_redir;
-			/* fallthru to submit skb */
+				break;
 			case XDP_TX:
 				generic_xdp_tx(skb, xdp_prog);
 				break;
diff --git a/net/core/filter.c b/net/core/filter.c
index aedf57489cb5..eab47173bc9e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -57,6 +57,7 @@
 #include <net/busy_poll.h>
 #include <net/tcp.h>
 #include <linux/bpf_trace.h>
+#include <net/xdp_sock.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -1809,8 +1810,8 @@ struct redirect_info {
 	struct bpf_map *map;
 	struct bpf_map *map_to_flush;
 	unsigned long   map_owner;
-	bool to_xsk;
-	/* XXX cache xsk socket here, to avoid lookup? */
+	bool xsk;
+	struct xdp_sock *xsk_to_flush;
 };
 
 static DEFINE_PER_CPU(struct redirect_info, redirect_info);
@@ -2575,6 +2576,7 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 void xdp_do_flush_map(void)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct xdp_sock *xsk = ri->xsk_to_flush;
 	struct bpf_map *map = ri->map_to_flush;
 
 	ri->map_to_flush = NULL;
@@ -2590,6 +2592,10 @@ void xdp_do_flush_map(void)
 			break;
 		}
 	}
+
+	ri->xsk_to_flush = NULL;
+	if (xsk)
+		xsk_flush(xsk);
 }
 EXPORT_SYMBOL_GPL(xdp_do_flush_map);
 
@@ -2611,6 +2617,29 @@ static inline bool xdp_map_invalid(const struct bpf_prog *xdp_prog,
 	return (unsigned long)xdp_prog->aux != aux;
 }
 
+static int xdp_do_xsk_redirect(struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct xdp_sock *xsk;
+
+	ri->ifindex = 0;
+	ri->map = NULL;
+	ri->map_owner = 0;
+	ri->xsk = false;
+
+	xsk = xsk_rcv(ri->xsk_to_flush, xdp);
+	if (IS_ERR(xsk)) {
+		_trace_xdp_redirect_err(xdp->rxq->dev, xdp_prog, -1,
+					PTR_ERR(xsk));
+		return PTR_ERR(xsk);
+	}
+
+	ri->xsk_to_flush = xsk;
+	_trace_xdp_redirect(xdp->rxq->dev, xdp_prog, -1);
+
+	return 0;
+}
+
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 			       struct bpf_prog *xdp_prog)
 {
@@ -2624,6 +2653,7 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 	ri->ifindex = 0;
 	ri->map = NULL;
 	ri->map_owner = 0;
+	ri->xsk = false;
 
 	if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
 		err = -EFAULT;
@@ -2659,6 +2689,9 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	u32 index = ri->ifindex;
 	int err;
 
+	if (ri->xsk)
+		return xdp_do_xsk_redirect(xdp, xdp_prog);
+
 	if (ri->map)
 		return xdp_do_redirect_map(dev, xdp, xdp_prog);
 
@@ -2681,6 +2714,30 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int xdp_do_generic_xsk_redirect(struct sk_buff *skb,
+				       struct xdp_buff *xdp,
+				       struct bpf_prog *xdp_prog)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	int err;
+
+	ri->ifindex = 0;
+	ri->map = NULL;
+	ri->map_owner = 0;
+	ri->xsk = false;
+
+	err = xsk_generic_rcv(xdp);
+	if (err) {
+		_trace_xdp_redirect_err(xdp->rxq->dev, xdp_prog, -1, err);
+		return err;
+	}
+
+	consume_skb(skb);
+	_trace_xdp_redirect(xdp->rxq->dev, xdp_prog, -1);  /* XXX fix tracing to support xsk */
+
+	return 0;
+}
+
 static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
 {
 	unsigned int len;
@@ -2709,7 +2766,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 	ri->ifindex = 0;
 	ri->map = NULL;
 	ri->map_owner = 0;
-	ri->to_xsk = false;
+	ri->xsk = false;
 
 	if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
 		err = -EFAULT;
@@ -2733,6 +2790,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 	}
 
 	_trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
+	generic_xdp_tx(skb, xdp_prog);
 	return 0;
 err:
 	_trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
@@ -2740,13 +2798,16 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 }
 
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
-			    struct bpf_prog *xdp_prog)
+			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	u32 index = ri->ifindex;
 	struct net_device *fwd;
 	int err = 0;
 
+	if (ri->xsk)
+		return xdp_do_generic_xsk_redirect(skb, xdp, xdp_prog);
+
 	if (ri->map)
 		return xdp_do_generic_redirect_map(dev, skb, xdp_prog);
 
@@ -2762,6 +2823,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 	skb->dev = fwd;
 	_trace_xdp_redirect(dev, xdp_prog, index);
+	generic_xdp_tx(skb, xdp_prog);
 	return 0;
 err:
 	_trace_xdp_redirect_err(dev, xdp_prog, index, err);
@@ -2828,7 +2890,7 @@ BPF_CALL_0(bpf_xdpsk_redirect)
 	 * and XDP_ABORTED on failure? Also, then we can populate xsk
 	 * in ri, and don't have to do the lookup multiple times.
 	 */
-	ri->to_xsk = true;
+	ri->xsk = true;
 
 	return XDP_REDIRECT;
 }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 07/24] xsk: introduce Tx functionality
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (5 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 06/24] net: wire up xsk support in the XDP_REDIRECT path Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 08/24] i40e: add support for XDP_REDIRECT Björn Töpel
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

The xsk socket can now send frames, in addition to receiving them.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 net/xdp/xsk.c | 191 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 190 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index db918e31079b..f372c3288301 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -52,6 +52,8 @@ struct xdp_sock {
 	struct xsk_info tx;
 	struct net_device *dev;
 	struct xsk_umem *umem;
+	/* Protects multiple processes from entering sendmsg */
+	struct mutex tx_mutex;
 	u32 ifindex;
 	u16 queue_id;
 };
@@ -346,8 +348,10 @@ static int xsk_release(struct socket *sock)
 		synchronize_net();
 
 		xskpa_destroy(xs->rx.pa);
+		xskpa_destroy(xs->tx.pa);
 		xsk_umem_destroy(xs_prev->umem);
 		xskq_destroy(xs_prev->rx.q);
+		xskq_destroy(xs_prev->tx.q);
 		kobject_put(&xs_prev->dev->_rx[xs->queue_id].kobj);
 		dev_put(xs_prev->dev);
 	}
@@ -406,6 +410,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	xs->ifindex = sxdp->sxdp_ifindex;
 	xs->queue_id = sxdp->sxdp_queue_id;
 	spin_lock_init(&xs->rx.pa_lock);
+	spin_lock_init(&xs->tx.pa_lock);
+	mutex_init(&xs->tx_mutex);
 
 	/* Rx */
 	xs->rx.buff_info = xsk_buff_info_create(xs->rx.umem);
@@ -423,10 +429,31 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		goto out_rx_pa;
 	}
 
+	/* Tx */
+	xs->tx.buff_info = xsk_buff_info_create(xs->tx.umem);
+	if (!xs->tx.buff_info) {
+		err = -ENOMEM;
+		goto out_tx_bi;
+	}
+	xskq_set_buff_info(xs->tx.q, xs->tx.buff_info, XSK_VALIDATION_TX);
+
+	xs->tx.pa = xskpa_create((struct xsk_user_queue *)xs->tx.q,
+				 xs->tx.buff_info, XSK_ARRAY_SIZE);
+	if (!xs->tx.pa) {
+		err = -ENOMEM;
+		goto out_tx_pa;
+	}
+
 	rcu_assign_pointer(dev->_rx[sxdp->sxdp_queue_id].xs, xs);
 
 	goto out_unlock;
 
+out_tx_pa:
+	xsk_buff_info_destroy(xs->tx.buff_info);
+	xs->tx.buff_info = NULL;
+out_tx_bi:
+	xskpa_destroy(xs->rx.pa);
+	xs->rx.pa = NULL;
 out_rx_pa:
 	xsk_buff_info_destroy(xs->rx.buff_info);
 	xs->rx.buff_info = NULL;
@@ -621,9 +648,171 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
 	return -EOPNOTSUPP;
 }
 
+void xsk_tx_completion(struct net_device *dev, u16 queue_index,
+		       unsigned int npackets)
+{
+	unsigned long flags;
+	struct xdp_sock *xs;
+
+	rcu_read_lock();
+	xs = lookup_xsk(dev, queue_index);
+	if (unlikely(!xs)) {
+		rcu_read_unlock();
+		return;
+	}
+
+	spin_lock_irqsave(&xs->tx.pa_lock, flags);
+	WARN_ON_ONCE(xskpa_flush_n(xs->tx.pa, npackets));
+	spin_unlock_irqrestore(&xs->tx.pa_lock, flags);
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(xsk_tx_completion);
+
+static void xsk_destruct_skb(struct sk_buff *skb)
+{
+	u64 idx = (u64)skb_shinfo(skb)->destructor_arg;
+	struct xsk_frame_set p = {.start = idx,
+				  .curr = idx,
+				  .end = idx + 1};
+	struct xdp_sock *xs;
+	unsigned long flags;
+
+	rcu_read_lock();
+	xs = lookup_xsk(skb->dev, skb_get_queue_mapping(skb));
+	if (unlikely(!xs)) {
+		rcu_read_unlock();
+		return;
+	}
+
+	p.pkt_arr = xs->tx.pa;
+	xskf_packet_completed(&p);
+	spin_lock_irqsave(&xs->tx.pa_lock, flags);
+	WARN_ON_ONCE(xskpa_flush_completed(xs->tx.pa));
+	spin_unlock_irqrestore(&xs->tx.pa_lock, flags);
+	rcu_read_unlock();
+
+	sock_wfree(skb);
+}
+
+static int xsk_xmit_skb(struct sk_buff *skb)
+{
+	struct net_device *dev = skb->dev;
+	struct sk_buff *orig_skb = skb;
+	struct netdev_queue *txq;
+	int ret = NETDEV_TX_BUSY;
+	bool again = false;
+
+	if (unlikely(!netif_running(dev) || !netif_carrier_ok(dev)))
+		goto drop;
+
+	skb = validate_xmit_skb_list(skb, dev, &again);
+	if (skb != orig_skb)
+		return NET_XMIT_DROP;
+
+	txq = skb_get_tx_queue(dev, skb);
+
+	local_bh_disable();
+
+	HARD_TX_LOCK(dev, txq, smp_processor_id());
+	if (!netif_xmit_frozen_or_drv_stopped(txq))
+		ret = netdev_start_xmit(skb, dev, txq, false);
+	HARD_TX_UNLOCK(dev, txq);
+
+	local_bh_enable();
+
+	if (!dev_xmit_complete(ret))
+		goto out_err;
+
+	return ret;
+drop:
+	atomic_long_inc(&dev->tx_dropped);
+out_err:
+	kfree_skb(skb);
+	return NET_XMIT_DROP;
+}
+
+static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
+			    size_t total_len)
+{
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct xsk_frame_set p;
+	struct sk_buff *skb;
+	unsigned long flags;
+	int err = 0;
+
+	if (need_wait)
+		/* Not implemented yet. */
+		return -EINVAL;
+
+	mutex_lock(&xs->tx_mutex);
+	spin_lock_irqsave(&xs->tx.pa_lock, flags);
+	xskpa_populate(xs->tx.pa);
+	spin_unlock_irqrestore(&xs->tx.pa_lock, flags);
+
+	while (xskpa_next_packet(xs->tx.pa, &p)) {
+		u32 len = xskf_get_packet_len(&p);
+
+		if (unlikely(len > xs->dev->mtu)) {
+			err = -EMSGSIZE;
+			goto out_err;
+		}
+
+		skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
+		if (unlikely(!skb)) {
+			err = -EAGAIN;
+			goto out_err;
+		}
+
+		/* XXX Use fragments for the data here */
+		skb_put(skb, len);
+		err = skb_store_bits(skb, 0, xskf_get_data(&p), len);
+		if (unlikely(err))
+			goto out_skb;
+
+		skb->dev = xs->dev;
+		skb->priority = sk->sk_priority;
+		skb->mark = sk->sk_mark;
+		skb_set_queue_mapping(skb, xs->queue_id);
+		skb_shinfo(skb)->destructor_arg =
+			(void *)(long)xskf_get_frame_id(&p);
+		skb->destructor = xsk_destruct_skb;
+
+		err = xsk_xmit_skb(skb);
+		/* Ignore NET_XMIT_CN as packet might have been sent */
+		if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
+			err = -EAGAIN;
+			break;
+		}
+	}
+
+	mutex_unlock(&xs->tx_mutex);
+	return err;
+
+out_skb:
+	kfree_skb(skb);
+out_err:
+	xskf_set_error(&p, -err);
+	xskf_packet_completed(&p);
+	spin_lock_irqsave(&xs->tx.pa_lock, flags);
+	WARN_ON_ONCE(xskpa_flush_completed(xs->tx.pa));
+	spin_unlock_irqrestore(&xs->tx.pa_lock, flags);
+	mutex_unlock(&xs->tx_mutex);
+
+	return err;
+}
+
 static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 {
-	return -EOPNOTSUPP;
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (unlikely(!xs->dev))
+		return -ENXIO;
+	if (unlikely(!(xs->dev->flags & IFF_UP)))
+		return -ENETDOWN;
+
+	return xsk_generic_xmit(sk, m, total_len);
 }
 
 static int xsk_mmap(struct file *file, struct socket *sock,
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 08/24] i40e: add support for XDP_REDIRECT
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (6 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 07/24] xsk: introduce Tx functionality Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 09/24] samples/bpf: added xdpsock program Björn Töpel
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

The driver now acts upon the XDP_REDIRECT return action. Two new ndos
are implemented, ndo_xdp_xmit and ndo_xdp_flush.

XDP_REDIRECT action enables XDP program to redirect frames to other
netdevs. The target redirect/forward netdev might release the XDP data
page within the ndo_xdp_xmit function (triggered by xdp_do_redirect),
which meant that the i40e page count logic had to be tweaked.

An example. i40e_clean_rx_irq is entered, and one rx_buffer is pulled
from the hardware descriptor ring. Say that the actual page refcount
is 1. XDP is enabled, and the redirect action is triggered. The target
netdev ndo_xdp_xmit decreases the page refcount, resulting in the page
being freed. The prior assumption was that the function owned the page
until i40e_put_rx_buffer was called, increasing the refcount again.

Now, we don't allow a refcount less than 2. Another option would be
calling xdp_do_redirect *after* i40e_put_rx_buffer, but that would
required new/more conditionals.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |  2 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 86 ++++++++++++++++++++++++-----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  3 +
 3 files changed, 76 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index f95ce9b5e4fb..09efb9dd09f3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11687,6 +11687,8 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_bridge_getlink	= i40e_ndo_bridge_getlink,
 	.ndo_bridge_setlink	= i40e_ndo_bridge_setlink,
 	.ndo_bpf		= i40e_xdp,
+	.ndo_xdp_xmit		= i40e_xdp_xmit,
+	.ndo_xdp_flush		= i40e_xdp_flush,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e554aa6cf070..f0feae92a34a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1435,8 +1435,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
 	bi->page = page;
 	bi->page_offset = i40e_rx_offset(rx_ring);
 
-	/* initialize pagecnt_bias to 1 representing we fully own page */
-	bi->pagecnt_bias = 1;
+	page_ref_add(page, USHRT_MAX - 1);
+	bi->pagecnt_bias = USHRT_MAX;
 
 	return true;
 }
@@ -1802,8 +1802,8 @@ static bool i40e_can_reuse_rx_page(struct i40e_rx_buffer *rx_buffer)
 	 * the pagecnt_bias and page count so that we fully restock the
 	 * number of references the driver holds.
 	 */
-	if (unlikely(!pagecnt_bias)) {
-		page_ref_add(page, USHRT_MAX);
+	if (unlikely(pagecnt_bias == 1)) {
+		page_ref_add(page, USHRT_MAX - 1);
 		rx_buffer->pagecnt_bias = USHRT_MAX;
 	}
 
@@ -2061,7 +2061,7 @@ static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
 static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 				    struct xdp_buff *xdp)
 {
-	int result = I40E_XDP_PASS;
+	int err, result = I40E_XDP_PASS;
 	struct i40e_ring *xdp_ring;
 	struct bpf_prog *xdp_prog;
 	u32 act;
@@ -2080,6 +2080,13 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
 		result = i40e_xmit_xdp_ring(xdp, xdp_ring);
 		break;
+	case XDP_REDIRECT:
+		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+		if (!err)
+			result = I40E_XDP_TX;
+		else
+			result = I40E_XDP_CONSUMED;
+		break;
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
@@ -2115,6 +2122,16 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
 #endif
 }
 
+static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+{
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.
+	 */
+	wmb();
+
+	writel(xdp_ring->next_to_use, xdp_ring->tail);
+}
+
 /**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
@@ -2249,16 +2266,9 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 	}
 
 	if (xdp_xmit) {
-		struct i40e_ring *xdp_ring;
-
-		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
-
-		/* Force memory writes to complete before letting h/w
-		 * know there are new descriptors to fetch.
-		 */
-		wmb();
-
-		writel(xdp_ring->next_to_use, xdp_ring->tail);
+		i40e_xdp_ring_update_tail(
+			rx_ring->vsi->xdp_rings[rx_ring->queue_index]);
+		xdp_do_flush_map();
 	}
 
 	rx_ring->skb = skb;
@@ -3509,3 +3519,49 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 
 	return i40e_xmit_frame_ring(skb, tx_ring);
 }
+
+/**
+ * i40e_xdp_xmit - Implements ndo_xdp_xmit
+ * @dev: netdev
+ * @xdp: XDP buffer
+ *
+ * Returns Zero if sent, else an error code
+ **/
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct i40e_netdev_priv *np = netdev_priv(dev);
+	unsigned int queue_index = smp_processor_id();
+	struct i40e_vsi *vsi = np->vsi;
+	int err;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return -EINVAL;
+
+	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
+	if (err != I40E_XDP_TX)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/**
+ * i40e_xdp_flush - Implements ndo_xdp_flush
+ * @dev: netdev
+ **/
+void i40e_xdp_flush(struct net_device *dev)
+{
+	struct i40e_netdev_priv *np = netdev_priv(dev);
+	unsigned int queue_index = smp_processor_id();
+	struct i40e_vsi *vsi = np->vsi;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return;
+
+	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
+		return;
+
+	i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 701b708628b0..d149ebb8330c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -505,6 +505,9 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring);
 void i40e_detect_recover_hung(struct i40e_vsi *vsi);
 int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
+void i40e_xdp_flush(struct net_device *dev);
+
 
 /**
  * i40e_get_head - Retrieve head from head writeback
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 09/24] samples/bpf: added xdpsock program
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (7 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 08/24] i40e: add support for XDP_REDIRECT Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 10/24] netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf Björn Töpel
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Test program for AF_XDP sockets.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 samples/bpf/Makefile        |   4 +
 samples/bpf/xdpsock_kern.c  |  11 +
 samples/bpf/xdpsock_queue.h |  62 +++++
 samples/bpf/xdpsock_user.c  | 642 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 719 insertions(+)
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_queue.h
 create mode 100644 samples/bpf/xdpsock_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 64335bb94f9f..9392335bd386 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -43,6 +43,7 @@ hostprogs-y += xdp_redirect_cpu
 hostprogs-y += xdp_monitor
 hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
+hostprogs-y += xdpsock
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -93,6 +94,7 @@ xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
+xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -144,6 +146,7 @@ always += xdp_monitor_kern.o
 always += xdp_rxq_info_kern.o
 always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
+always += xdpsock_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -188,6 +191,7 @@ HOSTLOADLIBES_xdp_redirect_cpu += -lelf
 HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
+HOSTLOADLIBES_xdpsock += -lelf -pthread
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
new file mode 100644
index 000000000000..bae0c09b4cd7
--- /dev/null
+++ b/samples/bpf/xdpsock_kern.c
@@ -0,0 +1,11 @@
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+SEC("xdp_sock")
+int xdp_sock_prog(struct xdp_md *ctx)
+{
+	return bpf_xdpsk_redirect();
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdpsock_queue.h b/samples/bpf/xdpsock_queue.h
new file mode 100644
index 000000000000..2307105e234b
--- /dev/null
+++ b/samples/bpf/xdpsock_queue.h
@@ -0,0 +1,62 @@
+#ifndef __XDPSOCK_QUEUE_H
+#define __XDPSOCK_QUEUE_H
+
+static inline int xq_enq(struct xdp_queue *q,
+			 const struct xdp_desc *descs,
+			 unsigned int ndescs)
+{
+	unsigned int avail_idx = q->avail_idx;
+	unsigned int i;
+	int j;
+
+	if (q->num_free < ndescs)
+		return -ENOSPC;
+
+	q->num_free -= ndescs;
+
+	for (i = 0; i < ndescs; i++) {
+		unsigned int idx = avail_idx++ & q->ring_mask;
+
+		q->ring[idx].idx	= descs[i].idx;
+		q->ring[idx].len	= descs[i].len;
+		q->ring[idx].offset	= descs[i].offset;
+		q->ring[idx].error	= 0;
+	}
+	smp_wmb();
+
+	for (j = ndescs - 1; j >= 0; j--) {
+		unsigned int idx = (q->avail_idx + j) & q->ring_mask;
+
+		q->ring[idx].flags = descs[j].flags | XDP_DESC_KERNEL;
+	}
+	q->avail_idx += ndescs;
+
+	return 0;
+}
+
+static inline int xq_deq(struct xdp_queue *q,
+			 struct xdp_desc *descs,
+			 int ndescs)
+{
+	unsigned int idx, last_used_idx = q->last_used_idx;
+	int i, entries = 0;
+
+	for (i = 0; i < ndescs; i++) {
+		idx = (last_used_idx++) & q->ring_mask;
+		if (q->ring[idx].flags & XDP_DESC_KERNEL)
+			break;
+		entries++;
+	}
+	q->num_free += entries;
+
+	smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = q->last_used_idx++ & q->ring_mask;
+		descs[i] = q->ring[idx];
+	}
+
+	return entries;
+}
+
+#endif /* __XDPSOCK_QUEUE_H */
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
new file mode 100644
index 000000000000..65912b095205
--- /dev/null
+++ b/samples/bpf/xdpsock_user.c
@@ -0,0 +1,642 @@
+/*
+ *  Copyright(c) 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <getopt.h>
+#include <libgen.h>
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <linux/if_xdp.h>
+#include <linux/if_ether.h>
+#include <net/if.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/ethernet.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <locale.h>
+
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "libbpf.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+#ifndef AF_XDP
+#define AF_XDP 44
+#endif
+
+#ifndef PF_XDP
+#define PF_XDP AF_XDP
+#endif
+
+#define NUM_BUFFERS 131072
+#define DATA_HEADROOM 0
+#define FRAME_SIZE 2048
+#define NUM_DESCS 1024
+#define BATCH_SIZE 16
+
+#define DEBUG_HEXDUMP 0
+
+static unsigned long rx_npkts;
+static unsigned long tx_npkts;
+static unsigned long start_time;
+
+enum benchmark_type {
+	BENCH_RXDROP = 0,
+	BENCH_TXONLY = 1,
+	BENCH_L2FWD = 2,
+};
+
+static enum benchmark_type opt_bench = BENCH_RXDROP;
+static __u32 opt_xdp_flags;
+static const char *opt_if = "";
+static int opt_ifindex;
+static int opt_queue;
+
+struct xdp_umem {
+	char *buffer;
+	size_t size;
+	unsigned int frame_size;
+	unsigned int frame_size_log2;
+	unsigned int nframes;
+	int mr_fd;
+};
+
+struct xdp_queue_pair {
+	struct xdp_queue rx;
+	struct xdp_queue tx;
+	int sfd;
+	struct xdp_umem *umem;
+	__u32 outstanding_tx;
+};
+
+static unsigned long get_nsecs(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
+}
+#define lassert(expr)							\
+	do {								\
+		if (!(expr)) {						\
+			fprintf(stderr, "%s:%s:%i: Assertion failed: " #expr ": errno: %d/\"%s\"\n", __FILE__, __func__, __LINE__, errno, strerror(errno)); \
+			exit(EXIT_FAILURE);				\
+		}							\
+	} while (0)
+
+#define barrier() __asm__ __volatile__("": : :"memory")
+#define smp_rmb() barrier()
+#define smp_wmb() barrier()
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+#define log2(x) ((unsigned int)(8 * sizeof(unsigned long long) - __builtin_clzll((x)) - 1))
+
+static const char pkt_data[] =
+	"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
+	"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
+	"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
+	"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
+
+#include "xdpsock_queue.h"
+
+static inline void *xq_get_data(struct xdp_queue_pair *q, __u32 idx, __u32 off)
+{
+	if (idx >= q->umem->nframes) {
+		fprintf(stderr, "ERROR idx=%u off=%u\n", (unsigned int)idx, (unsigned int)off);
+		lassert(0);
+	}
+
+	return (__u8 *)(q->umem->buffer + (idx << q->umem->frame_size_log2)
+			+ off);
+}
+
+static void swap_mac_addresses(void *data)
+{
+	struct ether_header *eth = (struct ether_header *)data;
+	struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
+	struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
+	struct ether_addr tmp;
+
+	tmp = *src_addr;
+	*src_addr = *dst_addr;
+	*dst_addr = tmp;
+}
+
+#if DEBUG_HEXDUMP
+static void hex_dump(void *pkt, size_t length, const char *prefix)
+{
+	int i = 0;
+	const unsigned char *address = (unsigned char *)pkt;
+	const unsigned char *line = address;
+	size_t line_size = 32;
+	unsigned char c;
+
+	printf("length = %zu\n", length);
+	printf("%s | ", prefix);
+	while (length-- > 0) {
+		printf("%02X ", *address++);
+		if (!(++i % line_size) || (length == 0 && i % line_size)) {
+			if (length == 0) {
+				while (i++ % line_size)
+					printf("__ ");
+			}
+			printf(" | ");	/* right close */
+			while (line < address) {
+				c = *line++;
+				printf("%c", (c < 33 || c == 255) ? 0x2E : c);
+			}
+			printf("\n");
+			if (length > 0)
+				printf("%s | ", prefix);
+		}
+	}
+	printf("\n");
+}
+#endif
+
+static size_t gen_eth_frame(char *frame)
+{
+	memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
+	return sizeof(pkt_data) - 1;
+}
+
+static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd, size_t nbuffers)
+{
+	struct xdp_mr_req req = { .frame_size = FRAME_SIZE,
+				  .data_headroom = DATA_HEADROOM };
+	struct xdp_umem *umem;
+	void *bufs;
+	int ret;
+
+	ret = posix_memalign((void **)&bufs, getpagesize(),
+			     nbuffers * req.frame_size);
+	lassert(ret == 0);
+
+	umem = calloc(1, sizeof(*umem));
+	lassert(umem);
+	req.addr = (unsigned long)bufs;
+	req.len = nbuffers * req.frame_size;
+	ret = setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
+	lassert(ret == 0);
+
+	umem->frame_size = FRAME_SIZE;
+	umem->frame_size_log2 = log2(FRAME_SIZE);
+	umem->buffer = bufs;
+	umem->size = nbuffers * req.frame_size;
+	umem->nframes = nbuffers;
+	umem->mr_fd = sfd;
+
+	if (opt_bench == BENCH_TXONLY) {
+		char *pkt = bufs;
+		int i = 0;
+
+		while (i++ < nbuffers) {
+			(void)gen_eth_frame(pkt);
+			pkt += req.frame_size;
+		}
+	}
+
+	return umem;
+}
+
+static struct xdp_queue_pair *xsk_configure(void)
+{
+	struct xdp_queue_pair *xqp;
+	struct sockaddr_xdp sxdp;
+	struct xdp_ring_req req;
+	int sfd, ret, i;
+
+	sfd = socket(PF_XDP, SOCK_RAW, 0);
+	lassert(sfd >= 0);
+
+	xqp = calloc(1, sizeof(*xqp));
+	lassert(xqp);
+
+	xqp->sfd = sfd;
+	xqp->outstanding_tx = 0;
+
+	xqp->umem = xsk_alloc_and_mem_reg_buffers(sfd, NUM_BUFFERS);
+	lassert(xqp->umem);
+
+	req.mr_fd = xqp->umem->mr_fd;
+	req.desc_nr = NUM_DESCS;
+
+	ret = setsockopt(sfd, SOL_XDP, XDP_RX_RING, &req, sizeof(req));
+	lassert(ret == 0);
+
+	ret = setsockopt(sfd, SOL_XDP, XDP_TX_RING, &req, sizeof(req));
+	lassert(ret == 0);
+
+	/* Rx */
+	xqp->rx.ring = mmap(0, req.desc_nr * sizeof(struct xdp_desc),
+			    PROT_READ | PROT_WRITE,
+			    MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sfd,
+			    XDP_PGOFF_RX_RING);
+	lassert(xqp->rx.ring != MAP_FAILED);
+
+	xqp->rx.num_free = req.desc_nr;
+	xqp->rx.ring_mask = req.desc_nr - 1;
+
+	for (i = 0; i < (xqp->rx.ring_mask + 1); i++) {
+		struct xdp_desc desc = {.idx = i};
+
+		ret = xq_enq(&xqp->rx, &desc, 1);
+		lassert(ret == 0);
+	}
+
+	/* Tx */
+	xqp->tx.ring = mmap(0, req.desc_nr * sizeof(struct xdp_desc),
+			    PROT_READ | PROT_WRITE,
+			    MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sfd,
+			    XDP_PGOFF_TX_RING);
+	lassert(xqp->tx.ring != MAP_FAILED);
+
+	xqp->tx.num_free = req.desc_nr;
+	xqp->tx.ring_mask = req.desc_nr - 1;
+
+	sxdp.sxdp_family = PF_XDP;
+	sxdp.sxdp_ifindex = opt_ifindex;
+	sxdp.sxdp_queue_id = opt_queue;
+
+	ret = bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp));
+	lassert(ret == 0);
+
+	return xqp;
+}
+
+static void print_benchmark(bool running)
+{
+	const char *bench_str = "INVALID";
+
+	if (opt_bench == BENCH_RXDROP)
+		bench_str = "rxdrop";
+	else if (opt_bench == BENCH_TXONLY)
+		bench_str = "txonly";
+	else if (opt_bench == BENCH_L2FWD)
+		bench_str = "l2fwd";
+
+	printf("%s:%d %s ", opt_if, opt_queue, bench_str);
+	if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
+		printf("xdp-skb ");
+	else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
+		printf("xdp-drv ");
+	else
+		printf("	");
+
+	if (running) {
+		printf("running...");
+		fflush(stdout);
+	}
+}
+
+static void dump_stats(void)
+{
+	unsigned long stop_time = get_nsecs();
+	long dt = stop_time - start_time;
+	double rx_pps = rx_npkts * 1000000000. / dt;
+	double tx_pps = tx_npkts * 1000000000. / dt;
+	char *fmt = "%-15s %'-11.0f %'-11lu\n";
+
+	printf("\n");
+	print_benchmark(false);
+	printf("\n");
+
+	printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts", dt / 1000000000.);
+	printf(fmt, "rx", rx_pps, rx_npkts);
+	printf(fmt, "tx", tx_pps, tx_npkts);
+}
+
+static void *poller(void *arg)
+{
+	(void)arg;
+	for (;;) {
+		dump_stats();
+		sleep(1);
+	}
+
+	return NULL;
+}
+
+static void int_exit(int sig)
+{
+	(void)sig;
+	dump_stats();
+	set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
+	exit(EXIT_SUCCESS);
+}
+
+static struct option long_options[] = {
+	{"rxdrop", no_argument, 0, 'r'},
+	{"txonly", no_argument, 0, 't'},
+	{"l2fwd", no_argument, 0, 'l'},
+	{"interface", required_argument, 0, 'i'},
+	{"queue", required_argument, 0, 'q'},
+	{"xdp-skb", no_argument, 0, 'S'},
+	{"xdp-native", no_argument, 0, 'N'},
+	{0, 0, 0, 0}
+};
+
+static void usage(const char *prog)
+{
+	const char *str =
+		"  Usage: %s [OPTIONS]\n"
+		"  Options:\n"
+		"  -r, --rxdrop		Discard all incoming packets (default)\n"
+		"  -t, --txonly		Only send packets\n"
+		"  -l, --l2fwd		MAC swap L2 forwarding\n"
+		"  -i, --interface=n	Run on interface n\n"
+		"  -q, --queue=n	Use queue n (default 0)\n"
+		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
+		"  -N, --xdp-native=n	Enfore XDP native mode\n"
+		"\n";
+	fprintf(stderr, str, prog);
+	exit(EXIT_FAILURE);
+}
+
+static void parse_command_line(int argc, char **argv)
+{
+	int option_index, c;
+
+	opterr = 0;
+
+	for (;;) {
+		c = getopt_long(argc, argv, "rtli:q:SN", long_options,
+				&option_index);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 'r':
+			opt_bench = BENCH_RXDROP;
+			break;
+		case 't':
+			opt_bench = BENCH_TXONLY;
+			break;
+		case 'l':
+			opt_bench = BENCH_L2FWD;
+			break;
+		case 'i':
+			opt_if = optarg;
+			break;
+		case 'q':
+			opt_queue = atoi(optarg);
+			break;
+		case 'S':
+			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
+			break;
+		default:
+			usage(basename(argv[0]));
+		}
+	}
+
+	opt_ifindex = if_nametoindex(opt_if);
+	if (!opt_ifindex) {
+		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n", opt_if);
+		usage(basename(argv[0]));
+	}
+}
+
+static void kick_tx(int fd)
+{
+	int ret;
+
+	for (;;) {
+		ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+		if (ret >= 0 || errno == ENOBUFS)
+			return;
+		if (errno == EAGAIN)
+			continue;
+		lassert(0);
+	}
+}
+
+static inline void complete_tx_l2fwd(struct xdp_queue_pair *q,
+				     struct xdp_desc *descs)
+{
+	unsigned int rcvd;
+	size_t ndescs;
+	int ret;
+
+	if (!q->outstanding_tx)
+		return;
+
+	ndescs = (q->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
+		q->outstanding_tx;
+
+	/* re-add completed Tx buffers */
+	rcvd = xq_deq(&q->tx, descs, ndescs);
+	if (rcvd > 0) {
+		/* No error checking on TX completion */
+		ret = xq_enq(&q->rx, descs, rcvd);
+		lassert(ret == 0);
+		q->outstanding_tx -= rcvd;
+		tx_npkts += rcvd;
+	}
+}
+
+static inline void complete_tx_only(struct xdp_queue_pair *q,
+				    struct xdp_desc *descs)
+{
+	unsigned int rcvd;
+	size_t ndescs;
+
+	if (!q->outstanding_tx)
+		return;
+
+	ndescs = (q->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
+		q->outstanding_tx;
+
+	rcvd = xq_deq(&q->tx, descs, ndescs);
+	if (rcvd > 0) {
+		q->outstanding_tx -= rcvd;
+		tx_npkts += rcvd;
+	}
+}
+
+static void rx_drop(struct xdp_queue_pair *xqp)
+{
+	for (;;) {
+		struct xdp_desc descs[BATCH_SIZE];
+		unsigned int rcvd, i;
+		int ret;
+
+		for (;;) {
+			rcvd = xq_deq(&xqp->rx, descs, BATCH_SIZE);
+			if (rcvd > 0)
+				break;
+		}
+
+		for (i = 0; i < rcvd; i++) {
+			__u32 idx = descs[i].idx;
+
+			lassert(idx < NUM_BUFFERS);
+#if DEBUG_HEXDUMP
+			char *pkt;
+			char buf[32];
+
+			pkt = xq_get_data(xqp, idx, descs[i].offset);
+			sprintf(buf, "idx=%d", idx);
+			hex_dump(pkt, descs[i].len, buf);
+#endif
+		}
+
+		rx_npkts += rcvd;
+
+		ret = xq_enq(&xqp->rx, descs, rcvd);
+		lassert(ret == 0);
+	}
+}
+
+static void gen_tx_descs(struct xdp_desc *descs, unsigned int idx,
+			 unsigned int ndescs)
+{
+	int i;
+
+	for (i = 0; i < ndescs; i++) {
+		descs[i].idx = idx + i;
+		descs[i].len = sizeof(pkt_data) - 1;
+		descs[i].offset = 0;
+		descs[i].flags = 0;
+	}
+}
+
+static void tx_only(struct xdp_queue_pair *xqp)
+{
+	unsigned int idx = 0;
+
+	for (;;) {
+		struct xdp_desc descs[BATCH_SIZE];
+		int ret;
+
+		if (xqp->tx.num_free >= BATCH_SIZE) {
+			gen_tx_descs(descs, idx, BATCH_SIZE);
+			ret = xq_enq(&xqp->tx, descs, BATCH_SIZE);
+			lassert(ret == 0);
+			kick_tx(xqp->sfd);
+
+			xqp->outstanding_tx += BATCH_SIZE;
+			idx += BATCH_SIZE;
+			idx %= NUM_BUFFERS;
+		}
+
+		complete_tx_only(xqp, descs);
+	}
+}
+
+static void l2fwd(struct xdp_queue_pair *xqp)
+{
+	for (;;) {
+		struct xdp_desc descs[BATCH_SIZE];
+		unsigned int rcvd, i;
+		int ret;
+
+		for (;;) {
+			complete_tx_l2fwd(xqp, descs);
+
+			rcvd = xq_deq(&xqp->rx, descs, BATCH_SIZE);
+			if (rcvd > 0)
+				break;
+		}
+
+		for (i = 0; i < rcvd; i++) {
+			char *pkt = xq_get_data(xqp, descs[i].idx,
+						descs[i].offset);
+
+			swap_mac_addresses(pkt);
+#if DEBUG_HEXDUMP
+			char buf[32];
+			__u32 idx = descs[i].idx;
+
+			sprintf(buf, "idx=%d", idx);
+			hex_dump(pkt, descs[i].len, buf);
+#endif
+		}
+
+		rx_npkts += rcvd;
+
+		ret = xq_enq(&xqp->tx, descs, rcvd);
+		lassert(ret == 0);
+		xqp->outstanding_tx += rcvd;
+		kick_tx(xqp->sfd);
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	struct xdp_queue_pair *xqp;
+	char xdp_filename[256];
+	pthread_t pt;
+	int ret;
+
+	parse_command_line(argc, argv);
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n", strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(xdp_filename)) {
+		fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
+		exit(EXIT_FAILURE);
+	}
+
+	if (!prog_fd[0]) {
+		fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n", strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	if (set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
+		fprintf(stderr, "ERROR: link set xdp fd failed\n");
+		exit(EXIT_FAILURE);
+	}
+
+	xqp = xsk_configure();
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+	signal(SIGABRT, int_exit);
+
+	start_time = get_nsecs();
+
+	setlocale(LC_ALL, "");
+
+        ret = pthread_create(&pt, NULL, poller, NULL);
+	lassert(ret == 0);
+
+	if (opt_bench == BENCH_RXDROP)
+		rx_drop(xqp);
+	else if (opt_bench == BENCH_TXONLY)
+		tx_only(xqp);
+	else
+		l2fwd(xqp);
+
+	return 0;
+}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 10/24] netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (8 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 09/24] samples/bpf: added xdpsock program Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 11/24] netdevice: added ndo for transmitting a packet from an XDP socket Björn Töpel
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, ndo_bpf is extended with two sub-commands: XDP_REGISTER_XSK and
XDP_UNREGISTER_XSK. They are used to support zero copy allocators with
XDP sockets.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h | 17 +++++++++++++++++
 include/net/xdp_sock.h    | 30 +++++++++++++++++++++++++++++-
 2 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 36cc7e92bd8e..a997649dd5cc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -810,10 +810,18 @@ enum bpf_netdev_command {
 	BPF_OFFLOAD_DESTROY,
 	BPF_OFFLOAD_MAP_ALLOC,
 	BPF_OFFLOAD_MAP_FREE,
+	/* Registers callbacks with the driver that is used to support
+	 * AF_XDP sockets in zero copy mode.
+	 */
+	XDP_REGISTER_XSK,
+	/* Unregisters and AF_XDP socket in zero copy mode. */
+	XDP_UNREGISTER_XSK,
 };
 
 struct bpf_prog_offload_ops;
 struct netlink_ext_ack;
+struct xsk_tx_parms;
+struct xsk_rx_parms;
 
 struct netdev_bpf {
 	enum bpf_netdev_command command;
@@ -844,6 +852,15 @@ struct netdev_bpf {
 		struct {
 			struct bpf_offloaded_map *offmap;
 		};
+		/* XDP_REGISTER_XSK, XDP_UNREGISTER_XSK
+		 * All fields used for XDP_REGISTER_XSK.
+		 * queue_id the only field used for XDP_UNREGISTER_XSK.
+		 */
+		struct {
+			struct xsk_tx_parms *tx_parms;
+			struct xsk_rx_parms *rx_parms;
+			u32 queue_id;
+		} xsk;
 	};
 };
 
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 132489fe0e70..866ea7191217 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -1,8 +1,36 @@
 #ifndef _LINUX_AF_XDP_SOCK_H
 #define _LINUX_AF_XDP_SOCK_H
 
-struct xdp_sock;
+#include <linux/dma-direction.h>
+
+struct buff_pool;
+struct net_device;
 struct xdp_buff;
+struct xdp_sock;
+
+/* These two functions have to be called from the same serializing conext,
+ * for example the same NAPI context.
+ * They should not be called for the XDP_SKB path, only XDP_DRV.
+ */
+
+struct xsk_tx_parms {
+	void (*tx_completion)(u32 start, u32 npackets,
+			      unsigned long ctx1, unsigned long ctx2);
+	unsigned long ctx1;
+	unsigned long ctx2;
+	int (*get_tx_packet)(struct net_device *dev, u32 queue_id,
+			     dma_addr_t *dma, void **data, u32 *len,
+			     u32 *offset);
+};
+
+struct xsk_rx_parms {
+	struct buff_pool *buff_pool;
+	int (*dma_map)(struct buff_pool *bp, struct device *dev,
+		       enum dma_data_direction dir,
+		       unsigned long attr);
+	void *error_report_ctx;
+	void (*error_report)(void *ctx, int errno);
+};
 
 #ifdef CONFIG_XDP_SOCKETS
 int xsk_generic_rcv(struct xdp_buff *xdp);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 11/24] netdevice: added ndo for transmitting a packet from an XDP socket
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (9 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 10/24] netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 12/24] xsk: add iterator functions to xsk_ring Björn Töpel
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Only used when a zero copy allocator is used.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a997649dd5cc..2b196fa8db6a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1178,6 +1178,9 @@ struct dev_ifalias {
  * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp);
  *	This function is used to submit a XDP packet for transmit on a
  *	netdevice.
+ * int	(*ndo_xdp_xmit_xsk)(struct net_device *dev, u32 queue_id);
+ *      This function is used to transmit a packet from an XDP socket
+ *      when a zero copy allocator is used.
  * void (*ndo_xdp_flush)(struct net_device *dev);
  *	This function is used to inform the driver to flush a particular
  *	xdp tx queue. Must be called on same CPU as xdp_xmit.
@@ -1367,6 +1370,8 @@ struct net_device_ops {
 					   struct netdev_bpf *bpf);
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
 						struct xdp_buff *xdp);
+	int			(*ndo_xdp_xmit_xsk)(struct net_device *dev,
+						    u32 queue_id);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
 };
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 12/24] xsk: add iterator functions to xsk_ring
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (10 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 11/24] netdevice: added ndo for transmitting a packet from an XDP socket Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 13/24] i40e: introduce external allocator support Björn Töpel
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Add packet array like functionality that acts directly on the
user/kernel shared ring. We'll use this in the zerocopy Rx scenario.

TODO Better naming...

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk_ring.c |   3 +-
 net/xdp/xsk_ring.h | 136 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 126 insertions(+), 13 deletions(-)

diff --git a/net/xdp/xsk_ring.c b/net/xdp/xsk_ring.c
index 11b590506ddf..f154ddfabcfc 100644
--- a/net/xdp/xsk_ring.c
+++ b/net/xdp/xsk_ring.c
@@ -41,7 +41,8 @@ struct xsk_queue *xskq_create(u32 nentries)
 	q->queue_ops.enqueue = xskq_enqueue_from_array;
 	q->queue_ops.enqueue_completed = xskq_enqueue_completed_from_array;
 	q->queue_ops.dequeue = xskq_dequeue_to_array;
-	q->used_idx = 0;
+	q->used_idx_head = 0;
+	q->used_idx_tail = 0;
 	q->last_avail_idx = 0;
 	q->ring_mask = nentries - 1;
 	q->num_free = 0;
diff --git a/net/xdp/xsk_ring.h b/net/xdp/xsk_ring.h
index c9d61195ab2d..43c841d55093 100644
--- a/net/xdp/xsk_ring.h
+++ b/net/xdp/xsk_ring.h
@@ -27,7 +27,8 @@ struct xsk_queue {
 	struct xsk_user_queue queue_ops;
 	struct xdp_desc *ring;
 
-	u32 used_idx;
+	u32 used_idx_head;
+	u32 used_idx_tail;
 	u32 last_avail_idx;
 	u32 ring_mask;
 	u32 num_free;
@@ -51,8 +52,7 @@ static inline unsigned int xsk_get_data_headroom(struct xsk_umem *umem)
  *
  * Returns true if the entry is a valid, otherwise false
  **/
-static inline bool xskq_is_valid_entry(struct xsk_queue *q,
-				       struct xdp_desc *d)
+static inline bool xskq_is_valid_entry(struct xsk_queue *q, struct xdp_desc *d)
 {
 	unsigned int buff_len;
 
@@ -115,7 +115,7 @@ static inline int xskq_nb_avail(struct xsk_queue *q, int dcnt)
 static inline int xskq_enqueue(struct xsk_queue *q,
 			       const struct xdp_desc *d, int dcnt)
 {
-	unsigned int used_idx = q->used_idx;
+	unsigned int used_idx = q->used_idx_tail;
 	int i;
 
 	if (q->num_free < dcnt)
@@ -136,11 +136,12 @@ static inline int xskq_enqueue(struct xsk_queue *q,
 	smp_wmb();
 
 	for (i = dcnt - 1; i >= 0; i--) {
-		unsigned int idx = (q->used_idx + i) & q->ring_mask;
+		unsigned int idx = (q->used_idx_tail + i) & q->ring_mask;
 
 		q->ring[idx].flags = d[i].flags & ~XDP_DESC_KERNEL;
 	}
-	q->used_idx += dcnt;
+	q->used_idx_head += dcnt;
+	q->used_idx_tail += dcnt;
 
 	return 0;
 }
@@ -157,7 +158,7 @@ static inline int xskq_enqueue_from_array(struct xsk_packet_array *a,
 					  u32 dcnt)
 {
 	struct xsk_queue *q = (struct xsk_queue *)a->q_ops;
-	unsigned int used_idx = q->used_idx;
+	unsigned int used_idx = q->used_idx_tail;
 	struct xdp_desc *d = a->items;
 	int i;
 
@@ -180,12 +181,13 @@ static inline int xskq_enqueue_from_array(struct xsk_packet_array *a,
 	smp_wmb();
 
 	for (i = dcnt - 1; i >= 0; i--) {
-		unsigned int idx = (q->used_idx + i) & q->ring_mask;
+		unsigned int idx = (q->used_idx_tail + i) & q->ring_mask;
 		unsigned int didx = (a->start + i) & a->mask;
 
 		q->ring[idx].flags = d[didx].flags & ~XDP_DESC_KERNEL;
 	}
-	q->used_idx += dcnt;
+	q->used_idx_tail += dcnt;
+	q->used_idx_head += dcnt;
 
 	return 0;
 }
@@ -204,7 +206,7 @@ static inline int xskq_enqueue_completed_from_array(struct xsk_packet_array *a,
 						    u32 dcnt)
 {
 	struct xsk_queue *q = (struct xsk_queue *)a->q_ops;
-	unsigned int used_idx = q->used_idx;
+	unsigned int used_idx = q->used_idx_tail;
 	struct xdp_desc *d = a->items;
 	int i, j;
 
@@ -233,13 +235,14 @@ static inline int xskq_enqueue_completed_from_array(struct xsk_packet_array *a,
 	smp_wmb();
 
 	for (j = i - 1; j >= 0; j--) {
-		unsigned int idx = (q->used_idx + j) & q->ring_mask;
+		unsigned int idx = (q->used_idx_tail + j) & q->ring_mask;
 		unsigned int didx = (a->start + j) & a->mask;
 
 		q->ring[idx].flags = d[didx].flags & ~XDP_DESC_KERNEL;
 	}
 	q->num_free -= i;
-	q->used_idx += i;
+	q->used_idx_tail += i;
+	q->used_idx_head += i;
 
 	return i;
 }
@@ -301,6 +304,115 @@ static inline void xskq_set_buff_info(struct xsk_queue *q,
 	q->validation = validation;
 }
 
+/* --- */
+
+struct xskq_iter {
+	unsigned int head;
+	unsigned int tail;
+};
+
+static inline bool xskq_iter_end(struct xskq_iter *it)
+{
+	return it->tail == it->head;
+}
+
+static inline void xskq_iter_validate(struct xsk_queue *q, struct xskq_iter *it)
+{
+	while (it->head != it->tail) {
+		unsigned int idx = it->head & q->ring_mask;
+		struct xdp_desc *d, *du;
+
+		d = &q->ring[idx];
+		if (xskq_is_valid_entry(q, d))
+			break;
+
+		/* Slow error path! */
+		du = &q->ring[q->used_idx_tail & q->ring_mask];
+		du->idx = d->idx;
+		du->len = d->len;
+		du->offset = d->offset;
+		du->error = EINVAL;
+
+		q->last_avail_idx++;
+		it->head++;
+
+		smp_wmb();
+
+		du->flags = d->flags & ~XDP_DESC_KERNEL;
+	}
+}
+
+static inline struct xskq_iter xskq_deq_iter(struct xsk_queue *q,
+					     int cnt)
+{
+	struct xskq_iter it;
+
+	it.head = q->last_avail_idx;
+	it.tail = it.head + (unsigned int)xskq_nb_avail(q, cnt);
+
+	smp_rmb();
+
+	xskq_iter_validate(q, &it);
+
+	return it;
+}
+
+static inline void xskq_deq_iter_next(struct xsk_queue *q, struct xskq_iter *it)
+{
+	it->head++;
+	xskq_iter_validate(q, it);
+}
+
+static inline void xskq_deq_iter_done(struct xsk_queue *q, struct xskq_iter *it)
+{
+	int entries = it->head - q->last_avail_idx;
+
+	q->num_free += entries;
+	q->last_avail_idx = it->head;
+}
+
+static inline u32 xskq_deq_iter_get_id(struct xsk_queue *q,
+				       struct xskq_iter *it)
+{
+	return q->ring[it->head & q->ring_mask].idx;
+}
+
+static inline void xskq_return_id(struct xsk_queue *q, u32 id)
+{
+	struct xdp_desc d = { .idx = id };
+
+	WARN(xskq_enqueue(q, &d, 1), "%s failed!\n", __func__);
+}
+
+static inline void xskq_enq_lazy(struct xsk_queue *q,
+				 u32 id, u32 len, u16 offset)
+{
+	unsigned int idx;
+
+	if (q->num_free == 0) {
+		WARN(1, "%s xsk_queue deq/enq out of sync!\n", __func__);
+		return;
+	}
+
+	q->num_free--;
+	idx = (q->used_idx_tail++) & q->ring_mask;
+	q->ring[idx].idx = id;
+	q->ring[idx].len = len;
+	q->ring[idx].offset = offset;
+	q->ring[idx].error = 0;
+}
+
+static inline void xskq_enq_flush(struct xsk_queue *q)
+{
+	smp_wmb();
+
+	while (q->used_idx_head != q->used_idx_tail) {
+		unsigned int idx = (q->used_idx_head++) & q->ring_mask;
+
+		q->ring[idx].flags = 0;
+	}
+}
+
 struct xsk_queue *xskq_create(u32 nentries);
 void xskq_destroy(struct xsk_queue *q_ops);
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 13/24] i40e: introduce external allocator support
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (11 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 12/24] xsk: add iterator functions to xsk_ring Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 14/24] i40e: implemented page recycling buff_pool Björn Töpel
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Here, buff_pool is introduced, which is an allocator/pool for Rx
frames. This commits pulls out the recycling allocator in i40e, starts
using the new buff_pool API, and adds a simple non-recycling page
allocating buff_pool implementation. Future commits will reintroduce a
buff_pool page recycling/flipping implementation.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/Makefile       |   3 +-
 drivers/net/ethernet/intel/i40e/buff_pool.c    | 285 ++++++++++++++
 drivers/net/ethernet/intel/i40e/buff_pool.h    |  70 ++++
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   1 -
 drivers/net/ethernet/intel/i40e/i40e_main.c    |  24 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 510 +++++++++----------------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |  64 +---
 7 files changed, 541 insertions(+), 416 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/buff_pool.c
 create mode 100644 drivers/net/ethernet/intel/i40e/buff_pool.h

diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
index 3da482c3d68d..bfdf9ce3e7f0 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -45,6 +45,7 @@ i40e-objs := i40e_main.o \
 	i40e_txrx.o	\
 	i40e_ptp.o	\
 	i40e_client.o   \
-	i40e_virtchnl_pf.o
+	i40e_virtchnl_pf.o \
+	buff_pool.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/i40e/buff_pool.c b/drivers/net/ethernet/intel/i40e/buff_pool.c
new file mode 100644
index 000000000000..8c51f61ca71d
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/buff_pool.c
@@ -0,0 +1,285 @@
+#include "buff_pool.h"
+
+#include "i40e.h"
+#include "i40e_txrx.h"
+
+struct buff_pool_ops {
+	int (*alloc)(void *pool, unsigned long *handle);
+	void (*free)(void *pool, unsigned long handle);
+	unsigned int (*buff_size)(void *pool);
+	unsigned int (*total_buff_size)(void *pool);
+	unsigned int (*buff_headroom)(void *pool);
+	unsigned int (*buff_truesize)(void *pool);
+	void *(*buff_ptr)(void *pool, unsigned long handle);
+	int (*buff_convert_to_page)(void *pool,
+				    unsigned long handle,
+				    struct page **pg, unsigned int *pg_off);
+	dma_addr_t (*buff_dma)(void *pool,
+			       unsigned long handle);
+	void (*buff_dma_sync_cpu)(void *pool,
+				  unsigned long handle,
+				  unsigned int off,
+				  unsigned int size);
+	void (*buff_dma_sync_dev)(void *pool,
+				  unsigned long handle,
+				  unsigned int off,
+				  unsigned int size);
+};
+
+int bpool_alloc(struct buff_pool *pool, unsigned long *handle)
+{
+	return pool->ops->alloc(pool->pool, handle);
+}
+
+void bpool_free(struct buff_pool *pool, unsigned long handle)
+{
+	pool->ops->free(pool->pool, handle);
+}
+
+unsigned int bpool_buff_size(struct buff_pool *pool)
+{
+	return pool->ops->buff_size(pool->pool);
+}
+
+unsigned int bpool_total_buff_size(struct buff_pool *pool)
+{
+	return pool->ops->total_buff_size(pool->pool);
+}
+
+unsigned int bpool_buff_headroom(struct buff_pool *pool)
+{
+	return pool->ops->buff_headroom(pool->pool);
+}
+
+unsigned int bpool_buff_truesize(struct buff_pool *pool)
+{
+	return pool->ops->buff_truesize(pool->pool);
+}
+
+void *bpool_buff_ptr(struct buff_pool *pool, unsigned long handle)
+{
+	return pool->ops->buff_ptr(pool->pool, handle);
+}
+
+int bpool_buff_convert_to_page(struct buff_pool *pool, unsigned long handle,
+			       struct page **pg, unsigned int *pg_off)
+{
+	return pool->ops->buff_convert_to_page(pool->pool, handle, pg, pg_off);
+}
+
+dma_addr_t bpool_buff_dma(struct buff_pool *pool,
+			  unsigned long handle)
+{
+	return pool->ops->buff_dma(pool->pool, handle);
+}
+
+void bpool_buff_dma_sync_cpu(struct buff_pool *pool,
+			     unsigned long handle,
+			     unsigned int off,
+			     unsigned int size)
+{
+	pool->ops->buff_dma_sync_cpu(pool->pool, handle, off, size);
+}
+
+void bpool_buff_dma_sync_dev(struct buff_pool *pool,
+			     unsigned long handle,
+			     unsigned int off,
+			     unsigned int size)
+{
+	pool->ops->buff_dma_sync_dev(pool->pool, handle, off, size);
+}
+
+/* Naive, non-recycling allocator. */
+
+struct i40e_bp_pool {
+	struct device *dev;
+};
+
+struct i40e_bp_header {
+	dma_addr_t dma;
+};
+
+#define I40E_BPHDR_ALIGNED_SIZE ALIGN(sizeof(struct i40e_bp_header),	\
+				     SMP_CACHE_BYTES)
+
+static int i40e_bp_alloc(void *pool, unsigned long *handle)
+{
+	struct i40e_bp_pool *impl = (struct i40e_bp_pool *)pool;
+	struct i40e_bp_header *hdr;
+	struct page *pg;
+	dma_addr_t dma;
+
+	pg = dev_alloc_pages(0);
+	if (unlikely(!pg))
+		return -ENOMEM;
+
+	dma = dma_map_page_attrs(impl->dev, pg, 0,
+				 PAGE_SIZE,
+				 DMA_FROM_DEVICE,
+				 I40E_RX_DMA_ATTR);
+
+	if (dma_mapping_error(impl->dev, dma)) {
+		__free_pages(pg, 0);
+		return -ENOMEM;
+	}
+
+	hdr = (struct i40e_bp_header *)page_address(pg);
+	hdr->dma = dma;
+
+	*handle = (unsigned long)(((void *)hdr) + I40E_BPHDR_ALIGNED_SIZE);
+
+	return 0;
+}
+
+static void i40e_bp_free(void *pool, unsigned long handle)
+{
+	struct i40e_bp_pool *impl = (struct i40e_bp_pool *)pool;
+	struct i40e_bp_header *hdr;
+
+	hdr = (struct i40e_bp_header *)(handle & PAGE_MASK);
+
+	dma_unmap_page_attrs(impl->dev, hdr->dma, PAGE_SIZE,
+			     DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
+	page_frag_free(hdr);
+}
+
+static unsigned int i40e_bp_buff_size(void *pool)
+{
+	(void)pool;
+	return I40E_RXBUFFER_3072;
+}
+
+static unsigned int i40e_bp_total_buff_size(void *pool)
+{
+	(void)pool;
+	return PAGE_SIZE - I40E_BPHDR_ALIGNED_SIZE -
+		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+}
+
+static unsigned int i40e_bp_buff_headroom(void *pool)
+{
+	(void)pool;
+	return PAGE_SIZE - I40E_BPHDR_ALIGNED_SIZE -
+		SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) -
+		I40E_RXBUFFER_3072;
+}
+
+static unsigned int i40e_bp_buff_truesize(void *pool)
+{
+	(void)pool;
+	return PAGE_SIZE;
+}
+
+static void *i40e_bp_buff_ptr(void *pool, unsigned long handle)
+{
+	return (void *)handle;
+}
+
+static int i40e_bp_buff_convert_to_page(void *pool,
+					unsigned long handle,
+					struct page **pg, unsigned int *pg_off)
+{
+	struct i40e_bp_pool *impl = (struct i40e_bp_pool *)pool;
+	struct i40e_bp_header *hdr;
+
+	hdr = (struct i40e_bp_header *)(handle & PAGE_MASK);
+
+	dma_unmap_page_attrs(impl->dev, hdr->dma, PAGE_SIZE,
+			     DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
+
+	*pg = virt_to_page(hdr);
+	*pg_off = I40E_BPHDR_ALIGNED_SIZE;
+
+	return 0;
+}
+
+static dma_addr_t i40e_bp_buff_dma(void *pool,
+				   unsigned long handle)
+{
+	struct i40e_bp_header *hdr;
+
+	hdr = (struct i40e_bp_header *)(handle & PAGE_MASK);
+
+	return hdr->dma + I40E_BPHDR_ALIGNED_SIZE;
+}
+
+static void i40e_bp_buff_dma_sync_cpu(void *pool,
+				      unsigned long handle,
+				      unsigned int off,
+				      unsigned int size)
+{
+	struct i40e_bp_pool *impl = (struct i40e_bp_pool *)pool;
+	struct i40e_bp_header *hdr;
+
+	off += I40E_BPHDR_ALIGNED_SIZE;
+
+	hdr = (struct i40e_bp_header *)(handle & PAGE_MASK);
+	dma_sync_single_range_for_cpu(impl->dev, hdr->dma, off, size,
+				      DMA_FROM_DEVICE);
+}
+
+static void i40e_bp_buff_dma_sync_dev(void *pool,
+				      unsigned long handle,
+				      unsigned int off,
+				      unsigned int size)
+{
+	struct i40e_bp_pool *impl = (struct i40e_bp_pool *)pool;
+	struct i40e_bp_header *hdr;
+
+	off += I40E_BPHDR_ALIGNED_SIZE;
+
+	hdr = (struct i40e_bp_header *)(handle & PAGE_MASK);
+	dma_sync_single_range_for_device(impl->dev, hdr->dma, off, size,
+					 DMA_FROM_DEVICE);
+}
+
+struct buff_pool *i40e_buff_pool_create(struct device *dev)
+{
+	struct i40e_bp_pool *pool_impl;
+	struct buff_pool_ops *pool_ops;
+	struct buff_pool *pool;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return NULL;
+
+	pool_impl = kzalloc(sizeof(*pool_impl), GFP_KERNEL);
+	if (!pool_impl) {
+		kfree(pool);
+		return NULL;
+	}
+
+	pool_ops = kzalloc(sizeof(*pool_ops), GFP_KERNEL);
+	if (!pool_ops) {
+		kfree(pool_impl);
+		kfree(pool);
+		return NULL;
+	}
+
+	pool_ops->alloc = i40e_bp_alloc;
+	pool_ops->free = i40e_bp_free;
+	pool_ops->buff_size = i40e_bp_buff_size;
+	pool_ops->total_buff_size = i40e_bp_total_buff_size;
+	pool_ops->buff_headroom = i40e_bp_buff_headroom;
+	pool_ops->buff_truesize = i40e_bp_buff_truesize;
+	pool_ops->buff_ptr = i40e_bp_buff_ptr;
+	pool_ops->buff_convert_to_page = i40e_bp_buff_convert_to_page;
+	pool_ops->buff_dma = i40e_bp_buff_dma;
+	pool_ops->buff_dma_sync_cpu = i40e_bp_buff_dma_sync_cpu;
+	pool_ops->buff_dma_sync_dev = i40e_bp_buff_dma_sync_dev;
+
+	pool_impl->dev = dev;
+
+	pool->pool = pool_impl;
+	pool->ops = pool_ops;
+
+	return pool;
+}
+
+void i40e_buff_pool_destroy(struct buff_pool *pool)
+{
+	kfree(pool->ops);
+	kfree(pool->pool);
+	kfree(pool);
+}
+
diff --git a/drivers/net/ethernet/intel/i40e/buff_pool.h b/drivers/net/ethernet/intel/i40e/buff_pool.h
new file mode 100644
index 000000000000..933881e14ac0
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/buff_pool.h
@@ -0,0 +1,70 @@
+#ifndef BUFF_POOL_H_
+#define BUFF_POOL_H_
+
+#include <linux/types.h>
+
+struct page;
+struct device;
+
+struct buff_pool_ops;
+
+struct buff_pool {
+	void *pool;
+	struct buff_pool_ops *ops;
+};
+
+/* Allocates a new buffer from the pool */
+int bpool_alloc(struct buff_pool *pool, unsigned long *handle);
+
+/* Returns a buffer originating from the pool, back to the pool */
+void bpool_free(struct buff_pool *pool, unsigned long handle);
+
+/* Returns the size of the buffer, w/o headroom. This is what the pool
+ * creator passed to the constructor.
+ */
+unsigned int bpool_buff_size(struct buff_pool *pool);
+
+/* Returns the size of the buffer, plus additional headroom (if
+ * any).
+ */
+unsigned int bpool_total_buff_size(struct buff_pool *pool);
+
+/* Returns additional headroom (if any) */
+unsigned int bpool_buff_headroom(struct buff_pool *pool);
+
+/* Returns the truesize (as for skbuff) */
+unsigned int bpool_buff_truesize(struct buff_pool *pool);
+
+/* Returns the kernel virtual address to the handle. */
+void *bpool_buff_ptr(struct buff_pool *pool, unsigned long handle);
+
+/* Converts a handle to a page. After a successful call, the handle is
+ * stale and should not be used and should be considered
+ * freed. Callers need to manually clean up the returned page (using
+ * page_free).
+ */
+int bpool_buff_convert_to_page(struct buff_pool *pool, unsigned long handle,
+			       struct page **pg, unsigned int *pg_off);
+
+/* Returns the dma address of a buffer */
+dma_addr_t bpool_buff_dma(struct buff_pool *pool,
+			  unsigned long handle);
+
+/* DMA sync for CPU */
+void bpool_buff_dma_sync_cpu(struct buff_pool *pool,
+			     unsigned long handle,
+			     unsigned int off,
+			     unsigned int size);
+
+/* DMA sync for device */
+void bpool_buff_dma_sync_dev(struct buff_pool *pool,
+			     unsigned long handle,
+			     unsigned int off,
+			     unsigned int size);
+/* ---- */
+
+struct buff_pool *i40e_buff_pool_create(struct device *dev);
+void i40e_buff_pool_destroy(struct buff_pool *pool);
+
+#endif /* BUFF_POOL_H_ */
+
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 2f5bee713fef..505e4bea01fb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1647,7 +1647,6 @@ static int i40e_set_ringparam(struct net_device *netdev,
 			 */
 			rx_rings[i].next_to_use = 0;
 			rx_rings[i].next_to_clean = 0;
-			rx_rings[i].next_to_alloc = 0;
 			/* do a struct copy */
 			*vsi->rx_rings[i] = rx_rings[i];
 		}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 09efb9dd09f3..7e82b7c6c0b7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -39,6 +39,7 @@
  */
 #define CREATE_TRACE_POINTS
 #include "i40e_trace.h"
+#include "buff_pool.h"
 
 const char i40e_driver_name[] = "i40e";
 static const char i40e_driver_string[] =
@@ -3217,7 +3218,9 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	/* clear the context structure first */
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
-	ring->rx_buf_len = vsi->rx_buf_len;
+	ring->bpool = i40e_buff_pool_create(ring->dev);
+	ring->rx_buf_hr = (u16)bpool_buff_headroom(ring->bpool);
+	ring->rx_buf_len = (u16)bpool_buff_size(ring->bpool);
 
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
 				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
@@ -3312,20 +3315,8 @@ static int i40e_vsi_configure_rx(struct i40e_vsi *vsi)
 	int err = 0;
 	u16 i;
 
-	if (!vsi->netdev || (vsi->back->flags & I40E_FLAG_LEGACY_RX)) {
-		vsi->max_frame = I40E_MAX_RXBUFFER;
-		vsi->rx_buf_len = I40E_RXBUFFER_2048;
-#if (PAGE_SIZE < 8192)
-	} else if (!I40E_2K_TOO_SMALL_WITH_PADDING &&
-		   (vsi->netdev->mtu <= ETH_DATA_LEN)) {
-		vsi->max_frame = I40E_RXBUFFER_1536 - NET_IP_ALIGN;
-		vsi->rx_buf_len = I40E_RXBUFFER_1536 - NET_IP_ALIGN;
-#endif
-	} else {
-		vsi->max_frame = I40E_MAX_RXBUFFER;
-		vsi->rx_buf_len = (PAGE_SIZE < 8192) ? I40E_RXBUFFER_3072 :
-						       I40E_RXBUFFER_2048;
-	}
+	vsi->max_frame = I40E_MAX_RXBUFFER;
+	vsi->rx_buf_len = I40E_RXBUFFER_3072;
 
 	/* set up individual rings */
 	for (i = 0; i < vsi->num_queue_pairs && !err; i++)
@@ -11601,6 +11592,9 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	bool need_reset;
 	int i;
 
+	/* XXX What's the correct behavior here, when we can have
+	 * different different rx_buf_lens per ring?
+	 */
 	/* Don't allow frames that span over multiple buffers */
 	if (frame_size > vsi->rx_buf_len)
 		return -EINVAL;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f0feae92a34a..aa29013acf0c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -31,6 +31,7 @@
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
+#include "buff_pool.h"
 
 static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
 				u32 td_tag)
@@ -1090,32 +1091,6 @@ static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc)
 	return false;
 }
 
-/**
- * i40e_reuse_rx_page - page flip buffer and store it back on the ring
- * @rx_ring: rx descriptor ring to store buffers on
- * @old_buff: donor buffer to have page reused
- *
- * Synchronizes page for reuse by the adapter
- **/
-static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
-			       struct i40e_rx_buffer *old_buff)
-{
-	struct i40e_rx_buffer *new_buff;
-	u16 nta = rx_ring->next_to_alloc;
-
-	new_buff = &rx_ring->rx_bi[nta];
-
-	/* update, and store next to alloc */
-	nta++;
-	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
-
-	/* transfer page from old buffer to new buffer */
-	new_buff->dma		= old_buff->dma;
-	new_buff->page		= old_buff->page;
-	new_buff->page_offset	= old_buff->page_offset;
-	new_buff->pagecnt_bias	= old_buff->pagecnt_bias;
-}
-
 /**
  * i40e_rx_is_programming_status - check for programming status descriptor
  * @qw: qword representing status_error_len in CPU ordering
@@ -1161,12 +1136,8 @@ static void i40e_clean_programming_status(struct i40e_ring *rx_ring,
 
 	prefetch(I40E_RX_DESC(rx_ring, ntc));
 
-	/* place unused page back on the ring */
-	i40e_reuse_rx_page(rx_ring, rx_buffer);
-	rx_ring->rx_stats.page_reuse_count++;
-
-	/* clear contents of buffer_info */
-	rx_buffer->page = NULL;
+	bpool_free(rx_ring->bpool, rx_buffer->handle);
+	rx_buffer->handle = 0;
 
 	id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
 		  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
@@ -1246,28 +1217,17 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 	for (i = 0; i < rx_ring->count; i++) {
 		struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
 
-		if (!rx_bi->page)
+		if (!rx_bi->handle)
 			continue;
 
 		/* Invalidate cache lines that may have been written to by
 		 * device so that we avoid corrupting memory.
 		 */
-		dma_sync_single_range_for_cpu(rx_ring->dev,
-					      rx_bi->dma,
-					      rx_bi->page_offset,
-					      rx_ring->rx_buf_len,
-					      DMA_FROM_DEVICE);
-
-		/* free resources associated with mapping */
-		dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
-				     i40e_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE,
-				     I40E_RX_DMA_ATTR);
+		bpool_buff_dma_sync_cpu(rx_ring->bpool, rx_bi->handle, 0,
+					rx_ring->rx_buf_len);
 
-		__page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
-
-		rx_bi->page = NULL;
-		rx_bi->page_offset = 0;
+		bpool_free(rx_ring->bpool, rx_bi->handle);
+		rx_bi->handle = 0;
 	}
 
 	bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
@@ -1276,7 +1236,6 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 	/* Zero out the descriptor ring */
 	memset(rx_ring->desc, 0, rx_ring->size);
 
-	rx_ring->next_to_alloc = 0;
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
 }
@@ -1296,6 +1255,9 @@ void i40e_free_rx_resources(struct i40e_ring *rx_ring)
 	kfree(rx_ring->rx_bi);
 	rx_ring->rx_bi = NULL;
 
+	i40e_buff_pool_destroy(rx_ring->bpool);
+	rx_ring->bpool = NULL;
+
 	if (rx_ring->desc) {
 		dma_free_coherent(rx_ring->dev, rx_ring->size,
 				  rx_ring->desc, rx_ring->dma);
@@ -1336,7 +1298,6 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
 		goto err;
 	}
 
-	rx_ring->next_to_alloc = 0;
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
 
@@ -1366,9 +1327,6 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 {
 	rx_ring->next_to_use = val;
 
-	/* update next to alloc since we have filled the ring */
-	rx_ring->next_to_alloc = val;
-
 	/* Force memory writes to complete before letting h/w
 	 * know there are new descriptors to fetch.  (Only
 	 * applicable for weak-ordered memory model archs,
@@ -1378,17 +1336,6 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 	writel(val, rx_ring->tail);
 }
 
-/**
- * i40e_rx_offset - Return expected offset into page to access data
- * @rx_ring: Ring we are requesting offset of
- *
- * Returns the offset value for ring into the data buffer.
- */
-static inline unsigned int i40e_rx_offset(struct i40e_ring *rx_ring)
-{
-	return ring_uses_build_skb(rx_ring) ? I40E_SKB_PAD : 0;
-}
-
 /**
  * i40e_alloc_mapped_page - recycle or make a new page
  * @rx_ring: ring to use
@@ -1400,43 +1347,14 @@ static inline unsigned int i40e_rx_offset(struct i40e_ring *rx_ring)
 static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
 				   struct i40e_rx_buffer *bi)
 {
-	struct page *page = bi->page;
-	dma_addr_t dma;
-
-	/* since we are recycling buffers we should seldom need to alloc */
-	if (likely(page)) {
-		rx_ring->rx_stats.page_reuse_count++;
-		return true;
-	}
-
-	/* alloc new page for storage */
-	page = dev_alloc_pages(i40e_rx_pg_order(rx_ring));
-	if (unlikely(!page)) {
-		rx_ring->rx_stats.alloc_page_failed++;
-		return false;
-	}
-
-	/* map page for use */
-	dma = dma_map_page_attrs(rx_ring->dev, page, 0,
-				 i40e_rx_pg_size(rx_ring),
-				 DMA_FROM_DEVICE,
-				 I40E_RX_DMA_ATTR);
+	unsigned long handle;
+	int err;
 
-	/* if mapping failed free memory back to system since
-	 * there isn't much point in holding memory we can't use
-	 */
-	if (dma_mapping_error(rx_ring->dev, dma)) {
-		__free_pages(page, i40e_rx_pg_order(rx_ring));
-		rx_ring->rx_stats.alloc_page_failed++;
+	err = bpool_alloc(rx_ring->bpool, &handle);
+	if (err)
 		return false;
-	}
-
-	bi->dma = dma;
-	bi->page = page;
-	bi->page_offset = i40e_rx_offset(rx_ring);
 
-	page_ref_add(page, USHRT_MAX - 1);
-	bi->pagecnt_bias = USHRT_MAX;
+	bi->handle = handle;
 
 	return true;
 }
@@ -1480,19 +1398,19 @@ bool i40e_alloc_rx_buffers(struct i40e_ring *rx_ring, u16 cleaned_count)
 	bi = &rx_ring->rx_bi[ntu];
 
 	do {
+		unsigned int headroom;
+		dma_addr_t dma;
+
 		if (!i40e_alloc_mapped_page(rx_ring, bi))
 			goto no_buffers;
 
-		/* sync the buffer for use by the device */
-		dma_sync_single_range_for_device(rx_ring->dev, bi->dma,
-						 bi->page_offset,
-						 rx_ring->rx_buf_len,
-						 DMA_FROM_DEVICE);
+		dma = bpool_buff_dma(rx_ring->bpool, bi->handle);
+		headroom = rx_ring->rx_buf_hr;
 
-		/* Refresh the desc even if buffer_addrs didn't change
-		 * because each write-back erases this info.
-		 */
-		rx_desc->read.pkt_addr = cpu_to_le64(bi->dma + bi->page_offset);
+		bpool_buff_dma_sync_dev(rx_ring->bpool, bi->handle,
+					headroom, rx_ring->rx_buf_len);
+
+		rx_desc->read.pkt_addr = cpu_to_le64(dma + headroom);
 
 		rx_desc++;
 		bi++;
@@ -1738,78 +1656,6 @@ static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb,
 	return false;
 }
 
-/**
- * i40e_page_is_reusable - check if any reuse is possible
- * @page: page struct to check
- *
- * A page is not reusable if it was allocated under low memory
- * conditions, or it's not in the same NUMA node as this CPU.
- */
-static inline bool i40e_page_is_reusable(struct page *page)
-{
-	return (page_to_nid(page) == numa_mem_id()) &&
-		!page_is_pfmemalloc(page);
-}
-
-/**
- * i40e_can_reuse_rx_page - Determine if this page can be reused by
- * the adapter for another receive
- *
- * @rx_buffer: buffer containing the page
- *
- * If page is reusable, rx_buffer->page_offset is adjusted to point to
- * an unused region in the page.
- *
- * For small pages, @truesize will be a constant value, half the size
- * of the memory at page.  We'll attempt to alternate between high and
- * low halves of the page, with one half ready for use by the hardware
- * and the other half being consumed by the stack.  We use the page
- * ref count to determine whether the stack has finished consuming the
- * portion of this page that was passed up with a previous packet.  If
- * the page ref count is >1, we'll assume the "other" half page is
- * still busy, and this page cannot be reused.
- *
- * For larger pages, @truesize will be the actual space used by the
- * received packet (adjusted upward to an even multiple of the cache
- * line size).  This will advance through the page by the amount
- * actually consumed by the received packets while there is still
- * space for a buffer.  Each region of larger pages will be used at
- * most once, after which the page will not be reused.
- *
- * In either case, if the page is reusable its refcount is increased.
- **/
-static bool i40e_can_reuse_rx_page(struct i40e_rx_buffer *rx_buffer)
-{
-	unsigned int pagecnt_bias = rx_buffer->pagecnt_bias;
-	struct page *page = rx_buffer->page;
-
-	/* Is any reuse possible? */
-	if (unlikely(!i40e_page_is_reusable(page)))
-		return false;
-
-#if (PAGE_SIZE < 8192)
-	/* if we are only owner of page we can reuse it */
-	if (unlikely((page_count(page) - pagecnt_bias) > 1))
-		return false;
-#else
-#define I40E_LAST_OFFSET \
-	(SKB_WITH_OVERHEAD(PAGE_SIZE) - I40E_RXBUFFER_2048)
-	if (rx_buffer->page_offset > I40E_LAST_OFFSET)
-		return false;
-#endif
-
-	/* If we have drained the page fragment pool we need to update
-	 * the pagecnt_bias and page count so that we fully restock the
-	 * number of references the driver holds.
-	 */
-	if (unlikely(pagecnt_bias == 1)) {
-		page_ref_add(page, USHRT_MAX - 1);
-		rx_buffer->pagecnt_bias = USHRT_MAX;
-	}
-
-	return true;
-}
-
 /**
  * i40e_add_rx_frag - Add contents of Rx buffer to sk_buff
  * @rx_ring: rx descriptor ring to transact packets on
@@ -1823,25 +1669,24 @@ static bool i40e_can_reuse_rx_page(struct i40e_rx_buffer *rx_buffer)
  * The function will then update the page offset.
  **/
 static void i40e_add_rx_frag(struct i40e_ring *rx_ring,
-			     struct i40e_rx_buffer *rx_buffer,
 			     struct sk_buff *skb,
-			     unsigned int size)
+			     unsigned long handle,
+			     unsigned int size,
+			     unsigned int headroom)
 {
-#if (PAGE_SIZE < 8192)
-	unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
-#else
-	unsigned int truesize = SKB_DATA_ALIGN(size + i40e_rx_offset(rx_ring));
-#endif
+	unsigned int truesize = bpool_buff_truesize(rx_ring->bpool);
+	unsigned int pg_off;
+	struct page *pg;
+	int err;
 
-	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
-			rx_buffer->page_offset, size, truesize);
+	err = bpool_buff_convert_to_page(rx_ring->bpool, handle, &pg, &pg_off);
+	if (err) {
+		bpool_free(rx_ring->bpool, handle);
+		return;
+	}
 
-	/* page is being used so we must update the page offset */
-#if (PAGE_SIZE < 8192)
-	rx_buffer->page_offset ^= truesize;
-#else
-	rx_buffer->page_offset += truesize;
-#endif
+	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, pg, pg_off + headroom,
+			size, truesize);
 }
 
 /**
@@ -1853,22 +1698,16 @@ static void i40e_add_rx_frag(struct i40e_ring *rx_ring,
  * for use by the CPU.
  */
 static struct i40e_rx_buffer *i40e_get_rx_buffer(struct i40e_ring *rx_ring,
-						 const unsigned int size)
+						 unsigned long *handle,
+						 const unsigned int size,
+						 unsigned int *headroom)
 {
 	struct i40e_rx_buffer *rx_buffer;
 
 	rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
-	prefetchw(rx_buffer->page);
-
-	/* we are reusing so sync this buffer for CPU use */
-	dma_sync_single_range_for_cpu(rx_ring->dev,
-				      rx_buffer->dma,
-				      rx_buffer->page_offset,
-				      size,
-				      DMA_FROM_DEVICE);
-
-	/* We have pulled a buffer for use, so decrement pagecnt_bias */
-	rx_buffer->pagecnt_bias--;
+	*handle = rx_buffer->handle;
+	*headroom = rx_ring->rx_buf_hr;
+	bpool_buff_dma_sync_cpu(rx_ring->bpool, *handle, *headroom, size);
 
 	return rx_buffer;
 }
@@ -1884,56 +1723,56 @@ static struct i40e_rx_buffer *i40e_get_rx_buffer(struct i40e_ring *rx_ring,
  * skb correctly.
  */
 static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
-					  struct i40e_rx_buffer *rx_buffer,
-					  struct xdp_buff *xdp)
+					  unsigned long handle,
+					  unsigned int size,
+					  unsigned int headroom)
 {
-	unsigned int size = xdp->data_end - xdp->data;
-#if (PAGE_SIZE < 8192)
-	unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
-#else
-	unsigned int truesize = SKB_DATA_ALIGN(size);
-#endif
-	unsigned int headlen;
+	unsigned int truesize = bpool_buff_truesize(rx_ring->bpool);
+	unsigned int pg_off, headlen;
 	struct sk_buff *skb;
+	struct page *pg;
+	void *data;
+	int err;
 
+	data = bpool_buff_ptr(rx_ring->bpool, handle) + headroom;
 	/* prefetch first cache line of first page */
-	prefetch(xdp->data);
+	prefetch(data);
 #if L1_CACHE_BYTES < 128
-	prefetch(xdp->data + L1_CACHE_BYTES);
+	prefetch(data + L1_CACHE_BYTES);
 #endif
 
 	/* allocate a skb to store the frags */
 	skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
 			       I40E_RX_HDR_SIZE,
 			       GFP_ATOMIC | __GFP_NOWARN);
-	if (unlikely(!skb))
+	if (unlikely(!skb)) {
+		bpool_free(rx_ring->bpool, handle);
 		return NULL;
+	}
 
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > I40E_RX_HDR_SIZE)
-		headlen = eth_get_headlen(xdp->data, I40E_RX_HDR_SIZE);
+		headlen = eth_get_headlen(data, I40E_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
-	memcpy(__skb_put(skb, headlen), xdp->data,
-	       ALIGN(headlen, sizeof(long)));
+	memcpy(__skb_put(skb, headlen), data, ALIGN(headlen, sizeof(long)));
 
 	/* update all of the pointers */
 	size -= headlen;
 	if (size) {
-		skb_add_rx_frag(skb, 0, rx_buffer->page,
-				rx_buffer->page_offset + headlen,
-				size, truesize);
-
-		/* buffer is used by skb, update page_offset */
-#if (PAGE_SIZE < 8192)
-		rx_buffer->page_offset ^= truesize;
-#else
-		rx_buffer->page_offset += truesize;
-#endif
+		err = bpool_buff_convert_to_page(rx_ring->bpool, handle, &pg,
+						 &pg_off);
+		if (err) {
+			dev_kfree_skb(skb);
+			bpool_free(rx_ring->bpool, handle);
+			return NULL;
+		}
+
+		skb_add_rx_frag(skb, 0, pg, pg_off + headroom + headlen, size,
+				truesize);
 	} else {
-		/* buffer is unused, reset bias back to rx_buffer */
-		rx_buffer->pagecnt_bias++;
+		bpool_free(rx_ring->bpool, handle);
 	}
 
 	return skb;
@@ -1949,70 +1788,45 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
  * to set up the skb correctly and avoid any memcpy overhead.
  */
 static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
-				      struct i40e_rx_buffer *rx_buffer,
-				      struct xdp_buff *xdp)
+				      unsigned long handle,
+				      unsigned int size,
+				      unsigned int headroom)
 {
-	unsigned int size = xdp->data_end - xdp->data;
-#if (PAGE_SIZE < 8192)
-	unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
-#else
-	unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
-				SKB_DATA_ALIGN(I40E_SKB_PAD + size);
-#endif
+	void *data, *data_hard_start;
 	struct sk_buff *skb;
+	unsigned int frag_size, pg_off;
+	struct page *pg;
+	int err;
+
+	err = bpool_buff_convert_to_page(rx_ring->bpool, handle, &pg, &pg_off);
+	if (err) {
+		bpool_free(rx_ring->bpool, handle);
+		return NULL;
+	}
 
+	frag_size = bpool_total_buff_size(rx_ring->bpool) +
+		    SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	data_hard_start = page_address(pg) + pg_off;
+	data = data_hard_start + headroom;
 	/* prefetch first cache line of first page */
-	prefetch(xdp->data);
+	prefetch(data);
 #if L1_CACHE_BYTES < 128
-	prefetch(xdp->data + L1_CACHE_BYTES);
+	prefetch(data + L1_CACHE_BYTES);
 #endif
 	/* build an skb around the page buffer */
-	skb = build_skb(xdp->data_hard_start, truesize);
-	if (unlikely(!skb))
+	skb = build_skb(data_hard_start, frag_size);
+	if (unlikely(!skb)) {
+		page_frag_free(data);
 		return NULL;
+	}
 
 	/* update pointers within the skb to store the data */
-	skb_reserve(skb, I40E_SKB_PAD);
+	skb_reserve(skb, headroom);
 	__skb_put(skb, size);
 
-	/* buffer is used by skb, update page_offset */
-#if (PAGE_SIZE < 8192)
-	rx_buffer->page_offset ^= truesize;
-#else
-	rx_buffer->page_offset += truesize;
-#endif
-
 	return skb;
 }
 
-/**
- * i40e_put_rx_buffer - Clean up used buffer and either recycle or free
- * @rx_ring: rx descriptor ring to transact packets on
- * @rx_buffer: rx buffer to pull data from
- *
- * This function will clean up the contents of the rx_buffer.  It will
- * either recycle the bufer or unmap it and free the associated resources.
- */
-static void i40e_put_rx_buffer(struct i40e_ring *rx_ring,
-			       struct i40e_rx_buffer *rx_buffer)
-{
-	if (i40e_can_reuse_rx_page(rx_buffer)) {
-		/* hand second half of page back to the ring */
-		i40e_reuse_rx_page(rx_ring, rx_buffer);
-		rx_ring->rx_stats.page_reuse_count++;
-	} else {
-		/* we are not reusing the buffer so unmap it */
-		dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
-				     i40e_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
-		__page_frag_cache_drain(rx_buffer->page,
-					rx_buffer->pagecnt_bias);
-	}
-
-	/* clear contents of buffer_info */
-	rx_buffer->page = NULL;
-}
-
 /**
  * i40e_is_non_eop - process handling of non-EOP buffers
  * @rx_ring: Rx ring being processed
@@ -2053,17 +1867,43 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
 			      struct i40e_ring *xdp_ring);
 
+static int i40e_xdp_buff_convert_page(struct i40e_ring *rx_ring,
+				      struct xdp_buff *xdp,
+				      unsigned long handle,
+				      unsigned int size,
+				      unsigned int headroom)
+{
+	unsigned int pg_off;
+	struct page *pg;
+	int err;
+
+	err = bpool_buff_convert_to_page(rx_ring->bpool, handle, &pg, &pg_off);
+	if (err)
+		return err;
+
+	xdp->data_hard_start = page_address(pg) + pg_off;
+	xdp->data = xdp->data_hard_start + headroom;
+	xdp_set_data_meta_invalid(xdp);
+	xdp->data_end = xdp->data + size;
+	xdp->rxq = &rx_ring->xdp_rxq;
+
+	return 0;
+}
+
 /**
  * i40e_run_xdp - run an XDP program
  * @rx_ring: Rx ring being processed
  * @xdp: XDP buffer containing the frame
  **/
 static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
-				    struct xdp_buff *xdp)
+				    unsigned long handle,
+				    unsigned int *size,
+				    unsigned int *headroom)
 {
 	int err, result = I40E_XDP_PASS;
 	struct i40e_ring *xdp_ring;
 	struct bpf_prog *xdp_prog;
+	struct xdp_buff xdp;
 	u32 act;
 
 	rcu_read_lock();
@@ -2072,20 +1912,47 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	if (!xdp_prog)
 		goto xdp_out;
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	xdp.data_hard_start = bpool_buff_ptr(rx_ring->bpool, handle);
+	xdp.data = xdp.data_hard_start + *headroom;
+	xdp_set_data_meta_invalid(&xdp);
+	xdp.data_end = xdp.data + *size;
+	xdp.rxq = &rx_ring->xdp_rxq;
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	*headroom = xdp.data - xdp.data_hard_start;
+	*size = xdp.data_end - xdp.data;
+
 	switch (act) {
 	case XDP_PASS:
 		break;
 	case XDP_TX:
+		err = i40e_xdp_buff_convert_page(rx_ring, &xdp, handle, *size,
+						 *headroom);
+		if (err) {
+			result = I40E_XDP_CONSUMED;
+			break;
+		}
+
 		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
-		result = i40e_xmit_xdp_ring(xdp, xdp_ring);
+		result = i40e_xmit_xdp_ring(&xdp, xdp_ring);
+		if (result == I40E_XDP_CONSUMED) {
+			page_frag_free(xdp.data);
+			result = I40E_XDP_TX; /* Hmm, here we bump the tail unnecessary, but better flow... */
+		}
 		break;
 	case XDP_REDIRECT:
-		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-		if (!err)
-			result = I40E_XDP_TX;
-		else
+		err = i40e_xdp_buff_convert_page(rx_ring, &xdp, handle, *size,
+						 *headroom);
+		if (err) {
 			result = I40E_XDP_CONSUMED;
+			break;
+		}
+
+		err = xdp_do_redirect(rx_ring->netdev, &xdp, xdp_prog);
+		result = I40E_XDP_TX;
+		if (err)
+			page_frag_free(xdp.data);
 		break;
 	default:
 		bpf_warn_invalid_xdp_action(act);
@@ -2101,27 +1968,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	return ERR_PTR(-result);
 }
 
-/**
- * i40e_rx_buffer_flip - adjusted rx_buffer to point to an unused region
- * @rx_ring: Rx ring
- * @rx_buffer: Rx buffer to adjust
- * @size: Size of adjustment
- **/
-static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
-				struct i40e_rx_buffer *rx_buffer,
-				unsigned int size)
-{
-#if (PAGE_SIZE < 8192)
-	unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
-
-	rx_buffer->page_offset ^= truesize;
-#else
-	unsigned int truesize = SKB_DATA_ALIGN(i40e_rx_offset(rx_ring) + size);
-
-	rx_buffer->page_offset += truesize;
-#endif
-}
-
 static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
 {
 	/* Force memory writes to complete before letting h/w
@@ -2150,14 +1996,12 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 	struct sk_buff *skb = rx_ring->skb;
 	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
 	bool failure = false, xdp_xmit = false;
-	struct xdp_buff xdp;
-
-	xdp.rxq = &rx_ring->xdp_rxq;
 
 	while (likely(total_rx_packets < (unsigned int)budget)) {
 		struct i40e_rx_buffer *rx_buffer;
 		union i40e_rx_desc *rx_desc;
-		unsigned int size;
+		unsigned int size, headroom;
+		unsigned long handle;
 		u16 vlan_tag;
 		u8 rx_ptype;
 		u64 qword;
@@ -2195,45 +2039,35 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 			break;
 
 		i40e_trace(clean_rx_irq, rx_ring, rx_desc, skb);
-		rx_buffer = i40e_get_rx_buffer(rx_ring, size);
+		rx_buffer = i40e_get_rx_buffer(rx_ring, &handle, size,
+					       &headroom);
 
 		/* retrieve a buffer from the ring */
-		if (!skb) {
-			xdp.data = page_address(rx_buffer->page) +
-				   rx_buffer->page_offset;
-			xdp_set_data_meta_invalid(&xdp);
-			xdp.data_hard_start = xdp.data -
-					      i40e_rx_offset(rx_ring);
-			xdp.data_end = xdp.data + size;
-
-			skb = i40e_run_xdp(rx_ring, &xdp);
-		}
+		if (!skb)
+			skb = i40e_run_xdp(rx_ring, handle, &size, &headroom);
 
 		if (IS_ERR(skb)) {
-			if (PTR_ERR(skb) == -I40E_XDP_TX) {
+			if (PTR_ERR(skb) == -I40E_XDP_TX)
 				xdp_xmit = true;
-				i40e_rx_buffer_flip(rx_ring, rx_buffer, size);
-			} else {
-				rx_buffer->pagecnt_bias++;
-			}
+			else
+				bpool_free(rx_ring->bpool, handle);
 			total_rx_bytes += size;
 			total_rx_packets++;
 		} else if (skb) {
-			i40e_add_rx_frag(rx_ring, rx_buffer, skb, size);
+			i40e_add_rx_frag(rx_ring, skb, handle, size, headroom);
 		} else if (ring_uses_build_skb(rx_ring)) {
-			skb = i40e_build_skb(rx_ring, rx_buffer, &xdp);
+			skb = i40e_build_skb(rx_ring, handle, size, headroom);
 		} else {
-			skb = i40e_construct_skb(rx_ring, rx_buffer, &xdp);
+			skb = i40e_construct_skb(rx_ring, handle, size,
+						 headroom);
 		}
 
+		rx_buffer->handle = 0;
+
 		/* exit if we failed to retrieve a buffer */
-		if (!skb) {
-			rx_ring->rx_stats.alloc_buff_failed++;
-			rx_buffer->pagecnt_bias++;
+		if (!skb)
 			break;
-		}
 
-		i40e_put_rx_buffer(rx_ring, rx_buffer);
 		cleaned_count++;
 
 		if (i40e_is_non_eop(rx_ring, rx_desc, skb))
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index d149ebb8330c..d8345265db1e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -140,58 +140,6 @@ enum i40e_dyn_idx_t {
 #define I40E_RX_DMA_ATTR \
 	(DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
 
-/* Attempt to maximize the headroom available for incoming frames.  We
- * use a 2K buffer for receives and need 1536/1534 to store the data for
- * the frame.  This leaves us with 512 bytes of room.  From that we need
- * to deduct the space needed for the shared info and the padding needed
- * to IP align the frame.
- *
- * Note: For cache line sizes 256 or larger this value is going to end
- *	 up negative.  In these cases we should fall back to the legacy
- *	 receive path.
- */
-#if (PAGE_SIZE < 8192)
-#define I40E_2K_TOO_SMALL_WITH_PADDING \
-((NET_SKB_PAD + I40E_RXBUFFER_1536) > SKB_WITH_OVERHEAD(I40E_RXBUFFER_2048))
-
-static inline int i40e_compute_pad(int rx_buf_len)
-{
-	int page_size, pad_size;
-
-	page_size = ALIGN(rx_buf_len, PAGE_SIZE / 2);
-	pad_size = SKB_WITH_OVERHEAD(page_size) - rx_buf_len;
-
-	return pad_size;
-}
-
-static inline int i40e_skb_pad(void)
-{
-	int rx_buf_len;
-
-	/* If a 2K buffer cannot handle a standard Ethernet frame then
-	 * optimize padding for a 3K buffer instead of a 1.5K buffer.
-	 *
-	 * For a 3K buffer we need to add enough padding to allow for
-	 * tailroom due to NET_IP_ALIGN possibly shifting us out of
-	 * cache-line alignment.
-	 */
-	if (I40E_2K_TOO_SMALL_WITH_PADDING)
-		rx_buf_len = I40E_RXBUFFER_3072 + SKB_DATA_ALIGN(NET_IP_ALIGN);
-	else
-		rx_buf_len = I40E_RXBUFFER_1536;
-
-	/* if needed make room for NET_IP_ALIGN */
-	rx_buf_len -= NET_IP_ALIGN;
-
-	return i40e_compute_pad(rx_buf_len);
-}
-
-#define I40E_SKB_PAD i40e_skb_pad()
-#else
-#define I40E_2K_TOO_SMALL_WITH_PADDING false
-#define I40E_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
-#endif
-
 /**
  * i40e_test_staterr - tests bits in Rx descriptor status and error fields
  * @rx_desc: pointer to receive descriptor (in le64 format)
@@ -312,14 +260,7 @@ struct i40e_tx_buffer {
 };
 
 struct i40e_rx_buffer {
-	dma_addr_t dma;
-	struct page *page;
-#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
-	__u32 page_offset;
-#else
-	__u16 page_offset;
-#endif
-	__u16 pagecnt_bias;
+	unsigned long handle;
 };
 
 struct i40e_queue_stats {
@@ -387,6 +328,7 @@ struct i40e_ring {
 
 	u16 count;			/* Number of descriptors */
 	u16 reg_idx;			/* HW register index of the ring */
+	u16 rx_buf_hr;
 	u16 rx_buf_len;
 
 	/* used in interrupt processing */
@@ -420,7 +362,6 @@ struct i40e_ring {
 	struct i40e_q_vector *q_vector;	/* Backreference to associated vector */
 
 	struct rcu_head rcu;		/* to avoid race on free */
-	u16 next_to_alloc;
 	struct sk_buff *skb;		/* When i40e_clean_rx_ring_irq() must
 					 * return before it sees the EOP for
 					 * the current packet, we save that skb
@@ -432,6 +373,7 @@ struct i40e_ring {
 
 	struct i40e_channel *ch;
 	struct xdp_rxq_info xdp_rxq;
+	struct buff_pool *bpool;
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 14/24] i40e: implemented page recycling buff_pool
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (12 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 13/24] i40e: introduce external allocator support Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 15/24] i40e: start using " Björn Töpel
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Added a buff_poll implementation that do page recycling.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/buff_pool.c | 385 ++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/buff_pool.h |   6 +
 2 files changed, 391 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/buff_pool.c b/drivers/net/ethernet/intel/i40e/buff_pool.c
index 8c51f61ca71d..42b6cf5042e9 100644
--- a/drivers/net/ethernet/intel/i40e/buff_pool.c
+++ b/drivers/net/ethernet/intel/i40e/buff_pool.c
@@ -283,3 +283,388 @@ void i40e_buff_pool_destroy(struct buff_pool *pool)
 	kfree(pool);
 }
 
+/* Recycling allocator */
+
+struct i40e_bpr_header {
+	dma_addr_t dma;
+#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
+	__u32 page_offset;
+#else
+	__u16 page_offset;
+#endif
+	__u16 pagecnt_bias;
+};
+
+struct i40e_bpr_pool {
+	unsigned int buff_tot_len;
+	unsigned int buff_len;
+	unsigned int headroom;
+	unsigned int pg_order;
+	unsigned int pg_size;
+	struct device *dev;
+	unsigned int head;
+	unsigned int tail;
+	unsigned int buffs_size_mask;
+	struct i40e_bpr_header *buffs[0];
+};
+
+#define I40E_BPRHDR_ALIGNED_SIZE ALIGN(sizeof(struct i40e_bpr_header),	\
+				       SMP_CACHE_BYTES)
+
+static int i40e_bpr_alloc(void *pool, unsigned long *handle)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool;
+	struct i40e_bpr_header *hdr;
+	struct page *pg;
+	dma_addr_t dma;
+
+	if (impl->head != impl->tail) {
+		*handle = (unsigned long)impl->buffs[impl->head];
+		impl->head = (impl->head + 1) & impl->buffs_size_mask;
+
+		return 0;
+	}
+
+	pg = dev_alloc_pages(impl->pg_order);
+	if (unlikely(!pg))
+		return -ENOMEM;
+
+	dma = dma_map_page_attrs(impl->dev, pg, 0, impl->pg_size,
+				 DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
+
+	if (dma_mapping_error(impl->dev, dma)) {
+		__free_pages(pg, impl->pg_order);
+		return -ENOMEM;
+	}
+
+	hdr = (struct i40e_bpr_header *)page_address(pg);
+	hdr->dma = dma;
+	hdr->page_offset = I40E_BPRHDR_ALIGNED_SIZE;
+	hdr->pagecnt_bias = 1;
+
+	*handle = (unsigned long)hdr;
+
+	return 0;
+}
+
+static void i40e_bpr_free(void *pool, unsigned long handle)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool;
+	struct i40e_bpr_header *hdr;
+	unsigned int tail;
+
+	hdr = (struct i40e_bpr_header *)handle;
+	tail = (impl->tail + 1) & impl->buffs_size_mask;
+	/* Is full? */
+	if (tail == impl->head) {
+		dma_unmap_page_attrs(impl->dev, hdr->dma, impl->pg_size,
+				     DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
+		__page_frag_cache_drain(virt_to_head_page(hdr),
+					hdr->pagecnt_bias);
+	}
+
+	impl->buffs[impl->tail] = hdr;
+	impl->tail = tail;
+}
+
+static unsigned int i40e_bpr_buff_size(void *pool)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool;
+
+	return impl->buff_len;
+}
+
+static unsigned int i40e_bpr_total_buff_size(void *pool)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool;
+
+	return impl->buff_tot_len;
+}
+
+static unsigned int i40e_bpr_buff_headroom(void *pool)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool;
+
+	return impl->headroom;
+}
+
+static unsigned int i40e_bpr_buff_truesize(void *pool)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool;
+
+	return impl->buff_tot_len;
+}
+
+static void *i40e_bpr_buff_ptr(void *pool, unsigned long handle)
+{
+	struct i40e_bpr_header *hdr;
+
+	hdr = (struct i40e_bpr_header *)handle;
+
+	return ((void *)hdr) + hdr->page_offset;
+}
+
+static bool i40e_page_is_reusable(struct page *page)
+{
+	return (page_to_nid(page) == numa_mem_id()) &&
+		!page_is_pfmemalloc(page);
+}
+
+static bool i40e_can_reuse_page(struct i40e_bpr_header *hdr)
+{
+	unsigned int pagecnt_bias = hdr->pagecnt_bias;
+	struct page *page = virt_to_head_page(hdr);
+
+	if (unlikely(!i40e_page_is_reusable(page)))
+		return false;
+
+#if (PAGE_SIZE < 8192)
+	if (unlikely((page_count(page) - pagecnt_bias) > 1))
+		return false;
+#else
+#define I40E_LAST_OFFSET \
+	(PAGE_SIZE - I40E_RXBUFFER_3072 - I40E_BPRHDR_ALIGNED_SIZE)
+	if (hdr->page_offset > I40E_LAST_OFFSET)
+		return false;
+#endif
+
+	if (unlikely(!pagecnt_bias)) {
+		page_ref_add(page, USHRT_MAX);
+		hdr->pagecnt_bias = USHRT_MAX;
+	}
+
+	return true;
+}
+
+static int i40e_bpr_buff_convert_to_page(void *pool, unsigned long handle,
+					 struct page **pg,
+					 unsigned int *pg_off)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool;
+	struct i40e_bpr_header *hdr;
+	unsigned int tail;
+
+	hdr = (struct i40e_bpr_header *)handle;
+
+	*pg = virt_to_page(hdr);
+	*pg_off = hdr->page_offset;
+
+#if (PAGE_SIZE < 8192)
+	hdr->page_offset ^= impl->buff_tot_len;
+#else
+	hdr->page_offset += impl->buff_tot_len;
+#endif
+	hdr->pagecnt_bias--;
+
+	tail = (impl->tail + 1) & impl->buffs_size_mask;
+	if (i40e_can_reuse_page(hdr) && tail != impl->head) {
+		impl->buffs[impl->tail] = hdr;
+		impl->tail = tail;
+
+		return 0;
+	}
+
+	dma_unmap_page_attrs(impl->dev, hdr->dma, impl->pg_size,
+			     DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
+	__page_frag_cache_drain(*pg, hdr->pagecnt_bias);
+	return 0;
+}
+
+static dma_addr_t i40e_bpr_buff_dma(void *pool,
+				    unsigned long handle)
+{
+	struct i40e_bpr_header *hdr;
+
+	hdr = (struct i40e_bpr_header *)handle;
+
+	return hdr->dma + hdr->page_offset;
+}
+
+static void i40e_bpr_buff_dma_sync_cpu(void *pool,
+				       unsigned long handle,
+				       unsigned int off,
+				       unsigned int size)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool;
+	dma_addr_t dma;
+
+	dma = i40e_bpr_buff_dma(pool, handle);
+	dma_sync_single_range_for_cpu(impl->dev, dma, off, size,
+				      DMA_FROM_DEVICE);
+}
+
+static void i40e_bpr_buff_dma_sync_dev(void *pool,
+				       unsigned long handle,
+				       unsigned int off,
+				       unsigned int size)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool;
+	dma_addr_t dma;
+
+	dma = i40e_bpr_buff_dma(pool, handle);
+	dma_sync_single_range_for_device(impl->dev, dma, off, size,
+					 DMA_FROM_DEVICE);
+}
+
+static void calc_buffer_size_less_8192(unsigned int mtu, bool reserve_headroom,
+				       unsigned int *buff_tot_len,
+				       unsigned int *buff_len,
+				       unsigned int *headroom,
+				       unsigned int *pg_order)
+{
+	*pg_order = 0;
+
+	if (!reserve_headroom) {
+		*buff_tot_len = (PAGE_SIZE - I40E_BPRHDR_ALIGNED_SIZE) / 2;
+		*buff_len = *buff_tot_len;
+		*headroom = 0;
+
+		return;
+	}
+
+	/* We're relying on page flipping, so make sure that a page
+	 * (with the buff header removed) / 2 is large enough.
+	 */
+	*buff_tot_len = (PAGE_SIZE - I40E_BPRHDR_ALIGNED_SIZE) / 2;
+	if ((NET_SKB_PAD + I40E_RXBUFFER_1536) <=
+	    SKB_WITH_OVERHEAD(*buff_tot_len) && mtu <= ETH_DATA_LEN) {
+		*buff_len = I40E_RXBUFFER_1536;
+		*headroom = SKB_WITH_OVERHEAD(*buff_tot_len) - *buff_len;
+
+		return;
+	}
+
+	*pg_order = 1;
+	*buff_tot_len = ((PAGE_SIZE << 1) - I40E_BPRHDR_ALIGNED_SIZE) / 2;
+	*buff_len = I40E_RXBUFFER_3072;
+	*headroom = SKB_WITH_OVERHEAD(*buff_tot_len) - *buff_len;
+}
+
+static void calc_buffer_size_greater_8192(bool reserve_headroom,
+					  unsigned int *buff_tot_len,
+					  unsigned int *buff_len,
+					  unsigned int *headroom,
+					  unsigned int *pg_order)
+{
+	*pg_order = 0;
+
+	if (!reserve_headroom) {
+		*buff_tot_len = I40E_RXBUFFER_2048;
+		*buff_len = I40E_RXBUFFER_2048;
+		*headroom = 0;
+
+		return;
+	}
+
+	*buff_tot_len = I40E_RXBUFFER_3072;
+	*buff_len = SKB_WITH_OVERHEAD(*buff_tot_len) - NET_SKB_PAD;
+	*buff_len = (*buff_len / 128) * 128; /* 128B align */
+	*headroom = *buff_tot_len - *buff_len;
+}
+
+static void calc_buffer_size(unsigned int mtu, bool reserve_headroom,
+			     unsigned int *buff_tot_len,
+			     unsigned int *buff_len,
+			     unsigned int *headroom,
+			     unsigned int *pg_order)
+{
+	if (PAGE_SIZE < 8192) {
+		calc_buffer_size_less_8192(mtu, reserve_headroom,
+					   buff_tot_len,
+					   buff_len,
+					   headroom,
+					   pg_order);
+
+		return;
+	}
+
+	calc_buffer_size_greater_8192(reserve_headroom, buff_tot_len,
+				      buff_len, headroom, pg_order);
+}
+
+struct buff_pool *i40e_buff_pool_recycle_create(unsigned int mtu,
+						bool reserve_headroom,
+						struct device *dev,
+						unsigned int pool_size)
+{
+	struct buff_pool_ops *pool_ops;
+	struct i40e_bpr_pool *impl;
+	struct buff_pool *pool;
+
+	if (!is_power_of_2(pool_size)) {
+		pr_err("%s pool_size (%u) is not power of 2\n", __func__, pool_size);
+
+		return NULL;
+	}
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return NULL;
+
+	pool_ops = kzalloc(sizeof(*pool_ops), GFP_KERNEL);
+	if (!pool_ops) {
+		kfree(pool);
+		return NULL;
+	}
+
+	impl = kzalloc(sizeof(*impl) +
+		       pool_size * sizeof(struct i40e_bpr_header *),
+		       GFP_KERNEL);
+	if (!impl) {
+		kfree(pool_ops);
+		kfree(pool);
+		return NULL;
+	}
+
+	calc_buffer_size(mtu, reserve_headroom,
+			 &impl->buff_tot_len,
+			 &impl->buff_len,
+			 &impl->headroom,
+			 &impl->pg_order);
+
+	impl->buffs_size_mask = pool_size - 1;
+	impl->dev = dev;
+	impl->pg_size = PAGE_SIZE << impl->pg_order;
+
+	pool_ops->alloc = i40e_bpr_alloc;
+	pool_ops->free = i40e_bpr_free;
+	pool_ops->buff_size = i40e_bpr_buff_size;
+	pool_ops->total_buff_size = i40e_bpr_total_buff_size;
+	pool_ops->buff_headroom = i40e_bpr_buff_headroom;
+	pool_ops->buff_truesize = i40e_bpr_buff_truesize;
+	pool_ops->buff_ptr = i40e_bpr_buff_ptr;
+	pool_ops->buff_convert_to_page = i40e_bpr_buff_convert_to_page;
+	pool_ops->buff_dma = i40e_bpr_buff_dma;
+	pool_ops->buff_dma_sync_cpu = i40e_bpr_buff_dma_sync_cpu;
+	pool_ops->buff_dma_sync_dev = i40e_bpr_buff_dma_sync_dev;
+
+	pr_err("%s mtu=%u reserve=%d pool_size=%u buff_tot_len=%u buff_len=%u headroom=%u pg_order=%u pf_size=%u\n",
+	       __func__,
+	       mtu, (int)reserve_headroom, pool_size, impl->buff_tot_len,
+	       impl->buff_len, impl->headroom, impl->pg_order, impl->pg_size);
+
+	pool->pool = impl;
+	pool->ops = pool_ops;
+
+	return pool;
+}
+
+void i40e_buff_pool_recycle_destroy(struct buff_pool *pool)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool->pool;
+	struct i40e_bpr_header *hdr;
+
+	while (impl->head != impl->tail) {
+		hdr = impl->buffs[impl->head];
+		dma_unmap_page_attrs(impl->dev, hdr->dma, impl->pg_size,
+				     DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
+		__page_frag_cache_drain(virt_to_head_page(hdr),
+					hdr->pagecnt_bias);
+		impl->head = (impl->head + 1) & impl->buffs_size_mask;
+	}
+
+	kfree(pool->ops);
+	kfree(pool->pool);
+	kfree(pool);
+}
+
diff --git a/drivers/net/ethernet/intel/i40e/buff_pool.h b/drivers/net/ethernet/intel/i40e/buff_pool.h
index 933881e14ac0..03897f5ebbff 100644
--- a/drivers/net/ethernet/intel/i40e/buff_pool.h
+++ b/drivers/net/ethernet/intel/i40e/buff_pool.h
@@ -66,5 +66,11 @@ void bpool_buff_dma_sync_dev(struct buff_pool *pool,
 struct buff_pool *i40e_buff_pool_create(struct device *dev);
 void i40e_buff_pool_destroy(struct buff_pool *pool);
 
+struct buff_pool *i40e_buff_pool_recycle_create(unsigned int mtu,
+						bool reserve_headroom,
+						struct device *dev,
+						unsigned int pool_size);
+void i40e_buff_pool_recycle_destroy(struct buff_pool *pool);
+
 #endif /* BUFF_POOL_H_ */
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 15/24] i40e: start using recycling buff_pool
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (13 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 14/24] i40e: implemented page recycling buff_pool Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 16/24] i40e: separated buff_pool interface from i40e implementaion Björn Töpel
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Here, we start using the newly added buff_pool implementation.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 14 +++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  2 +-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 7e82b7c6c0b7..79e48840a6bd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3211,6 +3211,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	u16 pf_q = vsi->base_queue + ring->queue_index;
 	struct i40e_hw *hw = &vsi->back->hw;
 	struct i40e_hmc_obj_rxq rx_ctx;
+	bool reserve_headroom;
+	unsigned int mtu = 0;
 	i40e_status err = 0;
 
 	bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
@@ -3218,7 +3220,17 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	/* clear the context structure first */
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
-	ring->bpool = i40e_buff_pool_create(ring->dev);
+	reserve_headroom = !vsi->netdev;
+
+	if (vsi->netdev) {
+		mtu = vsi->netdev->mtu;
+		reserve_headroom = !(vsi->back->flags & I40E_FLAG_LEGACY_RX);
+	} else {
+		reserve_headroom = false;
+	}
+	ring->bpool = i40e_buff_pool_recycle_create(mtu, reserve_headroom,
+						    ring->dev,
+						    ring->count);
 	ring->rx_buf_hr = (u16)bpool_buff_headroom(ring->bpool);
 	ring->rx_buf_len = (u16)bpool_buff_size(ring->bpool);
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index aa29013acf0c..757cda5ac889 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1255,7 +1255,7 @@ void i40e_free_rx_resources(struct i40e_ring *rx_ring)
 	kfree(rx_ring->rx_bi);
 	rx_ring->rx_bi = NULL;
 
-	i40e_buff_pool_destroy(rx_ring->bpool);
+	i40e_buff_pool_recycle_destroy(rx_ring->bpool);
 	rx_ring->bpool = NULL;
 
 	if (rx_ring->desc) {
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 16/24] i40e: separated buff_pool interface from i40e implementaion
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (14 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 15/24] i40e: start using " Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 17/24] xsk: introduce xsk_buff_pool Björn Töpel
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Moved buff_pool to include, so that buff_pool implementations can be
done outside of the i40e module.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/Makefile           |   2 +-
 drivers/net/ethernet/intel/i40e/buff_pool.h        |  76 -----------
 .../intel/i40e/{buff_pool.c => i40e_buff_pool.c}   | 148 ++++-----------------
 drivers/net/ethernet/intel/i40e/i40e_buff_pool.h   |  15 +++
 drivers/net/ethernet/intel/i40e/i40e_main.c        |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c        |   5 +-
 include/linux/buff_pool.h                          | 136 +++++++++++++++++++
 7 files changed, 186 insertions(+), 199 deletions(-)
 delete mode 100644 drivers/net/ethernet/intel/i40e/buff_pool.h
 rename drivers/net/ethernet/intel/i40e/{buff_pool.c => i40e_buff_pool.c} (82%)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.h
 create mode 100644 include/linux/buff_pool.h

diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
index bfdf9ce3e7f0..bbd7e2babd97 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -46,6 +46,6 @@ i40e-objs := i40e_main.o \
 	i40e_ptp.o	\
 	i40e_client.o   \
 	i40e_virtchnl_pf.o \
-	buff_pool.o
+	i40e_buff_pool.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/i40e/buff_pool.h b/drivers/net/ethernet/intel/i40e/buff_pool.h
deleted file mode 100644
index 03897f5ebbff..000000000000
--- a/drivers/net/ethernet/intel/i40e/buff_pool.h
+++ /dev/null
@@ -1,76 +0,0 @@
-#ifndef BUFF_POOL_H_
-#define BUFF_POOL_H_
-
-#include <linux/types.h>
-
-struct page;
-struct device;
-
-struct buff_pool_ops;
-
-struct buff_pool {
-	void *pool;
-	struct buff_pool_ops *ops;
-};
-
-/* Allocates a new buffer from the pool */
-int bpool_alloc(struct buff_pool *pool, unsigned long *handle);
-
-/* Returns a buffer originating from the pool, back to the pool */
-void bpool_free(struct buff_pool *pool, unsigned long handle);
-
-/* Returns the size of the buffer, w/o headroom. This is what the pool
- * creator passed to the constructor.
- */
-unsigned int bpool_buff_size(struct buff_pool *pool);
-
-/* Returns the size of the buffer, plus additional headroom (if
- * any).
- */
-unsigned int bpool_total_buff_size(struct buff_pool *pool);
-
-/* Returns additional headroom (if any) */
-unsigned int bpool_buff_headroom(struct buff_pool *pool);
-
-/* Returns the truesize (as for skbuff) */
-unsigned int bpool_buff_truesize(struct buff_pool *pool);
-
-/* Returns the kernel virtual address to the handle. */
-void *bpool_buff_ptr(struct buff_pool *pool, unsigned long handle);
-
-/* Converts a handle to a page. After a successful call, the handle is
- * stale and should not be used and should be considered
- * freed. Callers need to manually clean up the returned page (using
- * page_free).
- */
-int bpool_buff_convert_to_page(struct buff_pool *pool, unsigned long handle,
-			       struct page **pg, unsigned int *pg_off);
-
-/* Returns the dma address of a buffer */
-dma_addr_t bpool_buff_dma(struct buff_pool *pool,
-			  unsigned long handle);
-
-/* DMA sync for CPU */
-void bpool_buff_dma_sync_cpu(struct buff_pool *pool,
-			     unsigned long handle,
-			     unsigned int off,
-			     unsigned int size);
-
-/* DMA sync for device */
-void bpool_buff_dma_sync_dev(struct buff_pool *pool,
-			     unsigned long handle,
-			     unsigned int off,
-			     unsigned int size);
-/* ---- */
-
-struct buff_pool *i40e_buff_pool_create(struct device *dev);
-void i40e_buff_pool_destroy(struct buff_pool *pool);
-
-struct buff_pool *i40e_buff_pool_recycle_create(unsigned int mtu,
-						bool reserve_headroom,
-						struct device *dev,
-						unsigned int pool_size);
-void i40e_buff_pool_recycle_destroy(struct buff_pool *pool);
-
-#endif /* BUFF_POOL_H_ */
-
diff --git a/drivers/net/ethernet/intel/i40e/buff_pool.c b/drivers/net/ethernet/intel/i40e/i40e_buff_pool.c
similarity index 82%
rename from drivers/net/ethernet/intel/i40e/buff_pool.c
rename to drivers/net/ethernet/intel/i40e/i40e_buff_pool.c
index 42b6cf5042e9..d1e13632b6e4 100644
--- a/drivers/net/ethernet/intel/i40e/buff_pool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_buff_pool.c
@@ -1,94 +1,10 @@
-#include "buff_pool.h"
+#include "i40e_buff_pool.h"
+
+#include <linux/buff_pool.h>
 
 #include "i40e.h"
 #include "i40e_txrx.h"
 
-struct buff_pool_ops {
-	int (*alloc)(void *pool, unsigned long *handle);
-	void (*free)(void *pool, unsigned long handle);
-	unsigned int (*buff_size)(void *pool);
-	unsigned int (*total_buff_size)(void *pool);
-	unsigned int (*buff_headroom)(void *pool);
-	unsigned int (*buff_truesize)(void *pool);
-	void *(*buff_ptr)(void *pool, unsigned long handle);
-	int (*buff_convert_to_page)(void *pool,
-				    unsigned long handle,
-				    struct page **pg, unsigned int *pg_off);
-	dma_addr_t (*buff_dma)(void *pool,
-			       unsigned long handle);
-	void (*buff_dma_sync_cpu)(void *pool,
-				  unsigned long handle,
-				  unsigned int off,
-				  unsigned int size);
-	void (*buff_dma_sync_dev)(void *pool,
-				  unsigned long handle,
-				  unsigned int off,
-				  unsigned int size);
-};
-
-int bpool_alloc(struct buff_pool *pool, unsigned long *handle)
-{
-	return pool->ops->alloc(pool->pool, handle);
-}
-
-void bpool_free(struct buff_pool *pool, unsigned long handle)
-{
-	pool->ops->free(pool->pool, handle);
-}
-
-unsigned int bpool_buff_size(struct buff_pool *pool)
-{
-	return pool->ops->buff_size(pool->pool);
-}
-
-unsigned int bpool_total_buff_size(struct buff_pool *pool)
-{
-	return pool->ops->total_buff_size(pool->pool);
-}
-
-unsigned int bpool_buff_headroom(struct buff_pool *pool)
-{
-	return pool->ops->buff_headroom(pool->pool);
-}
-
-unsigned int bpool_buff_truesize(struct buff_pool *pool)
-{
-	return pool->ops->buff_truesize(pool->pool);
-}
-
-void *bpool_buff_ptr(struct buff_pool *pool, unsigned long handle)
-{
-	return pool->ops->buff_ptr(pool->pool, handle);
-}
-
-int bpool_buff_convert_to_page(struct buff_pool *pool, unsigned long handle,
-			       struct page **pg, unsigned int *pg_off)
-{
-	return pool->ops->buff_convert_to_page(pool->pool, handle, pg, pg_off);
-}
-
-dma_addr_t bpool_buff_dma(struct buff_pool *pool,
-			  unsigned long handle)
-{
-	return pool->ops->buff_dma(pool->pool, handle);
-}
-
-void bpool_buff_dma_sync_cpu(struct buff_pool *pool,
-			     unsigned long handle,
-			     unsigned int off,
-			     unsigned int size)
-{
-	pool->ops->buff_dma_sync_cpu(pool->pool, handle, off, size);
-}
-
-void bpool_buff_dma_sync_dev(struct buff_pool *pool,
-			     unsigned long handle,
-			     unsigned int off,
-			     unsigned int size)
-{
-	pool->ops->buff_dma_sync_dev(pool->pool, handle, off, size);
-}
-
 /* Naive, non-recycling allocator. */
 
 struct i40e_bp_pool {
@@ -233,6 +149,11 @@ static void i40e_bp_buff_dma_sync_dev(void *pool,
 					 DMA_FROM_DEVICE);
 }
 
+static void i40e_bp_destroy(void *pool)
+{
+	kfree(pool);
+}
+
 struct buff_pool *i40e_buff_pool_create(struct device *dev)
 {
 	struct i40e_bp_pool *pool_impl;
@@ -267,6 +188,7 @@ struct buff_pool *i40e_buff_pool_create(struct device *dev)
 	pool_ops->buff_dma = i40e_bp_buff_dma;
 	pool_ops->buff_dma_sync_cpu = i40e_bp_buff_dma_sync_cpu;
 	pool_ops->buff_dma_sync_dev = i40e_bp_buff_dma_sync_dev;
+	pool_ops->destroy = i40e_bp_destroy;
 
 	pool_impl->dev = dev;
 
@@ -276,13 +198,6 @@ struct buff_pool *i40e_buff_pool_create(struct device *dev)
 	return pool;
 }
 
-void i40e_buff_pool_destroy(struct buff_pool *pool)
-{
-	kfree(pool->ops);
-	kfree(pool->pool);
-	kfree(pool);
-}
-
 /* Recycling allocator */
 
 struct i40e_bpr_header {
@@ -470,8 +385,8 @@ static int i40e_bpr_buff_convert_to_page(void *pool, unsigned long handle,
 	return 0;
 }
 
-static dma_addr_t i40e_bpr_buff_dma(void *pool,
-				    unsigned long handle)
+static inline dma_addr_t i40e_bpr_buff_dma(void *pool,
+					   unsigned long handle)
 {
 	struct i40e_bpr_header *hdr;
 
@@ -582,6 +497,23 @@ static void calc_buffer_size(unsigned int mtu, bool reserve_headroom,
 				      buff_len, headroom, pg_order);
 }
 
+static void i40e_bpr_destroy(void *pool)
+{
+	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool;
+	struct i40e_bpr_header *hdr;
+
+	while (impl->head != impl->tail) {
+		hdr = impl->buffs[impl->head];
+		dma_unmap_page_attrs(impl->dev, hdr->dma, impl->pg_size,
+				     DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
+		__page_frag_cache_drain(virt_to_head_page(hdr),
+					hdr->pagecnt_bias);
+		impl->head = (impl->head + 1) & impl->buffs_size_mask;
+	}
+
+	kfree(impl);
+}
+
 struct buff_pool *i40e_buff_pool_recycle_create(unsigned int mtu,
 						bool reserve_headroom,
 						struct device *dev,
@@ -637,11 +569,7 @@ struct buff_pool *i40e_buff_pool_recycle_create(unsigned int mtu,
 	pool_ops->buff_dma = i40e_bpr_buff_dma;
 	pool_ops->buff_dma_sync_cpu = i40e_bpr_buff_dma_sync_cpu;
 	pool_ops->buff_dma_sync_dev = i40e_bpr_buff_dma_sync_dev;
-
-	pr_err("%s mtu=%u reserve=%d pool_size=%u buff_tot_len=%u buff_len=%u headroom=%u pg_order=%u pf_size=%u\n",
-	       __func__,
-	       mtu, (int)reserve_headroom, pool_size, impl->buff_tot_len,
-	       impl->buff_len, impl->headroom, impl->pg_order, impl->pg_size);
+	pool_ops->destroy = i40e_bpr_destroy;
 
 	pool->pool = impl;
 	pool->ops = pool_ops;
@@ -649,22 +577,4 @@ struct buff_pool *i40e_buff_pool_recycle_create(unsigned int mtu,
 	return pool;
 }
 
-void i40e_buff_pool_recycle_destroy(struct buff_pool *pool)
-{
-	struct i40e_bpr_pool *impl = (struct i40e_bpr_pool *)pool->pool;
-	struct i40e_bpr_header *hdr;
-
-	while (impl->head != impl->tail) {
-		hdr = impl->buffs[impl->head];
-		dma_unmap_page_attrs(impl->dev, hdr->dma, impl->pg_size,
-				     DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
-		__page_frag_cache_drain(virt_to_head_page(hdr),
-					hdr->pagecnt_bias);
-		impl->head = (impl->head + 1) & impl->buffs_size_mask;
-	}
-
-	kfree(pool->ops);
-	kfree(pool->pool);
-	kfree(pool);
-}
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_buff_pool.h b/drivers/net/ethernet/intel/i40e/i40e_buff_pool.h
new file mode 100644
index 000000000000..dddd04680c1a
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_buff_pool.h
@@ -0,0 +1,15 @@
+#ifndef I40E_BUFF_POOL_H_
+#define I40E_BUFF_POOL_H_
+
+#include <linux/types.h>
+
+struct buff_pool;
+struct device;
+
+struct buff_pool *i40e_buff_pool_create(struct device *dev);
+
+struct buff_pool *i40e_buff_pool_recycle_create(unsigned int mtu,
+						bool reserve_headroom,
+						struct device *dev,
+						unsigned int pool_size);
+#endif /* I40E_BUFF_POOL_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 79e48840a6bd..0e1445af6b01 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -28,6 +28,7 @@
 #include <linux/of_net.h>
 #include <linux/pci.h>
 #include <linux/bpf.h>
+#include <linux/buff_pool.h>
 
 /* Local includes */
 #include "i40e.h"
@@ -39,7 +40,7 @@
  */
 #define CREATE_TRACE_POINTS
 #include "i40e_trace.h"
-#include "buff_pool.h"
+#include "i40e_buff_pool.h"
 
 const char i40e_driver_name[] = "i40e";
 static const char i40e_driver_string[] =
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 757cda5ac889..fffc254abd8c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -27,11 +27,12 @@
 #include <linux/prefetch.h>
 #include <net/busy_poll.h>
 #include <linux/bpf_trace.h>
+#include <linux/buff_pool.h>
 #include <net/xdp.h>
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
-#include "buff_pool.h"
+#include "i40e_buff_pool.h"
 
 static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
 				u32 td_tag)
@@ -1255,7 +1256,7 @@ void i40e_free_rx_resources(struct i40e_ring *rx_ring)
 	kfree(rx_ring->rx_bi);
 	rx_ring->rx_bi = NULL;
 
-	i40e_buff_pool_recycle_destroy(rx_ring->bpool);
+	bpool_destroy(rx_ring->bpool);
 	rx_ring->bpool = NULL;
 
 	if (rx_ring->desc) {
diff --git a/include/linux/buff_pool.h b/include/linux/buff_pool.h
new file mode 100644
index 000000000000..660ca827f4a6
--- /dev/null
+++ b/include/linux/buff_pool.h
@@ -0,0 +1,136 @@
+#ifndef BUFF_POOL_H_
+#define BUFF_POOL_H_
+
+#include <linux/types.h>
+#include <linux/slab.h>
+
+struct page;
+struct device;
+
+struct buff_pool_ops {
+	int (*alloc)(void *pool, unsigned long *handle);
+	void (*free)(void *pool, unsigned long handle);
+	unsigned int (*buff_size)(void *pool);
+	unsigned int (*total_buff_size)(void *pool);
+	unsigned int (*buff_headroom)(void *pool);
+	unsigned int (*buff_truesize)(void *pool);
+	void *(*buff_ptr)(void *pool, unsigned long handle);
+	int (*buff_convert_to_page)(void *pool,
+				    unsigned long handle,
+				    struct page **pg, unsigned int *pg_off);
+	dma_addr_t (*buff_dma)(void *pool,
+			       unsigned long handle);
+	void (*buff_dma_sync_cpu)(void *pool,
+				  unsigned long handle,
+				  unsigned int off,
+				  unsigned int size);
+	void (*buff_dma_sync_dev)(void *pool,
+				  unsigned long handle,
+				  unsigned int off,
+				  unsigned int size);
+	void (*destroy)(void *pool);
+};
+
+struct buff_pool {
+	void *pool;
+	struct buff_pool_ops *ops;
+};
+
+/* Allocates a new buffer from the pool */
+static inline int bpool_alloc(struct buff_pool *pool, unsigned long *handle)
+{
+	return pool->ops->alloc(pool->pool, handle);
+}
+
+/* Returns a buffer originating from the pool, back to the pool */
+static inline void bpool_free(struct buff_pool *pool, unsigned long handle)
+{
+	pool->ops->free(pool->pool, handle);
+}
+
+/* Returns the size of the buffer, w/o headroom. This is what the pool
+ * creator passed to the constructor.
+ */
+static inline unsigned int bpool_buff_size(struct buff_pool *pool)
+{
+	return pool->ops->buff_size(pool->pool);
+}
+
+/* Returns the size of the buffer, plus additional headroom (if
+ * any).
+ */
+static inline unsigned int bpool_total_buff_size(struct buff_pool *pool)
+{
+	return pool->ops->total_buff_size(pool->pool);
+}
+
+/* Returns additional available headroom (if any) */
+static inline unsigned int bpool_buff_headroom(struct buff_pool *pool)
+{
+	return pool->ops->buff_headroom(pool->pool);
+}
+
+/* Returns the truesize (as for skbuff) */
+static inline unsigned int bpool_buff_truesize(struct buff_pool *pool)
+{
+	return pool->ops->buff_truesize(pool->pool);
+}
+
+/* Returns the kernel virtual address to the handle. */
+static inline void *bpool_buff_ptr(struct buff_pool *pool, unsigned long handle)
+{
+	return pool->ops->buff_ptr(pool->pool, handle);
+}
+
+/* Converts a handle to a page. After a successful call, the handle is
+ * stale and should not be used and should be considered
+ * freed. Callers need to manually clean up the returned page (using
+ * page_free).
+ */
+static inline int bpool_buff_convert_to_page(struct buff_pool *pool,
+					     unsigned long handle,
+					     struct page **pg,
+					     unsigned int *pg_off)
+{
+	return pool->ops->buff_convert_to_page(pool->pool, handle, pg, pg_off);
+}
+
+/* Returns the dma address of a buffer */
+static inline dma_addr_t bpool_buff_dma(struct buff_pool *pool,
+					unsigned long handle)
+{
+	return pool->ops->buff_dma(pool->pool, handle);
+}
+
+/* DMA sync for CPU */
+static inline void bpool_buff_dma_sync_cpu(struct buff_pool *pool,
+					   unsigned long handle,
+					   unsigned int off,
+					   unsigned int size)
+{
+	pool->ops->buff_dma_sync_cpu(pool->pool, handle, off, size);
+}
+
+/* DMA sync for device */
+static inline void bpool_buff_dma_sync_dev(struct buff_pool *pool,
+					   unsigned long handle,
+					   unsigned int off,
+					   unsigned int size)
+{
+	pool->ops->buff_dma_sync_dev(pool->pool, handle, off, size);
+}
+
+/* Destroy pool */
+static inline void bpool_destroy(struct buff_pool *pool)
+{
+	if (!pool)
+		return;
+
+	pool->ops->destroy(pool->pool);
+
+	kfree(pool->ops);
+	kfree(pool);
+}
+
+#endif /* BUFF_POOL_H_ */
+
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 17/24] xsk: introduce xsk_buff_pool
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (15 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 16/24] i40e: separated buff_pool interface from i40e implementaion Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 18/24] xdp: added buff_pool support to struct xdp_buff Björn Töpel
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

The xsk_buff_pool is a buff_pool implementation, that uses frames
(buffs) provided by userspace instead of the page allocator.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/Makefile        |   2 +-
 net/xdp/xsk_buff_pool.c | 225 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_buff_pool.h |  17 ++++
 3 files changed, 243 insertions(+), 1 deletion(-)
 create mode 100644 net/xdp/xsk_buff_pool.c
 create mode 100644 net/xdp/xsk_buff_pool.h

diff --git a/net/xdp/Makefile b/net/xdp/Makefile
index b9d5d6b8823c..42727a32490c 100644
--- a/net/xdp/Makefile
+++ b/net/xdp/Makefile
@@ -1 +1 @@
-obj-$(CONFIG_XDP_SOCKETS) += xsk.o xsk_ring.o xsk_packet_array.o
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xsk_ring.o xsk_packet_array.o xsk_buff_pool.o
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
new file mode 100644
index 000000000000..b0760ba2b188
--- /dev/null
+++ b/net/xdp/xsk_buff_pool.c
@@ -0,0 +1,225 @@
+#include "xsk_buff_pool.h"
+
+#include <linux/skbuff.h>
+#include <linux/buff_pool.h>
+
+#include "xsk_packet_array.h" /* XDP_KERNEL_HEADROOM */
+#include "xsk_buff.h"
+#include "xsk_ring.h"
+
+#define BATCH_SIZE 32
+
+static bool xsk_bp_alloc_from_freelist(struct xsk_buff_pool *impl,
+				       unsigned long *handle)
+{
+	struct xsk_buff *buff;
+
+	if (impl->free_list) {
+		buff = impl->free_list;
+		impl->free_list = buff->next;
+		buff->next = NULL;
+		*handle = (unsigned long)buff;
+
+		return true;
+	}
+
+	return false;
+}
+
+static int xsk_bp_alloc(void *pool, unsigned long *handle)
+{
+	struct xsk_buff_pool *impl = (struct xsk_buff_pool *)pool;
+	struct xsk_buff *buff;
+	struct xskq_iter it;
+	u32 id;
+
+	if (xsk_bp_alloc_from_freelist(impl, handle))
+		return 0;
+
+	it = xskq_deq_iter(impl->q, BATCH_SIZE);
+
+	while (!xskq_iter_end(&it)) {
+		id = xskq_deq_iter_get_id(impl->q, &it);
+		buff = &impl->bi->buffs[id];
+		buff->next = impl->free_list;
+		impl->free_list = buff;
+		xskq_deq_iter_next(impl->q, &it);
+	}
+
+	xskq_deq_iter_done(impl->q, &it);
+
+	if (xsk_bp_alloc_from_freelist(impl, handle))
+		return 0;
+
+	return -ENOMEM;
+}
+
+static void xsk_bp_free(void *pool, unsigned long handle)
+{
+	struct xsk_buff_pool *impl = (struct xsk_buff_pool *)pool;
+	struct xsk_buff *buff = (struct xsk_buff *)handle;
+
+	buff->next = impl->free_list;
+	impl->free_list = buff;
+}
+
+static unsigned int xsk_bp_buff_size(void *pool)
+{
+	struct xsk_buff_pool *impl = (struct xsk_buff_pool *)pool;
+
+	return impl->bi->buff_len - impl->bi->rx_headroom -
+		XDP_KERNEL_HEADROOM;
+}
+
+static unsigned int xsk_bp_total_buff_size(void *pool)
+{
+	struct xsk_buff_pool *impl = (struct xsk_buff_pool *)pool;
+
+	return impl->bi->buff_len - impl->bi->rx_headroom;
+}
+
+static unsigned int xsk_bp_buff_headroom(void *pool)
+{
+	(void)pool;
+
+	return XSK_KERNEL_HEADROOM;
+}
+
+static unsigned int xsk_bp_buff_truesize(void *pool)
+{
+	struct xsk_buff_pool *impl = (struct xsk_buff_pool *)pool;
+
+	return impl->bi->buff_len;
+}
+
+static void *xsk_bp_buff_ptr(void *pool, unsigned long handle)
+{
+	struct xsk_buff *buff = (struct xsk_buff *)handle;
+
+	(void)pool;
+	return buff->data + buff->offset;
+}
+
+static int xsk_bp_buff_convert_to_page(void *pool,
+				       unsigned long handle,
+				       struct page **pg, unsigned int *pg_off)
+{
+	unsigned int req_len, buff_len, pg_order = 0;
+	void *data;
+
+	buff_len = xsk_bp_total_buff_size(pool);
+	req_len = buff_len + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	/* XXX too sloppy? clean up... */
+	if (req_len > PAGE_SIZE) {
+		pg_order++;
+		if (req_len > (PAGE_SIZE << pg_order))
+			return -ENOMEM;
+	}
+
+	*pg = dev_alloc_pages(pg_order);
+	if (unlikely(!*pg))
+		return -ENOMEM;
+
+	data = page_address(*pg);
+	memcpy(data, xsk_bp_buff_ptr(pool, handle),
+	       xsk_bp_total_buff_size(pool));
+	*pg_off = 0;
+
+	xsk_bp_free(pool, handle);
+
+	return 0;
+}
+
+static dma_addr_t xsk_bp_buff_dma(void *pool,
+				  unsigned long handle)
+{
+	struct xsk_buff *buff = (struct xsk_buff *)handle;
+
+	(void)pool;
+	return buff->dma + buff->offset;
+}
+
+static void xsk_bp_buff_dma_sync_cpu(void *pool,
+				     unsigned long handle,
+				     unsigned int off,
+				     unsigned int size)
+{
+	struct xsk_buff_pool *impl = (struct xsk_buff_pool *)pool;
+	struct xsk_buff *buff = (struct xsk_buff *)handle;
+
+	dma_sync_single_range_for_cpu(impl->bi->dev, buff->dma,
+				      off, size, impl->bi->dir);
+}
+
+static void xsk_bp_buff_dma_sync_dev(void *pool,
+				     unsigned long handle,
+				     unsigned int off,
+				     unsigned int size)
+{
+	struct xsk_buff_pool *impl = (struct xsk_buff_pool *)pool;
+	struct xsk_buff *buff = (struct xsk_buff *)handle;
+
+	dma_sync_single_range_for_device(impl->bi->dev, buff->dma,
+					 off, size, impl->bi->dir);
+}
+
+static void xsk_bp_destroy(void *pool)
+{
+	struct xsk_buff_pool *impl = (struct xsk_buff_pool *)pool;
+	struct xsk_buff *buff = impl->free_list;
+
+	while (buff) {
+		xskq_return_id(impl->q, buff->id);
+		buff = buff->next;
+	}
+
+	kfree(impl);
+}
+
+struct buff_pool *xsk_buff_pool_create(struct xsk_buff_info *buff_info,
+				       struct xsk_queue *queue)
+{
+	struct buff_pool_ops *pool_ops;
+	struct xsk_buff_pool *impl;
+	struct buff_pool *pool;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return NULL;
+
+	pool_ops = kzalloc(sizeof(*pool_ops), GFP_KERNEL);
+	if (!pool_ops) {
+		kfree(pool);
+		return NULL;
+	}
+
+	impl = kzalloc(sizeof(*impl), GFP_KERNEL);
+	if (!impl) {
+		kfree(pool_ops);
+		kfree(pool);
+		return NULL;
+	}
+
+	impl->bi = buff_info;
+	impl->q = queue;
+
+	pool_ops->alloc = xsk_bp_alloc;
+	pool_ops->free = xsk_bp_free;
+	pool_ops->buff_size = xsk_bp_buff_size;
+	pool_ops->total_buff_size = xsk_bp_total_buff_size;
+	pool_ops->buff_headroom = xsk_bp_buff_headroom;
+	pool_ops->buff_truesize = xsk_bp_buff_truesize;
+	pool_ops->buff_ptr = xsk_bp_buff_ptr;
+	pool_ops->buff_convert_to_page = xsk_bp_buff_convert_to_page;
+	pool_ops->buff_dma = xsk_bp_buff_dma;
+	pool_ops->buff_dma_sync_cpu = xsk_bp_buff_dma_sync_cpu;
+	pool_ops->buff_dma_sync_dev = xsk_bp_buff_dma_sync_dev;
+	pool_ops->destroy = xsk_bp_destroy;
+
+	pool->pool = impl;
+	pool->ops = pool_ops;
+
+	return pool;
+}
+
diff --git a/net/xdp/xsk_buff_pool.h b/net/xdp/xsk_buff_pool.h
new file mode 100644
index 000000000000..302c3e40cae4
--- /dev/null
+++ b/net/xdp/xsk_buff_pool.h
@@ -0,0 +1,17 @@
+#ifndef XSK_BUFF_POOL_H_
+#define XSK_BUFF_POOL_H_
+
+struct xsk_buff;
+struct xsk_buff_info;
+struct xsk_queue;
+
+struct xsk_buff_pool {
+	struct xsk_buff *free_list;
+	struct xsk_buff_info *bi;
+	struct xsk_queue *q;
+};
+
+struct buff_pool *xsk_buff_pool_create(struct xsk_buff_info *buff_info,
+				       struct xsk_queue *queue);
+
+#endif /* XSK_BUFF_POOL_H_ */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 18/24] xdp: added buff_pool support to struct xdp_buff
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (16 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 17/24] xsk: introduce xsk_buff_pool Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 19/24] xsk: add support for zero copy Rx Björn Töpel
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Extend xdp_buff to with a buff_pool and handle.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/filter.h | 1 +
 include/net/xdp.h      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 43cacfe2cc2a..fbf6adb0fabd 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -504,6 +504,7 @@ struct xdp_buff {
 	void *data_end;
 	void *data_meta;
 	void *data_hard_start;
+	unsigned long bp_handle;
 	struct xdp_rxq_info *rxq;
 };
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index b2362ddfa694..fee3278e3d52 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -37,6 +37,7 @@ struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
 	u32 reg_state;
+	struct buff_pool *bpool;
 } ____cacheline_aligned; /* perf critical, avoid false-sharing */
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 19/24] xsk: add support for zero copy Rx
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (17 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 18/24] xdp: added buff_pool support to struct xdp_buff Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 20/24] xsk: add support for zero copy Tx Björn Töpel
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

In this commit we start making use of the new ndo_bpf sub-commands,
and try to enable zero copy, if available.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c | 185 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 145 insertions(+), 40 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f372c3288301..f05ab825d157 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -29,15 +29,21 @@
 #include <linux/netdevice.h>
 #include <net/sock.h>
 
+#include <net/xdp_sock.h>
+#include <linux/buff_pool.h>
+
 #include "xsk.h"
 #include "xsk_buff.h"
 #include "xsk_ring.h"
+#include "xsk_buff_pool.h"
+#include "xsk_packet_array.h"
 
 #define XSK_UMEM_MIN_FRAME_SIZE 2048
 #define XSK_ARRAY_SIZE 512
 
 struct xsk_info {
 	struct xsk_packet_array *pa;
+	struct buff_pool *bp;
 	spinlock_t pa_lock;
 	struct xsk_queue *q;
 	struct xsk_umem *umem;
@@ -56,8 +62,24 @@ struct xdp_sock {
 	struct mutex tx_mutex;
 	u32 ifindex;
 	u16 queue_id;
+	bool zc_mode;
 };
 
+static inline bool xsk_is_zc_cap(struct xdp_sock *xs)
+{
+	return xs->zc_mode;
+}
+
+static void xsk_set_zc_cap(struct xdp_sock *xs)
+{
+	xs->zc_mode = true;
+}
+
+static void xsk_clear_zc_cap(struct xdp_sock *xs)
+{
+	xs->zc_mode = false;
+}
+
 static struct xdp_sock *xdp_sk(struct sock *sk)
 {
 	return (struct xdp_sock *)sk;
@@ -323,6 +345,22 @@ static int xsk_init_tx_ring(struct sock *sk, int mr_fd, u32 desc_nr)
 	return xsk_init_ring(sk, mr_fd, desc_nr, &xs->tx);
 }
 
+static void xsk_disable_zc(struct xdp_sock *xs)
+{
+	struct netdev_bpf bpf = {};
+
+	if (!xsk_is_zc_cap(xs))
+		return;
+
+	bpf.command = XDP_UNREGISTER_XSK;
+	bpf.xsk.queue_id = xs->queue_id;
+
+	rtnl_lock();
+	(void)xs->dev->netdev_ops->ndo_bpf(xs->dev, &bpf);
+	rtnl_unlock();
+	xsk_clear_zc_cap(xs);
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -344,14 +382,22 @@ static int xsk_release(struct socket *sock)
 		xs_prev = xs->dev->_rx[xs->queue_id].xs;
 		rcu_assign_pointer(xs->dev->_rx[xs->queue_id].xs, NULL);
 
+		xsk_disable_zc(xs);
+
 		/* Wait for driver to stop using the xdp socket. */
 		synchronize_net();
 
 		xskpa_destroy(xs->rx.pa);
-		xskpa_destroy(xs->tx.pa);
-		xsk_umem_destroy(xs_prev->umem);
+		bpool_destroy(xs->rx.bp);
 		xskq_destroy(xs_prev->rx.q);
+		xsk_buff_info_destroy(xs->rx.buff_info);
+
+		xskpa_destroy(xs->tx.pa);
 		xskq_destroy(xs_prev->tx.q);
+		xsk_buff_info_destroy(xs->tx.buff_info);
+
+		xsk_umem_destroy(xs_prev->umem);
+
 		kobject_put(&xs_prev->dev->_rx[xs->queue_id].kobj);
 		dev_put(xs_prev->dev);
 	}
@@ -365,6 +411,45 @@ static int xsk_release(struct socket *sock)
 	return 0;
 }
 
+static int xsk_dma_map_pool_cb(struct buff_pool *pool, struct device *dev,
+			       enum dma_data_direction dir,
+			       unsigned long attrs)
+{
+	struct xsk_buff_pool *bp = (struct xsk_buff_pool *)pool->pool;
+
+	return xsk_buff_dma_map(bp->bi, dev, dir, attrs);
+}
+
+static void xsk_error_report(void *ctx, int err)
+{
+	struct xsk_sock *xs = (struct xsk_sock *)ctx;
+}
+
+static void xsk_try_enable_zc(struct xdp_sock *xs)
+{
+	struct xsk_rx_parms rx_parms = {};
+	struct netdev_bpf bpf = {};
+	int err;
+
+	if (!xs->dev->netdev_ops->ndo_bpf)
+		return;
+
+	rx_parms.buff_pool = xs->rx.bp;
+	rx_parms.dma_map = xsk_dma_map_pool_cb;
+	rx_parms.error_report_ctx = xs;
+	rx_parms.error_report = xsk_error_report;
+
+	bpf.command = XDP_REGISTER_XSK;
+	bpf.xsk.rx_parms = &rx_parms;
+	bpf.xsk.queue_id = xs->queue_id;
+
+	rtnl_lock();
+	err = xs->dev->netdev_ops->ndo_bpf(xs->dev, &bpf);
+	rtnl_unlock();
+	if (!err)
+		xsk_set_zc_cap(xs);
+}
+
 static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 {
 	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
@@ -429,6 +514,13 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		goto out_rx_pa;
 	}
 
+	/* ...and Rx buffer pool is used for zerocopy. */
+	xs->rx.bp = xsk_buff_pool_create(xs->rx.buff_info, xs->rx.q);
+	if (!xs->rx.bp) {
+		err = -ENOMEM;
+		goto out_rx_bp;
+	}
+
 	/* Tx */
 	xs->tx.buff_info = xsk_buff_info_create(xs->tx.umem);
 	if (!xs->tx.buff_info) {
@@ -446,12 +538,17 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 
 	rcu_assign_pointer(dev->_rx[sxdp->sxdp_queue_id].xs, xs);
 
+	xsk_try_enable_zc(xs);
+
 	goto out_unlock;
 
 out_tx_pa:
 	xsk_buff_info_destroy(xs->tx.buff_info);
 	xs->tx.buff_info = NULL;
 out_tx_bi:
+	bpool_destroy(xs->rx.bp);
+	xs->rx.bp = NULL;
+out_rx_bp:
 	xskpa_destroy(xs->rx.pa);
 	xs->rx.pa = NULL;
 out_rx_pa:
@@ -509,27 +606,16 @@ int xsk_generic_rcv(struct xdp_buff *xdp)
 }
 EXPORT_SYMBOL_GPL(xsk_generic_rcv);
 
-struct xdp_sock *xsk_rcv(struct xdp_sock *xsk, struct xdp_buff *xdp)
+static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	u32 len = xdp->data_end - xdp->data;
 	struct xsk_frame_set p;
 
-	rcu_read_lock();
-	if (!xsk)
-		xsk = lookup_xsk(xdp->rxq->dev, xdp->rxq->queue_index);
-	if (unlikely(!xsk)) {
-		rcu_read_unlock();
-		return ERR_PTR(-EINVAL);
-	}
-
-	if (!xskpa_next_frame_populate(xsk->rx.pa, &p)) {
-		rcu_read_unlock();
-		return ERR_PTR(-ENOSPC);
-	}
+	if (!xskpa_next_frame_populate(xs->rx.pa, &p))
+		return -ENOSPC;
 
 	memcpy(xskf_get_data(&p), xdp->data, len);
 	xskf_set_frame_no_offset(&p, len, true);
-	rcu_read_unlock();
 
 	/* We assume that the semantic of xdp_do_redirect is such that
 	 * ndo_xdp_xmit will decrease the refcount of the page when it
@@ -540,41 +626,60 @@ struct xdp_sock *xsk_rcv(struct xdp_sock *xsk, struct xdp_buff *xdp)
 	 */
 	page_frag_free(xdp->data);
 
-	return xsk;
+	return 0;
 }
-EXPORT_SYMBOL_GPL(xsk_rcv);
 
-int xsk_zc_rcv(struct xdp_sock *xsk, struct xdp_buff *xdp)
+static void __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	u32 offset = xdp->data - xdp->data_hard_start;
-	u32 len = xdp->data_end - xdp->data;
-	struct xsk_frame_set p;
+	struct xsk_buff *b = (struct xsk_buff *)xdp->bp_handle;
 
-	/* We do not need any locking here since we are guaranteed
-	 * a single producer and a single consumer.
-	 */
-	if (xskpa_next_frame_populate(xsk->rx.pa, &p)) {
-		xskf_set_frame(&p, len, offset, true);
-		return 0;
-	}
-
-	/* No user-space buffer to put the packet in. */
-	return -ENOSPC;
+	xskq_enq_lazy(xs->rx.q, b->id, xdp->data_end - xdp->data,
+		      b->offset + (xdp->data - xdp->data_hard_start));
 }
-EXPORT_SYMBOL_GPL(xsk_zc_rcv);
 
-void xsk_flush(struct xdp_sock *xsk)
+struct xdp_sock *xsk_rcv(struct xdp_sock *xsk, struct xdp_buff *xdp)
 {
+	int err = 0;
+
 	rcu_read_lock();
-	if (!xsk)
-		xsk = lookup_xsk(xsk->dev, xsk->queue_id);
-	if (unlikely(!xsk)) {
-		rcu_read_unlock();
-		return;
+
+	if (!xsk) {
+		xsk = lookup_xsk(xdp->rxq->dev, xdp->rxq->queue_index);
+		if (!xsk) {
+			err = -EINVAL;
+			goto out;
+		}
 	}
 
-	WARN_ON_ONCE(xskpa_flush(xsk->rx.pa));
+	/* XXX Ick, this is very hacky. Need a better solution */
+	if (xdp->rxq->bpool)
+		__xsk_rcv_zc(xsk, xdp);
+	else
+		err = __xsk_rcv(xsk, xdp);
+
+out:
 	rcu_read_unlock();
+
+	return err ? ERR_PTR(err) : xsk;
+}
+EXPORT_SYMBOL_GPL(xsk_rcv);
+
+static void __xsk_flush(struct xdp_sock *xs)
+{
+	WARN_ON_ONCE(xskpa_flush(xs->rx.pa));
+}
+
+static void __xsk_flush_zc(struct xdp_sock *xs)
+{
+	xskq_enq_flush(xs->rx.q);
+}
+
+void xsk_flush(struct xdp_sock *xsk)
+{
+	if (xsk_is_zc_cap(xsk))
+		__xsk_flush_zc(xsk);
+	else
+		__xsk_flush(xsk);
 }
 EXPORT_SYMBOL_GPL(xsk_flush);
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 20/24] xsk: add support for zero copy Tx
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (18 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 19/24] xsk: add support for zero copy Rx Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 21/24] i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx Björn Töpel
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, ndo_xdp_xmit_xsk support is wired up, for netdevices supporting
the ndo.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h     |   4 ++
 net/xdp/xsk.c              | 149 +++++++++++++++++++++++++++++++++++++--------
 net/xdp/xsk_packet_array.h |   5 ++
 3 files changed, 131 insertions(+), 27 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 866ea7191217..3a257eb5108b 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -14,6 +14,10 @@ struct xdp_sock;
  */
 
 struct xsk_tx_parms {
+	struct buff_pool *buff_pool;
+	int (*dma_map)(struct buff_pool *bp, struct device *dev,
+		       enum dma_data_direction dir,
+		       unsigned long attr);
 	void (*tx_completion)(u32 start, u32 npackets,
 			      unsigned long ctx1, unsigned long ctx2);
 	unsigned long ctx1;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f05ab825d157..0de3cadc7165 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -68,7 +68,7 @@ struct xdp_sock {
 static inline bool xsk_is_zc_cap(struct xdp_sock *xs)
 {
 	return xs->zc_mode;
-}
+};
 
 static void xsk_set_zc_cap(struct xdp_sock *xs)
 {
@@ -85,6 +85,7 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+/* CONFIG */
 static void xsk_umem_unpin_pages(struct xsk_umem *umem)
 {
 	unsigned int i;
@@ -393,6 +394,7 @@ static int xsk_release(struct socket *sock)
 		xsk_buff_info_destroy(xs->rx.buff_info);
 
 		xskpa_destroy(xs->tx.pa);
+		bpool_destroy(xs->tx.bp);
 		xskq_destroy(xs_prev->tx.q);
 		xsk_buff_info_destroy(xs->tx.buff_info);
 
@@ -423,17 +425,96 @@ static int xsk_dma_map_pool_cb(struct buff_pool *pool, struct device *dev,
 static void xsk_error_report(void *ctx, int err)
 {
 	struct xsk_sock *xs = (struct xsk_sock *)ctx;
+
+	(void)xs;
+}
+
+static inline struct xdp_sock *lookup_xsk(struct net_device *dev,
+					  unsigned int queue_id)
+{
+	if (unlikely(queue_id > dev->num_rx_queues))
+		return NULL;
+
+	return rcu_dereference(dev->_rx[queue_id].xs);
+}
+
+/* TX */
+static void xsk_tx_completion(u32 start, u32 npackets,
+			      unsigned long ctx1, unsigned long ctx2)
+{
+	struct net_device *dev = (struct net_device *)ctx1;
+	u32 queue_id = (u32)ctx2;
+	struct xdp_sock *xs;
+
+	(void)start;
+	rcu_read_lock();
+	xs = lookup_xsk(dev, queue_id);
+	if (likely(xs))
+		WARN_ON_ONCE(xskpa_flush_n(xs->tx.pa, npackets));
+
+	rcu_read_unlock();
+}
+
+static int xsk_get_packet(struct net_device *dev, u32 queue_id,
+			  dma_addr_t *dma, void **data, u32 *len,
+			  u32 *offset)
+{
+	struct xsk_frame_set p;
+	struct xdp_sock *xs;
+	int ret = 0;
+
+	rcu_read_lock();
+	xs = lookup_xsk(dev, queue_id);
+	if (unlikely(!xs))
+		goto out;
+
+	if (xskpa_next_frame_populate(xs->tx.pa, &p)) {
+		struct xsk_buff *buff;
+
+		*offset = xskf_get_data_offset(&p);
+		*len = xskf_get_frame_len(&p);
+		*data = xskf_get_data(&p);
+		buff = xsk_buff_info_get_buff(xs->tx.buff_info,
+					      xskf_get_frame_id(&p));
+		WARN_ON_ONCE(!buff);
+		if (!buff)
+			goto out;
+		*dma = buff->dma;
+		ret = 1;
+	}
+
+out:
+	rcu_read_unlock();
+	return ret;
 }
 
 static void xsk_try_enable_zc(struct xdp_sock *xs)
 {
 	struct xsk_rx_parms rx_parms = {};
+	struct xsk_tx_parms tx_parms = {};
 	struct netdev_bpf bpf = {};
 	int err;
 
-	if (!xs->dev->netdev_ops->ndo_bpf)
+	if (!xs->dev->netdev_ops->ndo_bpf ||
+	    !xs->dev->netdev_ops->ndo_xdp_xmit_xsk)
 		return;
 
+	/* Until we can attach an XDP program on TX as well,
+	 * egress operates in the same mode (XDP_SKB or XDP_DRV) as set
+	 * by the XDP RX program loading.
+	 * An XDP program need to be loaded, for now.
+	 */
+	if (xs->dev->netdev_ops->ndo_bpf) {
+		struct netdev_bpf xdp;
+
+		rtnl_lock();
+		__dev_xdp_query(xs->dev, xs->dev->netdev_ops->ndo_bpf, &xdp);
+		rtnl_unlock();
+
+		if (!xdp.prog_attached)
+			return;
+	}
+
 	rx_parms.buff_pool = xs->rx.bp;
 	rx_parms.dma_map = xsk_dma_map_pool_cb;
 	rx_parms.error_report_ctx = xs;
@@ -443,6 +524,14 @@ static void xsk_try_enable_zc(struct xdp_sock *xs)
 	bpf.xsk.rx_parms = &rx_parms;
 	bpf.xsk.queue_id = xs->queue_id;
 
+	tx_parms.buff_pool = xs->tx.bp;
+	tx_parms.dma_map = xsk_dma_map_pool_cb;
+	tx_parms.tx_completion = xsk_tx_completion;
+	tx_parms.ctx1 = (unsigned long)xs->dev;
+	tx_parms.ctx2 = xs->queue_id;
+	tx_parms.get_tx_packet = xsk_get_packet;
+	bpf.xsk.tx_parms = &tx_parms;
+
 	rtnl_lock();
 	err = xs->dev->netdev_ops->ndo_bpf(xs->dev, &bpf);
 	rtnl_unlock();
@@ -536,12 +625,29 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		goto out_tx_pa;
 	}
 
+	xs->tx.bp = xsk_buff_pool_create(xs->tx.buff_info, xs->tx.q);
+	if (!xs->tx.bp) {
+		err = -ENOMEM;
+		goto out_tx_bp;
+	}
+
 	rcu_assign_pointer(dev->_rx[sxdp->sxdp_queue_id].xs, xs);
 
 	xsk_try_enable_zc(xs);
+	/* Need to have an XDP program loaded for now. */
+	if (!xsk_is_zc_cap(xs) && !dev->xdp_prog) {
+		err = -ENODATA;
+		goto out_no_xdp_prog;
+	}
 
 	goto out_unlock;
 
+out_no_xdp_prog:
+	xskpa_destroy(xs->tx.pa);
+	xs->tx.pa = NULL;
+out_tx_bp:
+	bpool_destroy(xs->tx.bp);
+	xs->tx.bp = NULL;
 out_tx_pa:
 	xsk_buff_info_destroy(xs->tx.buff_info);
 	xs->tx.buff_info = NULL;
@@ -563,15 +669,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	return err;
 }
 
-static inline struct xdp_sock *lookup_xsk(struct net_device *dev,
-					  unsigned int queue_id)
-{
-	if (unlikely(queue_id > dev->num_rx_queues))
-		return NULL;
-
-	return rcu_dereference(dev->_rx[queue_id].xs);
-}
-
+/* RX */
 int xsk_generic_rcv(struct xdp_buff *xdp)
 {
 	u32 len = xdp->data_end - xdp->data;
@@ -753,25 +851,19 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
 	return -EOPNOTSUPP;
 }
 
-void xsk_tx_completion(struct net_device *dev, u16 queue_index,
-		       unsigned int npackets)
+static int xsk_xdp_xmit(struct sock *sk, struct msghdr *m,
+			size_t total_len)
 {
-	unsigned long flags;
-	struct xdp_sock *xs;
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct net_device *dev = xs->dev;
 
-	rcu_read_lock();
-	xs = lookup_xsk(dev, queue_index);
-	if (unlikely(!xs)) {
-		rcu_read_unlock();
-		return;
-	}
+	if (need_wait)
+		/* Not implemented yet. */
+		return -EINVAL;
 
-	spin_lock_irqsave(&xs->tx.pa_lock, flags);
-	WARN_ON_ONCE(xskpa_flush_n(xs->tx.pa, npackets));
-	spin_unlock_irqrestore(&xs->tx.pa_lock, flags);
-	rcu_read_unlock();
+	return dev->netdev_ops->ndo_xdp_xmit_xsk(dev, xs->queue_id);
 }
-EXPORT_SYMBOL_GPL(xsk_tx_completion);
 
 static void xsk_destruct_skb(struct sk_buff *skb)
 {
@@ -917,7 +1009,10 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 	if (unlikely(!(xs->dev->flags & IFF_UP)))
 		return -ENETDOWN;
 
-	return xsk_generic_xmit(sk, m, total_len);
+	if (!xsk_is_zc_cap(xs))
+		return xsk_generic_xmit(sk, m, total_len);
+
+	return xsk_xdp_xmit(sk, m, total_len);
 }
 
 static int xsk_mmap(struct file *file, struct socket *sock,
diff --git a/net/xdp/xsk_packet_array.h b/net/xdp/xsk_packet_array.h
index 1f7544dee443..53803a1b7281 100644
--- a/net/xdp/xsk_packet_array.h
+++ b/net/xdp/xsk_packet_array.h
@@ -149,6 +149,11 @@ static inline void *xskf_get_data(struct xsk_frame_set *p)
 	return buff->data + desc->offset;
 }
 
+static inline dma_addr_t xskf_get_dma(struct xsk_frame_set *p)
+{
+	return 0;
+}
+
 static inline u32 xskf_get_data_offset(struct xsk_frame_set *p)
 {
 	return p->pkt_arr->items[p->curr & p->pkt_arr->mask].offset;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 21/24] i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (19 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 20/24] xsk: add support for zero copy Tx Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 22/24] i40e: introduced a clean_tx callback function Björn Töpel
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

In this commit we add support for the two xsk ndo_bpf sub-commands for
registering a xsk to the driver.

NB! There's code here for disabling/enabling a queue pair in
i40e. Should probably separate this commit from the ndo
implementation.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h      |  24 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c | 434 +++++++++++++++++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  17 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  32 ++
 4 files changed, 493 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 46e9f4e0a02c..6452ac5caa76 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -806,6 +806,10 @@ struct i40e_vsi {
 
 	/* VSI specific handlers */
 	irqreturn_t (*irq_handler)(int irq, void *data);
+
+	struct i40e_xsk_ctx **xsk_ctxs;
+	u16 num_xsk_ctxs;
+	u16 xsk_ctxs_in_use;
 } ____cacheline_internodealigned_in_smp;
 
 struct i40e_netdev_priv {
@@ -1109,4 +1113,24 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
 
 int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
 int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
+
+static inline bool i40e_xsk_attached(struct i40e_ring *rxr)
+{
+	bool xdp_on = i40e_enabled_xdp_vsi(rxr->vsi);
+	int qid = rxr->queue_index;
+
+	return rxr->vsi->xsk_ctxs && rxr->vsi->xsk_ctxs[qid] && xdp_on;
+}
+
+static inline struct buff_pool *i40e_xsk_buff_pool(struct i40e_ring *rxr)
+{
+	bool xdp_on = i40e_enabled_xdp_vsi(rxr->vsi);
+	int qid = rxr->queue_index;
+
+	if (!rxr->vsi->xsk_ctxs || !rxr->vsi->xsk_ctxs[qid] || !xdp_on)
+		return NULL;
+
+	return rxr->vsi->xsk_ctxs[qid]->buff_pool;
+}
+
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 0e1445af6b01..0c1ac8564f77 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -29,6 +29,7 @@
 #include <linux/pci.h>
 #include <linux/bpf.h>
 #include <linux/buff_pool.h>
+#include <net/xdp_sock.h>
 
 /* Local includes */
 #include "i40e.h"
@@ -3211,6 +3212,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	u32 chain_len = vsi->back->hw.func_caps.rx_buf_chain_len;
 	u16 pf_q = vsi->base_queue + ring->queue_index;
 	struct i40e_hw *hw = &vsi->back->hw;
+	struct buff_pool *xsk_buff_pool;
 	struct i40e_hmc_obj_rxq rx_ctx;
 	bool reserve_headroom;
 	unsigned int mtu = 0;
@@ -3229,9 +3231,20 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	} else {
 		reserve_headroom = false;
 	}
-	ring->bpool = i40e_buff_pool_recycle_create(mtu, reserve_headroom,
-						    ring->dev,
-						    ring->count);
+
+	xsk_buff_pool = i40e_xsk_buff_pool(ring);
+	if (xsk_buff_pool) {
+		ring->bpool = xsk_buff_pool;
+		ring->xdp_rxq.bpool = xsk_buff_pool;
+		set_ring_xsk_buff_pool(ring);
+	} else {
+		ring->bpool = i40e_buff_pool_recycle_create(mtu,
+							    reserve_headroom,
+							    ring->dev,
+							    ring->count);
+		ring->xdp_rxq.bpool = NULL;
+		clear_ring_xsk_buff_pool(ring);
+	}
 	ring->rx_buf_hr = (u16)bpool_buff_headroom(ring->bpool);
 	ring->rx_buf_len = (u16)bpool_buff_size(ring->bpool);
 
@@ -9923,6 +9936,25 @@ static void i40e_clear_rss_config_user(struct i40e_vsi *vsi)
 	vsi->rss_lut_user = NULL;
 }
 
+static void i40e_free_xsk_ctxs(struct i40e_vsi *vsi)
+{
+	struct i40e_xsk_ctx *ctx;
+	u16 i;
+
+	if (!vsi->xsk_ctxs)
+		return;
+
+	for (i = 0; i < vsi->num_xsk_ctxs; i++) {
+		ctx = vsi->xsk_ctxs[i];
+		/* ctx free'd by error handle */
+		if (ctx)
+			ctx->err_handler(ctx->err_ctx, -1 /* XXX wat? */);
+	}
+
+	kfree(vsi->xsk_ctxs);
+	vsi->xsk_ctxs = NULL;
+}
+
 /**
  * i40e_vsi_clear - Deallocate the VSI provided
  * @vsi: the VSI being un-configured
@@ -9938,6 +9970,8 @@ static int i40e_vsi_clear(struct i40e_vsi *vsi)
 		goto free_vsi;
 	pf = vsi->back;
 
+	i40e_free_xsk_ctxs(vsi);
+
 	mutex_lock(&pf->switch_mutex);
 	if (!pf->vsi[vsi->idx]) {
 		dev_err(&pf->pdev->dev, "pf->vsi[%d] is NULL, just free vsi[%d](%p,type %d)\n",
@@ -11635,6 +11669,394 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	return 0;
 }
 
+/**
+ * i40e_enter_busy_conf - Enters busy config state
+ * @vsi: vsi
+ *
+ * Returns 0 on success, <0 for failure.
+ **/
+static int i40e_enter_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+	int timeout = 50;
+
+	while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
+		timeout--;
+		if (!timeout)
+			return -EBUSY;
+		usleep_range(1000, 2000);
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_exit_busy_conf - Exits busy config state
+ * @vsi: vsi
+ **/
+static void i40e_exit_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+
+	clear_bit(__I40E_CONFIG_BUSY, pf->state);
+}
+
+/**
+ * i40e_queue_pair_reset_stats - Resets all statistics for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_reset_stats(struct i40e_vsi *vsi, int queue_pair)
+{
+	memset(&vsi->rx_rings[queue_pair]->rx_stats, 0,
+	       sizeof(vsi->rx_rings[queue_pair]->rx_stats));
+	memset(&vsi->tx_rings[queue_pair]->stats, 0,
+	       sizeof(vsi->tx_rings[queue_pair]->stats));
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		memset(&vsi->xdp_rings[queue_pair]->stats, 0,
+		       sizeof(vsi->xdp_rings[queue_pair]->stats));
+	}
+}
+
+/**
+ * i40e_queue_pair_clean_rings - Cleans all the rings of a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_clean_rings(struct i40e_vsi *vsi, int queue_pair)
+{
+	i40e_clean_tx_ring(vsi->tx_rings[queue_pair]);
+	if (i40e_enabled_xdp_vsi(vsi))
+		i40e_clean_tx_ring(vsi->xdp_rings[queue_pair]);
+	i40e_clean_rx_ring(vsi->rx_rings[queue_pair]);
+}
+
+/**
+ * i40e_queue_pair_control_napi - Enables/disables NAPI for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ **/
+static void i40e_queue_pair_control_napi(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_q_vector *q_vector = rxr->q_vector;
+
+	if (!vsi->netdev)
+		return;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (q_vector->rx.ring || q_vector->tx.ring) {
+		if (enable)
+			napi_enable(&q_vector->napi);
+		else
+			napi_disable(&q_vector->napi);
+	}
+}
+
+/**
+ * i40e_queue_pair_control_rings - Enables/disables all rings for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_control_rings(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_pf *pf = vsi->back;
+	int pf_q, ret = 0;
+
+	pf_q = vsi->base_queue + queue_pair;
+	ret = i40e_control_wait_tx_q(vsi->seid, pf, pf_q,
+				     false /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	i40e_control_rx_q(pf, pf_q, enable);
+	ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Rx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	/* Due to HW errata, on Rx disable only, the register can
+	 * indicate done before it really is. Needs 50ms to be sure
+	 */
+	if (!enable)
+		mdelay(50);
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return ret;
+
+	ret = i40e_control_wait_tx_q(vsi->seid, pf,
+				     pf_q + vsi->alloc_queue_pairs,
+				     true /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d XDP Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+	}
+
+	return ret;
+}
+
+/**
+ * i40e_queue_pair_enable_irq - Enables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_enable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED)
+		i40e_irq_dynamic_enable(vsi, rxr->q_vector->v_idx);
+	else
+		i40e_irq_dynamic_enable_icr0(pf);
+
+	i40e_flush(hw);
+}
+
+/**
+ * i40e_queue_pair_disable_irq - Disables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* For simplicity, instead of removing the qp interrupt causes
+	 * from the interrupt linked list, we simply disable the interrupt, and
+	 * leave the list intact.
+	 *
+	 * All rings in a qp belong to the same qvector.
+	 */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
+		u32 intpf = vsi->base_vector + rxr->q_vector->v_idx;
+
+		wr32(hw, I40E_PFINT_DYN_CTLN(intpf - 1), 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->msix_entries[intpf].vector);
+	} else {
+		/* Legacy and MSI mode - this stops all interrupt handling */
+		wr32(hw, I40E_PFINT_ICR0_ENA, 0);
+		wr32(hw, I40E_PFINT_DYN_CTL0, 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->pdev->irq);
+	}
+}
+
+/**
+ * i40e_queue_pair_disable - Disables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_enter_busy_conf(vsi);
+	if (err)
+		return err;
+
+	i40e_queue_pair_disable_irq(vsi, queue_pair);
+	err = i40e_queue_pair_control_rings(vsi, queue_pair,
+					    false /* disable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, false /* disable */);
+	i40e_queue_pair_clean_rings(vsi, queue_pair);
+	i40e_queue_pair_reset_stats(vsi, queue_pair);
+
+	return err;
+}
+
+/**
+ * i40e_queue_pair_enable - Enables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_configure_tx_ring(vsi->tx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		err = i40e_configure_tx_ring(vsi->xdp_rings[queue_pair]);
+		if (err)
+			return err;
+	}
+
+	err = i40e_configure_rx_ring(vsi->rx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	err = i40e_queue_pair_control_rings(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_enable_irq(vsi, queue_pair);
+
+	i40e_exit_busy_conf(vsi);
+
+	return err;
+}
+
+static void i40e_free_xsk_ctxs_if_last(struct i40e_vsi *vsi)
+{
+	if (vsi->xsk_ctxs_in_use > 0)
+		return;
+
+	kfree(vsi->xsk_ctxs);
+	vsi->xsk_ctxs = NULL;
+	vsi->num_xsk_ctxs = 0;
+}
+
+static int i40e_alloc_xsk_ctxs(struct i40e_vsi *vsi)
+{
+	if (vsi->xsk_ctxs)
+		return 0;
+
+	vsi->num_xsk_ctxs = vsi->alloc_queue_pairs;
+	vsi->xsk_ctxs = kcalloc(vsi->num_xsk_ctxs, sizeof(*vsi->xsk_ctxs),
+				GFP_KERNEL);
+	if (!vsi->xsk_ctxs) {
+		vsi->num_xsk_ctxs = 0;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int i40e_add_xsk_ctx(struct i40e_vsi *vsi,
+			    int queue_id,
+			    struct buff_pool *buff_pool,
+			    void *err_ctx,
+			    void (*err_handler)(void *, int))
+{
+	struct i40e_xsk_ctx *ctx;
+	int err;
+
+	err = i40e_alloc_xsk_ctxs(vsi);
+	if (err)
+		return err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx) {
+		i40e_free_xsk_ctxs_if_last(vsi);
+		return -ENOMEM;
+	}
+
+	vsi->xsk_ctxs_in_use++;
+	ctx->buff_pool = buff_pool;
+	ctx->err_ctx = err_ctx;
+	ctx->err_handler = err_handler;
+
+	vsi->xsk_ctxs[queue_id] = ctx;
+
+	return 0;
+}
+
+static void i40e_remove_xsk_ctx(struct i40e_vsi *vsi, int queue_id)
+{
+	kfree(vsi->xsk_ctxs[queue_id]);
+	vsi->xsk_ctxs[queue_id] = NULL;
+	vsi->xsk_ctxs_in_use--;
+	i40e_free_xsk_ctxs_if_last(vsi);
+}
+
+static int i40e_xsk_enable(struct net_device *netdev, u32 qid,
+			   struct xsk_rx_parms *parms)
+{
+	struct i40e_netdev_priv *np = netdev_priv(netdev);
+	struct i40e_vsi *vsi = np->vsi;
+	bool if_running;
+	int err;
+
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	if (qid >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (vsi->xsk_ctxs && vsi->xsk_ctxs[qid])
+		return -EBUSY;
+
+	err = parms->dma_map(parms->buff_pool, &vsi->back->pdev->dev,
+			     DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
+	if (err)
+		return err;
+
+	if_running = netif_running(netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	err = i40e_add_xsk_ctx(vsi, qid, parms->buff_pool,
+			       parms->error_report_ctx, parms->error_report);
+	if (err)
+		return err;
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int i40e_xsk_disable(struct net_device *netdev, u32 qid,
+			    struct xsk_rx_parms *parms)
+{
+	struct i40e_netdev_priv *np = netdev_priv(netdev);
+	struct i40e_vsi *vsi = np->vsi;
+	bool if_running;
+	int err;
+
+	if (!vsi->xsk_ctxs || qid >= vsi->num_xsk_ctxs || !vsi->xsk_ctxs[qid])
+		return -EINVAL;
+
+	if_running = netif_running(netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	i40e_remove_xsk_ctx(vsi, qid);
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
 /**
  * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
@@ -11656,6 +12078,12 @@ static int i40e_xdp(struct net_device *dev,
 		xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
 		xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
 		return 0;
+	case XDP_REGISTER_XSK:
+		return i40e_xsk_enable(dev, xdp->xsk.queue_id,
+				       xdp->xsk.rx_parms);
+	case XDP_UNREGISTER_XSK:
+		return i40e_xsk_disable(dev, xdp->xsk.queue_id,
+					xdp->xsk.rx_parms);
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index fffc254abd8c..4fb5bc030df7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1256,8 +1256,11 @@ void i40e_free_rx_resources(struct i40e_ring *rx_ring)
 	kfree(rx_ring->rx_bi);
 	rx_ring->rx_bi = NULL;
 
-	bpool_destroy(rx_ring->bpool);
+	if (!ring_has_xsk_buff_pool(rx_ring))
+		bpool_destroy(rx_ring->bpool);
+
 	rx_ring->bpool = NULL;
+	clear_ring_xsk_buff_pool(rx_ring);
 
 	if (rx_ring->desc) {
 		dma_free_coherent(rx_ring->dev, rx_ring->size,
@@ -1917,6 +1920,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	xdp.data = xdp.data_hard_start + *headroom;
 	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = xdp.data + *size;
+	xdp.bp_handle = handle;
 	xdp.rxq = &rx_ring->xdp_rxq;
 
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
@@ -1943,17 +1947,8 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 		}
 		break;
 	case XDP_REDIRECT:
-		err = i40e_xdp_buff_convert_page(rx_ring, &xdp, handle, *size,
-						 *headroom);
-		if (err) {
-			result = I40E_XDP_CONSUMED;
-			break;
-		}
-
 		err = xdp_do_redirect(rx_ring->netdev, &xdp, xdp_prog);
-		result = I40E_XDP_TX;
-		if (err)
-			page_frag_free(xdp.data);
+		result = err ? I40E_XDP_CONSUMED : I40E_XDP_TX;
 		break;
 	default:
 		bpf_warn_invalid_xdp_action(act);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index d8345265db1e..906a562507a9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -245,6 +245,14 @@ static inline unsigned int i40e_txd_use_count(unsigned int size)
 #define I40E_TX_FLAGS_VLAN_PRIO_SHIFT	29
 #define I40E_TX_FLAGS_VLAN_SHIFT	16
 
+/* Signals completion of a TX packet for an XDP socket. */
+typedef void (*tx_completion_func)(u32 start, u32 npackets,
+				   unsigned long ctx1, unsigned long ctx2);
+/* Returns the next packet to send for an XDP socket. */
+typedef int (*get_tx_packet_func)(struct net_device *dev, u32 queue_id,
+				  dma_addr_t *dma, void **data, u32 *len,
+				  u32 *offset);
+
 struct i40e_tx_buffer {
 	struct i40e_tx_desc *next_to_watch;
 	union {
@@ -291,6 +299,12 @@ enum i40e_ring_state_t {
 	__I40E_RING_STATE_NBITS /* must be last */
 };
 
+struct i40e_xsk_ctx {
+	struct buff_pool *buff_pool;
+	void *err_ctx;
+	void (*err_handler)(void *ctx, int errno);
+};
+
 /* some useful defines for virtchannel interface, which
  * is the only remaining user of header split
  */
@@ -346,6 +360,7 @@ struct i40e_ring {
 #define I40E_TXR_FLAGS_WB_ON_ITR		BIT(0)
 #define I40E_RXR_FLAGS_BUILD_SKB_ENABLED	BIT(1)
 #define I40E_TXR_FLAGS_XDP			BIT(2)
+#define I40E_RXR_FLAGS_XSK_BUFF_POOL		BIT(3)
 
 	/* stats structs */
 	struct i40e_queue_stats	stats;
@@ -374,6 +389,7 @@ struct i40e_ring {
 	struct i40e_channel *ch;
 	struct xdp_rxq_info xdp_rxq;
 	struct buff_pool *bpool;
+	struct i40e_xsk_ctx *xsk;
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
@@ -401,6 +417,21 @@ static inline void set_ring_xdp(struct i40e_ring *ring)
 	ring->flags |= I40E_TXR_FLAGS_XDP;
 }
 
+static inline bool ring_has_xsk_buff_pool(struct i40e_ring *ring)
+{
+	return !!(ring->flags & I40E_RXR_FLAGS_XSK_BUFF_POOL);
+}
+
+static inline void clear_ring_xsk_buff_pool(struct i40e_ring *ring)
+{
+	ring->flags &= ~I40E_RXR_FLAGS_XSK_BUFF_POOL;
+}
+
+static inline void set_ring_xsk_buff_pool(struct i40e_ring *ring)
+{
+	ring->flags |= I40E_RXR_FLAGS_XSK_BUFF_POOL;
+}
+
 enum i40e_latency_range {
 	I40E_LOWEST_LATENCY = 0,
 	I40E_LOW_LATENCY = 1,
@@ -536,4 +567,5 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
 {
 	return netdev_get_tx_queue(ring->netdev, ring->queue_index);
 }
+
 #endif /* _I40E_TXRX_H_ */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 22/24] i40e: introduced a clean_tx callback function
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (20 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 21/24] i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 23/24] i40e: introduced Tx completion callbacks Björn Töpel
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

To be used by XDP and AF_XDP so that they can have their own
clean_tx_irq function.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 4 +++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 8 ++++----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h | 4 ++++
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 0c1ac8564f77..363077c7157a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -10053,6 +10053,7 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
 		ring->count = vsi->num_desc;
 		ring->size = 0;
 		ring->dcb_tc = 0;
+		ring->clean_tx = i40e_clean_tx_irq;
 		if (vsi->back->hw_features & I40E_HW_WB_ON_ITR_CAPABLE)
 			ring->flags = I40E_TXR_FLAGS_WB_ON_ITR;
 		ring->tx_itr_setting = pf->tx_itr_default;
@@ -10065,11 +10066,12 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
 		ring->reg_idx = vsi->base_queue + ring->queue_index;
 		ring->ring_active = false;
 		ring->vsi = vsi;
-		ring->netdev = NULL;
+		ring->netdev = vsi->netdev;
 		ring->dev = &pf->pdev->dev;
 		ring->count = vsi->num_desc;
 		ring->size = 0;
 		ring->dcb_tc = 0;
+		ring->clean_tx = i40e_clean_tx_irq;
 		if (vsi->back->hw_features & I40E_HW_WB_ON_ITR_CAPABLE)
 			ring->flags = I40E_TXR_FLAGS_WB_ON_ITR;
 		set_ring_xdp(ring);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 4fb5bc030df7..932b318b8147 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -681,7 +681,7 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
 	tx_ring->next_to_use = 0;
 	tx_ring->next_to_clean = 0;
 
-	if (!tx_ring->netdev)
+	if (!tx_ring->netdev || ring_is_xdp(tx_ring))
 		return;
 
 	/* cleanup Tx queue statistics */
@@ -791,8 +791,8 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
  *
  * Returns true if there's any budget left (e.g. the clean is finished)
  **/
-static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
-			      struct i40e_ring *tx_ring, int napi_budget)
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget)
 {
 	u16 i = tx_ring->next_to_clean;
 	struct i40e_tx_buffer *tx_buf;
@@ -2249,7 +2249,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	 * budget and be more aggressive about cleaning up the Tx descriptors.
 	 */
 	i40e_for_each_ring(ring, q_vector->tx) {
-		if (!i40e_clean_tx_irq(vsi, ring, budget)) {
+		if (!ring->clean_tx(vsi, ring, budget)) {
 			clean_complete = false;
 			continue;
 		}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 906a562507a9..cbb1bd261e6a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -390,6 +390,8 @@ struct i40e_ring {
 	struct xdp_rxq_info xdp_rxq;
 	struct buff_pool *bpool;
 	struct i40e_xsk_ctx *xsk;
+	bool (*clean_tx)(struct i40e_vsi *vsi,
+			 struct i40e_ring *tx_ring, int napi_budget);
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
@@ -480,6 +482,8 @@ int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
 int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
 void i40e_xdp_flush(struct net_device *dev);
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget);
 
 
 /**
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 23/24] i40e: introduced Tx completion callbacks
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (21 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 22/24] i40e: introduced a clean_tx callback function Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-01-31 13:53 ` [RFC PATCH 24/24] i40e: Tx support for zero copy allocator Björn Töpel
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Introduced a TX completion callback in the clean_tx_irq function.
In order to make it non-introsive to the SKB path, the XDP path
now has its own function for cleaning up after TX completion.
The XDP cases have been removed from the SKB path, which should
make this faster.

This is in preparation for the AF_XDP zero copy mode.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 182 +++++++++++++++++++++++-----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   6 +-
 3 files changed, 159 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 363077c7157a..95b8942a31ae 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -10071,7 +10071,7 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
 		ring->count = vsi->num_desc;
 		ring->size = 0;
 		ring->dcb_tc = 0;
-		ring->clean_tx = i40e_clean_tx_irq;
+		ring->clean_tx = i40e_xdp_clean_tx_irq;
 		if (vsi->back->hw_features & I40E_HW_WB_ON_ITR_CAPABLE)
 			ring->flags = I40E_TXR_FLAGS_WB_ON_ITR;
 		set_ring_xdp(ring);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 932b318b8147..7ab49146d15c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -783,6 +783,153 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
 
 #define WB_STRIDE 4
 
+static void i40e_update_stats_and_arm_wb(struct i40e_ring *tx_ring,
+					 struct i40e_vsi *vsi,
+					 unsigned int total_packets,
+					 unsigned int total_bytes,
+					 int budget)
+{
+	u64_stats_update_begin(&tx_ring->syncp);
+	tx_ring->stats.bytes += total_bytes;
+	tx_ring->stats.packets += total_packets;
+	u64_stats_update_end(&tx_ring->syncp);
+	tx_ring->q_vector->tx.total_bytes += total_bytes;
+	tx_ring->q_vector->tx.total_packets += total_packets;
+
+	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+		/* check to see if there are < 4 descriptors
+		 * waiting to be written back, then kick the hardware to force
+		 * them to be written back in case we stay in NAPI.
+		 * In this mode on X722 we do not enable Interrupt.
+		 */
+		unsigned int j = i40e_get_tx_pending(tx_ring);
+
+		if (budget &&
+		    ((j / WB_STRIDE) == 0) && j > 0 &&
+		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
+		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
+			tx_ring->arm_wb = true;
+	}
+}
+
+static void i40e_xdp_tx_completion(u32 start, u32 npackets,
+				   unsigned long ctx1, unsigned long ctx2)
+{
+	struct i40e_ring *tx_ring = (struct i40e_ring *)ctx1;
+	struct i40e_tx_buffer *tx_buf;
+	u32 i = 0;
+
+	(void)ctx2;
+	tx_buf = &tx_ring->tx_bi[start];
+	while (i < npackets) {
+		/* free the XDP data */
+		page_frag_free(tx_buf->raw_buf);
+
+		/* unmap skb header data */
+		dma_unmap_single(tx_ring->dev,
+				 dma_unmap_addr(tx_buf, dma),
+				 dma_unmap_len(tx_buf, len),
+				 DMA_TO_DEVICE);
+
+		/* clear tx_buffer data */
+		tx_buf->skb = NULL;
+		dma_unmap_len_set(tx_buf, len, 0);
+
+		/* Next packet */
+		tx_buf++;
+		i++;
+		if (unlikely(i + start == tx_ring->count))
+			tx_buf = tx_ring->tx_bi;
+	}
+}
+
+/**
+ * i40e_xdp_clean_tx_irq - Reclaim resources after transmit completes
+ * @vsi: the VSI we care about
+ * @tx_ring: Tx ring to clean
+ * @napi_budget: Used to determine if we are in netpoll
+ *
+ * Used for XDP packets.
+ * Returns true if there's any budget left (e.g. the clean is finished)
+ **/
+bool i40e_xdp_clean_tx_irq(struct i40e_vsi *vsi,
+			   struct i40e_ring *tx_ring, int napi_budget)
+{
+	u16 i = tx_ring->next_to_clean;
+	struct i40e_tx_buffer *tx_buf;
+	struct i40e_tx_desc *tx_head;
+	struct i40e_tx_desc *tx_desc;
+	unsigned int total_bytes = 0, total_packets = 0;
+	unsigned int budget = vsi->work_limit;
+	struct i40e_tx_buffer *prev_buf;
+	unsigned int packets_completed = 0;
+	u16 start = tx_ring->next_to_clean;
+
+	tx_buf = &tx_ring->tx_bi[i];
+	prev_buf = tx_buf;
+	tx_desc = I40E_TX_DESC(tx_ring, i);
+	i -= tx_ring->count;
+
+	tx_head = I40E_TX_DESC(tx_ring, i40e_get_head(tx_ring));
+
+	do {
+		struct i40e_tx_desc *eop_desc = tx_buf->next_to_watch;
+
+		/* if next_to_watch is not set then there is no work pending */
+		if (!eop_desc)
+			break;
+
+		/* prevent any other reads prior to eop_desc */
+		smp_rmb();
+
+		i40e_trace(clean_tx_irq, tx_ring, tx_desc, tx_buf);
+		/* we have caught up to head, no work left to do */
+		if (tx_head == tx_desc)
+			break;
+
+		/* clear next_to_watch to prevent false hangs */
+		tx_buf->next_to_watch = NULL;
+
+		/* update the statistics for this packet */
+		total_bytes += tx_buf->bytecount;
+		total_packets += tx_buf->gso_segs;
+
+		if (prev_buf->completion != tx_buf->completion) {
+			prev_buf->completion(start, packets_completed,
+					     prev_buf->ctx1, prev_buf->ctx2);
+			packets_completed = 0;
+			start = i + tx_ring->count - 1;
+		}
+		packets_completed++;
+
+		/* move us one more past the eop_desc for start of next pkt */
+		prev_buf = tx_buf++;
+		tx_desc++;
+		i++;
+		if (unlikely(!i)) {
+			i -= tx_ring->count;
+			tx_buf = tx_ring->tx_bi;
+			tx_desc = I40E_TX_DESC(tx_ring, 0);
+		}
+
+		prefetch(tx_desc);
+
+		/* update budget accounting */
+		budget--;
+	} while (likely(budget));
+
+	if (packets_completed > 0)
+		prev_buf->completion(start, packets_completed,
+				     prev_buf->ctx1, prev_buf->ctx2);
+
+	i += tx_ring->count;
+	tx_ring->next_to_clean = i;
+	i40e_update_stats_and_arm_wb(tx_ring, vsi, total_packets,
+				     total_bytes, budget);
+
+	return !!budget;
+}
+
 /**
  * i40e_clean_tx_irq - Reclaim resources after transmit completes
  * @vsi: the VSI we care about
@@ -829,11 +976,8 @@ bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 		total_bytes += tx_buf->bytecount;
 		total_packets += tx_buf->gso_segs;
 
-		/* free the skb/XDP data */
-		if (ring_is_xdp(tx_ring))
-			page_frag_free(tx_buf->raw_buf);
-		else
-			napi_consume_skb(tx_buf->skb, napi_budget);
+		/* free the skb data */
+		napi_consume_skb(tx_buf->skb, napi_budget);
 
 		/* unmap skb header data */
 		dma_unmap_single(tx_ring->dev,
@@ -887,30 +1031,8 @@ bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
-	u64_stats_update_begin(&tx_ring->syncp);
-	tx_ring->stats.bytes += total_bytes;
-	tx_ring->stats.packets += total_packets;
-	u64_stats_update_end(&tx_ring->syncp);
-	tx_ring->q_vector->tx.total_bytes += total_bytes;
-	tx_ring->q_vector->tx.total_packets += total_packets;
-
-	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
-		/* check to see if there are < 4 descriptors
-		 * waiting to be written back, then kick the hardware to force
-		 * them to be written back in case we stay in NAPI.
-		 * In this mode on X722 we do not enable Interrupt.
-		 */
-		unsigned int j = i40e_get_tx_pending(tx_ring);
-
-		if (budget &&
-		    ((j / WB_STRIDE) == 0) && (j > 0) &&
-		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
-		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
-			tx_ring->arm_wb = true;
-	}
-
-	if (ring_is_xdp(tx_ring))
-		return !!budget;
+	i40e_update_stats_and_arm_wb(tx_ring, vsi, total_packets,
+				     total_bytes, budget);
 
 	/* notify netdev of completed buffers */
 	netdev_tx_completed_queue(txring_txq(tx_ring),
@@ -3182,6 +3304,8 @@ static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
 	tx_bi->bytecount = size;
 	tx_bi->gso_segs = 1;
 	tx_bi->raw_buf = xdp->data;
+	tx_bi->completion = i40e_xdp_tx_completion;
+	tx_bi->ctx1 = (unsigned long)xdp_ring;
 
 	/* record length, and DMA address */
 	dma_unmap_len_set(tx_bi, len, size);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index cbb1bd261e6a..6f8ce176e509 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -261,6 +261,9 @@ struct i40e_tx_buffer {
 	};
 	unsigned int bytecount;
 	unsigned short gso_segs;
+	tx_completion_func completion;
+	unsigned long ctx1;
+	unsigned long ctx2;
 
 	DEFINE_DMA_UNMAP_ADDR(dma);
 	DEFINE_DMA_UNMAP_LEN(len);
@@ -484,7 +487,8 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
 void i40e_xdp_flush(struct net_device *dev);
 bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 		       struct i40e_ring *tx_ring, int napi_budget);
-
+bool i40e_xdp_clean_tx_irq(struct i40e_vsi *vsi,
+			   struct i40e_ring *tx_ring, int napi_budget);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 24/24] i40e: Tx support for zero copy allocator
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (22 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 23/24] i40e: introduced Tx completion callbacks Björn Töpel
@ 2018-01-31 13:53 ` Björn Töpel
  2018-02-01 16:42 ` [RFC PATCH 00/24] Introducing AF_XDP support Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-01-31 13:53 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Used with AF_XDP sockets.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |  86 ++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 218 +++++++++++++++++++++-------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  18 ++-
 3 files changed, 255 insertions(+), 67 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 95b8942a31ae..85cde6b228e6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -10027,6 +10027,22 @@ static void i40e_vsi_clear_rings(struct i40e_vsi *vsi)
 	}
 }
 
+static void i40e_restore_xsk_tx(struct i40e_vsi *vsi, int qid,
+				struct i40e_ring *xdp_ring)
+{
+	if (!vsi->xsk_ctxs)
+		return;
+
+	if (qid < 0 || qid >= vsi->num_xsk_ctxs)
+		return;
+
+	if (!vsi->xsk_ctxs[qid])
+		return;
+
+	xdp_ring->xsk_tx_completion = vsi->xsk_ctxs[qid]->tx_comp;
+	xdp_ring->get_packet = vsi->xsk_ctxs[qid]->get_tx_packet;
+}
+
 /**
  * i40e_alloc_rings - Allocates the Rx and Tx rings for the provided VSI
  * @vsi: the VSI being configured
@@ -10072,10 +10088,13 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
 		ring->size = 0;
 		ring->dcb_tc = 0;
 		ring->clean_tx = i40e_xdp_clean_tx_irq;
+		ring->xdp_tx_completion.func = i40e_xdp_tx_completion;
+		ring->xdp_tx_completion.ctx1 = (unsigned long)ring;
 		if (vsi->back->hw_features & I40E_HW_WB_ON_ITR_CAPABLE)
 			ring->flags = I40E_TXR_FLAGS_WB_ON_ITR;
 		set_ring_xdp(ring);
 		ring->tx_itr_setting = pf->tx_itr_default;
+		i40e_restore_xsk_tx(vsi, i, ring);
 		vsi->xdp_rings[i] = ring++;
 
 setup_rx:
@@ -11985,8 +12004,54 @@ static void i40e_remove_xsk_ctx(struct i40e_vsi *vsi, int queue_id)
 	i40e_free_xsk_ctxs_if_last(vsi);
 }
 
-static int i40e_xsk_enable(struct net_device *netdev, u32 qid,
-			   struct xsk_rx_parms *parms)
+static int i40e_xsk_tx_enable(struct i40e_vsi *vsi, u32 qid,
+			      struct xsk_tx_parms *parms)
+{
+	struct i40e_ring *xdp_ring = NULL;
+	int err;
+
+	if (qid >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (!parms->tx_completion || !parms->get_tx_packet)
+		return -EINVAL;
+
+	err = parms->dma_map(parms->buff_pool, &vsi->back->pdev->dev,
+			     DMA_TO_DEVICE, 0);
+	if (err)
+		return err;
+
+	vsi->xsk_ctxs[qid]->tx_comp.func = parms->tx_completion;
+	vsi->xsk_ctxs[qid]->tx_comp.ctx1 = parms->ctx1;
+	vsi->xsk_ctxs[qid]->tx_comp.ctx2 = parms->ctx2;
+	vsi->xsk_ctxs[qid]->get_tx_packet = parms->get_tx_packet;
+
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		xdp_ring = vsi->tx_rings[qid + vsi->alloc_queue_pairs];
+		xdp_ring->xsk_tx_completion.func = parms->tx_completion;
+		xdp_ring->xsk_tx_completion.ctx1 = parms->ctx1;
+		xdp_ring->xsk_tx_completion.ctx2 = parms->ctx2;
+		xdp_ring->get_packet = parms->get_tx_packet;
+	}
+
+	return 0;
+}
+
+static void i40e_xsk_tx_disable(struct i40e_vsi *vsi, u32 queue_id)
+{
+	struct i40e_ring *xdp_ring;
+
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		xdp_ring = vsi->tx_rings[queue_id + vsi->alloc_queue_pairs];
+		xdp_ring->xsk_tx_completion.func = NULL;
+		xdp_ring->xsk_tx_completion.ctx1 = 0;
+		xdp_ring->xsk_tx_completion.ctx2 = 0;
+		xdp_ring->get_packet = NULL;
+	}
+}
+
+static int i40e_xsk_rx_enable(struct net_device *netdev, u32 qid,
+			      struct xsk_rx_parms *parms)
 {
 	struct i40e_netdev_priv *np = netdev_priv(netdev);
 	struct i40e_vsi *vsi = np->vsi;
@@ -12029,8 +12094,8 @@ static int i40e_xsk_enable(struct net_device *netdev, u32 qid,
 	return 0;
 }
 
-static int i40e_xsk_disable(struct net_device *netdev, u32 qid,
-			    struct xsk_rx_parms *parms)
+static int i40e_xsk_rx_disable(struct net_device *netdev, u32 qid,
+			       struct xsk_rx_parms *parms)
 {
 	struct i40e_netdev_priv *np = netdev_priv(netdev);
 	struct i40e_vsi *vsi = np->vsi;
@@ -12069,6 +12134,7 @@ static int i40e_xdp(struct net_device *dev,
 {
 	struct i40e_netdev_priv *np = netdev_priv(dev);
 	struct i40e_vsi *vsi = np->vsi;
+	int err;
 
 	if (vsi->type != I40E_VSI_MAIN)
 		return -EINVAL;
@@ -12081,11 +12147,14 @@ static int i40e_xdp(struct net_device *dev,
 		xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
 		return 0;
 	case XDP_REGISTER_XSK:
-		return i40e_xsk_enable(dev, xdp->xsk.queue_id,
-				       xdp->xsk.rx_parms);
+		err = i40e_xsk_rx_enable(dev, xdp->xsk.queue_id,
+					 xdp->xsk.rx_parms);
+		return err ?: i40e_xsk_tx_enable(vsi, xdp->xsk.queue_id,
+						 xdp->xsk.tx_parms);
 	case XDP_UNREGISTER_XSK:
-		return i40e_xsk_disable(dev, xdp->xsk.queue_id,
-					xdp->xsk.rx_parms);
+		i40e_xsk_tx_disable(vsi, xdp->xsk.queue_id);
+		return i40e_xsk_rx_disable(dev, xdp->xsk.queue_id,
+					   xdp->xsk.rx_parms);
 	default:
 		return -EINVAL;
 	}
@@ -12125,6 +12194,7 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_bridge_setlink	= i40e_ndo_bridge_setlink,
 	.ndo_bpf		= i40e_xdp,
 	.ndo_xdp_xmit		= i40e_xdp_xmit,
+	.ndo_xdp_xmit_xsk	= i40e_xdp_xmit_xsk,
 	.ndo_xdp_flush		= i40e_xdp_flush,
 };
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 7ab49146d15c..7e9453514df0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -627,16 +627,21 @@ static void i40e_fd_handle_status(struct i40e_ring *rx_ring,
  * @ring:      the ring that owns the buffer
  * @tx_buffer: the buffer to free
  **/
-static void i40e_unmap_and_free_tx_resource(struct i40e_ring *ring,
-					    struct i40e_tx_buffer *tx_buffer)
+static void i40e_unmap_and_free_tx_resource(struct i40e_ring *ring, int i)
 {
+	struct i40e_tx_buffer *tx_buffer = &ring->tx_bi[i];
+
 	if (tx_buffer->skb) {
-		if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB)
+		if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB) {
 			kfree(tx_buffer->raw_buf);
-		else if (ring_is_xdp(ring))
-			page_frag_free(tx_buffer->raw_buf);
-		else
+		} else if (ring_is_xdp(ring)) {
+			struct i40e_tx_completion *comp =
+				tx_buffer->completion;
+
+			comp->func(i, 1, comp->ctx1, comp->ctx2);
+		} else {
 			dev_kfree_skb_any(tx_buffer->skb);
+		}
 		if (dma_unmap_len(tx_buffer, len))
 			dma_unmap_single(ring->dev,
 					 dma_unmap_addr(tx_buffer, dma),
@@ -670,7 +675,7 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
 
 	/* Free all the Tx ring sk_buffs */
 	for (i = 0; i < tx_ring->count; i++)
-		i40e_unmap_and_free_tx_resource(tx_ring, &tx_ring->tx_bi[i]);
+		i40e_unmap_and_free_tx_resource(tx_ring, i);
 
 	bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
 	memset(tx_ring->tx_bi, 0, bi_size);
@@ -781,6 +786,16 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
 	}
 }
 
+static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+{
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.
+	 */
+	wmb();
+
+	writel(xdp_ring->next_to_use, xdp_ring->tail);
+}
+
 #define WB_STRIDE 4
 
 static void i40e_update_stats_and_arm_wb(struct i40e_ring *tx_ring,
@@ -812,8 +827,8 @@ static void i40e_update_stats_and_arm_wb(struct i40e_ring *tx_ring,
 	}
 }
 
-static void i40e_xdp_tx_completion(u32 start, u32 npackets,
-				   unsigned long ctx1, unsigned long ctx2)
+void i40e_xdp_tx_completion(u32 start, u32 npackets,
+			    unsigned long ctx1, unsigned long ctx2)
 {
 	struct i40e_ring *tx_ring = (struct i40e_ring *)ctx1;
 	struct i40e_tx_buffer *tx_buf;
@@ -843,6 +858,72 @@ static void i40e_xdp_tx_completion(u32 start, u32 npackets,
 	}
 }
 
+static bool i40e_xmit_xsk(struct i40e_ring *xdp_ring)
+{
+	struct i40e_vsi *vsi = xdp_ring->vsi;
+	struct i40e_tx_buffer *tx_bi;
+	struct i40e_tx_desc *tx_desc;
+	bool packets_pending = false;
+	bool packets_sent = false;
+	dma_addr_t dma;
+	void *data;
+	u32 offset;
+	u32 len;
+
+	if (!xdp_ring->get_packet)
+		return 0;
+
+	if (unlikely(!I40E_DESC_UNUSED(xdp_ring))) {
+		xdp_ring->tx_stats.tx_busy++;
+		return false;
+	}
+
+	packets_pending = xdp_ring->get_packet(xdp_ring->netdev,
+					       xdp_ring->queue_index -
+					       vsi->alloc_queue_pairs,
+					       &dma, &data, &len, &offset);
+	while (packets_pending) {
+		packets_sent = true;
+		dma_sync_single_for_device(xdp_ring->dev, dma, len,
+					   DMA_TO_DEVICE);
+
+		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
+		tx_bi->bytecount = len;
+		tx_bi->gso_segs = 1;
+		tx_bi->raw_buf = data;
+		tx_bi->completion = &xdp_ring->xsk_tx_completion;
+
+		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
+		tx_desc->buffer_addr = cpu_to_le64(dma);
+		tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
+							| I40E_TX_DESC_CMD_EOP,
+							  0, len, 0);
+
+		xdp_ring->next_to_use++;
+		if (xdp_ring->next_to_use == xdp_ring->count)
+			xdp_ring->next_to_use = 0;
+
+		packets_pending = xdp_ring->get_packet(xdp_ring->netdev,
+						       xdp_ring->queue_index -
+						       vsi->alloc_queue_pairs,
+						       &dma, &data, &len,
+						       &offset);
+		if (unlikely(!I40E_DESC_UNUSED(xdp_ring))) {
+			xdp_ring->tx_stats.tx_busy++;
+			break;
+		}
+	}
+
+	/* Request an interrupt for the last frame. */
+	if (packets_sent)
+		tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
+						 I40E_TXD_QW1_CMD_SHIFT);
+
+	i40e_xdp_ring_update_tail(xdp_ring);
+
+	return !packets_pending;
+}
+
 /**
  * i40e_xdp_clean_tx_irq - Reclaim resources after transmit completes
  * @vsi: the VSI we care about
@@ -855,7 +936,6 @@ static void i40e_xdp_tx_completion(u32 start, u32 npackets,
 bool i40e_xdp_clean_tx_irq(struct i40e_vsi *vsi,
 			   struct i40e_ring *tx_ring, int napi_budget)
 {
-	u16 i = tx_ring->next_to_clean;
 	struct i40e_tx_buffer *tx_buf;
 	struct i40e_tx_desc *tx_head;
 	struct i40e_tx_desc *tx_desc;
@@ -864,50 +944,43 @@ bool i40e_xdp_clean_tx_irq(struct i40e_vsi *vsi,
 	struct i40e_tx_buffer *prev_buf;
 	unsigned int packets_completed = 0;
 	u16 start = tx_ring->next_to_clean;
+	bool xmit_done;
 
-	tx_buf = &tx_ring->tx_bi[i];
+	tx_buf = &tx_ring->tx_bi[tx_ring->next_to_clean];
 	prev_buf = tx_buf;
-	tx_desc = I40E_TX_DESC(tx_ring, i);
-	i -= tx_ring->count;
 
 	tx_head = I40E_TX_DESC(tx_ring, i40e_get_head(tx_ring));
 
 	do {
-		struct i40e_tx_desc *eop_desc = tx_buf->next_to_watch;
-
-		/* if next_to_watch is not set then there is no work pending */
-		if (!eop_desc)
-			break;
+		tx_desc = I40E_TX_DESC(tx_ring, tx_ring->next_to_clean);
 
-		/* prevent any other reads prior to eop_desc */
-		smp_rmb();
-
-		i40e_trace(clean_tx_irq, tx_ring, tx_desc, tx_buf);
 		/* we have caught up to head, no work left to do */
 		if (tx_head == tx_desc)
 			break;
 
-		/* clear next_to_watch to prevent false hangs */
-		tx_buf->next_to_watch = NULL;
+		tx_desc->buffer_addr = 0;
+		tx_desc->cmd_type_offset_bsz = 0;
 
 		/* update the statistics for this packet */
 		total_bytes += tx_buf->bytecount;
 		total_packets += tx_buf->gso_segs;
 
 		if (prev_buf->completion != tx_buf->completion) {
-			prev_buf->completion(start, packets_completed,
-					     prev_buf->ctx1, prev_buf->ctx2);
+			struct i40e_tx_completion *comp = prev_buf->completion;
+
+			comp->func(start, packets_completed,
+				   comp->ctx1, comp->ctx2);
 			packets_completed = 0;
-			start = i + tx_ring->count - 1;
+			start = tx_ring->next_to_clean;
 		}
 		packets_completed++;
 
 		/* move us one more past the eop_desc for start of next pkt */
 		prev_buf = tx_buf++;
 		tx_desc++;
-		i++;
-		if (unlikely(!i)) {
-			i -= tx_ring->count;
+		tx_ring->next_to_clean++;
+		if (unlikely(tx_ring->next_to_clean == tx_ring->count)) {
+			tx_ring->next_to_clean = 0;
 			tx_buf = tx_ring->tx_bi;
 			tx_desc = I40E_TX_DESC(tx_ring, 0);
 		}
@@ -918,16 +991,18 @@ bool i40e_xdp_clean_tx_irq(struct i40e_vsi *vsi,
 		budget--;
 	} while (likely(budget));
 
-	if (packets_completed > 0)
-		prev_buf->completion(start, packets_completed,
-				     prev_buf->ctx1, prev_buf->ctx2);
+	if (packets_completed > 0) {
+		struct i40e_tx_completion *comp = prev_buf->completion;
+
+		comp->func(start, packets_completed, comp->ctx1, comp->ctx2);
+	}
 
-	i += tx_ring->count;
-	tx_ring->next_to_clean = i;
 	i40e_update_stats_and_arm_wb(tx_ring, vsi, total_packets,
 				     total_bytes, budget);
 
-	return !!budget;
+	xmit_done = i40e_xmit_xsk(tx_ring);
+
+	return !!budget && xmit_done;
 }
 
 /**
@@ -1001,7 +1076,7 @@ bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 				i -= tx_ring->count;
 				tx_buf = tx_ring->tx_bi;
 				tx_desc = I40E_TX_DESC(tx_ring, 0);
-			}
+		}
 
 			/* unmap any remaining paged data */
 			if (dma_unmap_len(tx_buf, len)) {
@@ -2086,16 +2161,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	return ERR_PTR(-result);
 }
 
-static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
-{
-	/* Force memory writes to complete before letting h/w
-	 * know there are new descriptors to fetch.
-	 */
-	wmb();
-
-	writel(xdp_ring->next_to_use, xdp_ring->tail);
-}
-
 /**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
@@ -3264,7 +3329,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 	/* clear dma mappings for failed tx_bi map */
 	for (;;) {
 		tx_bi = &tx_ring->tx_bi[i];
-		i40e_unmap_and_free_tx_resource(tx_ring, tx_bi);
+		i40e_unmap_and_free_tx_resource(tx_ring, i);
 		if (tx_bi == first)
 			break;
 		if (i == 0)
@@ -3304,8 +3369,7 @@ static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
 	tx_bi->bytecount = size;
 	tx_bi->gso_segs = 1;
 	tx_bi->raw_buf = xdp->data;
-	tx_bi->completion = i40e_xdp_tx_completion;
-	tx_bi->ctx1 = (unsigned long)xdp_ring;
+	tx_bi->completion = &xdp_ring->xdp_tx_completion;
 
 	/* record length, and DMA address */
 	dma_unmap_len_set(tx_bi, len, size);
@@ -3317,16 +3381,10 @@ static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
 						  | I40E_TXD_CMD,
 						  0, size, 0);
 
-	/* Make certain all of the status bits have been updated
-	 * before next_to_watch is written.
-	 */
-	smp_wmb();
-
 	i++;
 	if (i == xdp_ring->count)
 		i = 0;
 
-	tx_bi->next_to_watch = tx_desc;
 	xdp_ring->next_to_use = i;
 
 	return I40E_XDP_TX;
@@ -3501,6 +3559,54 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	return 0;
 }
 
+/**
+ * i40e_napi_is_scheduled - If napi is running, set the NAPIF_STATE_MISSED
+ * @n: napi context
+ *
+ * Returns true if NAPI is scheduled.
+ **/
+static bool i40e_napi_is_scheduled(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (val & NAPIF_STATE_DISABLE)
+			return true;
+
+		if (!(val & NAPIF_STATE_SCHED))
+			return false;
+
+		new = val | NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return true;
+}
+
+/**
+ * i40e_xdp_xmit_xsk - Implements ndo_xdp_xmit_xsk
+ * @dev: netdev
+ * @queue_id: queue pair index
+ *
+ * Returns zero if sent, else an error code
+ **/
+int i40e_xdp_xmit_xsk(struct net_device *dev, u32 queue_id)
+{
+	struct i40e_netdev_priv *np = netdev_priv(dev);
+	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_ring *tx_ring;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return -EAGAIN;
+
+	tx_ring = vsi->tx_rings[queue_id + vsi->alloc_queue_pairs];
+
+	if (!i40e_napi_is_scheduled(&tx_ring->q_vector->napi))
+		i40e_force_wb(vsi, tx_ring->q_vector);
+
+	return 0;
+}
+
 /**
  * i40e_xdp_flush - Implements ndo_xdp_flush
  * @dev: netdev
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 6f8ce176e509..81c47ec77ec9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -253,17 +253,21 @@ typedef int (*get_tx_packet_func)(struct net_device *dev, u32 queue_id,
 				  dma_addr_t *dma, void **data, u32 *len,
 				  u32 *offset);
 
+struct i40e_tx_completion {
+	tx_completion_func func;
+	unsigned long ctx1;
+	unsigned long ctx2;
+};
+
 struct i40e_tx_buffer {
 	struct i40e_tx_desc *next_to_watch;
 	union {
 		struct sk_buff *skb;
 		void *raw_buf;
 	};
+	struct i40e_tx_completion *completion;
 	unsigned int bytecount;
 	unsigned short gso_segs;
-	tx_completion_func completion;
-	unsigned long ctx1;
-	unsigned long ctx2;
 
 	DEFINE_DMA_UNMAP_ADDR(dma);
 	DEFINE_DMA_UNMAP_LEN(len);
@@ -306,6 +310,8 @@ struct i40e_xsk_ctx {
 	struct buff_pool *buff_pool;
 	void *err_ctx;
 	void (*err_handler)(void *ctx, int errno);
+	struct i40e_tx_completion tx_comp;
+	get_tx_packet_func get_tx_packet;
 };
 
 /* some useful defines for virtchannel interface, which
@@ -395,6 +401,9 @@ struct i40e_ring {
 	struct i40e_xsk_ctx *xsk;
 	bool (*clean_tx)(struct i40e_vsi *vsi,
 			 struct i40e_ring *tx_ring, int napi_budget);
+	get_tx_packet_func get_packet;
+	struct i40e_tx_completion xdp_tx_completion;
+	struct i40e_tx_completion xsk_tx_completion;
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
@@ -484,11 +493,14 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi);
 int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
 int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
+int i40e_xdp_xmit_xsk(struct net_device *dev, u32 queue_id);
 void i40e_xdp_flush(struct net_device *dev);
 bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 		       struct i40e_ring *tx_ring, int napi_budget);
 bool i40e_xdp_clean_tx_irq(struct i40e_vsi *vsi,
 			   struct i40e_ring *tx_ring, int napi_budget);
+void i40e_xdp_tx_completion(u32 start, u32 npackets,
+			    unsigned long ctx1, unsigned long ctx2);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (23 preceding siblings ...)
  2018-01-31 13:53 ` [RFC PATCH 24/24] i40e: Tx support for zero copy allocator Björn Töpel
@ 2018-02-01 16:42 ` Jesper Dangaard Brouer
  2018-02-02 10:31 ` Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2018-02-01 16:42 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, willemdebruijn.kernel, daniel, netdev,
	Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang, brouer,
	Saeed Mahameed



On Wed, 31 Jan 2018 14:53:32 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote:

> * In this RFC, do not use an XDP_REDIRECT action other than
>   bpf_xdpsk_redirect for XDP_DRV_ZC. This is because a zero-copy
>   allocated buffer will then be sent to a cpu id / queue_pair through
>   ndo_xdp_xmit that does not know this has been ZC allocated. It will
>   then do a page_free on it and you will get a crash. How to extend
>   ndo_xdp_xmit with some free/completion function that could be called
>   instead of page_free?  Hopefully, the same solution can be used here
>   as in the first problem item in this section.

I'm prototype-coding extending ndo_xdp_xmit with a free/completion
function call, that look at the xdp_rxq_info to determine what
allocator type the RX-NIC used (info per RXq), and invoke the
appropriate callback.

I dusted off my old page_pool implementation (modifying it to run
outside page-allocator).  Implemented XDP_REDIRECT for mlx5, and
extended xdp_rxq_info, and stored needed info in ixgbe for DMA TX
completion.  Disabled the mlx5 page cache, and instead use the
page_pool.

It worked surprisingly well... test is: pktgen on mlx5 100Gbit/s NIC,
and XDP_REDIRECT with xdp_redirect_map sample, out 10G ixgbe NIC.

Performance is surprisingly good... Testing DMA-TX completion on
ixgbe, that calls "xdp_return_frame", which is mapped to
page_pool_put_page(pool, page); Here DMA-TX-completion runs on CPU#3
and mlx5 RX runs on CPU#0.  (Internally page_pool uses ptr_ring, which
is what gives the good cross CPU performance).

Show adapter(s) (ixgbe2 mlx5p2) statistics (ONLY that changed!)
Ethtool(ixgbe2  ) stat:    810562253 (    810,562,253) <= tx_bytes /sec
Ethtool(ixgbe2  ) stat:    864600261 (    864,600,261) <= tx_bytes_nic /sec
Ethtool(ixgbe2  ) stat:     13509371 (     13,509,371) <= tx_packets /sec
Ethtool(ixgbe2  ) stat:     13509380 (     13,509,380) <= tx_pkts_nic /sec
Ethtool(mlx5p2  ) stat:     36827369 (     36,827,369) <= rx_64_bytes_phy /sec
Ethtool(mlx5p2  ) stat:   2356953271 (  2,356,953,271) <= rx_bytes_phy /sec
Ethtool(mlx5p2  ) stat:     23313782 (     23,313,782) <= rx_discards_phy /sec
Ethtool(mlx5p2  ) stat:         3019 (          3,019) <= rx_out_of_buffer /sec
Ethtool(mlx5p2  ) stat:     36827395 (     36,827,395) <= rx_packets_phy /sec
Ethtool(mlx5p2  ) stat:   2356924099 (  2,356,924,099) <= rx_prio0_bytes /sec
Ethtool(mlx5p2  ) stat:     13513560 (     13,513,560) <= rx_prio0_packets /sec
Ethtool(mlx5p2  ) stat:    810820253 (    810,820,253) <= rx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat:     13513672 (     13,513,672) <= rx_vport_unicast_packets /sec

If I only disabled the mlx5 page cache (no page_pool), then single flow
performance was 6Mpps, and if I started two flows the collective
performance drop to 4Mpps, because we hit the page allocator lock
(further negative scaling occurs).

If I keep the mlx5 cache, I see between 7-11Mpps... which varies
depending on ixgbe TX-ring size and DMA-completion interrupt levels.


For AF_XDP, we just register another free/completion callback function.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (24 preceding siblings ...)
  2018-02-01 16:42 ` [RFC PATCH 00/24] Introducing AF_XDP support Jesper Dangaard Brouer
@ 2018-02-02 10:31 ` Jesper Dangaard Brouer
  2018-02-05 15:05 ` Björn Töpel
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2018-02-02 10:31 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, willemdebruijn.kernel, daniel, netdev,
	Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang, brouer

On Wed, 31 Jan 2018 14:53:32 +0100
Björn Töpel <bjorn.topel@gmail.com> wrote:

> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>  
> XDP baseline numbers without this RFC:
> xdp_rxq_info --action XDP_DROP 31.3 Mpps
> xdp_rxq_info --action XDP_TX   16.7 Mpps
>  
> XDP performance with this RFC i.e. with the buffer allocator:
> XDP_DROP 21.0 Mpps
> XDP_TX   11.9 Mpps
>  
> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
> Benchmark   V2     V3     V4     V4+ZC
> rxdrop      0.67   0.73   0.74   33.7
> txpush      0.98   0.98   0.91   19.6
> l2fwd       0.66   0.71   0.67   15.5

My numbers from before:
        V4+ZC
rxdrop  35.2 Mpps
txpush  20.7 Mpps
l2fwd   16.9 Mpps

> AF_XDP performance:
> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
> rxdrop      3.3        11.6         16.9
> txpush      2.2         NA*         21.8
> l2fwd       1.7         NA*         10.4

The numbers on my system are better than your system, and compared to
the my-own before results, the txpush is almost the same, and the gap
between l2fwd is smaller for me.

The surprise is the drop in the 'rxdrop' performance.

        XDP_DRV_ZC
rxdrop  22.0 Mpps
txpush  20.9 Mpps
l2fwd   14.2 Mpps


BUT is also seems you have generally slowed down the XDP_DROP results
for i40e:

Before:
 sudo ./xdp_bench01_mem_access_cost --dev i40e1
 XDP_DROP     35878204   35,878,204         no_touch  

After this patchset:
 $ sudo ./xdp_bench01_mem_access_cost --dev i40e1
 XDP_action   pps        pps-human-readable mem      
 XDP_DROP     28992009   28,992,009         no_touch  

And if I read data:
 sudo ./xdp_bench01_mem_access_cost --dev i40e1 --read
 XDP_action   pps        pps-human-readable mem      
 XDP_DROP     25107793   25,107,793         read      


BTW, see you soon in Brussels (FOSDEM18) ...
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

$ sudo ./xdpsock --rxdrop --interface=i40e1 --queue=11
[...]
i40e1:11 rxdrop 	
                pps         pkts        60.01      
rx              22,040,099  1,322,572,352
tx              0           0          

$ sudo ./xdpsock --txonly --interface=i40e1 --queue=11
[...]
i40e1:11 txonly 	
                pps         pkts        239.03     
rx              0           0          
tx              20,937,885  5,004,790,500


$ sudo ./xdpsock --l2fwd --interface=i40e1 --queue=11
[...]
i40e1:11 l2fwd 	
                pps         pkts        152.02     
rx              14,244,719  2,165,460,044
tx              14,244,718  2,165,459,915



My before results:

 $ sudo ./bench_all.sh
 You might want to change the parameters in ./bench_all.sh
 i40e1 cpu5 duration 30s zc 16
 i40e1 v2 rxdrop    duration 29.27s rx:    62959986pkts @  2150794.94pps
 i40e1 v3 rxdrop    duration 29.18s rx:    68470248pkts @  2346658.86pps
 i40e1 v4 rxdrop    duration 29.45s rx:    68900864pkts @  2339633.99pps
 i40e1 v4 rxdrop zc duration 29.36s rx:  1033722048pkts @ 35206198.62pps

 i40e1 v2 txonly    duration 29.16s tx:  63272640pkts @     2169632.53pps.
 i40e1 v3 txonly    duration 29.14s tx:  62531968pkts @     2145714.21pps.
 i40e1 v4 txonly    duration 29.48s tx:  40587316pkts @     1376761.87pps.
 i40e1 v4 txonly zc duration 29.36s tx: 608794761pkts @    20738953.62pps.

 i40e1 v2 l2fwd    duration 29.19s rx:  57532736pkts @  1970885.56pps
                                   tx   57532672pkts @  1970883.37pps.
 i40e1 v3 l2fwd    duration 29.16s rx:  57675961pkts @  1978149.64pps
                                   tx:  57675897pkts @  1978147.44pps.
 i40e1 v4 l2fwd    duration 29.51s rx:     29732pkts @     1007.58pps
                                   tx:     28708pkts @      972.88pps.
 i40e1 v4 l2fwd zc duration 29.32s rx: 497528256pkts @ 16969091.01pps
                                   tx: 497527296pkts @ 16969058.27pps.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect
  2018-01-31 13:53 ` [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect Björn Töpel
@ 2018-02-05 13:42   ` Jesper Dangaard Brouer
  2018-02-07 21:11     ` Björn Töpel
  0 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2018-02-05 13:42 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, willemdebruijn.kernel, daniel, netdev,
	Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang, brouer

On Wed, 31 Jan 2018 14:53:37 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote:

> The bpf_xdpsk_redirect call redirects the XDP context to the XDP
> socket bound to the receiving queue (if any).

As I explained in-person at FOSDEM, my suggestion is to use the
bpf-map infrastructure for AF_XDP redirect, but people on this
upstream mailing also need a chance to validate my idea ;-)

The important thing to keep in-mind is how we can still maintain a
SPSC (Single producer Single Consumer) relationship between an
RX-queue and a userspace consumer-process.

This AF_XDP "FOSDEM" patchset, store the "xsk" (xdp_sock) pointer
directly in the net_device (_rx[].netdev_rx_queue.xs) structure.  This
limit each RX-queue to service a single xdp_sock.  It sounds good from
a SPSC pov, but not very flexible.  With a "xdp_sock_map" we can get
the flexibility to select among multiple xdp_sock'ets (via XDP
pre-filter selecting a different map), and still maintain a SPSC
relationship.  The RX-queue will just have several SPSC relationships
with the individual xdp_sock's.

This is true for the AF_XDP-copy mode, and require less driver change.
For the AF_XDP-zero-copy (ZC) mode drivers need significant changes
anyhow, and in ZC case we will have to disallow this multiple
xdp_sock's, which is simply achieved by checking if the xdp_sock
pointer returned from the map lookup match the one that userspace
requested driver to register it's memory for RX-rings from.

The "xdp_sock_map" is an array, where the index correspond to the
queue_index.  The bpf_redirect_map() ignore the specified index and
instead use xdp_rxq_info->queue_index in the lookup.

Notice that a bpf-map have no pinned relationship with the device or
XDP prog loaded.  Thus, userspace need to bind() this map to the
device before traffic can flow, like the proposed bind() on the
xdp_sock.  This is to establish the SPSC binding.  My proposal is that
userspace insert the xdp_sock file-descriptor(s) in the map at the
queue-index, and the map (which is also just a file-descriptor) is
bound maybe via bind() to a specific device (via the ifindex).  Kernel
side will walk the map and do required actions xdp_sock's in find in
map.

TX-side is harder, as now multiple xdp_sock's can have the same
queue-pair ID with the same net_device. But Magnus propose that this
can be solved with hardware. As newer NICs have many TX-queue, and the
queue-pair ID is just an external visible number, while the kernel
internal structure can point to a dedicated TX-queue per xdp_sock.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (25 preceding siblings ...)
  2018-02-02 10:31 ` Jesper Dangaard Brouer
@ 2018-02-05 15:05 ` Björn Töpel
  2018-02-07 15:54   ` Willem de Bruijn
  2018-02-07 17:59 ` Tom Herbert
  2018-03-26 16:06 ` William Tu
  28 siblings, 1 reply; 50+ messages in thread
From: Björn Töpel @ 2018-02-05 15:05 UTC (permalink / raw)
  To: Bjorn Topel, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
	Netdev
  Cc: Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Shaw, Jeffrey B, Yigit, Ferruh, Zhang, Qi Z

2018-01-31 14:53 GMT+01:00 Björn Töpel <bjorn.topel@gmail.com>:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 20x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> from November last year. The feedback from that RFC submission and the
> presentation at NetdevConf in Seoul was to create a new address family
> instead of building on top of AF_PACKET. AF_XDP is this new address
> family.
>
> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> level is that TX and RX descriptors are separated from packet
> buffers. An RX or TX descriptor points to a data buffer in a packet
> buffer area. RX and TX can share the same packet buffer so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, then
> the descriptor that points to that packet buffer can be changed to
> point to another buffer and reused right away. This again avoids
> copying data.
>
> The RX and TX descriptor rings are registered with the setsockopts
> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> area is allocated by user space and registered with the kernel using
> the new XDP_MEM_REG setsockopt. All these three areas are shared
> between user space and kernel space. The socket is then bound with a
> bind() call to a device and a specific queue id on that device, and it
> is not until bind is completed that traffic starts to flow.
>
> An XDP program can be loaded to direct part of the traffic on that
> device and queue id to user space through a new redirect action in an
> XDP program called bpf_xdpsk_redirect that redirects a packet up to
> the socket in user space. All the other XDP actions work just as
> before. Note that the current RFC requires the user to load an XDP
> program to get any traffic to user space (for example all traffic to
> user space with the one-liner program "return
> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> this requirement and sends all traffic from a queue to user space if
> an AF_XDP socket is bound to it.
>

We realized, a bit late maybe, that 24 patches is a bit mouthful, so
let me try to make it more palatable.

Patch 1 to 7 introduce AF_XDP socket support with copy semantics
(require no driver changes). Patch 8 adds XDP_REDIRECT support to i40e
and patch 9 is the test application.

The rest of the patches are enabling zero-copy support, and they're
messier. So, if you don't really care about zero-copy, just have a
look at 1 to 7.

We'd really appreciate your thoughts on the user space APIs (including
the bpf APIs).

For the next review, we'll keep the set smaller, and introduce many of
the i40e patches as pre-patches.


Björn


> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> is no specific mode called XDP_DRV_ZC). If the driver does not have
> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> program, XDP_SKB mode is employed that uses SKBs together with the
> generic XDP support and copies out the data to user space. A fallback
> mode that works for any network device. On the other hand, if the
> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> ndo_xdp_flush), these NDOs, without any modifications, will be used by
> the AF_XDP code to provide better performance, but there is still a
> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> driver support with the zero-copy user space allocator that provides
> even better performance. In this mode, the networking HW (or SW driver
> if it is a virtual driver like veth) DMAs/puts packets straight into
> the packet buffer that is shared between user space and kernel
> space. The RX and TX descriptor queues of the networking HW are NOT
> shared to user space. Only the kernel can read and write these and it
> is the kernel driver's responsibility to translate these HW specific
> descriptors to the HW agnostic ones in the virtual descriptor rings
> that user space sees. This way, a malicious user space program cannot
> mess with the networking HW. This mode though requires some extensions
> to XDP.
>
> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> buffer pool concept so that the same XDP driver code can be used for
> buffers allocated using the page allocator (XDP_DRV), the user-space
> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> allocator/cache/recycling mechanism. The ndo_bpf call has also been
> extended with two commands for registering and unregistering an XSK
> socket and is in the RX case mainly used to communicate some
> information about the user-space buffer pool to the driver.
>
> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
> but we run into problems with this (further discussion in the
> challenges section) and had to introduce a new NDO called
> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
> and an explicit queue id that packets should be sent out on. In
> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
> sent from the xdp socket (associated with the dev and queue
> combination that was provided with the NDO call) using a callback
> (get_tx_packet), and when they have been transmitted it uses another
> callback (tx_completion) to signal completion of packets. These
> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
> and thus does not clash with the XDP_REDIRECT use of
> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
> (without ZC) is currently not supported by TX. Please have a look at
> the challenges section for further discussions.
>
> The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
> so the user needs to steer the traffic to the zero-copy enabled queue
> pair. Which queue to use, is up to the user.
>
> For an untrusted application, HW packet steering to a specific queue
> pair (the one associated with the application) is a requirement, as
> the application would otherwise be able to see other user space
> processes' packets. If the HW cannot support the required packet
> steering, XDP_DRV or XDP_SKB mode have to be used as they do not
> expose the NIC's packet buffer into user space as the packets are
> copied into user space from the NIC's packet buffer in the kernel.
>
> There is a xdpsock benchmarking/test application included. Say that
> you would like your UDP traffic from port 4242 to end up in queue 16,
> that we will enable AF_XDP on. Here, we use ethtool for this:
>
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
>
> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
>
>       samples/bpf/xdpsock -i p3p2 -q 16 -l -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> XDP baseline numbers without this RFC:
> xdp_rxq_info --action XDP_DROP 31.3 Mpps
> xdp_rxq_info --action XDP_TX   16.7 Mpps
>
> XDP performance with this RFC i.e. with the buffer allocator:
> XDP_DROP 21.0 Mpps
> XDP_TX   11.9 Mpps
>
> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
> Benchmark   V2     V3     V4     V4+ZC
> rxdrop      0.67   0.73   0.74   33.7
> txpush      0.98   0.98   0.91   19.6
> l2fwd       0.66   0.71   0.67   15.5
>
> AF_XDP performance:
> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
> rxdrop      3.3        11.6         16.9
> txpush      2.2         NA*         21.8
> l2fwd       1.7         NA*         10.4
>
> * NA since there is no XDP_DRV mode (without ZC) for TX in this RFC,
>   see challenges below.
>
> If we start by comparing XDP_SKB performance with copy mode in
> AF_PACKET V4, we can see that AF_XDP delivers 3-5 times the
> throughput, which is positive. We are also happy with the XDP_DRV
> performance that provides 11.6 Mpps for rxdrop, and should work on any
> driver implementing full XDP support. Now to the problematic part:
> XDP_DRV_ZC. The txpush (TX only) benchmark shows decent results at
> 21.8 Mpps and is better than it was for V4, even though we have spent
> no time optimizing the code in AF_XDP. (We did that in AF_PACKET V4.)
> But the RX performance is sliced by half, which is not good. The
> reason for this is, for the major part, the new buffer allocator which
> is used for RX ZC only (at this point, see todo section). If you take
> a look at the XDP baseline numbers, introducing the buffer pool
> allocator drops the performance by around 30% or 10 Mpps which is
> obviously not acceptable. We clearly need to give this code some
> overdue performance love. But the overhanging question is how much
> overhead it will produce in the end and if this will be
> acceptable. Another thing to note is that V4 provided 33.7 Mpps for
> rxdrop, but with AF_XDP we are quite unlikely to get above the
> XDP_DROP number of 31.3, since we are reusing the XDP infrastructure
> and driver code on the RX side. So in the end, the AF_XDP XDP_DRV_ZC
> numbers will likely be lower than the V4 ZC numbers.
>
> We based this patch set on net-next commit 91e6dd828425 ("ipmr: Fix
> ptrdiff_t print formatting").
>
> Challenges: areas we would really appreciate your help on and that we
> are having substantial problems with.
>
> * We would like to, if possible, use ndo_xdp_xmit and ndo_xdp_flush
>   instead of introducing another NDO in the form of
>   ndo_xdp_xmit_xsk. The first reason behind our ineptitude to be able
>   to accomplish this is that if both paths use ndo_xdp_xmit, they will
>   create a race as ndo_xdp_xmit currently does not contain any
>   locking. How to implement some type of mutual exclusion here without
>   resorting to slowing down the NDO with a lock? The second problem is
>   that the XDP_REDIRECT code implicitly assumes that core id = queue
>   id. AF_XDP, on the other hand, explicitly specifies a queue id that
>   has nothing to do with the core id (the application can run on any
>   core). How to reconcile these two views in one ndo? If these two
>   problems can be solved, then we would still need to introduce a
>   completion callback and a get_packet callback, but this seems to be
>   less challenging. This would also make it possible to run TX in the
>   XDP_DRV mode (with the default page allocator).
>
> * What should the buffer allocator look like and how to make it
>   generic enough so it can be used by all NIC vendors? Would be great
>   if you could take a look at it and come with suggestions. As you can
>   see from the change log, it took some effort to rewire the i40e code
>   to use the buff pool, and we expect the same to be true for many
>   other NICs. Ideas on how to introduce multiple allocator into XDP in
>   a less intrusive way would be highly appreciated. Another question
>   is how to create a buffer pool that gives rise to very little
>   overhead? We do not know if the current one can be optimized to have
>   an acceptable overhead as we have not started any optimization
>   effort. But we will give it a try during the next week or so to see
>   where it leads.
>
> * In this RFC, do not use an XDP_REDIRECT action other than
>   bpf_xdpsk_redirect for XDP_DRV_ZC. This is because a zero-copy
>   allocated buffer will then be sent to a cpu id / queue_pair through
>   ndo_xdp_xmit that does not know this has been ZC allocated. It will
>   then do a page_free on it and you will get a crash. How to extend
>   ndo_xdp_xmit with some free/completion function that could be called
>   instead of page_free?  Hopefully, the same solution can be used here
>   as in the first problem item in this section.
>
> Caveats with this RFC. In contrast to the last section, we believe we
> have solutions for these but we did not have time to fix them. We
> chose to show you all the code sooner than later, even though
> everything does not work. Sorry.
>
> * This RFC is more immature (read, has more bugs) than the AF_PACKET
>   V4 RFC. Some known mentioned here, others unknown.
>
> * We have done absolutely no optimization to this RFC. There is
>   (hopefully) some substantial low hanging fruit that we could fix
>   once we get to this, to improve XDP_DRV_ZC performance to levels
>   that we are not ashamed of and also bring the i40e driver to the
>   same performance levels it had before our changes, which is a must.
>
> * There is a race in the TX XSK clean up code in the i40e driver that
>   triggers a WARN_ON_ONCE. Clearly a bug that needs to be fixed. It
>   can be triggered by performing ifdown/ifup when the application is
>   running, or when changing the number of queues of the device
>   underneath the hood of the application. As a workaround, please
>   refrain from doing these two things without restarting the
>   application, as not all buffers will be returned in the TX
>   path. This bug can also be triggered when killing the application,
>   but has no negative effect in this case as the process will never
>   execute again.
>
> * Before this RFC, ndo_xdp_xmit triggered by an XDP_REDIRECT to a NIC
>   never modified the page count, so the redirect code could assume
>   that the page would still be valid after the NDO call. With the
>   introduction of the xsk_rcv path that is called as a result of an
>   XDP_REDIRECT to an AF_XDP socket, the page count will be decreased
>   if the page is copied out to user space, since we have no use for it
>   anymore. Our somewhat blunt solution to this is to make sure in the
>   i40e driver that the refcount is never under two. Note though, that
>   with the introduction of the buffer pool, this problem
>   disappears. This also means that XDP_DRV will not work out of the
>   box with a Niantic NIC, since it also needs this modification to
>   work. One question that we have is what should the semantics of
>   ndo_xdp_xmit be? Can we always assume that the page count will never
>   be changed by all possible netdevices that implement this NDO, or
>   should we remove this assumption to gain more device implementation
>   flexibility?
>
> To do:
>
> * Optimize performance. No optimization whatsoever was performed on
>   this RFC, in contrast to the previous one for AF_PACKET V4.
>
> * Kernel load module support.
>
> * Polling has not been implemented yet.
>
> * Optimize the user space sample application. It is simple but naive
>   at this point. The one for AF_PACKET V4 had a number of
>   optimizations that have not been introduced in the AF_XDP version.
>
> * Implement a way to pick the XDP_DRV mode even if XDP_DRV_ZC is
>   available. Would be nice to have for the sample application too.
>
> * Introduce a notifier chain for queue changes (caused by ethtool for
>   example). This would get rid of the error callback that we have at
>   this point.
>
> * Use one NAPI context for RX and another one for TX in i40e. This
>   would make it possible to run RX on one core and TX on another for
>   better performance. Today, they need to share a single core since
>   they share NAPI context.
>
> * Get rid of packet arrays (PA) and convert them to the buffer pool
>   allocator by transferring the necessary PA functionality into the
>   buffer pool. This has only been done for RX in ZC mode, while all
>   the other modes are still using packet arrays. Clearly, having two
>   structures with largely the same information is not a good thing.
>
> * Support for AF_XDP sockets without an XPD program loaded. In this
>   case all the traffic on a queue should go up to user space.
>
> * Support shared packet buffers
>
> * Support for packets spanning multiple frames
>
> Thanks: Björn and Magnus
>
> Björn Töpel (16):
>   xsk: AF_XDP sockets buildable skeleton
>   xsk: add user memory registration sockopt
>   xsk: added XDP_{R,T}X_RING sockopt and supporting structures
>   bpf: added bpf_xdpsk_redirect
>   net: wire up xsk support in the XDP_REDIRECT path
>   i40e: add support for XDP_REDIRECT
>   samples/bpf: added xdpsock program
>   xsk: add iterator functions to xsk_ring
>   i40e: introduce external allocator support
>   i40e: implemented page recycling buff_pool
>   i40e: start using recycling buff_pool
>   i40e: separated buff_pool interface from i40e implementaion
>   xsk: introduce xsk_buff_pool
>   xdp: added buff_pool support to struct xdp_buff
>   xsk: add support for zero copy Rx
>   i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx
>
> Magnus Karlsson (8):
>   xsk: add bind support and introduce Rx functionality
>   xsk: introduce Tx functionality
>   netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf
>   netdevice: added ndo for transmitting a packet from an XDP socket
>   xsk: add support for zero copy Tx
>   i40e: introduced a clean_tx callback function
>   i40e: introduced Tx completion callbacks
>   i40e: Tx support for zero copy allocator
>
>  drivers/net/ethernet/intel/i40e/Makefile         |    3 +-
>  drivers/net/ethernet/intel/i40e/i40e.h           |   24 +
>  drivers/net/ethernet/intel/i40e/i40e_buff_pool.c |  580 +++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_buff_pool.h |   15 +
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c   |    1 -
>  drivers/net/ethernet/intel/i40e/i40e_main.c      |  541 +++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c      |  906 +++++++++--------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h      |  119 ++-
>  include/linux/buff_pool.h                        |  136 +++
>  include/linux/filter.h                           |    3 +-
>  include/linux/netdevice.h                        |   25 +
>  include/linux/socket.h                           |    5 +-
>  include/net/xdp.h                                |    1 +
>  include/net/xdp_sock.h                           |   60 ++
>  include/uapi/linux/bpf.h                         |    6 +-
>  include/uapi/linux/if_xdp.h                      |   72 ++
>  net/Kconfig                                      |    1 +
>  net/Makefile                                     |    1 +
>  net/core/dev.c                                   |   28 +-
>  net/core/filter.c                                |   88 +-
>  net/core/sock.c                                  |   12 +-
>  net/xdp/Kconfig                                  |    7 +
>  net/xdp/Makefile                                 |    1 +
>  net/xdp/xsk.c                                    | 1142 ++++++++++++++++++++++
>  net/xdp/xsk.h                                    |   31 +
>  net/xdp/xsk_buff.h                               |  161 +++
>  net/xdp/xsk_buff_pool.c                          |  225 +++++
>  net/xdp/xsk_buff_pool.h                          |   17 +
>  net/xdp/xsk_packet_array.c                       |   62 ++
>  net/xdp/xsk_packet_array.h                       |  399 ++++++++
>  net/xdp/xsk_ring.c                               |   61 ++
>  net/xdp/xsk_ring.h                               |  419 ++++++++
>  net/xdp/xsk_user_queue.h                         |   24 +
>  samples/bpf/Makefile                             |    4 +
>  samples/bpf/xdpsock_kern.c                       |   11 +
>  samples/bpf/xdpsock_queue.h                      |   62 ++
>  samples/bpf/xdpsock_user.c                       |  642 ++++++++++++
>  security/selinux/hooks.c                         |    4 +-
>  security/selinux/include/classmap.h              |    4 +-
>  tools/testing/selftests/bpf/bpf_helpers.h        |    2 +
>  40 files changed, 5408 insertions(+), 497 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.c
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.h
>  create mode 100644 include/linux/buff_pool.h
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 net/xdp/Kconfig
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xsk.c
>  create mode 100644 net/xdp/xsk.h
>  create mode 100644 net/xdp/xsk_buff.h
>  create mode 100644 net/xdp/xsk_buff_pool.c
>  create mode 100644 net/xdp/xsk_buff_pool.h
>  create mode 100644 net/xdp/xsk_packet_array.c
>  create mode 100644 net/xdp/xsk_packet_array.h
>  create mode 100644 net/xdp/xsk_ring.c
>  create mode 100644 net/xdp/xsk_ring.h
>  create mode 100644 net/xdp/xsk_user_queue.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_queue.h
>  create mode 100644 samples/bpf/xdpsock_user.c
>
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-02-05 15:05 ` Björn Töpel
@ 2018-02-07 15:54   ` Willem de Bruijn
  2018-02-07 21:28     ` Björn Töpel
  0 siblings, 1 reply; 50+ messages in thread
From: Willem de Bruijn @ 2018-02-07 15:54 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Shaw,
	Jeffrey B, Yigit, Ferruh, Zhang, Qi Z

> We realized, a bit late maybe, that 24 patches is a bit mouthful, so
> let me try to make it more palatable.

Overall, this approach looks great to me.

The patch set incorporates all the feedback from AF_PACKET V4.
At this point I don't have additional high-level interface comments.

As you point out, 24 patches and nearly 6000 changed lines is
quite a bit to ingest. Splitting up in smaller patch sets will help
give more detailed implementation feedback.

The frame pool and device driver changes are largely independent
from AF_XDP and probably should be resolved first (esp. the
observed regresssion even without AF_XDP).

As you suggest, it would be great if the need for a separate
xsk_packet_array data structure can be avoided.

Since frames from the same frame pool can be forwarded between
multiple device ports and thus AF_XDP sockets, that should perhaps
be a separate object independent from the sockets. This comment
hints at the awkward situation if tied to a descriptor pair:

> +       /* Check if umem is from this socket, if so do not make
> +        * circular references.
> +        */

Since this is in principle just a large shared memory area, could
it reuse existing BPF map logic?

More extreme, and perhaps unrealistic, is if the descriptor ring
could similarly be a BPF map and the Rx XDP program directly
writes the descriptor, instead of triggering xdp_do_xsk_redirect.
As we discussed before, this would avoid the need to specify a
descriptor format upfront.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/24] xsk: add user memory registration sockopt
  2018-01-31 13:53 ` [RFC PATCH 02/24] xsk: add user memory registration sockopt Björn Töpel
@ 2018-02-07 16:00   ` Willem de Bruijn
  2018-02-07 21:39     ` Björn Töpel
  0 siblings, 1 reply; 50+ messages in thread
From: Willem de Bruijn @ 2018-02-07 16:00 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Shaw,
	Jeffrey B, Yigit, Ferruh, Zhang, Qi Z

On Wed, Jan 31, 2018 at 8:53 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> The XDP_MEM_REG socket option allows a process to register a window of
> user space memory to the kernel. This memory will later be used as
> frame data buffer.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---

> +static struct xsk_umem *xsk_mem_reg(u64 addr, u64 size, u32 frame_size,
> +                                   u32 data_headroom)
> +{
> +       unsigned long lock_limit, locked, npages;
> +       int ret = 0;
> +       struct xsk_umem *umem;
> +
> +       if (!can_do_mlock())
> +               return ERR_PTR(-EPERM);
> +
> +       umem = xsk_umem_create(addr, size, frame_size, data_headroom);
> +       if (IS_ERR(umem))
> +               return umem;
> +
> +       npages = PAGE_ALIGN(umem->nframes * umem->frame_size) >> PAGE_SHIFT;
> +
> +       down_write(&current->mm->mmap_sem);
> +
> +       locked = npages + current->mm->pinned_vm;
> +       lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +
> +       if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +
> +       if (npages == 0 || npages > UINT_MAX) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +       umem->npgs = npages;
> +
> +       ret = xsk_umem_pin_pages(umem);
> +
> +out:
> +       if (ret < 0) {
> +               put_pid(umem->pid);
> +               kfree(umem);
> +       } else {
> +               current->mm->pinned_vm = locked;
> +       }
> +
> +       up_write(&current->mm->mmap_sem);

This limits per process. You may want to limit per user. See also
mm_account_pinned_pages.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (26 preceding siblings ...)
  2018-02-05 15:05 ` Björn Töpel
@ 2018-02-07 17:59 ` Tom Herbert
  2018-02-07 21:38   ` Björn Töpel
  2018-03-26 16:06 ` William Tu
  28 siblings, 1 reply; 50+ messages in thread
From: Tom Herbert @ 2018-02-07 17:59 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, Alexander Duyck, Alexander Duyck,
	john fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Jesse Brandeburg, Anjali Singhai Jain,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 20x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> from November last year. The feedback from that RFC submission and the
> presentation at NetdevConf in Seoul was to create a new address family
> instead of building on top of AF_PACKET. AF_XDP is this new address
> family.
>
> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> level is that TX and RX descriptors are separated from packet
> buffers. An RX or TX descriptor points to a data buffer in a packet
> buffer area. RX and TX can share the same packet buffer so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, then
> the descriptor that points to that packet buffer can be changed to
> point to another buffer and reused right away. This again avoids
> copying data.
>
> The RX and TX descriptor rings are registered with the setsockopts
> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> area is allocated by user space and registered with the kernel using
> the new XDP_MEM_REG setsockopt. All these three areas are shared
> between user space and kernel space. The socket is then bound with a
> bind() call to a device and a specific queue id on that device, and it
> is not until bind is completed that traffic starts to flow.
>
> An XDP program can be loaded to direct part of the traffic on that
> device and queue id to user space through a new redirect action in an
> XDP program called bpf_xdpsk_redirect that redirects a packet up to
> the socket in user space. All the other XDP actions work just as
> before. Note that the current RFC requires the user to load an XDP
> program to get any traffic to user space (for example all traffic to
> user space with the one-liner program "return
> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> this requirement and sends all traffic from a queue to user space if
> an AF_XDP socket is bound to it.
>
> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> is no specific mode called XDP_DRV_ZC). If the driver does not have
> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> program, XDP_SKB mode is employed that uses SKBs together with the
> generic XDP support and copies out the data to user space. A fallback
> mode that works for any network device. On the other hand, if the
> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> ndo_xdp_flush), these NDOs, without any modifications, will be used by
> the AF_XDP code to provide better performance, but there is still a
> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> driver support with the zero-copy user space allocator that provides
> even better performance. In this mode, the networking HW (or SW driver
> if it is a virtual driver like veth) DMAs/puts packets straight into
> the packet buffer that is shared between user space and kernel
> space. The RX and TX descriptor queues of the networking HW are NOT
> shared to user space. Only the kernel can read and write these and it
> is the kernel driver's responsibility to translate these HW specific
> descriptors to the HW agnostic ones in the virtual descriptor rings
> that user space sees. This way, a malicious user space program cannot
> mess with the networking HW. This mode though requires some extensions
> to XDP.
>
> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> buffer pool concept so that the same XDP driver code can be used for
> buffers allocated using the page allocator (XDP_DRV), the user-space
> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> allocator/cache/recycling mechanism. The ndo_bpf call has also been
> extended with two commands for registering and unregistering an XSK
> socket and is in the RX case mainly used to communicate some
> information about the user-space buffer pool to the driver.
>
> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
> but we run into problems with this (further discussion in the
> challenges section) and had to introduce a new NDO called
> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
> and an explicit queue id that packets should be sent out on. In
> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
> sent from the xdp socket (associated with the dev and queue
> combination that was provided with the NDO call) using a callback
> (get_tx_packet), and when they have been transmitted it uses another
> callback (tx_completion) to signal completion of packets. These
> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
> and thus does not clash with the XDP_REDIRECT use of
> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
> (without ZC) is currently not supported by TX. Please have a look at
> the challenges section for further discussions.
>
> The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
> so the user needs to steer the traffic to the zero-copy enabled queue
> pair. Which queue to use, is up to the user.
>
> For an untrusted application, HW packet steering to a specific queue
> pair (the one associated with the application) is a requirement, as
> the application would otherwise be able to see other user space
> processes' packets. If the HW cannot support the required packet
> steering, XDP_DRV or XDP_SKB mode have to be used as they do not
> expose the NIC's packet buffer into user space as the packets are
> copied into user space from the NIC's packet buffer in the kernel.
>
> There is a xdpsock benchmarking/test application included. Say that
> you would like your UDP traffic from port 4242 to end up in queue 16,
> that we will enable AF_XDP on. Here, we use ethtool for this:
>
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
>
> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
>
>       samples/bpf/xdpsock -i p3p2 -q 16 -l -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> XDP baseline numbers without this RFC:
> xdp_rxq_info --action XDP_DROP 31.3 Mpps
> xdp_rxq_info --action XDP_TX   16.7 Mpps
>
> XDP performance with this RFC i.e. with the buffer allocator:
> XDP_DROP 21.0 Mpps
> XDP_TX   11.9 Mpps
>
> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
> Benchmark   V2     V3     V4     V4+ZC
> rxdrop      0.67   0.73   0.74   33.7
> txpush      0.98   0.98   0.91   19.6
> l2fwd       0.66   0.71   0.67   15.5
>
> AF_XDP performance:
> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
> rxdrop      3.3        11.6         16.9
> txpush      2.2         NA*         21.8
> l2fwd       1.7         NA*         10.4
>
Hi Bjorn,

This is very impressive work, thank you for contributing it!

For these benchmarks how are the AF_PACKET and AF_XDP numbers to be
compared. For instance is rxdpop comparable to XDP_DROP and
"xdp_rxq_info --action XDP_DROP"? Given your explanation below, I
believe they are, but it might be better to make that clear up front.

Tom


> * NA since there is no XDP_DRV mode (without ZC) for TX in this RFC,
>   see challenges below.
>
> If we start by comparing XDP_SKB performance with copy mode in
> AF_PACKET V4, we can see that AF_XDP delivers 3-5 times the
> throughput, which is positive. We are also happy with the XDP_DRV
> performance that provides 11.6 Mpps for rxdrop, and should work on any
> driver implementing full XDP support. Now to the problematic part:
> XDP_DRV_ZC. The txpush (TX only) benchmark shows decent results at
> 21.8 Mpps and is better than it was for V4, even though we have spent
> no time optimizing the code in AF_XDP. (We did that in AF_PACKET V4.)
> But the RX performance is sliced by half, which is not good. The
> reason for this is, for the major part, the new buffer allocator which
> is used for RX ZC only (at this point, see todo section). If you take
> a look at the XDP baseline numbers, introducing the buffer pool
> allocator drops the performance by around 30% or 10 Mpps which is
> obviously not acceptable. We clearly need to give this code some
> overdue performance love. But the overhanging question is how much
> overhead it will produce in the end and if this will be
> acceptable. Another thing to note is that V4 provided 33.7 Mpps for
> rxdrop, but with AF_XDP we are quite unlikely to get above the
> XDP_DROP number of 31.3, since we are reusing the XDP infrastructure
> and driver code on the RX side. So in the end, the AF_XDP XDP_DRV_ZC
> numbers will likely be lower than the V4 ZC numbers.
>
> We based this patch set on net-next commit 91e6dd828425 ("ipmr: Fix
> ptrdiff_t print formatting").
>
> Challenges: areas we would really appreciate your help on and that we
> are having substantial problems with.
>
> * We would like to, if possible, use ndo_xdp_xmit and ndo_xdp_flush
>   instead of introducing another NDO in the form of
>   ndo_xdp_xmit_xsk. The first reason behind our ineptitude to be able
>   to accomplish this is that if both paths use ndo_xdp_xmit, they will
>   create a race as ndo_xdp_xmit currently does not contain any
>   locking. How to implement some type of mutual exclusion here without
>   resorting to slowing down the NDO with a lock? The second problem is
>   that the XDP_REDIRECT code implicitly assumes that core id = queue
>   id. AF_XDP, on the other hand, explicitly specifies a queue id that
>   has nothing to do with the core id (the application can run on any
>   core). How to reconcile these two views in one ndo? If these two
>   problems can be solved, then we would still need to introduce a
>   completion callback and a get_packet callback, but this seems to be
>   less challenging. This would also make it possible to run TX in the
>   XDP_DRV mode (with the default page allocator).
>
> * What should the buffer allocator look like and how to make it
>   generic enough so it can be used by all NIC vendors? Would be great
>   if you could take a look at it and come with suggestions. As you can
>   see from the change log, it took some effort to rewire the i40e code
>   to use the buff pool, and we expect the same to be true for many
>   other NICs. Ideas on how to introduce multiple allocator into XDP in
>   a less intrusive way would be highly appreciated. Another question
>   is how to create a buffer pool that gives rise to very little
>   overhead? We do not know if the current one can be optimized to have
>   an acceptable overhead as we have not started any optimization
>   effort. But we will give it a try during the next week or so to see
>   where it leads.
>
> * In this RFC, do not use an XDP_REDIRECT action other than
>   bpf_xdpsk_redirect for XDP_DRV_ZC. This is because a zero-copy
>   allocated buffer will then be sent to a cpu id / queue_pair through
>   ndo_xdp_xmit that does not know this has been ZC allocated. It will
>   then do a page_free on it and you will get a crash. How to extend
>   ndo_xdp_xmit with some free/completion function that could be called
>   instead of page_free?  Hopefully, the same solution can be used here
>   as in the first problem item in this section.
>
> Caveats with this RFC. In contrast to the last section, we believe we
> have solutions for these but we did not have time to fix them. We
> chose to show you all the code sooner than later, even though
> everything does not work. Sorry.
>
> * This RFC is more immature (read, has more bugs) than the AF_PACKET
>   V4 RFC. Some known mentioned here, others unknown.
>
> * We have done absolutely no optimization to this RFC. There is
>   (hopefully) some substantial low hanging fruit that we could fix
>   once we get to this, to improve XDP_DRV_ZC performance to levels
>   that we are not ashamed of and also bring the i40e driver to the
>   same performance levels it had before our changes, which is a must.
>
> * There is a race in the TX XSK clean up code in the i40e driver that
>   triggers a WARN_ON_ONCE. Clearly a bug that needs to be fixed. It
>   can be triggered by performing ifdown/ifup when the application is
>   running, or when changing the number of queues of the device
>   underneath the hood of the application. As a workaround, please
>   refrain from doing these two things without restarting the
>   application, as not all buffers will be returned in the TX
>   path. This bug can also be triggered when killing the application,
>   but has no negative effect in this case as the process will never
>   execute again.
>
> * Before this RFC, ndo_xdp_xmit triggered by an XDP_REDIRECT to a NIC
>   never modified the page count, so the redirect code could assume
>   that the page would still be valid after the NDO call. With the
>   introduction of the xsk_rcv path that is called as a result of an
>   XDP_REDIRECT to an AF_XDP socket, the page count will be decreased
>   if the page is copied out to user space, since we have no use for it
>   anymore. Our somewhat blunt solution to this is to make sure in the
>   i40e driver that the refcount is never under two. Note though, that
>   with the introduction of the buffer pool, this problem
>   disappears. This also means that XDP_DRV will not work out of the
>   box with a Niantic NIC, since it also needs this modification to
>   work. One question that we have is what should the semantics of
>   ndo_xdp_xmit be? Can we always assume that the page count will never
>   be changed by all possible netdevices that implement this NDO, or
>   should we remove this assumption to gain more device implementation
>   flexibility?
>
> To do:
>
> * Optimize performance. No optimization whatsoever was performed on
>   this RFC, in contrast to the previous one for AF_PACKET V4.
>
> * Kernel load module support.
>
> * Polling has not been implemented yet.
>
> * Optimize the user space sample application. It is simple but naive
>   at this point. The one for AF_PACKET V4 had a number of
>   optimizations that have not been introduced in the AF_XDP version.
>
> * Implement a way to pick the XDP_DRV mode even if XDP_DRV_ZC is
>   available. Would be nice to have for the sample application too.
>
> * Introduce a notifier chain for queue changes (caused by ethtool for
>   example). This would get rid of the error callback that we have at
>   this point.
>
> * Use one NAPI context for RX and another one for TX in i40e. This
>   would make it possible to run RX on one core and TX on another for
>   better performance. Today, they need to share a single core since
>   they share NAPI context.
>
> * Get rid of packet arrays (PA) and convert them to the buffer pool
>   allocator by transferring the necessary PA functionality into the
>   buffer pool. This has only been done for RX in ZC mode, while all
>   the other modes are still using packet arrays. Clearly, having two
>   structures with largely the same information is not a good thing.
>
> * Support for AF_XDP sockets without an XPD program loaded. In this
>   case all the traffic on a queue should go up to user space.
>
> * Support shared packet buffers
>
> * Support for packets spanning multiple frames
>
> Thanks: Björn and Magnus
>
> Björn Töpel (16):
>   xsk: AF_XDP sockets buildable skeleton
>   xsk: add user memory registration sockopt
>   xsk: added XDP_{R,T}X_RING sockopt and supporting structures
>   bpf: added bpf_xdpsk_redirect
>   net: wire up xsk support in the XDP_REDIRECT path
>   i40e: add support for XDP_REDIRECT
>   samples/bpf: added xdpsock program
>   xsk: add iterator functions to xsk_ring
>   i40e: introduce external allocator support
>   i40e: implemented page recycling buff_pool
>   i40e: start using recycling buff_pool
>   i40e: separated buff_pool interface from i40e implementaion
>   xsk: introduce xsk_buff_pool
>   xdp: added buff_pool support to struct xdp_buff
>   xsk: add support for zero copy Rx
>   i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx
>
> Magnus Karlsson (8):
>   xsk: add bind support and introduce Rx functionality
>   xsk: introduce Tx functionality
>   netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf
>   netdevice: added ndo for transmitting a packet from an XDP socket
>   xsk: add support for zero copy Tx
>   i40e: introduced a clean_tx callback function
>   i40e: introduced Tx completion callbacks
>   i40e: Tx support for zero copy allocator
>
>  drivers/net/ethernet/intel/i40e/Makefile         |    3 +-
>  drivers/net/ethernet/intel/i40e/i40e.h           |   24 +
>  drivers/net/ethernet/intel/i40e/i40e_buff_pool.c |  580 +++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_buff_pool.h |   15 +
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c   |    1 -
>  drivers/net/ethernet/intel/i40e/i40e_main.c      |  541 +++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c      |  906 +++++++++--------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h      |  119 ++-
>  include/linux/buff_pool.h                        |  136 +++
>  include/linux/filter.h                           |    3 +-
>  include/linux/netdevice.h                        |   25 +
>  include/linux/socket.h                           |    5 +-
>  include/net/xdp.h                                |    1 +
>  include/net/xdp_sock.h                           |   60 ++
>  include/uapi/linux/bpf.h                         |    6 +-
>  include/uapi/linux/if_xdp.h                      |   72 ++
>  net/Kconfig                                      |    1 +
>  net/Makefile                                     |    1 +
>  net/core/dev.c                                   |   28 +-
>  net/core/filter.c                                |   88 +-
>  net/core/sock.c                                  |   12 +-
>  net/xdp/Kconfig                                  |    7 +
>  net/xdp/Makefile                                 |    1 +
>  net/xdp/xsk.c                                    | 1142 ++++++++++++++++++++++
>  net/xdp/xsk.h                                    |   31 +
>  net/xdp/xsk_buff.h                               |  161 +++
>  net/xdp/xsk_buff_pool.c                          |  225 +++++
>  net/xdp/xsk_buff_pool.h                          |   17 +
>  net/xdp/xsk_packet_array.c                       |   62 ++
>  net/xdp/xsk_packet_array.h                       |  399 ++++++++
>  net/xdp/xsk_ring.c                               |   61 ++
>  net/xdp/xsk_ring.h                               |  419 ++++++++
>  net/xdp/xsk_user_queue.h                         |   24 +
>  samples/bpf/Makefile                             |    4 +
>  samples/bpf/xdpsock_kern.c                       |   11 +
>  samples/bpf/xdpsock_queue.h                      |   62 ++
>  samples/bpf/xdpsock_user.c                       |  642 ++++++++++++
>  security/selinux/hooks.c                         |    4 +-
>  security/selinux/include/classmap.h              |    4 +-
>  tools/testing/selftests/bpf/bpf_helpers.h        |    2 +
>  40 files changed, 5408 insertions(+), 497 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.c
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.h
>  create mode 100644 include/linux/buff_pool.h
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 net/xdp/Kconfig
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xsk.c
>  create mode 100644 net/xdp/xsk.h
>  create mode 100644 net/xdp/xsk_buff.h
>  create mode 100644 net/xdp/xsk_buff_pool.c
>  create mode 100644 net/xdp/xsk_buff_pool.h
>  create mode 100644 net/xdp/xsk_packet_array.c
>  create mode 100644 net/xdp/xsk_packet_array.h
>  create mode 100644 net/xdp/xsk_ring.c
>  create mode 100644 net/xdp/xsk_ring.h
>  create mode 100644 net/xdp/xsk_user_queue.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_queue.h
>  create mode 100644 samples/bpf/xdpsock_user.c
>
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect
  2018-02-05 13:42   ` Jesper Dangaard Brouer
@ 2018-02-07 21:11     ` Björn Töpel
  0 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-02-07 21:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Willem de Bruijn,
	Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Shaw,
	Jeffrey B, Yigit, Ferruh, Zhang, Qi Z

2018-02-05 14:42 GMT+01:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Wed, 31 Jan 2018 14:53:37 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote:
>
>> The bpf_xdpsk_redirect call redirects the XDP context to the XDP
>> socket bound to the receiving queue (if any).
>
> As I explained in-person at FOSDEM, my suggestion is to use the
> bpf-map infrastructure for AF_XDP redirect, but people on this
> upstream mailing also need a chance to validate my idea ;-)
>
> The important thing to keep in-mind is how we can still maintain a
> SPSC (Single producer Single Consumer) relationship between an
> RX-queue and a userspace consumer-process.
>
> This AF_XDP "FOSDEM" patchset, store the "xsk" (xdp_sock) pointer
> directly in the net_device (_rx[].netdev_rx_queue.xs) structure.  This
> limit each RX-queue to service a single xdp_sock.  It sounds good from
> a SPSC pov, but not very flexible.  With a "xdp_sock_map" we can get
> the flexibility to select among multiple xdp_sock'ets (via XDP
> pre-filter selecting a different map), and still maintain a SPSC
> relationship.  The RX-queue will just have several SPSC relationships
> with the individual xdp_sock's.
>
> This is true for the AF_XDP-copy mode, and require less driver change.
> For the AF_XDP-zero-copy (ZC) mode drivers need significant changes
> anyhow, and in ZC case we will have to disallow this multiple
> xdp_sock's, which is simply achieved by checking if the xdp_sock
> pointer returned from the map lookup match the one that userspace
> requested driver to register it's memory for RX-rings from.
>
> The "xdp_sock_map" is an array, where the index correspond to the
> queue_index.  The bpf_redirect_map() ignore the specified index and
> instead use xdp_rxq_info->queue_index in the lookup.
>
> Notice that a bpf-map have no pinned relationship with the device or
> XDP prog loaded.  Thus, userspace need to bind() this map to the
> device before traffic can flow, like the proposed bind() on the
> xdp_sock.  This is to establish the SPSC binding.  My proposal is that
> userspace insert the xdp_sock file-descriptor(s) in the map at the
> queue-index, and the map (which is also just a file-descriptor) is
> bound maybe via bind() to a specific device (via the ifindex).  Kernel
> side will walk the map and do required actions xdp_sock's in find in
> map.
>

As we discussed at FOSDEM, I like the idea of using a map. This also
opens up for configuring the AF_XDP sockets via bpf code, like sockmap
does.

I'll have a stab at adding an "xdp_sock_map/xskmap" or similar, and
also extending the cgroup sock_ops to support AF_XDP sockets, so that
the xskmap can be configured from bpf-land.


Björn

> TX-side is harder, as now multiple xdp_sock's can have the same
> queue-pair ID with the same net_device. But Magnus propose that this
> can be solved with hardware. As newer NICs have many TX-queue, and the
> queue-pair ID is just an external visible number, while the kernel
> internal structure can point to a dedicated TX-queue per xdp_sock.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-02-07 15:54   ` Willem de Bruijn
@ 2018-02-07 21:28     ` Björn Töpel
  2018-02-08 23:16       ` Willem de Bruijn
  0 siblings, 1 reply; 50+ messages in thread
From: Björn Töpel @ 2018-02-07 21:28 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Shaw,
	Jeffrey B, Yigit, Ferruh, Zhang, Qi Z

2018-02-07 16:54 GMT+01:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
>> We realized, a bit late maybe, that 24 patches is a bit mouthful, so
>> let me try to make it more palatable.
>
> Overall, this approach looks great to me.
>

Yay! :-)

> The patch set incorporates all the feedback from AF_PACKET V4.
> At this point I don't have additional high-level interface comments.
>

I have a thought on the socket API. Now, we're registering buffer
memory *to* the kernel, but mmap:ing the Rx/Tx rings *from* the
kernel. I'm leaning towards removing the mmap call, in favor of
registering the rings to kernel analogous to the XDP_MEM_REG socket
option. We wont guarantee physical contiguous memory for the rings,
but I think we can live with that. Thoughts?

> As you point out, 24 patches and nearly 6000 changed lines is
> quite a bit to ingest. Splitting up in smaller patch sets will help
> give more detailed implementation feedback.
>
> The frame pool and device driver changes are largely independent
> from AF_XDP and probably should be resolved first (esp. the
> observed regresssion even without AF_XDP).
>

Yeah, the regression is unacceptable.

Another way is starting with the patches without zero-copy first
(i.e. the copy path), and later add the driver modifications. That
would be the first 7 patches.

> As you suggest, it would be great if the need for a separate
> xsk_packet_array data structure can be avoided.
>

Yes, we'll address that!

> Since frames from the same frame pool can be forwarded between
> multiple device ports and thus AF_XDP sockets, that should perhaps
> be a separate object independent from the sockets. This comment
> hints at the awkward situation if tied to a descriptor pair:
>
>> +       /* Check if umem is from this socket, if so do not make
>> +        * circular references.
>> +        */
>
> Since this is in principle just a large shared memory area, could
> it reuse existing BPF map logic?
>

Hmm, care to elaborate on your thinking here?

> More extreme, and perhaps unrealistic, is if the descriptor ring
> could similarly be a BPF map and the Rx XDP program directly
> writes the descriptor, instead of triggering xdp_do_xsk_redirect.
> As we discussed before, this would avoid the need to specify a
> descriptor format upfront.

Having the XDP program writeback the descriptor to user space ring is
really something that would be useful (writing a virtio-net
descriptors...). I need to think a bit more about this. :-) Please
share your ideas!

Thanks for looking into the patches!


Björn

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-02-07 17:59 ` Tom Herbert
@ 2018-02-07 21:38   ` Björn Töpel
  0 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-02-07 21:38 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	john fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Jesse Brandeburg, Anjali Singhai Jain, Shaw,
	Jeffrey B, Yigit, Ferruh, Zhang, Qi Z

2018-02-07 18:59 GMT+01:00 Tom Herbert <tom@herbertland.com>:
> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
[...]
>>
>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>> byte packets, generated by commercial packet generator HW that is
>> generating packets at full 40 Gbit/s line rate.
>>
>> XDP baseline numbers without this RFC:
>> xdp_rxq_info --action XDP_DROP 31.3 Mpps
>> xdp_rxq_info --action XDP_TX   16.7 Mpps
>>
>> XDP performance with this RFC i.e. with the buffer allocator:
>> XDP_DROP 21.0 Mpps
>> XDP_TX   11.9 Mpps
>>
>> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
>> Benchmark   V2     V3     V4     V4+ZC
>> rxdrop      0.67   0.73   0.74   33.7
>> txpush      0.98   0.98   0.91   19.6
>> l2fwd       0.66   0.71   0.67   15.5
>>
>> AF_XDP performance:
>> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
>> rxdrop      3.3        11.6         16.9
>> txpush      2.2         NA*         21.8
>> l2fwd       1.7         NA*         10.4
>>
> Hi Bjorn,
>
> This is very impressive work, thank you for contributing it!
>

Thank you for looking at it! :-)

> For these benchmarks how are the AF_PACKET and AF_XDP numbers to be
> compared. For instance is rxdpop comparable to XDP_DROP and
> "xdp_rxq_info --action XDP_DROP"? Given your explanation below, I
> believe they are, but it might be better to make that clear up front.
>

Ah, yeah, that was a bit confusing:

"xdp_rxq_info --action XDP_DROP" is doing an XDP_DROP (no buffer
touching) and should be compared to "XDP_DROP". I meant to write
"xdp_rxq_info --action XDP_DROP" instead of "XDP_DROP" for the
second case.

So, what this shows is that the buffer allocation scheme in the patch
set (buff_pool) has a pretty hard performance regression (21.0 vs
31.3) on the regular XDP (and skb!) path. Not acceptable.

"rxdrop" from AF_PACKET V4 should be compared to "rxdrop" from
AF_XDP. This is dropping a packet in user space, whereas the former is
dropping a packet in XDP/kernel space.

Less confusing?


Björn

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/24] xsk: add user memory registration sockopt
  2018-02-07 16:00   ` Willem de Bruijn
@ 2018-02-07 21:39     ` Björn Töpel
  0 siblings, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-02-07 21:39 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Shaw,
	Jeffrey B, Yigit, Ferruh, Zhang, Qi Z

2018-02-07 17:00 GMT+01:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Wed, Jan 31, 2018 at 8:53 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> The XDP_MEM_REG socket option allows a process to register a window of
>> user space memory to the kernel. This memory will later be used as
>> frame data buffer.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> ---
>
>> +static struct xsk_umem *xsk_mem_reg(u64 addr, u64 size, u32 frame_size,
>> +                                   u32 data_headroom)
>> +{
>> +       unsigned long lock_limit, locked, npages;
>> +       int ret = 0;
>> +       struct xsk_umem *umem;
>> +
>> +       if (!can_do_mlock())
>> +               return ERR_PTR(-EPERM);
>> +
>> +       umem = xsk_umem_create(addr, size, frame_size, data_headroom);
>> +       if (IS_ERR(umem))
>> +               return umem;
>> +
>> +       npages = PAGE_ALIGN(umem->nframes * umem->frame_size) >> PAGE_SHIFT;
>> +
>> +       down_write(&current->mm->mmap_sem);
>> +
>> +       locked = npages + current->mm->pinned_vm;
>> +       lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> +
>> +       if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>> +               ret = -ENOMEM;
>> +               goto out;
>> +       }
>> +
>> +       if (npages == 0 || npages > UINT_MAX) {
>> +               ret = -EINVAL;
>> +               goto out;
>> +       }
>> +       umem->npgs = npages;
>> +
>> +       ret = xsk_umem_pin_pages(umem);
>> +
>> +out:
>> +       if (ret < 0) {
>> +               put_pid(umem->pid);
>> +               kfree(umem);
>> +       } else {
>> +               current->mm->pinned_vm = locked;
>> +       }
>> +
>> +       up_write(&current->mm->mmap_sem);
>
> This limits per process. You may want to limit per user. See also
> mm_account_pinned_pages.

Ah, noted! Thanks for pointing that out!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-02-07 21:28     ` Björn Töpel
@ 2018-02-08 23:16       ` Willem de Bruijn
  0 siblings, 0 replies; 50+ messages in thread
From: Willem de Bruijn @ 2018-02-08 23:16 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Shaw,
	Jeffrey B, Yigit, Ferruh, Zhang, Qi Z

On Wed, Feb 7, 2018 at 4:28 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> 2018-02-07 16:54 GMT+01:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
>>> We realized, a bit late maybe, that 24 patches is a bit mouthful, so
>>> let me try to make it more palatable.
>>
>> Overall, this approach looks great to me.
>>
>
> Yay! :-)
>
>> The patch set incorporates all the feedback from AF_PACKET V4.
>> At this point I don't have additional high-level interface comments.
>>
>
> I have a thought on the socket API. Now, we're registering buffer
> memory *to* the kernel, but mmap:ing the Rx/Tx rings *from* the
> kernel. I'm leaning towards removing the mmap call, in favor of
> registering the rings to kernel analogous to the XDP_MEM_REG socket
> option. We wont guarantee physical contiguous memory for the rings,
> but I think we can live with that. Thoughts?
>
>> As you point out, 24 patches and nearly 6000 changed lines is
>> quite a bit to ingest. Splitting up in smaller patch sets will help
>> give more detailed implementation feedback.
>>
>> The frame pool and device driver changes are largely independent
>> from AF_XDP and probably should be resolved first (esp. the
>> observed regresssion even without AF_XDP).
>>
>
> Yeah, the regression is unacceptable.
>
> Another way is starting with the patches without zero-copy first
> (i.e. the copy path), and later add the driver modifications. That
> would be the first 7 patches.
>
>> As you suggest, it would be great if the need for a separate
>> xsk_packet_array data structure can be avoided.
>>
>
> Yes, we'll address that!
>
>> Since frames from the same frame pool can be forwarded between
>> multiple device ports and thus AF_XDP sockets, that should perhaps
>> be a separate object independent from the sockets. This comment
>> hints at the awkward situation if tied to a descriptor pair:
>>
>>> +       /* Check if umem is from this socket, if so do not make
>>> +        * circular references.
>>> +        */
>>
>> Since this is in principle just a large shared memory area, could
>> it reuse existing BPF map logic?
>>
>
> Hmm, care to elaborate on your thinking here?

On second thought, that is not workable. I was thinking of reusing
existing mmap support for maps, but that is limited to the perf ring
buffer.

>> More extreme, and perhaps unrealistic, is if the descriptor ring
>> could similarly be a BPF map and the Rx XDP program directly
>> writes the descriptor, instead of triggering xdp_do_xsk_redirect.
>> As we discussed before, this would avoid the need to specify a
>> descriptor format upfront.
>
> Having the XDP program writeback the descriptor to user space ring is
> really something that would be useful (writing a virtio-net
> descriptors...).

Yes, that's a great use case. This ties in with Jason Wang's
presentation on XDP with tap and virtio, too.

https://www.netdevconf.org/2.2/slides/wang-vmperformance-talk.pdf

> I need to think a bit more about this. :-) Please
> share your ideas!
>
> Thanks for looking into the patches!
>
>
> Björn

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
                   ` (27 preceding siblings ...)
  2018-02-07 17:59 ` Tom Herbert
@ 2018-03-26 16:06 ` William Tu
  2018-03-26 16:38   ` Jesper Dangaard Brouer
  28 siblings, 1 reply; 50+ messages in thread
From: William Tu @ 2018-03-26 16:06 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	willemdebruijn.kernel, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 20x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> from November last year. The feedback from that RFC submission and the
> presentation at NetdevConf in Seoul was to create a new address family
> instead of building on top of AF_PACKET. AF_XDP is this new address
> family.
>
> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> level is that TX and RX descriptors are separated from packet
> buffers. An RX or TX descriptor points to a data buffer in a packet
> buffer area. RX and TX can share the same packet buffer so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, then
> the descriptor that points to that packet buffer can be changed to
> point to another buffer and reused right away. This again avoids
> copying data.
>
> The RX and TX descriptor rings are registered with the setsockopts
> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> area is allocated by user space and registered with the kernel using
> the new XDP_MEM_REG setsockopt. All these three areas are shared
> between user space and kernel space. The socket is then bound with a
> bind() call to a device and a specific queue id on that device, and it
> is not until bind is completed that traffic starts to flow.
>
> An XDP program can be loaded to direct part of the traffic on that
> device and queue id to user space through a new redirect action in an
> XDP program called bpf_xdpsk_redirect that redirects a packet up to
> the socket in user space. All the other XDP actions work just as
> before. Note that the current RFC requires the user to load an XDP
> program to get any traffic to user space (for example all traffic to
> user space with the one-liner program "return
> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> this requirement and sends all traffic from a queue to user space if
> an AF_XDP socket is bound to it.
>
> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> is no specific mode called XDP_DRV_ZC). If the driver does not have
> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> program, XDP_SKB mode is employed that uses SKBs together with the
> generic XDP support and copies out the data to user space. A fallback
> mode that works for any network device. On the other hand, if the
> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> ndo_xdp_flush), these NDOs, without any modifications, will be used by
> the AF_XDP code to provide better performance, but there is still a
> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> driver support with the zero-copy user space allocator that provides
> even better performance. In this mode, the networking HW (or SW driver
> if it is a virtual driver like veth) DMAs/puts packets straight into
> the packet buffer that is shared between user space and kernel
> space. The RX and TX descriptor queues of the networking HW are NOT
> shared to user space. Only the kernel can read and write these and it
> is the kernel driver's responsibility to translate these HW specific
> descriptors to the HW agnostic ones in the virtual descriptor rings
> that user space sees. This way, a malicious user space program cannot
> mess with the networking HW. This mode though requires some extensions
> to XDP.
>
> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> buffer pool concept so that the same XDP driver code can be used for
> buffers allocated using the page allocator (XDP_DRV), the user-space
> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> allocator/cache/recycling mechanism. The ndo_bpf call has also been
> extended with two commands for registering and unregistering an XSK
> socket and is in the RX case mainly used to communicate some
> information about the user-space buffer pool to the driver.
>
> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
> but we run into problems with this (further discussion in the
> challenges section) and had to introduce a new NDO called
> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
> and an explicit queue id that packets should be sent out on. In
> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
> sent from the xdp socket (associated with the dev and queue
> combination that was provided with the NDO call) using a callback
> (get_tx_packet), and when they have been transmitted it uses another
> callback (tx_completion) to signal completion of packets. These
> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
> and thus does not clash with the XDP_REDIRECT use of
> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
> (without ZC) is currently not supported by TX. Please have a look at
> the challenges section for further discussions.
>
> The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
> so the user needs to steer the traffic to the zero-copy enabled queue
> pair. Which queue to use, is up to the user.
>
> For an untrusted application, HW packet steering to a specific queue
> pair (the one associated with the application) is a requirement, as
> the application would otherwise be able to see other user space
> processes' packets. If the HW cannot support the required packet
> steering, XDP_DRV or XDP_SKB mode have to be used as they do not
> expose the NIC's packet buffer into user space as the packets are
> copied into user space from the NIC's packet buffer in the kernel.
>
> There is a xdpsock benchmarking/test application included. Say that
> you would like your UDP traffic from port 4242 to end up in queue 16,
> that we will enable AF_XDP on. Here, we use ethtool for this:
>
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
>
> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
>
>       samples/bpf/xdpsock -i p3p2 -q 16 -l -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> XDP baseline numbers without this RFC:
> xdp_rxq_info --action XDP_DROP 31.3 Mpps
> xdp_rxq_info --action XDP_TX   16.7 Mpps
>
> XDP performance with this RFC i.e. with the buffer allocator:
> XDP_DROP 21.0 Mpps
> XDP_TX   11.9 Mpps
>
> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
> Benchmark   V2     V3     V4     V4+ZC
> rxdrop      0.67   0.73   0.74   33.7
> txpush      0.98   0.98   0.91   19.6
> l2fwd       0.66   0.71   0.67   15.5
>
> AF_XDP performance:
> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
> rxdrop      3.3        11.6         16.9
> txpush      2.2         NA*         21.8
> l2fwd       1.7         NA*         10.4
>

Hi,
I also did an evaluation of AF_XDP, however the performance isn't as
good as above.
I'd like to share the result and see if there are some tuning suggestions.

System:
16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode

AF_XDP performance:
Benchmark   XDP_SKB
rxdrop      1.27 Mpps
txpush      0.99 Mpps
l2fwd        0.85 Mpps

NIC configuration:
the command
"ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
doesn't work on my ixgbe driver, so I use ntuple:

ethtool -K enp10s0f0 ntuple on
ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
then
echo 1 > /proc/sys/net/core/bpf_jit_enable
./xdpsock -i enp10s0f0 -r -S --queue=1

I also take a look at perf result:
For rxdrop:
86.56%  xdpsock xdpsock           [.] main
  9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
  4.23%  xdpsock  xdpsock         [.] xq_enq

For l2fwd:
20.81%  xdpsock xdpsock             [.] main
 10.64%  xdpsock [kernel.vmlinux]    [k] clflush_cache_range
  8.46%  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
  6.72%  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
  5.89%  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
  5.74%  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
  4.62%  xdpsock  [kernel.vmlinux]    [k] netif_skb_features
  3.96%  xdpsock  [kernel.vmlinux]    [k] ___slab_alloc
  3.18%  xdpsock  [kernel.vmlinux]    [k] nmi

I observed that the i40e's XDP_SKB result is much better than my ixgbe's result.
I wonder in XDP_SKB mode, does the driver make performance difference?
Or my cpu (E5-2440 v2 @ 1.90GHz) is too old?

Thanks
William

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-26 16:06 ` William Tu
@ 2018-03-26 16:38   ` Jesper Dangaard Brouer
  2018-03-26 21:58     ` William Tu
  2018-03-26 22:54     ` Tushar Dave
  0 siblings, 2 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-26 16:38 UTC (permalink / raw)
  To: William Tu
  Cc: Björn Töpel, magnus.karlsson, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	willemdebruijn.kernel, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang, brouer


On Mon, 26 Mar 2018 09:06:54 -0700 William Tu <u9012063@gmail.com> wrote:

> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > This RFC introduces a new address family called AF_XDP that is
> > optimized for high performance packet processing and zero-copy
> > semantics. Throughput improvements can be up to 20x compared to V2 and
> > V3 for the micro benchmarks included. Would be great to get your
> > feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> > from November last year. The feedback from that RFC submission and the
> > presentation at NetdevConf in Seoul was to create a new address family
> > instead of building on top of AF_PACKET. AF_XDP is this new address
> > family.
> >
> > The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> > level is that TX and RX descriptors are separated from packet
> > buffers. An RX or TX descriptor points to a data buffer in a packet
> > buffer area. RX and TX can share the same packet buffer so that a
> > packet does not have to be copied between RX and TX. Moreover, if a
> > packet needs to be kept for a while due to a possible retransmit, then
> > the descriptor that points to that packet buffer can be changed to
> > point to another buffer and reused right away. This again avoids
> > copying data.
> >
> > The RX and TX descriptor rings are registered with the setsockopts
> > XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> > area is allocated by user space and registered with the kernel using
> > the new XDP_MEM_REG setsockopt. All these three areas are shared
> > between user space and kernel space. The socket is then bound with a
> > bind() call to a device and a specific queue id on that device, and it
> > is not until bind is completed that traffic starts to flow.
> >
> > An XDP program can be loaded to direct part of the traffic on that
> > device and queue id to user space through a new redirect action in an
> > XDP program called bpf_xdpsk_redirect that redirects a packet up to
> > the socket in user space. All the other XDP actions work just as
> > before. Note that the current RFC requires the user to load an XDP
> > program to get any traffic to user space (for example all traffic to
> > user space with the one-liner program "return
> > bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> > this requirement and sends all traffic from a queue to user space if
> > an AF_XDP socket is bound to it.
> >
> > AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> > XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> > is no specific mode called XDP_DRV_ZC). If the driver does not have
> > support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> > program, XDP_SKB mode is employed that uses SKBs together with the
> > generic XDP support and copies out the data to user space. A fallback
> > mode that works for any network device. On the other hand, if the
> > driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> > ndo_xdp_flush), these NDOs, without any modifications, will be used by
> > the AF_XDP code to provide better performance, but there is still a
> > copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> > driver support with the zero-copy user space allocator that provides
> > even better performance. In this mode, the networking HW (or SW driver
> > if it is a virtual driver like veth) DMAs/puts packets straight into
> > the packet buffer that is shared between user space and kernel
> > space. The RX and TX descriptor queues of the networking HW are NOT
> > shared to user space. Only the kernel can read and write these and it
> > is the kernel driver's responsibility to translate these HW specific
> > descriptors to the HW agnostic ones in the virtual descriptor rings
> > that user space sees. This way, a malicious user space program cannot
> > mess with the networking HW. This mode though requires some extensions
> > to XDP.
> >
> > To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> > buffer pool concept so that the same XDP driver code can be used for
> > buffers allocated using the page allocator (XDP_DRV), the user-space
> > zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> > allocator/cache/recycling mechanism. The ndo_bpf call has also been
> > extended with two commands for registering and unregistering an XSK
> > socket and is in the RX case mainly used to communicate some
> > information about the user-space buffer pool to the driver.
> >
> > For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
> > but we run into problems with this (further discussion in the
> > challenges section) and had to introduce a new NDO called
> > ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
> > and an explicit queue id that packets should be sent out on. In
> > contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
> > sent from the xdp socket (associated with the dev and queue
> > combination that was provided with the NDO call) using a callback
> > (get_tx_packet), and when they have been transmitted it uses another
> > callback (tx_completion) to signal completion of packets. These
> > callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
> > command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
> > and thus does not clash with the XDP_REDIRECT use of
> > ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
> > (without ZC) is currently not supported by TX. Please have a look at
> > the challenges section for further discussions.
> >
> > The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
> > so the user needs to steer the traffic to the zero-copy enabled queue
> > pair. Which queue to use, is up to the user.
> >
> > For an untrusted application, HW packet steering to a specific queue
> > pair (the one associated with the application) is a requirement, as
> > the application would otherwise be able to see other user space
> > processes' packets. If the HW cannot support the required packet
> > steering, XDP_DRV or XDP_SKB mode have to be used as they do not
> > expose the NIC's packet buffer into user space as the packets are
> > copied into user space from the NIC's packet buffer in the kernel.
> >
> > There is a xdpsock benchmarking/test application included. Say that
> > you would like your UDP traffic from port 4242 to end up in queue 16,
> > that we will enable AF_XDP on. Here, we use ethtool for this:
> >
> >       ethtool -N p3p2 rx-flow-hash udp4 fn
> >       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
> >           action 16
> >
> > Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
> >
> >       samples/bpf/xdpsock -i p3p2 -q 16 -l -N
> >
> > For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> > can be displayed with "-h", as usual.
> >
> > We have run some benchmarks on a dual socket system with two Broadwell
> > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> > cores which gives a total of 28, but only two cores are used in these
> > experiments. One for TR/RX and one for the user space application. The
> > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> > memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> > Intel I40E 40Gbit/s using the i40e driver.
> >
> > Below are the results in Mpps of the I40E NIC benchmark runs for 64
> > byte packets, generated by commercial packet generator HW that is
> > generating packets at full 40 Gbit/s line rate.
> >
> > XDP baseline numbers without this RFC:
> > xdp_rxq_info --action XDP_DROP 31.3 Mpps
> > xdp_rxq_info --action XDP_TX   16.7 Mpps
> >
> > XDP performance with this RFC i.e. with the buffer allocator:
> > XDP_DROP 21.0 Mpps
> > XDP_TX   11.9 Mpps
> >
> > AF_PACKET V4 performance from previous RFC on 4.14-rc7:
> > Benchmark   V2     V3     V4     V4+ZC
> > rxdrop      0.67   0.73   0.74   33.7
> > txpush      0.98   0.98   0.91   19.6
> > l2fwd       0.66   0.71   0.67   15.5
> >
> > AF_XDP performance:
> > Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
> > rxdrop      3.3        11.6         16.9
> > txpush      2.2         NA*         21.8
> > l2fwd       1.7         NA*         10.4
> >  
> 
> Hi,
> I also did an evaluation of AF_XDP, however the performance isn't as
> good as above.
> I'd like to share the result and see if there are some tuning suggestions.
> 
> System:
> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode

Hmmm, why is X540-AT2 not able to use XDP natively?

> AF_XDP performance:
> Benchmark   XDP_SKB
> rxdrop      1.27 Mpps
> txpush      0.99 Mpps
> l2fwd        0.85 Mpps

Definitely too low...

What is the performance if you drop packets via iptables?

Command:
 $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP

> NIC configuration:
> the command
> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
> doesn't work on my ixgbe driver, so I use ntuple:
> 
> ethtool -K enp10s0f0 ntuple on
> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
> then
> echo 1 > /proc/sys/net/core/bpf_jit_enable
> ./xdpsock -i enp10s0f0 -r -S --queue=1
> 
> I also take a look at perf result:
> For rxdrop:
> 86.56%  xdpsock xdpsock           [.] main
>   9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
>   4.23%  xdpsock  xdpsock         [.] xq_enq

It looks very strange that you see non-maskable interrupt's (NMI) being
this high...

 
> For l2fwd:
>  20.81%  xdpsock xdpsock             [.] main
>  10.64%  xdpsock [kernel.vmlinux]    [k] clflush_cache_range

Oh, clflush_cache_range is being called!
Do your system use an IOMMU ?

>   8.46%  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
>   6.72%  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
>   5.89%  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
>   5.74%  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
>   4.62%  xdpsock  [kernel.vmlinux]    [k] netif_skb_features
>   3.96%  xdpsock  [kernel.vmlinux]    [k] ___slab_alloc
>   3.18%  xdpsock  [kernel.vmlinux]    [k] nmi

Again high count for NMI ?!?

Maybe you just forgot to tell perf that you want it to decode the
bpf_prog correctly?

https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols

Enable via:
 $ sysctl net/core/bpf_jit_kallsyms=1

And use perf report (while BPF is STILL LOADED):

 $ perf report --kallsyms=/proc/kallsyms

E.g. for emailing this you can use this command:

 $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
 

> I observed that the i40e's XDP_SKB result is much better than my ixgbe's result.
> I wonder in XDP_SKB mode, does the driver make performance difference?
> Or my cpu (E5-2440 v2 @ 1.90GHz) is too old?

I suspect some setup issue on your system.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-26 16:38   ` Jesper Dangaard Brouer
@ 2018-03-26 21:58     ` William Tu
  2018-03-27  6:09       ` Björn Töpel
  2018-03-27  9:37       ` Jesper Dangaard Brouer
  2018-03-26 22:54     ` Tushar Dave
  1 sibling, 2 replies; 50+ messages in thread
From: William Tu @ 2018-03-26 21:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, magnus.karlsson, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	willemdebruijn.kernel, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

Hi Jesper,

Thanks a lot for your prompt reply.

>> Hi,
>> I also did an evaluation of AF_XDP, however the performance isn't as
>> good as above.
>> I'd like to share the result and see if there are some tuning suggestions.
>>
>> System:
>> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
>> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode
>
> Hmmm, why is X540-AT2 not able to use XDP natively?

Because I'm only able to use ixgbe driver for this NIC,
and AF_XDP patch only has i40e support?

>
>> AF_XDP performance:
>> Benchmark   XDP_SKB
>> rxdrop      1.27 Mpps
>> txpush      0.99 Mpps
>> l2fwd        0.85 Mpps
>
> Definitely too low...
>
I did another run, the rxdrop seems better.
Benchmark   XDP_SKB
rxdrop      2.3 Mpps
txpush     1.05 Mpps
l2fwd        0.90 Mpps

> What is the performance if you drop packets via iptables?
>
> Command:
>  $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP
>
I did
# iptables -t raw -I PREROUTING -p udp -i enp10s0f0 -j DROP
# iptables -nvL -t raw; sleep 10; iptables -nvL -t raw

and I got 2.9Mpps.

>> NIC configuration:
>> the command
>> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
>> doesn't work on my ixgbe driver, so I use ntuple:
>>
>> ethtool -K enp10s0f0 ntuple on
>> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
>> then
>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>> ./xdpsock -i enp10s0f0 -r -S --queue=1
>>
>> I also take a look at perf result:
>> For rxdrop:
>> 86.56%  xdpsock xdpsock           [.] main
>>   9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
>>   4.23%  xdpsock  xdpsock         [.] xq_enq
>
> It looks very strange that you see non-maskable interrupt's (NMI) being
> this high...
>
yes, that's weird. Looking at the perf annotate of nmi,
it shows 100% spent on nop instruction.

>
>> For l2fwd:
>>  20.81%  xdpsock xdpsock             [.] main
>>  10.64%  xdpsock [kernel.vmlinux]    [k] clflush_cache_range
>
> Oh, clflush_cache_range is being called!

I though clflush_cache_range is high because we have many smp_rmb, smp_wmb
in the xdpsock queue/ring management userspace code.
(perf shows that 75% of this 10.64% spent on mfence instruction.)

> Do your system use an IOMMU ?
>
Yes.
With CONFIG_INTEL_IOMMU=y
and I saw some related functions called (ex: intel_alloc_iova).

>>   8.46%  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
>>   6.72%  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
>>   5.89%  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
>>   5.74%  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
>>   4.62%  xdpsock  [kernel.vmlinux]    [k] netif_skb_features
>>   3.96%  xdpsock  [kernel.vmlinux]    [k] ___slab_alloc
>>   3.18%  xdpsock  [kernel.vmlinux]    [k] nmi
>
> Again high count for NMI ?!?
>
> Maybe you just forgot to tell perf that you want it to decode the
> bpf_prog correctly?
>
> https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
>
> Enable via:
>  $ sysctl net/core/bpf_jit_kallsyms=1
>
> And use perf report (while BPF is STILL LOADED):
>
>  $ perf report --kallsyms=/proc/kallsyms
>
> E.g. for emailing this you can use this command:
>
>  $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
>

Thanks, I followed the steps, the result of l2fwd
# Total Lost Samples: 119
#
# Samples: 2K of event 'cycles:ppp'
# Event count (approx.): 25675705627
#
# Overhead  CPU  Command  Shared Object       Symbol
# ........  ...  .......  ..................  ..................................
#
    10.48%  013  xdpsock  xdpsock             [.] main
     9.77%  013  xdpsock  [kernel.vmlinux]    [k] clflush_cache_range
     8.45%  013  xdpsock  [kernel.vmlinux]    [k] nmi
     8.07%  013  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
     7.81%  013  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
     4.95%  013  xdpsock  [kernel.vmlinux]    [k] ixgbe_xmit_frame_ring
     4.66%  013  xdpsock  [kernel.vmlinux]    [k] skb_store_bits
     4.39%  013  xdpsock  [kernel.vmlinux]    [k] syscall_return_via_sysret
     3.93%  013  xdpsock  [kernel.vmlinux]    [k] pfn_to_dma_pte
     2.62%  013  xdpsock  [kernel.vmlinux]    [k] __intel_map_single
     2.53%  013  xdpsock  [kernel.vmlinux]    [k] __alloc_skb
     2.36%  013  xdpsock  [kernel.vmlinux]    [k] iommu_no_mapping
     2.21%  013  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
     2.07%  013  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
     1.98%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_node_track_caller
     1.94%  013  xdpsock  [kernel.vmlinux]    [k] ksize
     1.84%  013  xdpsock  [kernel.vmlinux]    [k] validate_xmit_skb_list
     1.62%  013  xdpsock  [kernel.vmlinux]    [k] kmem_cache_alloc_node
     1.48%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_reserve.isra.37
     1.21%  013  xdpsock  xdpsock             [.] xq_enq
     1.08%  013  xdpsock  [kernel.vmlinux]    [k] intel_alloc_iova

And l2fwd under "perf stat" looks OK to me. There is little context
switches, cpu
is fully utilized, 1.17 insn per cycle seems ok.

Performance counter stats for 'CPU(s) 6':
      10000.787420      cpu-clock (msec)          #    1.000 CPUs
utilized
                24      context-switches          #    0.002 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                 0      page-faults               #    0.000 K/sec
    22,361,333,647      cycles                    #    2.236 GHz
    13,458,442,838      stalled-cycles-frontend   #   60.19% frontend
cycles idle
    26,251,003,067      instructions              #    1.17  insn per
cycle
                                                  #    0.51  stalled
cycles per insn
     4,938,921,868      branches                  #  493.853 M/sec
         7,591,739      branch-misses             #    0.15% of all
branches
      10.000835769 seconds time elapsed

Will continue investigate...
Thanks
William

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-26 16:38   ` Jesper Dangaard Brouer
  2018-03-26 21:58     ` William Tu
@ 2018-03-26 22:54     ` Tushar Dave
  2018-03-26 23:03       ` Alexander Duyck
  1 sibling, 1 reply; 50+ messages in thread
From: Tushar Dave @ 2018-03-26 22:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, William Tu
  Cc: Björn Töpel, magnus.karlsson, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	willemdebruijn.kernel, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang



On 03/26/2018 09:38 AM, Jesper Dangaard Brouer wrote:
> 
> On Mon, 26 Mar 2018 09:06:54 -0700 William Tu <u9012063@gmail.com> wrote:
> 
>> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>
>>> This RFC introduces a new address family called AF_XDP that is
>>> optimized for high performance packet processing and zero-copy
>>> semantics. Throughput improvements can be up to 20x compared to V2 and
>>> V3 for the micro benchmarks included. Would be great to get your
>>> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
>>> from November last year. The feedback from that RFC submission and the
>>> presentation at NetdevConf in Seoul was to create a new address family
>>> instead of building on top of AF_PACKET. AF_XDP is this new address
>>> family.
>>>
>>> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
>>> level is that TX and RX descriptors are separated from packet
>>> buffers. An RX or TX descriptor points to a data buffer in a packet
>>> buffer area. RX and TX can share the same packet buffer so that a
>>> packet does not have to be copied between RX and TX. Moreover, if a
>>> packet needs to be kept for a while due to a possible retransmit, then
>>> the descriptor that points to that packet buffer can be changed to
>>> point to another buffer and reused right away. This again avoids
>>> copying data.
>>>
>>> The RX and TX descriptor rings are registered with the setsockopts
>>> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
>>> area is allocated by user space and registered with the kernel using
>>> the new XDP_MEM_REG setsockopt. All these three areas are shared
>>> between user space and kernel space. The socket is then bound with a
>>> bind() call to a device and a specific queue id on that device, and it
>>> is not until bind is completed that traffic starts to flow.
>>>
>>> An XDP program can be loaded to direct part of the traffic on that
>>> device and queue id to user space through a new redirect action in an
>>> XDP program called bpf_xdpsk_redirect that redirects a packet up to
>>> the socket in user space. All the other XDP actions work just as
>>> before. Note that the current RFC requires the user to load an XDP
>>> program to get any traffic to user space (for example all traffic to
>>> user space with the one-liner program "return
>>> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
>>> this requirement and sends all traffic from a queue to user space if
>>> an AF_XDP socket is bound to it.
>>>
>>> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
>>> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
>>> is no specific mode called XDP_DRV_ZC). If the driver does not have
>>> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
>>> program, XDP_SKB mode is employed that uses SKBs together with the
>>> generic XDP support and copies out the data to user space. A fallback
>>> mode that works for any network device. On the other hand, if the
>>> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
>>> ndo_xdp_flush), these NDOs, without any modifications, will be used by
>>> the AF_XDP code to provide better performance, but there is still a
>>> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
>>> driver support with the zero-copy user space allocator that provides
>>> even better performance. In this mode, the networking HW (or SW driver
>>> if it is a virtual driver like veth) DMAs/puts packets straight into
>>> the packet buffer that is shared between user space and kernel
>>> space. The RX and TX descriptor queues of the networking HW are NOT
>>> shared to user space. Only the kernel can read and write these and it
>>> is the kernel driver's responsibility to translate these HW specific
>>> descriptors to the HW agnostic ones in the virtual descriptor rings
>>> that user space sees. This way, a malicious user space program cannot
>>> mess with the networking HW. This mode though requires some extensions
>>> to XDP.
>>>
>>> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
>>> buffer pool concept so that the same XDP driver code can be used for
>>> buffers allocated using the page allocator (XDP_DRV), the user-space
>>> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
>>> allocator/cache/recycling mechanism. The ndo_bpf call has also been
>>> extended with two commands for registering and unregistering an XSK
>>> socket and is in the RX case mainly used to communicate some
>>> information about the user-space buffer pool to the driver.
>>>
>>> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
>>> but we run into problems with this (further discussion in the
>>> challenges section) and had to introduce a new NDO called
>>> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
>>> and an explicit queue id that packets should be sent out on. In
>>> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
>>> sent from the xdp socket (associated with the dev and queue
>>> combination that was provided with the NDO call) using a callback
>>> (get_tx_packet), and when they have been transmitted it uses another
>>> callback (tx_completion) to signal completion of packets. These
>>> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
>>> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
>>> and thus does not clash with the XDP_REDIRECT use of
>>> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
>>> (without ZC) is currently not supported by TX. Please have a look at
>>> the challenges section for further discussions.
>>>
>>> The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
>>> so the user needs to steer the traffic to the zero-copy enabled queue
>>> pair. Which queue to use, is up to the user.
>>>
>>> For an untrusted application, HW packet steering to a specific queue
>>> pair (the one associated with the application) is a requirement, as
>>> the application would otherwise be able to see other user space
>>> processes' packets. If the HW cannot support the required packet
>>> steering, XDP_DRV or XDP_SKB mode have to be used as they do not
>>> expose the NIC's packet buffer into user space as the packets are
>>> copied into user space from the NIC's packet buffer in the kernel.
>>>
>>> There is a xdpsock benchmarking/test application included. Say that
>>> you would like your UDP traffic from port 4242 to end up in queue 16,
>>> that we will enable AF_XDP on. Here, we use ethtool for this:
>>>
>>>        ethtool -N p3p2 rx-flow-hash udp4 fn
>>>        ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>>>            action 16
>>>
>>> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
>>>
>>>        samples/bpf/xdpsock -i p3p2 -q 16 -l -N
>>>
>>> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
>>> can be displayed with "-h", as usual.
>>>
>>> We have run some benchmarks on a dual socket system with two Broadwell
>>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>>> cores which gives a total of 28, but only two cores are used in these
>>> experiments. One for TR/RX and one for the user space application. The
>>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>>> Intel I40E 40Gbit/s using the i40e driver.
>>>
>>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>>> byte packets, generated by commercial packet generator HW that is
>>> generating packets at full 40 Gbit/s line rate.
>>>
>>> XDP baseline numbers without this RFC:
>>> xdp_rxq_info --action XDP_DROP 31.3 Mpps
>>> xdp_rxq_info --action XDP_TX   16.7 Mpps
>>>
>>> XDP performance with this RFC i.e. with the buffer allocator:
>>> XDP_DROP 21.0 Mpps
>>> XDP_TX   11.9 Mpps
>>>
>>> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
>>> Benchmark   V2     V3     V4     V4+ZC
>>> rxdrop      0.67   0.73   0.74   33.7
>>> txpush      0.98   0.98   0.91   19.6
>>> l2fwd       0.66   0.71   0.67   15.5
>>>
>>> AF_XDP performance:
>>> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
>>> rxdrop      3.3        11.6         16.9
>>> txpush      2.2         NA*         21.8
>>> l2fwd       1.7         NA*         10.4
>>>   
>>
>> Hi,
>> I also did an evaluation of AF_XDP, however the performance isn't as
>> good as above.
>> I'd like to share the result and see if there are some tuning suggestions.
>>
>> System:
>> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
>> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode
> 
> Hmmm, why is X540-AT2 not able to use XDP natively?
> 
>> AF_XDP performance:
>> Benchmark   XDP_SKB
>> rxdrop      1.27 Mpps
>> txpush      0.99 Mpps
>> l2fwd        0.85 Mpps
> 
> Definitely too low...
> 
> What is the performance if you drop packets via iptables?
> 
> Command:
>   $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP
> 
>> NIC configuration:
>> the command
>> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
>> doesn't work on my ixgbe driver, so I use ntuple:
>>
>> ethtool -K enp10s0f0 ntuple on
>> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
>> then
>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>> ./xdpsock -i enp10s0f0 -r -S --queue=1
>>
>> I also take a look at perf result:
>> For rxdrop:
>> 86.56%  xdpsock xdpsock           [.] main
>>    9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
>>    4.23%  xdpsock  xdpsock         [.] xq_enq
> 
> It looks very strange that you see non-maskable interrupt's (NMI) being
> this high...
> 
>   
>> For l2fwd:
>>   20.81%  xdpsock xdpsock             [.] main
>>   10.64%  xdpsock [kernel.vmlinux]    [k] clflush_cache_range
> 
> Oh, clflush_cache_range is being called!
> Do your system use an IOMMU ?

Whats the implication here. Should IOMMU be disabled?
I'm asking because I do see a huge difference while running pktgen test 
for my performance benchmarks, with and without intel_iommu.


-Tushar

> 
>>    8.46%  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
>>    6.72%  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
>>    5.89%  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
>>    5.74%  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
>>    4.62%  xdpsock  [kernel.vmlinux]    [k] netif_skb_features
>>    3.96%  xdpsock  [kernel.vmlinux]    [k] ___slab_alloc
>>    3.18%  xdpsock  [kernel.vmlinux]    [k] nmi
> 
> Again high count for NMI ?!?
> 
> Maybe you just forgot to tell perf that you want it to decode the
> bpf_prog correctly?
> 
> https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
> 
> Enable via:
>   $ sysctl net/core/bpf_jit_kallsyms=1
> 
> And use perf report (while BPF is STILL LOADED):
> 
>   $ perf report --kallsyms=/proc/kallsyms
> 
> E.g. for emailing this you can use this command:
> 
>   $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
>   
> 
>> I observed that the i40e's XDP_SKB result is much better than my ixgbe's result.
>> I wonder in XDP_SKB mode, does the driver make performance difference?
>> Or my cpu (E5-2440 v2 @ 1.90GHz) is too old?
> 
> I suspect some setup issue on your system.
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-26 22:54     ` Tushar Dave
@ 2018-03-26 23:03       ` Alexander Duyck
  2018-03-26 23:20         ` Tushar Dave
  2018-03-27  6:30         ` Björn Töpel
  0 siblings, 2 replies; 50+ messages in thread
From: Alexander Duyck @ 2018-03-26 23:03 UTC (permalink / raw)
  To: Tushar Dave
  Cc: Jesper Dangaard Brouer, William Tu, Björn Töpel,
	Karlsson, Magnus, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

On Mon, Mar 26, 2018 at 3:54 PM, Tushar Dave <tushar.n.dave@oracle.com> wrote:
>
>
> On 03/26/2018 09:38 AM, Jesper Dangaard Brouer wrote:
>>
>>
>> On Mon, 26 Mar 2018 09:06:54 -0700 William Tu <u9012063@gmail.com> wrote:
>>
>>> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@gmail.com>
>>> wrote:
>>>>
>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>
>>>> This RFC introduces a new address family called AF_XDP that is
>>>> optimized for high performance packet processing and zero-copy
>>>> semantics. Throughput improvements can be up to 20x compared to V2 and
>>>> V3 for the micro benchmarks included. Would be great to get your
>>>> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
>>>> from November last year. The feedback from that RFC submission and the
>>>> presentation at NetdevConf in Seoul was to create a new address family
>>>> instead of building on top of AF_PACKET. AF_XDP is this new address
>>>> family.
>>>>
>>>> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
>>>> level is that TX and RX descriptors are separated from packet
>>>> buffers. An RX or TX descriptor points to a data buffer in a packet
>>>> buffer area. RX and TX can share the same packet buffer so that a
>>>> packet does not have to be copied between RX and TX. Moreover, if a
>>>> packet needs to be kept for a while due to a possible retransmit, then
>>>> the descriptor that points to that packet buffer can be changed to
>>>> point to another buffer and reused right away. This again avoids
>>>> copying data.
>>>>
>>>> The RX and TX descriptor rings are registered with the setsockopts
>>>> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
>>>> area is allocated by user space and registered with the kernel using
>>>> the new XDP_MEM_REG setsockopt. All these three areas are shared
>>>> between user space and kernel space. The socket is then bound with a
>>>> bind() call to a device and a specific queue id on that device, and it
>>>> is not until bind is completed that traffic starts to flow.
>>>>
>>>> An XDP program can be loaded to direct part of the traffic on that
>>>> device and queue id to user space through a new redirect action in an
>>>> XDP program called bpf_xdpsk_redirect that redirects a packet up to
>>>> the socket in user space. All the other XDP actions work just as
>>>> before. Note that the current RFC requires the user to load an XDP
>>>> program to get any traffic to user space (for example all traffic to
>>>> user space with the one-liner program "return
>>>> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
>>>> this requirement and sends all traffic from a queue to user space if
>>>> an AF_XDP socket is bound to it.
>>>>
>>>> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
>>>> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
>>>> is no specific mode called XDP_DRV_ZC). If the driver does not have
>>>> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
>>>> program, XDP_SKB mode is employed that uses SKBs together with the
>>>> generic XDP support and copies out the data to user space. A fallback
>>>> mode that works for any network device. On the other hand, if the
>>>> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
>>>> ndo_xdp_flush), these NDOs, without any modifications, will be used by
>>>> the AF_XDP code to provide better performance, but there is still a
>>>> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
>>>> driver support with the zero-copy user space allocator that provides
>>>> even better performance. In this mode, the networking HW (or SW driver
>>>> if it is a virtual driver like veth) DMAs/puts packets straight into
>>>> the packet buffer that is shared between user space and kernel
>>>> space. The RX and TX descriptor queues of the networking HW are NOT
>>>> shared to user space. Only the kernel can read and write these and it
>>>> is the kernel driver's responsibility to translate these HW specific
>>>> descriptors to the HW agnostic ones in the virtual descriptor rings
>>>> that user space sees. This way, a malicious user space program cannot
>>>> mess with the networking HW. This mode though requires some extensions
>>>> to XDP.
>>>>
>>>> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
>>>> buffer pool concept so that the same XDP driver code can be used for
>>>> buffers allocated using the page allocator (XDP_DRV), the user-space
>>>> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
>>>> allocator/cache/recycling mechanism. The ndo_bpf call has also been
>>>> extended with two commands for registering and unregistering an XSK
>>>> socket and is in the RX case mainly used to communicate some
>>>> information about the user-space buffer pool to the driver.
>>>>
>>>> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
>>>> but we run into problems with this (further discussion in the
>>>> challenges section) and had to introduce a new NDO called
>>>> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
>>>> and an explicit queue id that packets should be sent out on. In
>>>> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
>>>> sent from the xdp socket (associated with the dev and queue
>>>> combination that was provided with the NDO call) using a callback
>>>> (get_tx_packet), and when they have been transmitted it uses another
>>>> callback (tx_completion) to signal completion of packets. These
>>>> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
>>>> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
>>>> and thus does not clash with the XDP_REDIRECT use of
>>>> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
>>>> (without ZC) is currently not supported by TX. Please have a look at
>>>> the challenges section for further discussions.
>>>>
>>>> The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
>>>> so the user needs to steer the traffic to the zero-copy enabled queue
>>>> pair. Which queue to use, is up to the user.
>>>>
>>>> For an untrusted application, HW packet steering to a specific queue
>>>> pair (the one associated with the application) is a requirement, as
>>>> the application would otherwise be able to see other user space
>>>> processes' packets. If the HW cannot support the required packet
>>>> steering, XDP_DRV or XDP_SKB mode have to be used as they do not
>>>> expose the NIC's packet buffer into user space as the packets are
>>>> copied into user space from the NIC's packet buffer in the kernel.
>>>>
>>>> There is a xdpsock benchmarking/test application included. Say that
>>>> you would like your UDP traffic from port 4242 to end up in queue 16,
>>>> that we will enable AF_XDP on. Here, we use ethtool for this:
>>>>
>>>>        ethtool -N p3p2 rx-flow-hash udp4 fn
>>>>        ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>>>>            action 16
>>>>
>>>> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
>>>>
>>>>        samples/bpf/xdpsock -i p3p2 -q 16 -l -N
>>>>
>>>> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
>>>> can be displayed with "-h", as usual.
>>>>
>>>> We have run some benchmarks on a dual socket system with two Broadwell
>>>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>>>> cores which gives a total of 28, but only two cores are used in these
>>>> experiments. One for TR/RX and one for the user space application. The
>>>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>>>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>>>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>>>> Intel I40E 40Gbit/s using the i40e driver.
>>>>
>>>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>>>> byte packets, generated by commercial packet generator HW that is
>>>> generating packets at full 40 Gbit/s line rate.
>>>>
>>>> XDP baseline numbers without this RFC:
>>>> xdp_rxq_info --action XDP_DROP 31.3 Mpps
>>>> xdp_rxq_info --action XDP_TX   16.7 Mpps
>>>>
>>>> XDP performance with this RFC i.e. with the buffer allocator:
>>>> XDP_DROP 21.0 Mpps
>>>> XDP_TX   11.9 Mpps
>>>>
>>>> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
>>>> Benchmark   V2     V3     V4     V4+ZC
>>>> rxdrop      0.67   0.73   0.74   33.7
>>>> txpush      0.98   0.98   0.91   19.6
>>>> l2fwd       0.66   0.71   0.67   15.5
>>>>
>>>> AF_XDP performance:
>>>> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
>>>> rxdrop      3.3        11.6         16.9
>>>> txpush      2.2         NA*         21.8
>>>> l2fwd       1.7         NA*         10.4
>>>>
>>>
>>>
>>> Hi,
>>> I also did an evaluation of AF_XDP, however the performance isn't as
>>> good as above.
>>> I'd like to share the result and see if there are some tuning
>>> suggestions.
>>>
>>> System:
>>> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
>>> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode
>>
>>
>> Hmmm, why is X540-AT2 not able to use XDP natively?
>>
>>> AF_XDP performance:
>>> Benchmark   XDP_SKB
>>> rxdrop      1.27 Mpps
>>> txpush      0.99 Mpps
>>> l2fwd        0.85 Mpps
>>
>>
>> Definitely too low...
>>
>> What is the performance if you drop packets via iptables?
>>
>> Command:
>>   $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP
>>
>>> NIC configuration:
>>> the command
>>> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
>>> doesn't work on my ixgbe driver, so I use ntuple:
>>>
>>> ethtool -K enp10s0f0 ntuple on
>>> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
>>> then
>>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>>> ./xdpsock -i enp10s0f0 -r -S --queue=1
>>>
>>> I also take a look at perf result:
>>> For rxdrop:
>>> 86.56%  xdpsock xdpsock           [.] main
>>>    9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
>>>    4.23%  xdpsock  xdpsock         [.] xq_enq
>>
>>
>> It looks very strange that you see non-maskable interrupt's (NMI) being
>> this high...
>>
>>
>>>
>>> For l2fwd:
>>>   20.81%  xdpsock xdpsock             [.] main
>>>   10.64%  xdpsock [kernel.vmlinux]    [k] clflush_cache_range
>>
>>
>> Oh, clflush_cache_range is being called!
>> Do your system use an IOMMU ?
>
>
> Whats the implication here. Should IOMMU be disabled?
> I'm asking because I do see a huge difference while running pktgen test for
> my performance benchmarks, with and without intel_iommu.
>
>
> -Tushar

For the Intel parts the IOMMU can be expensive primarily for Tx, since
it should have minimal impact if the Rx pages are pinned/recycled. I
am assuming the same is true here for AF_XDP, Bjorn can correct me if
I am wrong.

Basically the IOMMU can make creating/destroying a DMA mapping really
expensive. The easiest way to work around it in the case of the Intel
IOMMU is to boot with "iommu=pt" which will create an identity mapping
for the host. The downside is though that you then have the entire
system accessible to the device unless a new mapping is created for it
by assigning it to a new IOMMU domain.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-26 23:03       ` Alexander Duyck
@ 2018-03-26 23:20         ` Tushar Dave
  2018-03-28  0:49           ` William Tu
  2018-03-27  6:30         ` Björn Töpel
  1 sibling, 1 reply; 50+ messages in thread
From: Tushar Dave @ 2018-03-26 23:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, William Tu, Björn Töpel,
	Karlsson, Magnus, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang



On 03/26/2018 04:03 PM, Alexander Duyck wrote:
> On Mon, Mar 26, 2018 at 3:54 PM, Tushar Dave <tushar.n.dave@oracle.com> wrote:
>>
>>
>> On 03/26/2018 09:38 AM, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On Mon, 26 Mar 2018 09:06:54 -0700 William Tu <u9012063@gmail.com> wrote:
>>>
>>>> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@gmail.com>
>>>> wrote:
>>>>>
>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>
>>>>> This RFC introduces a new address family called AF_XDP that is
>>>>> optimized for high performance packet processing and zero-copy
>>>>> semantics. Throughput improvements can be up to 20x compared to V2 and
>>>>> V3 for the micro benchmarks included. Would be great to get your
>>>>> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
>>>>> from November last year. The feedback from that RFC submission and the
>>>>> presentation at NetdevConf in Seoul was to create a new address family
>>>>> instead of building on top of AF_PACKET. AF_XDP is this new address
>>>>> family.
>>>>>
>>>>> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
>>>>> level is that TX and RX descriptors are separated from packet
>>>>> buffers. An RX or TX descriptor points to a data buffer in a packet
>>>>> buffer area. RX and TX can share the same packet buffer so that a
>>>>> packet does not have to be copied between RX and TX. Moreover, if a
>>>>> packet needs to be kept for a while due to a possible retransmit, then
>>>>> the descriptor that points to that packet buffer can be changed to
>>>>> point to another buffer and reused right away. This again avoids
>>>>> copying data.
>>>>>
>>>>> The RX and TX descriptor rings are registered with the setsockopts
>>>>> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
>>>>> area is allocated by user space and registered with the kernel using
>>>>> the new XDP_MEM_REG setsockopt. All these three areas are shared
>>>>> between user space and kernel space. The socket is then bound with a
>>>>> bind() call to a device and a specific queue id on that device, and it
>>>>> is not until bind is completed that traffic starts to flow.
>>>>>
>>>>> An XDP program can be loaded to direct part of the traffic on that
>>>>> device and queue id to user space through a new redirect action in an
>>>>> XDP program called bpf_xdpsk_redirect that redirects a packet up to
>>>>> the socket in user space. All the other XDP actions work just as
>>>>> before. Note that the current RFC requires the user to load an XDP
>>>>> program to get any traffic to user space (for example all traffic to
>>>>> user space with the one-liner program "return
>>>>> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
>>>>> this requirement and sends all traffic from a queue to user space if
>>>>> an AF_XDP socket is bound to it.
>>>>>
>>>>> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
>>>>> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
>>>>> is no specific mode called XDP_DRV_ZC). If the driver does not have
>>>>> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
>>>>> program, XDP_SKB mode is employed that uses SKBs together with the
>>>>> generic XDP support and copies out the data to user space. A fallback
>>>>> mode that works for any network device. On the other hand, if the
>>>>> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
>>>>> ndo_xdp_flush), these NDOs, without any modifications, will be used by
>>>>> the AF_XDP code to provide better performance, but there is still a
>>>>> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
>>>>> driver support with the zero-copy user space allocator that provides
>>>>> even better performance. In this mode, the networking HW (or SW driver
>>>>> if it is a virtual driver like veth) DMAs/puts packets straight into
>>>>> the packet buffer that is shared between user space and kernel
>>>>> space. The RX and TX descriptor queues of the networking HW are NOT
>>>>> shared to user space. Only the kernel can read and write these and it
>>>>> is the kernel driver's responsibility to translate these HW specific
>>>>> descriptors to the HW agnostic ones in the virtual descriptor rings
>>>>> that user space sees. This way, a malicious user space program cannot
>>>>> mess with the networking HW. This mode though requires some extensions
>>>>> to XDP.
>>>>>
>>>>> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
>>>>> buffer pool concept so that the same XDP driver code can be used for
>>>>> buffers allocated using the page allocator (XDP_DRV), the user-space
>>>>> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
>>>>> allocator/cache/recycling mechanism. The ndo_bpf call has also been
>>>>> extended with two commands for registering and unregistering an XSK
>>>>> socket and is in the RX case mainly used to communicate some
>>>>> information about the user-space buffer pool to the driver.
>>>>>
>>>>> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
>>>>> but we run into problems with this (further discussion in the
>>>>> challenges section) and had to introduce a new NDO called
>>>>> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
>>>>> and an explicit queue id that packets should be sent out on. In
>>>>> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
>>>>> sent from the xdp socket (associated with the dev and queue
>>>>> combination that was provided with the NDO call) using a callback
>>>>> (get_tx_packet), and when they have been transmitted it uses another
>>>>> callback (tx_completion) to signal completion of packets. These
>>>>> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
>>>>> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
>>>>> and thus does not clash with the XDP_REDIRECT use of
>>>>> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
>>>>> (without ZC) is currently not supported by TX. Please have a look at
>>>>> the challenges section for further discussions.
>>>>>
>>>>> The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
>>>>> so the user needs to steer the traffic to the zero-copy enabled queue
>>>>> pair. Which queue to use, is up to the user.
>>>>>
>>>>> For an untrusted application, HW packet steering to a specific queue
>>>>> pair (the one associated with the application) is a requirement, as
>>>>> the application would otherwise be able to see other user space
>>>>> processes' packets. If the HW cannot support the required packet
>>>>> steering, XDP_DRV or XDP_SKB mode have to be used as they do not
>>>>> expose the NIC's packet buffer into user space as the packets are
>>>>> copied into user space from the NIC's packet buffer in the kernel.
>>>>>
>>>>> There is a xdpsock benchmarking/test application included. Say that
>>>>> you would like your UDP traffic from port 4242 to end up in queue 16,
>>>>> that we will enable AF_XDP on. Here, we use ethtool for this:
>>>>>
>>>>>         ethtool -N p3p2 rx-flow-hash udp4 fn
>>>>>         ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>>>>>             action 16
>>>>>
>>>>> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
>>>>>
>>>>>         samples/bpf/xdpsock -i p3p2 -q 16 -l -N
>>>>>
>>>>> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
>>>>> can be displayed with "-h", as usual.
>>>>>
>>>>> We have run some benchmarks on a dual socket system with two Broadwell
>>>>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>>>>> cores which gives a total of 28, but only two cores are used in these
>>>>> experiments. One for TR/RX and one for the user space application. The
>>>>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>>>>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>>>>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>>>>> Intel I40E 40Gbit/s using the i40e driver.
>>>>>
>>>>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>>>>> byte packets, generated by commercial packet generator HW that is
>>>>> generating packets at full 40 Gbit/s line rate.
>>>>>
>>>>> XDP baseline numbers without this RFC:
>>>>> xdp_rxq_info --action XDP_DROP 31.3 Mpps
>>>>> xdp_rxq_info --action XDP_TX   16.7 Mpps
>>>>>
>>>>> XDP performance with this RFC i.e. with the buffer allocator:
>>>>> XDP_DROP 21.0 Mpps
>>>>> XDP_TX   11.9 Mpps
>>>>>
>>>>> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
>>>>> Benchmark   V2     V3     V4     V4+ZC
>>>>> rxdrop      0.67   0.73   0.74   33.7
>>>>> txpush      0.98   0.98   0.91   19.6
>>>>> l2fwd       0.66   0.71   0.67   15.5
>>>>>
>>>>> AF_XDP performance:
>>>>> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
>>>>> rxdrop      3.3        11.6         16.9
>>>>> txpush      2.2         NA*         21.8
>>>>> l2fwd       1.7         NA*         10.4
>>>>>
>>>>
>>>>
>>>> Hi,
>>>> I also did an evaluation of AF_XDP, however the performance isn't as
>>>> good as above.
>>>> I'd like to share the result and see if there are some tuning
>>>> suggestions.
>>>>
>>>> System:
>>>> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
>>>> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode
>>>
>>>
>>> Hmmm, why is X540-AT2 not able to use XDP natively?
>>>
>>>> AF_XDP performance:
>>>> Benchmark   XDP_SKB
>>>> rxdrop      1.27 Mpps
>>>> txpush      0.99 Mpps
>>>> l2fwd        0.85 Mpps
>>>
>>>
>>> Definitely too low...
>>>
>>> What is the performance if you drop packets via iptables?
>>>
>>> Command:
>>>    $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP
>>>
>>>> NIC configuration:
>>>> the command
>>>> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
>>>> doesn't work on my ixgbe driver, so I use ntuple:
>>>>
>>>> ethtool -K enp10s0f0 ntuple on
>>>> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
>>>> then
>>>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>>>> ./xdpsock -i enp10s0f0 -r -S --queue=1
>>>>
>>>> I also take a look at perf result:
>>>> For rxdrop:
>>>> 86.56%  xdpsock xdpsock           [.] main
>>>>     9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
>>>>     4.23%  xdpsock  xdpsock         [.] xq_enq
>>>
>>>
>>> It looks very strange that you see non-maskable interrupt's (NMI) being
>>> this high...
>>>
>>>
>>>>
>>>> For l2fwd:
>>>>    20.81%  xdpsock xdpsock             [.] main
>>>>    10.64%  xdpsock [kernel.vmlinux]    [k] clflush_cache_range
>>>
>>>
>>> Oh, clflush_cache_range is being called!
>>> Do your system use an IOMMU ?
>>
>>
>> Whats the implication here. Should IOMMU be disabled?
>> I'm asking because I do see a huge difference while running pktgen test for
>> my performance benchmarks, with and without intel_iommu.
>>
>>
>> -Tushar
> 
> For the Intel parts the IOMMU can be expensive primarily for Tx, since
> it should have minimal impact if the Rx pages are pinned/recycled. I
> am assuming the same is true here for AF_XDP, Bjorn can correct me if
> I am wrong.

Indeed. Intel iommu has least effect on RX because of premap/recycle.
But TX dma map and unmap is really expensive!

> 
> Basically the IOMMU can make creating/destroying a DMA mapping really
> expensive. The easiest way to work around it in the case of the Intel
> IOMMU is to boot with "iommu=pt" which will create an identity mapping
> for the host. The downside is though that you then have the entire
> system accessible to the device unless a new mapping is created for it
> by assigning it to a new IOMMU domain.

Yeah thats what I would say, If you really want to use intel iommu and
don't want to hit by performance , use 'iommu=pt'.

Good to have confirmation from you Alex. Thanks.

btw, I don't want to distract this thread on iommu discussion however
even using 'pt' doesn't give you the same performance numbers that you
rather get with intel iommu disabled!

-Tushar

> 
> Thanks.
> 
> - Alex
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-26 21:58     ` William Tu
@ 2018-03-27  6:09       ` Björn Töpel
  2018-03-27  9:37       ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-03-27  6:09 UTC (permalink / raw)
  To: William Tu
  Cc: Jesper Dangaard Brouer, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Shaw,
	Jeffrey B, Yigit, Ferruh, Zhang, Qi Z

 2018-03-26 23:58 GMT+02:00 William Tu <u9012063@gmail.com>:
> Hi Jesper,
>
> Thanks a lot for your prompt reply.
>
>>> Hi,
>>> I also did an evaluation of AF_XDP, however the performance isn't as
>>> good as above.
>>> I'd like to share the result and see if there are some tuning suggestions.
>>>
>>> System:
>>> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
>>> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode
>>
>> Hmmm, why is X540-AT2 not able to use XDP natively?
>
> Because I'm only able to use ixgbe driver for this NIC,
> and AF_XDP patch only has i40e support?
>

It's only i40e that support zero copy. As for native XDP support, only
XDP_REDIRECT support is required and ixgbe does support XDP_REDIRECT
-- unfortunately, ixgbe still needs a needs a patch to work properly,
which is in net-next: ed93a3987128 ("ixgbe: tweak page counting for
XDP_REDIRECT").

>>
>>> AF_XDP performance:
>>> Benchmark   XDP_SKB
>>> rxdrop      1.27 Mpps
>>> txpush      0.99 Mpps
>>> l2fwd        0.85 Mpps
>>
>> Definitely too low...
>>
> I did another run, the rxdrop seems better.
> Benchmark   XDP_SKB
> rxdrop      2.3 Mpps
> txpush     1.05 Mpps
> l2fwd        0.90 Mpps
>
>> What is the performance if you drop packets via iptables?
>>
>> Command:
>>  $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP
>>
> I did
> # iptables -t raw -I PREROUTING -p udp -i enp10s0f0 -j DROP
> # iptables -nvL -t raw; sleep 10; iptables -nvL -t raw
>
> and I got 2.9Mpps.
>
>>> NIC configuration:
>>> the command
>>> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
>>> doesn't work on my ixgbe driver, so I use ntuple:
>>>
>>> ethtool -K enp10s0f0 ntuple on
>>> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
>>> then
>>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>>> ./xdpsock -i enp10s0f0 -r -S --queue=1
>>>
>>> I also take a look at perf result:
>>> For rxdrop:
>>> 86.56%  xdpsock xdpsock           [.] main
>>>   9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
>>>   4.23%  xdpsock  xdpsock         [.] xq_enq
>>
>> It looks very strange that you see non-maskable interrupt's (NMI) being
>> this high...
>>
> yes, that's weird. Looking at the perf annotate of nmi,
> it shows 100% spent on nop instruction.
>
>>
>>> For l2fwd:
>>>  20.81%  xdpsock xdpsock             [.] main
>>>  10.64%  xdpsock [kernel.vmlinux]    [k] clflush_cache_range
>>
>> Oh, clflush_cache_range is being called!
>
> I though clflush_cache_range is high because we have many smp_rmb, smp_wmb
> in the xdpsock queue/ring management userspace code.
> (perf shows that 75% of this 10.64% spent on mfence instruction.)
>
>> Do your system use an IOMMU ?
>>
> Yes.
> With CONFIG_INTEL_IOMMU=y
> and I saw some related functions called (ex: intel_alloc_iova).
>
>>>   8.46%  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
>>>   6.72%  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
>>>   5.89%  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
>>>   5.74%  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
>>>   4.62%  xdpsock  [kernel.vmlinux]    [k] netif_skb_features
>>>   3.96%  xdpsock  [kernel.vmlinux]    [k] ___slab_alloc
>>>   3.18%  xdpsock  [kernel.vmlinux]    [k] nmi
>>
>> Again high count for NMI ?!?
>>
>> Maybe you just forgot to tell perf that you want it to decode the
>> bpf_prog correctly?
>>
>> https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
>>
>> Enable via:
>>  $ sysctl net/core/bpf_jit_kallsyms=1
>>
>> And use perf report (while BPF is STILL LOADED):
>>
>>  $ perf report --kallsyms=/proc/kallsyms
>>
>> E.g. for emailing this you can use this command:
>>
>>  $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
>>
>
> Thanks, I followed the steps, the result of l2fwd
> # Total Lost Samples: 119
> #
> # Samples: 2K of event 'cycles:ppp'
> # Event count (approx.): 25675705627
> #
> # Overhead  CPU  Command  Shared Object       Symbol
> # ........  ...  .......  ..................  ..................................
> #
>     10.48%  013  xdpsock  xdpsock             [.] main
>      9.77%  013  xdpsock  [kernel.vmlinux]    [k] clflush_cache_range
>      8.45%  013  xdpsock  [kernel.vmlinux]    [k] nmi
>      8.07%  013  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
>      7.81%  013  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
>      4.95%  013  xdpsock  [kernel.vmlinux]    [k] ixgbe_xmit_frame_ring
>      4.66%  013  xdpsock  [kernel.vmlinux]    [k] skb_store_bits
>      4.39%  013  xdpsock  [kernel.vmlinux]    [k] syscall_return_via_sysret
>      3.93%  013  xdpsock  [kernel.vmlinux]    [k] pfn_to_dma_pte
>      2.62%  013  xdpsock  [kernel.vmlinux]    [k] __intel_map_single
>      2.53%  013  xdpsock  [kernel.vmlinux]    [k] __alloc_skb
>      2.36%  013  xdpsock  [kernel.vmlinux]    [k] iommu_no_mapping
>      2.21%  013  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
>      2.07%  013  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
>      1.98%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_node_track_caller
>      1.94%  013  xdpsock  [kernel.vmlinux]    [k] ksize
>      1.84%  013  xdpsock  [kernel.vmlinux]    [k] validate_xmit_skb_list
>      1.62%  013  xdpsock  [kernel.vmlinux]    [k] kmem_cache_alloc_node
>      1.48%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_reserve.isra.37
>      1.21%  013  xdpsock  xdpsock             [.] xq_enq
>      1.08%  013  xdpsock  [kernel.vmlinux]    [k] intel_alloc_iova
>
> And l2fwd under "perf stat" looks OK to me. There is little context
> switches, cpu
> is fully utilized, 1.17 insn per cycle seems ok.
>
> Performance counter stats for 'CPU(s) 6':
>       10000.787420      cpu-clock (msec)          #    1.000 CPUs
> utilized
>                 24      context-switches          #    0.002 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                  0      page-faults               #    0.000 K/sec
>     22,361,333,647      cycles                    #    2.236 GHz
>     13,458,442,838      stalled-cycles-frontend   #   60.19% frontend
> cycles idle
>     26,251,003,067      instructions              #    1.17  insn per
> cycle
>                                                   #    0.51  stalled
> cycles per insn
>      4,938,921,868      branches                  #  493.853 M/sec
>          7,591,739      branch-misses             #    0.15% of all
> branches
>       10.000835769 seconds time elapsed
>
> Will continue investigate...
> Thanks
> William

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-26 23:03       ` Alexander Duyck
  2018-03-26 23:20         ` Tushar Dave
@ 2018-03-27  6:30         ` Björn Töpel
  1 sibling, 0 replies; 50+ messages in thread
From: Björn Töpel @ 2018-03-27  6:30 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Tushar Dave, Jesper Dangaard Brouer, William Tu, Karlsson,
	Magnus, Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain, Shaw,
	Jeffrey B, Yigit, Ferruh, Zhang, Qi Z

2018-03-27 1:03 GMT+02:00 Alexander Duyck <alexander.duyck@gmail.com>:
> On Mon, Mar 26, 2018 at 3:54 PM, Tushar Dave <tushar.n.dave@oracle.com> wrote:
[...]
>>
>> Whats the implication here. Should IOMMU be disabled?
>> I'm asking because I do see a huge difference while running pktgen test for
>> my performance benchmarks, with and without intel_iommu.
>>
>>
>> -Tushar
>
> For the Intel parts the IOMMU can be expensive primarily for Tx, since
> it should have minimal impact if the Rx pages are pinned/recycled. I
> am assuming the same is true here for AF_XDP, Bjorn can correct me if
> I am wrong.
>

For the non-zc case the DMA mapping is done in the Tx fast path, so
there, as Alex says, you'll definitely see a performance penalty. For
Rx the page-recycle mechanism (Intel drivers) usally avoids doing any
DMA mappings in the fast-path.

As for AF_XDP zerocopy mode, we do the DMA mapping up front (avoiding
the single-use mappings), to avoid that performance hit. Keep in mind,
though, that the IOTLB is still in play, and usually performs worse
under pressure, than the non-IOMMU case.

> Basically the IOMMU can make creating/destroying a DMA mapping really
> expensive. The easiest way to work around it in the case of the Intel
> IOMMU is to boot with "iommu=pt" which will create an identity mapping
> for the host. The downside is though that you then have the entire
> system accessible to the device unless a new mapping is created for it
> by assigning it to a new IOMMU domain.
>



> Thanks.
>
> - Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-26 21:58     ` William Tu
  2018-03-27  6:09       ` Björn Töpel
@ 2018-03-27  9:37       ` Jesper Dangaard Brouer
  2018-03-28  0:06         ` William Tu
  1 sibling, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-27  9:37 UTC (permalink / raw)
  To: William Tu
  Cc: Björn Töpel, magnus.karlsson, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	willemdebruijn.kernel, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang, brouer

On Mon, 26 Mar 2018 14:58:02 -0700
William Tu <u9012063@gmail.com> wrote:

> > Again high count for NMI ?!?
> >
> > Maybe you just forgot to tell perf that you want it to decode the
> > bpf_prog correctly?
> >
> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
> >
> > Enable via:
> >  $ sysctl net/core/bpf_jit_kallsyms=1
> >
> > And use perf report (while BPF is STILL LOADED):
> >
> >  $ perf report --kallsyms=/proc/kallsyms
> >
> > E.g. for emailing this you can use this command:
> >
> >  $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
> >  
> 
> Thanks, I followed the steps, the result of l2fwd
> # Total Lost Samples: 119
> #
> # Samples: 2K of event 'cycles:ppp'
> # Event count (approx.): 25675705627
> #
> # Overhead  CPU  Command  Shared Object       Symbol
> # ........  ...  .......  ..................  ..................................
> #
>     10.48%  013  xdpsock  xdpsock             [.] main
>      9.77%  013  xdpsock  [kernel.vmlinux]    [k] clflush_cache_range
>      8.45%  013  xdpsock  [kernel.vmlinux]    [k] nmi
>      8.07%  013  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
>      7.81%  013  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
>      4.95%  013  xdpsock  [kernel.vmlinux]    [k] ixgbe_xmit_frame_ring
>      4.66%  013  xdpsock  [kernel.vmlinux]    [k] skb_store_bits
>      4.39%  013  xdpsock  [kernel.vmlinux]    [k] syscall_return_via_sysret
>      3.93%  013  xdpsock  [kernel.vmlinux]    [k] pfn_to_dma_pte
>      2.62%  013  xdpsock  [kernel.vmlinux]    [k] __intel_map_single
>      2.53%  013  xdpsock  [kernel.vmlinux]    [k] __alloc_skb
>      2.36%  013  xdpsock  [kernel.vmlinux]    [k] iommu_no_mapping
>      2.21%  013  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
>      2.07%  013  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
>      1.98%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_node_track_caller
>      1.94%  013  xdpsock  [kernel.vmlinux]    [k] ksize
>      1.84%  013  xdpsock  [kernel.vmlinux]    [k] validate_xmit_skb_list
>      1.62%  013  xdpsock  [kernel.vmlinux]    [k] kmem_cache_alloc_node
>      1.48%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_reserve.isra.37
>      1.21%  013  xdpsock  xdpsock             [.] xq_enq
>      1.08%  013  xdpsock  [kernel.vmlinux]    [k] intel_alloc_iova
> 

You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of
bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks wrong.


> And l2fwd under "perf stat" looks OK to me. There is little context
> switches, cpu is fully utilized, 1.17 insn per cycle seems ok.
> 
> Performance counter stats for 'CPU(s) 6':
>   10000.787420      cpu-clock (msec)          #    1.000 CPUs utilized
>             24      context-switches          #    0.002 K/sec
>              0      cpu-migrations            #    0.000 K/sec
>              0      page-faults               #    0.000 K/sec
> 22,361,333,647      cycles                    #    2.236 GHz
> 13,458,442,838      stalled-cycles-frontend   #   60.19% frontend cycles idle
> 26,251,003,067      instructions              #    1.17  insn per cycle
>                                               #    0.51  stalled cycles per insn
>  4,938,921,868      branches                  #  493.853 M/sec
>      7,591,739      branch-misses             #    0.15% of all branches
>   10.000835769 seconds time elapsed

This perf stat also indicate something is wrong.

The 1.17 insn per cycle is NOT okay, it is too low (compared to what
usually I see, e.g. 2.36  insn per cycle).

It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend
cycles idle'.   This means your CPU have issues/bottleneck fetching
instructions. Explained by Andi Kleen here [1]

[1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-27  9:37       ` Jesper Dangaard Brouer
@ 2018-03-28  0:06         ` William Tu
  2018-03-28  8:01           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: William Tu @ 2018-03-28  0:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, magnus.karlsson, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	willemdebruijn.kernel, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

On Tue, Mar 27, 2018 at 2:37 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Mon, 26 Mar 2018 14:58:02 -0700
> William Tu <u9012063@gmail.com> wrote:
>
>> > Again high count for NMI ?!?
>> >
>> > Maybe you just forgot to tell perf that you want it to decode the
>> > bpf_prog correctly?
>> >
>> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
>> >
>> > Enable via:
>> >  $ sysctl net/core/bpf_jit_kallsyms=1
>> >
>> > And use perf report (while BPF is STILL LOADED):
>> >
>> >  $ perf report --kallsyms=/proc/kallsyms
>> >
>> > E.g. for emailing this you can use this command:
>> >
>> >  $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
>> >
>>
>> Thanks, I followed the steps, the result of l2fwd
>> # Total Lost Samples: 119
>> #
>> # Samples: 2K of event 'cycles:ppp'
>> # Event count (approx.): 25675705627
>> #
>> # Overhead  CPU  Command  Shared Object       Symbol
>> # ........  ...  .......  ..................  ..................................
>> #
>>     10.48%  013  xdpsock  xdpsock             [.] main
>>      9.77%  013  xdpsock  [kernel.vmlinux]    [k] clflush_cache_range
>>      8.45%  013  xdpsock  [kernel.vmlinux]    [k] nmi
>>      8.07%  013  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
>>      7.81%  013  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
>>      4.95%  013  xdpsock  [kernel.vmlinux]    [k] ixgbe_xmit_frame_ring
>>      4.66%  013  xdpsock  [kernel.vmlinux]    [k] skb_store_bits
>>      4.39%  013  xdpsock  [kernel.vmlinux]    [k] syscall_return_via_sysret
>>      3.93%  013  xdpsock  [kernel.vmlinux]    [k] pfn_to_dma_pte
>>      2.62%  013  xdpsock  [kernel.vmlinux]    [k] __intel_map_single
>>      2.53%  013  xdpsock  [kernel.vmlinux]    [k] __alloc_skb
>>      2.36%  013  xdpsock  [kernel.vmlinux]    [k] iommu_no_mapping
>>      2.21%  013  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
>>      2.07%  013  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
>>      1.98%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_node_track_caller
>>      1.94%  013  xdpsock  [kernel.vmlinux]    [k] ksize
>>      1.84%  013  xdpsock  [kernel.vmlinux]    [k] validate_xmit_skb_list
>>      1.62%  013  xdpsock  [kernel.vmlinux]    [k] kmem_cache_alloc_node
>>      1.48%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_reserve.isra.37
>>      1.21%  013  xdpsock  xdpsock             [.] xq_enq
>>      1.08%  013  xdpsock  [kernel.vmlinux]    [k] intel_alloc_iova
>>
>
> You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of
> bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks wrong.
>
Thanks, you're right. Let me dig more on this NMI behavior.

>
>> And l2fwd under "perf stat" looks OK to me. There is little context
>> switches, cpu is fully utilized, 1.17 insn per cycle seems ok.
>>
>> Performance counter stats for 'CPU(s) 6':
>>   10000.787420      cpu-clock (msec)          #    1.000 CPUs utilized
>>             24      context-switches          #    0.002 K/sec
>>              0      cpu-migrations            #    0.000 K/sec
>>              0      page-faults               #    0.000 K/sec
>> 22,361,333,647      cycles                    #    2.236 GHz
>> 13,458,442,838      stalled-cycles-frontend   #   60.19% frontend cycles idle
>> 26,251,003,067      instructions              #    1.17  insn per cycle
>>                                               #    0.51  stalled cycles per insn
>>  4,938,921,868      branches                  #  493.853 M/sec
>>      7,591,739      branch-misses             #    0.15% of all branches
>>   10.000835769 seconds time elapsed
>
> This perf stat also indicate something is wrong.
>
> The 1.17 insn per cycle is NOT okay, it is too low (compared to what
> usually I see, e.g. 2.36  insn per cycle).
>
> It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend
> cycles idle'.   This means your CPU have issues/bottleneck fetching
> instructions. Explained by Andi Kleen here [1]
>
> [1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
>
thanks for the link!
It's definitely weird that my frontend cycle (fetch and decode)
stalled is so high.
I assume this xdpsock code is small and should all fit into the icache.
However, doing another perf stat on xdpsock l2fwd shows

13,720,109,581      stalled-cycles-frontend   # 60.01% frontend cycles
idle     (23.82%)

  <not supported>      stalled-cycles-backend
        7,994,837      branch-misses           # 0.16% of all branches
         (23.80%)
      996,874,424      bus-cycles         # 99.679 M/sec          (23.80%)
   18,942,220,445      ref-cycles      # 1894.067 M/sec          (28.56%)
      100,983,226      LLC-loads         # 10.097 M/sec          (23.80%)
        4,897,089      LLC-load-misses           # 4.85% of all
LL-cache hits     (23.80%)
       66,659,889      LLC-stores          # 6.665 M/sec          (9.52%)
            8,373 LLC-store-misses          # 0.837 K/sec (9.52%)
      158,178,410      LLC-prefetches         # 15.817 M/sec          (9.52%)
        3,011,180      LLC-prefetch-misses       # 0.301 M/sec          (9.52%)
    8,190,383,109      dTLB-loads       # 818.971 M/sec          (9.52%)
       20,432,204      dTLB-load-misses          # 0.25% of all dTLB
cache hits   (9.52%)
    3,729,504,674      dTLB-stores       # 372.920 M/sec          (9.52%)
          992,231  dTLB-store-misses         # 0.099 M/sec          (9.52%)
  <not supported>      dTLB-prefetches
  <not supported>      dTLB-prefetch-misses
           11,619 iTLB-loads                # 0.001 M/sec (9.52%)
        1,874,756      iTLB-load-misses          # 16135.26% of all
iTLB cache hits  (14.28%)

I have super high iTLB-load-misses. This is probably the cause of high
frontend stalled.
Do you know any way to improve iTLB hit rate?

Thanks
William

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-26 23:20         ` Tushar Dave
@ 2018-03-28  0:49           ` William Tu
  0 siblings, 0 replies; 50+ messages in thread
From: William Tu @ 2018-03-28  0:49 UTC (permalink / raw)
  To: Tushar Dave
  Cc: Alexander Duyck, Jesper Dangaard Brouer, Björn Töpel,
	Karlsson, Magnus, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

> Indeed. Intel iommu has least effect on RX because of premap/recycle.
> But TX dma map and unmap is really expensive!
>
>>
>> Basically the IOMMU can make creating/destroying a DMA mapping really
>> expensive. The easiest way to work around it in the case of the Intel
>> IOMMU is to boot with "iommu=pt" which will create an identity mapping
>> for the host. The downside is though that you then have the entire
>> system accessible to the device unless a new mapping is created for it
>> by assigning it to a new IOMMU domain.
>
>
> Yeah thats what I would say, If you really want to use intel iommu and
> don't want to hit by performance , use 'iommu=pt'.
>
> Good to have confirmation from you Alex. Thanks.
>

Thanks for the suggestion! Update my performance number:

without iommu=pt (posted before)
Benchmark   XDP_SKB
rxdrop      2.3 Mpps
txpush     1.05 Mpps
l2fwd        0.90 Mpps

with iommu=pt (new)
Benchmark   XDP_SKB
rxdrop      2.24 Mpps
txpush     1.54 Mpps
l2fwd        1.23 Mpps

TX indeed shows better rate, while RX remains.
William

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-28  0:06         ` William Tu
@ 2018-03-28  8:01           ` Jesper Dangaard Brouer
  2018-03-28 15:05             ` William Tu
  0 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-28  8:01 UTC (permalink / raw)
  To: William Tu
  Cc: Björn Töpel, magnus.karlsson, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	willemdebruijn.kernel, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, jesse.brandeburg, anjali.singhai,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang, brouer, dendibakh

On Tue, 27 Mar 2018 17:06:50 -0700
William Tu <u9012063@gmail.com> wrote:

> On Tue, Mar 27, 2018 at 2:37 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Mon, 26 Mar 2018 14:58:02 -0700
> > William Tu <u9012063@gmail.com> wrote:
> >  
> >> > Again high count for NMI ?!?
> >> >
> >> > Maybe you just forgot to tell perf that you want it to decode the
> >> > bpf_prog correctly?
> >> >
> >> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
> >> >
> >> > Enable via:
> >> >  $ sysctl net/core/bpf_jit_kallsyms=1
> >> >
> >> > And use perf report (while BPF is STILL LOADED):
> >> >
> >> >  $ perf report --kallsyms=/proc/kallsyms
> >> >
> >> > E.g. for emailing this you can use this command:
> >> >
> >> >  $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
> >> >  
> >>
> >> Thanks, I followed the steps, the result of l2fwd
> >> # Total Lost Samples: 119
> >> #
> >> # Samples: 2K of event 'cycles:ppp'
> >> # Event count (approx.): 25675705627
> >> #
> >> # Overhead  CPU  Command  Shared Object       Symbol
> >> # ........  ...  .......  ..................  ..................................
> >> #
> >>     10.48%  013  xdpsock  xdpsock             [.] main
> >>      9.77%  013  xdpsock  [kernel.vmlinux]    [k] clflush_cache_range
> >>      8.45%  013  xdpsock  [kernel.vmlinux]    [k] nmi
> >>      8.07%  013  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
> >>      7.81%  013  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
> >>      4.95%  013  xdpsock  [kernel.vmlinux]    [k] ixgbe_xmit_frame_ring
> >>      4.66%  013  xdpsock  [kernel.vmlinux]    [k] skb_store_bits
> >>      4.39%  013  xdpsock  [kernel.vmlinux]    [k] syscall_return_via_sysret
> >>      3.93%  013  xdpsock  [kernel.vmlinux]    [k] pfn_to_dma_pte
> >>      2.62%  013  xdpsock  [kernel.vmlinux]    [k] __intel_map_single
> >>      2.53%  013  xdpsock  [kernel.vmlinux]    [k] __alloc_skb
> >>      2.36%  013  xdpsock  [kernel.vmlinux]    [k] iommu_no_mapping
> >>      2.21%  013  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
> >>      2.07%  013  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
> >>      1.98%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_node_track_caller
> >>      1.94%  013  xdpsock  [kernel.vmlinux]    [k] ksize
> >>      1.84%  013  xdpsock  [kernel.vmlinux]    [k] validate_xmit_skb_list
> >>      1.62%  013  xdpsock  [kernel.vmlinux]    [k] kmem_cache_alloc_node
> >>      1.48%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_reserve.isra.37
> >>      1.21%  013  xdpsock  xdpsock             [.] xq_enq
> >>      1.08%  013  xdpsock  [kernel.vmlinux]    [k] intel_alloc_iova
> >>  
> >
> > You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of
> > bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks wrong.
> >  
> Thanks, you're right. Let me dig more on this NMI behavior.
> 
> >  
> >> And l2fwd under "perf stat" looks OK to me. There is little context
> >> switches, cpu is fully utilized, 1.17 insn per cycle seems ok.
> >>
> >> Performance counter stats for 'CPU(s) 6':
> >>   10000.787420      cpu-clock (msec)          #    1.000 CPUs utilized
> >>             24      context-switches          #    0.002 K/sec
> >>              0      cpu-migrations            #    0.000 K/sec
> >>              0      page-faults               #    0.000 K/sec
> >> 22,361,333,647      cycles                    #    2.236 GHz
> >> 13,458,442,838      stalled-cycles-frontend   #   60.19% frontend cycles idle
> >> 26,251,003,067      instructions              #    1.17  insn per cycle
> >>                                               #    0.51  stalled cycles per insn
> >>  4,938,921,868      branches                  #  493.853 M/sec
> >>      7,591,739      branch-misses             #    0.15% of all branches
> >>   10.000835769 seconds time elapsed  
> >
> > This perf stat also indicate something is wrong.
> >
> > The 1.17 insn per cycle is NOT okay, it is too low (compared to what
> > usually I see, e.g. 2.36  insn per cycle).
> >
> > It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend
> > cycles idle'.   This means your CPU have issues/bottleneck fetching
> > instructions. Explained by Andi Kleen here [1]
> >
> > [1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
> >  
> thanks for the link!
>
> It's definitely weird that my frontend cycle (fetch and decode)
> stalled is so high.
>
> I assume this xdpsock code is small and should all fit into the icache.
> However, doing another perf stat on xdpsock l2fwd shows
> 
> 13,720,109,581      stalled-cycles-frontend   # 60.01% frontend cycles
> idle     (23.82%)
> 
> <not supported>      stalled-cycles-backend
>       7,994,837      branch-misses           # 0.16% of all branches
>        (23.80%)
>     996,874,424      bus-cycles      # 99.679 M/sec      (23.80%)
>  18,942,220,445      ref-cycles      # 1894.067 M/sec    (28.56%)
>     100,983,226      LLC-loads       # 10.097 M/sec      (23.80%)
>       4,897,089      LLC-load-misses # 4.85% of all LL-cache hits     (23.80%)
>      66,659,889      LLC-stores      # 6.665 M/sec       (9.52%)
>           8,373 LLC-store-misses     # 0.837 K/sec  (9.52%)
>     158,178,410      LLC-prefetches       # 15.817 M/sec  (9.52%)
>       3,011,180      LLC-prefetch-misses  # 0.301 M/sec   (9.52%)
>   8,190,383,109      dTLB-loads       # 818.971 M/sec     (9.52%)
>      20,432,204      dTLB-load-misses # 0.25% of all dTLB cache hits   (9.52%)
>   3,729,504,674      dTLB-stores       # 372.920 M/sec     (9.52%)
>         992,231  dTLB-store-misses         # 0.099 M/sec    (9.52%)
> <not supported>      dTLB-prefetches
> <not supported>      dTLB-prefetch-misses
>          11,619 iTLB-loads            # 0.001 M/sec (9.52%)
>       1,874,756      iTLB-load-misses # 16135.26% of all iTLB cache hits (14.28%)

What was the sample period for this perf stat?

> I have super high iTLB-load-misses. This is probably the cause of high
> frontend stalled.

It looks very strange that your iTLB-loads are 11,619, while the
iTLB-load-misses are much much higher 1,874,756.

> Do you know any way to improve iTLB hit rate?

The xdpsock code should be small enough to fit in the iCache, but it
might be layout in memory in an unfortunate way.  You could play with
rearranging the C-code (look at the objdump alignments).

If you want to know the details about code alignment issue, and how to
troubleshoot them, you should read this VERY excellent blog post by
Denis Bakhvalov:
https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/24] Introducing AF_XDP support
  2018-03-28  8:01           ` Jesper Dangaard Brouer
@ 2018-03-28 15:05             ` William Tu
  0 siblings, 0 replies; 50+ messages in thread
From: William Tu @ 2018-03-28 15:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain,
	jeffrey.b.shaw, ferruh.yigit, qi.z.zhang, dendibakh

Hi Jesper,
Thanks for the comments.

>> I assume this xdpsock code is small and should all fit into the icache.
>> However, doing another perf stat on xdpsock l2fwd shows
>>
>> 13,720,109,581      stalled-cycles-frontend   # 60.01% frontend cycles
>> idle     (23.82%)
>>
>> <not supported>      stalled-cycles-backend
>>       7,994,837      branch-misses           # 0.16% of all branches
>>        (23.80%)
>>     996,874,424      bus-cycles      # 99.679 M/sec      (23.80%)
>>  18,942,220,445      ref-cycles      # 1894.067 M/sec    (28.56%)
>>     100,983,226      LLC-loads       # 10.097 M/sec      (23.80%)
>>       4,897,089      LLC-load-misses # 4.85% of all LL-cache hits     (23.80%)
>>      66,659,889      LLC-stores      # 6.665 M/sec       (9.52%)
>>           8,373 LLC-store-misses     # 0.837 K/sec  (9.52%)
>>     158,178,410      LLC-prefetches       # 15.817 M/sec  (9.52%)
>>       3,011,180      LLC-prefetch-misses  # 0.301 M/sec   (9.52%)
>>   8,190,383,109      dTLB-loads       # 818.971 M/sec     (9.52%)
>>      20,432,204      dTLB-load-misses # 0.25% of all dTLB cache hits   (9.52%)
>>   3,729,504,674      dTLB-stores       # 372.920 M/sec     (9.52%)
>>         992,231  dTLB-store-misses         # 0.099 M/sec    (9.52%)
>> <not supported>      dTLB-prefetches
>> <not supported>      dTLB-prefetch-misses
>>          11,619 iTLB-loads            # 0.001 M/sec (9.52%)
>>       1,874,756      iTLB-load-misses # 16135.26% of all iTLB cache hits (14.28%)
>
> What was the sample period for this perf stat?
>
10 seconds.
root@ovs-smartnic:~/net-next/tools/perf# ./perf stat -C 6 sleep 10

>> I have super high iTLB-load-misses. This is probably the cause of high
>> frontend stalled.
>
> It looks very strange that your iTLB-loads are 11,619, while the
> iTLB-load-misses are much much higher 1,874,756.
>
Does it mean cpu try to load the code, then fail, then load again and
fail again...
So the number of iTLB loads is larger than misses.
Maybe it's related to high nmi rate, where the nmi handler clear my iTLB?
Let me try to remove the nmi interference first.

>> Do you know any way to improve iTLB hit rate?
>
> The xdpsock code should be small enough to fit in the iCache, but it
> might be layout in memory in an unfortunate way.  You could play with
> rearranging the C-code (look at the objdump alignments).
>
> If you want to know the details about code alignment issue, and how to
> troubleshoot them, you should read this VERY excellent blog post by
> Denis Bakhvalov:
> https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues

Thanks for the link.
William

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2018-03-28 15:06 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 01/24] xsk: AF_XDP sockets buildable skeleton Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 02/24] xsk: add user memory registration sockopt Björn Töpel
2018-02-07 16:00   ` Willem de Bruijn
2018-02-07 21:39     ` Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 03/24] xsk: added XDP_{R,T}X_RING sockopt and supporting structures Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 04/24] xsk: add bind support and introduce Rx functionality Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect Björn Töpel
2018-02-05 13:42   ` Jesper Dangaard Brouer
2018-02-07 21:11     ` Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 06/24] net: wire up xsk support in the XDP_REDIRECT path Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 07/24] xsk: introduce Tx functionality Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 08/24] i40e: add support for XDP_REDIRECT Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 09/24] samples/bpf: added xdpsock program Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 10/24] netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 11/24] netdevice: added ndo for transmitting a packet from an XDP socket Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 12/24] xsk: add iterator functions to xsk_ring Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 13/24] i40e: introduce external allocator support Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 14/24] i40e: implemented page recycling buff_pool Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 15/24] i40e: start using " Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 16/24] i40e: separated buff_pool interface from i40e implementaion Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 17/24] xsk: introduce xsk_buff_pool Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 18/24] xdp: added buff_pool support to struct xdp_buff Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 19/24] xsk: add support for zero copy Rx Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 20/24] xsk: add support for zero copy Tx Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 21/24] i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 22/24] i40e: introduced a clean_tx callback function Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 23/24] i40e: introduced Tx completion callbacks Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 24/24] i40e: Tx support for zero copy allocator Björn Töpel
2018-02-01 16:42 ` [RFC PATCH 00/24] Introducing AF_XDP support Jesper Dangaard Brouer
2018-02-02 10:31 ` Jesper Dangaard Brouer
2018-02-05 15:05 ` Björn Töpel
2018-02-07 15:54   ` Willem de Bruijn
2018-02-07 21:28     ` Björn Töpel
2018-02-08 23:16       ` Willem de Bruijn
2018-02-07 17:59 ` Tom Herbert
2018-02-07 21:38   ` Björn Töpel
2018-03-26 16:06 ` William Tu
2018-03-26 16:38   ` Jesper Dangaard Brouer
2018-03-26 21:58     ` William Tu
2018-03-27  6:09       ` Björn Töpel
2018-03-27  9:37       ` Jesper Dangaard Brouer
2018-03-28  0:06         ` William Tu
2018-03-28  8:01           ` Jesper Dangaard Brouer
2018-03-28 15:05             ` William Tu
2018-03-26 22:54     ` Tushar Dave
2018-03-26 23:03       ` Alexander Duyck
2018-03-26 23:20         ` Tushar Dave
2018-03-28  0:49           ` William Tu
2018-03-27  6:30         ` Björn Töpel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.