All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/14] Introducing AF_XDP support
@ 2018-03-27 16:59 Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 01/14] net: initial AF_XDP skeleton Björn Töpel
                   ` (15 more replies)
  0 siblings, 16 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, ravineet.singh

From: Björn Töpel <bjorn.topel@intel.com>

This RFC introduces a new address family called AF_XDP that is
optimized for high performance packet processing and, in upcoming
patch sets, zero-copy semantics. In this v2 version, we have removed
all zero-copy related code in order to make it smaller, simpler and
hopefully more review friendly. This RFC only supports copy-mode for
the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
using the XDP_DRV path. Zero-copy support requires XDP and driver
changes that Jesper Dangaard Brouer is working on. Some of his work is
already on the mailing list for review. We will publish our zero-copy
support for RX and TX on top of his patch sets at a later point in
time.

An AF_XDP socket (XSK) is created with the normal socket()
syscall. Associated with each XSK are two queues: the RX queue and the
TX queue. A socket can receive packets on the RX queue and it can send
packets on the TX queue. These queues are registered and sized with
the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is
mandatory to have at least one of these queues for each socket. In
contrast to AF_PACKET V2/V3 these descriptor queues are separated from
packet buffers. An RX or TX descriptor points to a data buffer in a
memory area called a UMEM. RX and TX can share the same UMEM so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, the
descriptor that points to that packet can be changed to point to
another and reused right away. This again avoids copying data.

This new dedicated packet buffer area is called a UMEM. It consists of
a number of equally size frames and each frame has a unique frame
id. A descriptor in one of the queues references a frame by
referencing its frame id. The user space allocates memory for this
UMEM using whatever means it feels is most appropriate (malloc, mmap,
huge pages, etc). This memory area is then registered with the kernel
using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues:
the FILL queue and the COMPLETION queue. The fill queue is used by the
application to send down frame ids for the kernel to fill in with RX
packet data. References to these frames will then appear in the RX
queue of the XSK once they have been received. The completion queue,
on the other hand, contains frame ids that the kernel has transmitted
completely and can now be used again by user space, for either TX or
RX. Thus, the frame ids appearing in the completion queue are ids that
were previously transmitted using the TX queue. In summary, the RX and
FILL queues are used for the RX path and the TX and COMPLETION queues
are used for the TX path.

The socket is then finally bound with a bind() call to a device and a
specific queue id on that device, and it is not until bind is
completed that traffic starts to flow. Note that in this RFC, all
packet data is copied out to user-space.

A new feature in this RFC is that the UMEM can be shared between
processes, if desired. If a process wants to do this, it simply skips
the registration of the UMEM and its corresponding two queues, sets a
flag in the bind call and submits the XSK of the process it would like
to share UMEM with as well as its own newly created XSK socket. The
new process will then receive frame id references in its own RX queue
that point to this shared UMEM. Note that since the queue structures
are single-consumer / single-producer (for performance reasons), the
new process has to create its own socket with associated RX and TX
queues, since it cannot share this with the other process. This is
also the reason that there is only one set of FILL and COMPLETION
queues per UMEM. It is the responsibility of a single process to
handle the UMEM. If multiple-producer / multiple-consumer queues are
implemented in the future, this requirement could be relaxed.

How is then packets distributed between these two XSK? We have
introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
full). The user-space application can place an XSK at an arbitrary
place in this map. The XDP program can then redirect a packet to a
specific index in this map and at this point XDP validates that the
XSK in that map was indeed bound to that device and queue number. If
not, the packet is dropped. If the map is empty at that index, the
packet is also dropped. This also means that it is currently mandatory
to have an XDP program loaded (and one XSK in the XSKMAP) to be able
to get any traffic to user space through the XSK.

AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
driver does not have support for XDP, or XDP_SKB is explicitly chosen
when loading the XDP program, XDP_SKB mode is employed that uses SKBs
together with the generic XDP support and copies out the data to user
space. A fallback mode that works for any network device. On the other
hand, if the driver has support for XDP, it will be used by the AF_XDP
code to provide better performance, but there is still a copy of the
data into user space.

There is a xdpsock benchmarking/test application included that
demonstrates how to use AF_XDP sockets with both private and shared
UMEMs. Say that you would like your UDP traffic from port 4242 to end
up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode can then be done
using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by commercial packet generator HW that is
generating packets at full 40 Gbit/s line rate.

AF_XDP performance 64 byte packets:
Benchmark   XDP_SKB   XDP_DRV
rxdrop       3.0        9.3  
l2fwd        1.7        NA*  

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV
rxdrop       2.2        3.1  
l2fwd        1.1        NA*  

* NA since we have no support for TX using the XDP_DRV infrastructure
  in this RFC. This is for a future patch set since it involves
  changes to the XDP NDOs. Some of this is being worked on by Jesper
  Dangaard Brouer in his latest patch set on the netdev mailing list.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      30,961,380  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3,289,493   0

Changes in V2:

* Dropped zero-copy support
* Unidirectional queues for better performance and to allow for
  non-symmetric consumer and producer use cases
* Shared packet buffers
* XSKMAP for directing packets with XDP
* Poll() support
* Rebind support
* Changed accounting to be per user as suggested by Willem de Bruijn
* Exterminated packet arrays from the code base
* Some bug fixes and code cleanup

The structure of the patch set is as follows:

Patches 1-2: Basic socket and umem plumbing 
Patches 3-9: RX support together with the new XSKMAP
Patches 10-13: TX support
Patch 14: Sample application

We based this patch set on net-next commit: 5306653850b4 ("sctp:
remove unnecessary asoc in sctp_has_association")

For i40e support, this RFC depends on commit: d9314c474d4f ("i40e: add
support for XDP_REDIRECT")

For ixgbe support, this RFC depends on commit: ed93a3987128 
("ixgbe: tweak page counting for XDP_REDIRECT")

The two Intel NIC patches above are in net-next as of today.

Questions:

* How to deal with cache alignment for uapi when different
  architectures can have different cache line sizes? We have just
  aligned it to 64 bytes for now, which works for many popular
  architectures, but not all. Please advise.

To do:

* Optimize performance

* Kernel selftest

Post-series plan:

* Kernel load module support of AF_XDP would be nice. Unclear how to
  achieve this though since our XDP code depends on net/core.

* Support for AF_XDP sockets without an XPD program loaded. In this
  case all the traffic on a queue should go up to the user space
  socket.

* Daniel Borkmann's suggestion for a "copy to XDP socket, and return
  XDP_PASS" for a tcpdump-like functionality.

* And of course getting to zero-copy support in small increments. 

Thanks: Björn and Magnus

Björn Töpel (7):
  net: initial AF_XDP skeleton
  xsk: add user memory registration support sockopt
  xsk: add Rx queue setup and mmap support
  xsk: add Rx receive functions and poll support
  bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  xsk: wire up XDP_DRV side of AF_XDP
  xsk: wire up XDP_SKB side of AF_XDP

Magnus Karlsson (7):
  xsk: add umem fill queue support and mmap
  xsk: add support for bind for Rx
  xsk: add umem completion queue support and mmap
  xsk: add Tx queue setup and mmap support
  xsk: support for Tx
  xsk: statistics support
  samples/bpf: sample application for AF_XDP sockets

 include/linux/bpf.h                 |  26 +
 include/linux/bpf_types.h           |   3 +
 include/linux/filter.h              |   2 +-
 include/linux/socket.h              |   5 +-
 include/net/xdp_sock.h              |  40 ++
 include/uapi/linux/bpf.h            |   1 +
 include/uapi/linux/if_xdp.h         |  86 ++++
 kernel/bpf/Makefile                 |   3 +
 kernel/bpf/verifier.c               |   8 +-
 kernel/bpf/xskmap.c                 | 281 +++++++++++
 net/Kconfig                         |   1 +
 net/Makefile                        |   1 +
 net/core/dev.c                      |  28 +-
 net/core/filter.c                   |  40 +-
 net/core/sock.c                     |  12 +-
 net/xdp/Kconfig                     |   7 +
 net/xdp/Makefile                    |   2 +
 net/xdp/xdp_umem.c                  | 259 ++++++++++
 net/xdp/xdp_umem.h                  |  64 +++
 net/xdp/xdp_umem_props.h            |  23 +
 net/xdp/xsk.c                       | 710 +++++++++++++++++++++++++++
 net/xdp/xsk_queue.c                 |  59 +++
 net/xdp/xsk_queue.h                 | 297 ++++++++++++
 samples/bpf/Makefile                |   4 +
 samples/bpf/xdpsock.h               |  10 +
 samples/bpf/xdpsock_kern.c          |  55 +++
 samples/bpf/xdpsock_user.c          | 923 ++++++++++++++++++++++++++++++++++++
 security/selinux/hooks.c            |   4 +-
 security/selinux/include/classmap.h |   4 +-
 29 files changed, 2929 insertions(+), 29 deletions(-)
 create mode 100644 include/net/xdp_sock.h
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 kernel/bpf/xskmap.c
 create mode 100644 net/xdp/Kconfig
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xdp_umem.c
 create mode 100644 net/xdp/xdp_umem.h
 create mode 100644 net/xdp/xdp_umem_props.h
 create mode 100644 net/xdp/xsk.c
 create mode 100644 net/xdp/xsk_queue.c
 create mode 100644 net/xdp/xsk_queue.h
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c

-- 
2.14.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 01/14] net: initial AF_XDP skeleton
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 02/14] xsk: add user memory registration support sockopt Björn Töpel
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, ravineet.singh

From: Björn Töpel <bjorn.topel@intel.com>

Buildable skeleton of AF_XDP without any functionality. Just what it
takes to register a new address family.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/socket.h              |  5 ++++-
 net/Kconfig                         |  1 +
 net/core/sock.c                     | 12 ++++++++----
 net/xdp/Kconfig                     |  7 +++++++
 security/selinux/hooks.c            |  4 +++-
 security/selinux/include/classmap.h |  4 +++-
 6 files changed, 26 insertions(+), 7 deletions(-)
 create mode 100644 net/xdp/Kconfig

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 60e01482a9c4..141e527b8615 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -207,8 +207,9 @@ struct ucred {
 				 * PF_SMC protocol family that
 				 * reuses AF_INET address family
 				 */
+#define AF_XDP		44	/* XDP sockets			*/
 
-#define AF_MAX		44	/* For now.. */
+#define AF_MAX		45	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -257,6 +258,7 @@ struct ucred {
 #define PF_KCM		AF_KCM
 #define PF_QIPCRTR	AF_QIPCRTR
 #define PF_SMC		AF_SMC
+#define PF_XDP		AF_XDP
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -338,6 +340,7 @@ struct ucred {
 #define SOL_NFC		280
 #define SOL_KCM		281
 #define SOL_TLS		282
+#define SOL_XDP		283
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..308f1a0d265a 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -59,6 +59,7 @@ source "net/tls/Kconfig"
 source "net/xfrm/Kconfig"
 source "net/iucv/Kconfig"
 source "net/smc/Kconfig"
+source "net/xdp/Kconfig"
 
 config INET
 	bool "TCP/IP networking"
diff --git a/net/core/sock.c b/net/core/sock.c
index 8cee2920a47f..89e692f62c86 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -226,7 +226,8 @@ static struct lock_class_key af_family_kern_slock_keys[AF_MAX];
   x "AF_RXRPC" ,	x "AF_ISDN"     ,	x "AF_PHONET"   , \
   x "AF_IEEE802154",	x "AF_CAIF"	,	x "AF_ALG"      , \
   x "AF_NFC"   ,	x "AF_VSOCK"    ,	x "AF_KCM"      , \
-  x "AF_QIPCRTR",	x "AF_SMC"	,	x "AF_MAX"
+  x "AF_QIPCRTR",	x "AF_SMC"	,	x "AF_XDP"	, \
+  x "AF_MAX"
 
 static const char *const af_family_key_strings[AF_MAX+1] = {
 	_sock_locks("sk_lock-")
@@ -262,7 +263,8 @@ static const char *const af_family_rlock_key_strings[AF_MAX+1] = {
   "rlock-AF_RXRPC" , "rlock-AF_ISDN"     , "rlock-AF_PHONET"   ,
   "rlock-AF_IEEE802154", "rlock-AF_CAIF" , "rlock-AF_ALG"      ,
   "rlock-AF_NFC"   , "rlock-AF_VSOCK"    , "rlock-AF_KCM"      ,
-  "rlock-AF_QIPCRTR", "rlock-AF_SMC"     , "rlock-AF_MAX"
+  "rlock-AF_QIPCRTR", "rlock-AF_SMC"     , "rlock-AF_XDP"      ,
+  "rlock-AF_MAX"
 };
 static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
   "wlock-AF_UNSPEC", "wlock-AF_UNIX"     , "wlock-AF_INET"     ,
@@ -279,7 +281,8 @@ static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
   "wlock-AF_RXRPC" , "wlock-AF_ISDN"     , "wlock-AF_PHONET"   ,
   "wlock-AF_IEEE802154", "wlock-AF_CAIF" , "wlock-AF_ALG"      ,
   "wlock-AF_NFC"   , "wlock-AF_VSOCK"    , "wlock-AF_KCM"      ,
-  "wlock-AF_QIPCRTR", "wlock-AF_SMC"     , "wlock-AF_MAX"
+  "wlock-AF_QIPCRTR", "wlock-AF_SMC"     , "wlock-AF_XDP"      ,
+  "wlock-AF_MAX"
 };
 static const char *const af_family_elock_key_strings[AF_MAX+1] = {
   "elock-AF_UNSPEC", "elock-AF_UNIX"     , "elock-AF_INET"     ,
@@ -296,7 +299,8 @@ static const char *const af_family_elock_key_strings[AF_MAX+1] = {
   "elock-AF_RXRPC" , "elock-AF_ISDN"     , "elock-AF_PHONET"   ,
   "elock-AF_IEEE802154", "elock-AF_CAIF" , "elock-AF_ALG"      ,
   "elock-AF_NFC"   , "elock-AF_VSOCK"    , "elock-AF_KCM"      ,
-  "elock-AF_QIPCRTR", "elock-AF_SMC"     , "elock-AF_MAX"
+  "elock-AF_QIPCRTR", "elock-AF_SMC"     , "elock-AF_XDP"      ,
+  "elock-AF_MAX"
 };
 
 /*
diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
new file mode 100644
index 000000000000..90e4a7152854
--- /dev/null
+++ b/net/xdp/Kconfig
@@ -0,0 +1,7 @@
+config XDP_SOCKETS
+	bool "XDP sockets"
+	depends on BPF_SYSCALL
+	default n
+	help
+	  XDP sockets allows a channel between XDP programs and
+	  userspace applications.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index b4d7b6242a40..5082dd6e613f 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -1436,7 +1436,9 @@ static inline u16 socket_type_to_security_class(int family, int type, int protoc
 			return SECCLASS_QIPCRTR_SOCKET;
 		case PF_SMC:
 			return SECCLASS_SMC_SOCKET;
-#if PF_MAX > 44
+		case PF_XDP:
+			return SECCLASS_XDP_SOCKET;
+#if PF_MAX > 45
 #error New address family defined, please update this function.
 #endif
 		}
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index acdee7795297..e2044cd358bb 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -240,9 +240,11 @@ struct security_class_mapping secclass_map[] = {
 	  { "manage_subnet", NULL } },
 	{ "bpf",
 	  {"map_create", "map_read", "map_write", "prog_load", "prog_run"} },
+	{ "xdp_socket",
+	  { COMMON_SOCK_PERMS, NULL } },
 	{ NULL }
   };
 
-#if PF_MAX > 44
+#if PF_MAX > 45
 #error New address family defined, please update secclass_map.
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 02/14] xsk: add user memory registration support sockopt
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 01/14] net: initial AF_XDP skeleton Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap Björn Töpel
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, ravineet.singh

From: Björn Töpel <bjorn.topel@intel.com>

In this commit the base structure of the AF_XDP address family is set
up. Further, we introduce the abilty register a window of user memory
to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
window is viewed by an AF_XDP socket as a set of equally large
frames. After a user memory registration all frames are "owned" by the
user application, and not the kernel.

Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/uapi/linux/if_xdp.h |  33 ++++++
 net/Makefile                |   1 +
 net/xdp/Makefile            |   2 +
 net/xdp/xdp_umem.c          | 240 ++++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.h          |  41 ++++++++
 net/xdp/xdp_umem_props.h    |  23 +++++
 net/xdp/xsk.c               | 223 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 563 insertions(+)
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xdp_umem.c
 create mode 100644 net/xdp/xdp_umem.h
 create mode 100644 net/xdp/xdp_umem_props.h
 create mode 100644 net/xdp/xsk.c

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
new file mode 100644
index 000000000000..69ccb3b0a3f2
--- /dev/null
+++ b/include/uapi/linux/if_xdp.h
@@ -0,0 +1,33 @@
+/*
+ * if_xdp: XDP socket user-space interface
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Author(s): Björn Töpel <bjorn.topel@intel.com>
+ *	      Magnus Karlsson <magnus.karlsson@intel.com>
+ */
+
+#ifndef _LINUX_IF_XDP_H
+#define _LINUX_IF_XDP_H
+
+#include <linux/types.h>
+
+/* XDP socket options */
+#define XDP_UMEM_REG		  3
+
+struct xdp_umem_reg {
+	__u64 addr; /* Start of packet data area */
+	__u64 len; /* Length of packet data area */
+	__u32 frame_size; /* Frame size */
+	__u32 frame_headroom; /* Frame head room */
+};
+
+#endif /* _LINUX_IF_XDP_H */
diff --git a/net/Makefile b/net/Makefile
index a6147c61b174..77aaddedbd29 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -85,3 +85,4 @@ obj-y				+= l3mdev/
 endif
 obj-$(CONFIG_QRTR)		+= qrtr/
 obj-$(CONFIG_NET_NCSI)		+= ncsi/
+obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
new file mode 100644
index 000000000000..a5d736640a0f
--- /dev/null
+++ b/net/xdp/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
+
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
new file mode 100644
index 000000000000..8f768a7887da
--- /dev/null
+++ b/net/xdp/xdp_umem.c
@@ -0,0 +1,240 @@
+/*
+ * XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/init.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/uaccess.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/mm.h>
+
+#include "xdp_umem.h"
+
+#define XDP_UMEM_MIN_FRAME_SIZE 2048
+
+int xdp_umem_create(struct xdp_umem **umem)
+{
+	*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
+
+	if (!(*umem))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void xdp_umem_unpin_pages(struct xdp_umem *umem)
+{
+	unsigned int i;
+
+	if (umem->pgs) {
+		for (i = 0; i < umem->npgs; i++) {
+			struct page *page = umem->pgs[i];
+
+			set_page_dirty_lock(page);
+			put_page(page);
+		}
+
+		kfree(umem->pgs);
+		umem->pgs = NULL;
+	}
+}
+
+static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
+{
+	if (umem->user) {
+		atomic_long_sub(umem->npgs, &umem->user->locked_vm);
+		free_uid(umem->user);
+	}
+}
+
+static void xdp_umem_release(struct xdp_umem *umem)
+{
+	struct task_struct *task;
+	struct mm_struct *mm;
+	unsigned long diff;
+
+	if (umem->pgs) {
+		xdp_umem_unpin_pages(umem);
+
+		task = get_pid_task(umem->pid, PIDTYPE_PID);
+		put_pid(umem->pid);
+		if (!task)
+			goto out;
+		mm = get_task_mm(task);
+		put_task_struct(task);
+		if (!mm)
+			goto out;
+
+		diff = umem->size >> PAGE_SHIFT;
+
+		down_write(&mm->mmap_sem);
+		mm->pinned_vm -= diff;
+		up_write(&mm->mmap_sem);
+		mmput(mm);
+		umem->pgs = NULL;
+	}
+
+	xdp_umem_unaccount_pages(umem);
+out:
+	kfree(umem);
+}
+
+void xdp_put_umem(struct xdp_umem *umem)
+{
+	if (!umem)
+		return;
+
+	if (atomic_dec_and_test(&umem->users))
+		xdp_umem_release(umem);
+}
+
+static int xdp_umem_pin_pages(struct xdp_umem *umem)
+{
+	unsigned int gup_flags = FOLL_WRITE;
+	long npgs;
+	int err;
+
+	umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), GFP_KERNEL);
+	if (!umem->pgs)
+		return -ENOMEM;
+
+	npgs = get_user_pages(umem->address, umem->npgs,
+			      gup_flags, &umem->pgs[0], NULL);
+	if (npgs != umem->npgs) {
+		if (npgs >= 0) {
+			umem->npgs = npgs;
+			err = -ENOMEM;
+			goto out_pin;
+		}
+		err = npgs;
+		goto out_pgs;
+	}
+	return 0;
+
+out_pin:
+	xdp_umem_unpin_pages(umem);
+out_pgs:
+	kfree(umem->pgs);
+	umem->pgs = NULL;
+	return err;
+}
+
+static int xdp_umem_account_pages(struct xdp_umem *umem)
+{
+	unsigned long lock_limit, new_npgs, old_npgs;
+
+	if (capable(CAP_IPC_LOCK))
+		return 0;
+
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	umem->user = get_uid(current_user());
+
+	do {
+		old_npgs = atomic_long_read(&umem->user->locked_vm);
+		new_npgs = old_npgs + umem->npgs;
+		if (new_npgs > lock_limit) {
+			free_uid(umem->user);
+			umem->user = NULL;
+			return -ENOBUFS;
+		}
+	} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
+				     new_npgs) != old_npgs);
+	return 0;
+}
+
+static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
+{
+	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
+	u64 addr = mr->addr, size = mr->len;
+	unsigned int nframes;
+	int size_chk, err;
+
+	if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
+		/* Strictly speaking we could support this, if:
+		 * - huge pages, or*
+		 * - using an IOMMU, or
+		 * - making sure the memory area is consecutive
+		 * but for now, we simply say "computer says no".
+		 */
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(frame_size))
+		return -EINVAL;
+
+	if (!PAGE_ALIGNED(addr)) {
+		/* Memory area has to be page size aligned. For
+		 * simplicity, this might change.
+		 */
+		return -EINVAL;
+	}
+
+	if ((addr + size) < addr)
+		return -EINVAL;
+
+	nframes = size / frame_size;
+	if (nframes == 0 || nframes > UINT_MAX)
+		return -EINVAL;
+
+	frame_headroom = ALIGN(frame_headroom, 64);
+
+	size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
+	if (size_chk < 0)
+		return -EINVAL;
+
+	umem->pid = get_task_pid(current, PIDTYPE_PID);
+	umem->size = (size_t)size;
+	umem->address = (unsigned long)addr;
+	umem->props.frame_size = frame_size;
+	umem->props.nframes = nframes;
+	umem->frame_headroom = frame_headroom;
+	umem->npgs = size / PAGE_SIZE;
+	umem->pgs = NULL;
+	umem->user = NULL;
+
+	umem->frame_size_log2 = ilog2(frame_size);
+	umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size);
+	atomic_set(&umem->users, 1);
+
+	err = xdp_umem_account_pages(umem);
+	if (err)
+		goto out;
+
+	err = xdp_umem_pin_pages(umem);
+	if (err)
+		goto out;
+	return 0;
+
+out:
+	put_pid(umem->pid);
+	return err;
+}
+
+int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
+{
+	int err;
+
+	if (!umem)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+
+	err = __xdp_umem_reg(umem, mr);
+
+	up_write(&current->mm->mmap_sem);
+	return err;
+}
+
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
new file mode 100644
index 000000000000..d9bbbb880088
--- /dev/null
+++ b/net/xdp/xdp_umem.h
@@ -0,0 +1,41 @@
+/*
+ * XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef XDP_UMEM_H_
+#define XDP_UMEM_H_
+
+#include <linux/mm.h>
+#include <linux/if_xdp.h>
+
+#include "xdp_umem_props.h"
+
+struct xdp_umem {
+	struct pid *pid;
+	struct page **pgs;
+	unsigned long address;
+	size_t size;
+	struct xdp_umem_props props;
+	u32 npgs;
+	u32 frame_headroom;
+	u32 nfpplog2;
+	u32 frame_size_log2;
+	atomic_t users;
+	struct user_struct *user;
+};
+
+int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
+void xdp_put_umem(struct xdp_umem *umem);
+int xdp_umem_create(struct xdp_umem **umem);
+
+#endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
new file mode 100644
index 000000000000..a2dc91d34262
--- /dev/null
+++ b/net/xdp/xdp_umem_props.h
@@ -0,0 +1,23 @@
+/*
+ * XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef XDP_UMEM_PROPS_H_
+#define XDP_UMEM_PROPS_H_
+
+struct xdp_umem_props {
+	u32 frame_size;
+	u32 nframes;
+};
+
+#endif /* XDP_UMEM_PROPS_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
new file mode 100644
index 000000000000..cc0f26e17baf
--- /dev/null
+++ b/net/xdp/xsk.c
@@ -0,0 +1,223 @@
+/*
+ * XDP sockets
+ *
+ * AF_XDP sockets allows a channel between XDP programs and userspace
+ * applications.
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Author(s): Björn Töpel <bjorn.topel@intel.com>
+ *	      Magnus Karlsson <magnus.karlsson@intel.com>
+ */
+
+#define pr_fmt(fmt) "AF_XDP: %s: " fmt, __func__
+
+#include <linux/if_xdp.h>
+#include <linux/init.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/socket.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/net.h>
+#include <linux/netdevice.h>
+#include <net/sock.h>
+
+#include "xdp_umem.h"
+
+struct xdp_sock {
+	/* struct sock must be the first member of struct xdp_sock */
+	struct sock sk;
+	/* Protects multiple processes in the control path */
+	struct mutex mutex;
+	struct xdp_umem *umem;
+};
+
+static struct xdp_sock *xdp_sk(struct sock *sk)
+{
+	return (struct xdp_sock *)sk;
+}
+
+static int xsk_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct net *net;
+
+	if (!sk)
+		return 0;
+
+	net = sock_net(sk);
+
+	local_bh_disable();
+	sock_prot_inuse_add(net, sk->sk_prot, -1);
+	local_bh_enable();
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+
+	sk_refcnt_debug_release(sk);
+	sock_put(sk);
+
+	return 0;
+}
+
+static int xsk_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, unsigned int optlen)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	int err;
+
+	if (level != SOL_XDP)
+		return -ENOPROTOOPT;
+
+	switch (optname) {
+	case XDP_UMEM_REG:
+	{
+		struct xdp_umem_reg mr;
+		struct xdp_umem *umem;
+
+		if (xs->umem)
+			return -EBUSY;
+
+		if (copy_from_user(&mr, optval, sizeof(mr)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		err = xdp_umem_create(&umem);
+
+		err = xdp_umem_reg(umem, &mr);
+		if (err) {
+			kfree(umem);
+			mutex_unlock(&xs->mutex);
+			return err;
+		}
+
+		/* Make sure umem is ready before it can be seen by others */
+		smp_wmb();
+
+		xs->umem = umem;
+		mutex_unlock(&xs->mutex);
+		return 0;
+	}
+	default:
+		break;
+	}
+
+	return -ENOPROTOOPT;
+}
+
+static struct proto xsk_proto = {
+	.name =		"XDP",
+	.owner =	THIS_MODULE,
+	.obj_size =	sizeof(struct xdp_sock),
+};
+
+static const struct proto_ops xsk_proto_ops = {
+	.family =	PF_XDP,
+	.owner =	THIS_MODULE,
+	.release =	xsk_release,
+	.bind =		sock_no_bind,
+	.connect =	sock_no_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	sock_no_accept,
+	.getname =	sock_no_getname,
+	.poll =		sock_no_poll,
+	.ioctl =	sock_no_ioctl,
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	xsk_setsockopt,
+	.getsockopt =	sock_no_getsockopt,
+	.sendmsg =	sock_no_sendmsg,
+	.recvmsg =	sock_no_recvmsg,
+	.mmap =		sock_no_mmap,
+	.sendpage =	sock_no_sendpage,
+};
+
+static void xsk_destruct(struct sock *sk)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (!sock_flag(sk, SOCK_DEAD))
+		return;
+
+	xdp_put_umem(xs->umem);
+
+	sk_refcnt_debug_dec(sk);
+}
+
+static int xsk_create(struct net *net, struct socket *sock, int protocol,
+		      int kern)
+{
+	struct sock *sk;
+	struct xdp_sock *xs;
+
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
+		return -EPERM;
+	if (sock->type != SOCK_RAW)
+		return -ESOCKTNOSUPPORT;
+
+	if (protocol)
+		return -EPROTONOSUPPORT;
+
+	sock->state = SS_UNCONNECTED;
+
+	sk = sk_alloc(net, PF_XDP, GFP_KERNEL, &xsk_proto, kern);
+	if (!sk)
+		return -ENOBUFS;
+
+	sock->ops = &xsk_proto_ops;
+
+	sock_init_data(sock, sk);
+
+	sk->sk_family = PF_XDP;
+
+	sk->sk_destruct = xsk_destruct;
+	sk_refcnt_debug_inc(sk);
+
+	xs = xdp_sk(sk);
+	mutex_init(&xs->mutex);
+
+	local_bh_disable();
+	sock_prot_inuse_add(net, &xsk_proto, 1);
+	local_bh_enable();
+
+	return 0;
+}
+
+static const struct net_proto_family xsk_family_ops = {
+	.family = PF_XDP,
+	.create = xsk_create,
+	.owner	= THIS_MODULE,
+};
+
+static int __init xsk_init(void)
+{
+	int err;
+
+	err = proto_register(&xsk_proto, 0 /* no slab */);
+	if (err)
+		goto out;
+
+	err = sock_register(&xsk_family_ops);
+	if (err)
+		goto out_proto;
+
+	return 0;
+
+out_proto:
+	proto_unregister(&xsk_proto);
+out:
+	return err;
+}
+
+fs_initcall(xsk_init);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 01/14] net: initial AF_XDP skeleton Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 02/14] xsk: add user memory registration support sockopt Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-04-12  2:15   ` Michael S. Tsirkin
  2018-03-27 16:59 ` [RFC PATCH v2 04/14] xsk: add Rx queue setup and mmap support Björn Töpel
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	ravineet.singh

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, we add another setsockopt for registered user memory (umem)
called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
ask the kernel to allocate a queue (ring buffer) and also mmap it
(XDP_UMEM_PGOFF_FILL_QUEUE) into the process.

The queue is used to explicitly pass ownership of umem frames from the
user process to the kernel. These frames will in a later patch be
filled in with Rx packet data by the kernel.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 15 +++++++++++
 net/xdp/Makefile            |  2 +-
 net/xdp/xdp_umem.c          |  5 ++++
 net/xdp/xdp_umem.h          |  2 ++
 net/xdp/xsk.c               | 65 ++++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.c         | 54 +++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h         | 49 ++++++++++++++++++++++++++++++++++
 7 files changed, 190 insertions(+), 2 deletions(-)
 create mode 100644 net/xdp/xsk_queue.c
 create mode 100644 net/xdp/xsk_queue.h

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 69ccb3b0a3f2..0de1bbf2c5c7 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -22,6 +22,7 @@
 
 /* XDP socket options */
 #define XDP_UMEM_REG		  3
+#define XDP_UMEM_FILL_QUEUE	  4
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -30,4 +31,18 @@ struct xdp_umem_reg {
 	__u32 frame_headroom; /* Frame head room */
 };
 
+/* Pgoff for mmaping the rings */
+#define XDP_UMEM_PGOFF_FILL_QUEUE	0x100000000
+
+struct xdp_queue {
+	__u32 head_idx __attribute__((aligned(64)));
+	__u32 tail_idx __attribute__((aligned(64)));
+};
+
+/* Used for the fill and completion queues for buffers */
+struct xdp_umem_queue {
+	struct xdp_queue ptrs;
+	__u32 desc[0] __attribute__((aligned(64)));
+};
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
index a5d736640a0f..074fb2b2d51c 100644
--- a/net/xdp/Makefile
+++ b/net/xdp/Makefile
@@ -1,2 +1,2 @@
-obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
 
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 8f768a7887da..f2c0768ca63e 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -66,6 +66,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
 	struct mm_struct *mm;
 	unsigned long diff;
 
+	if (umem->fq) {
+		xskq_destroy(umem->fq);
+		umem->fq = NULL;
+	}
+
 	if (umem->pgs) {
 		xdp_umem_unpin_pages(umem);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index d9bbbb880088..b3538dd2118c 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -18,9 +18,11 @@
 #include <linux/mm.h>
 #include <linux/if_xdp.h>
 
+#include "xsk_queue.h"
 #include "xdp_umem_props.h"
 
 struct xdp_umem {
+	struct xsk_queue *fq;
 	struct pid *pid;
 	struct page **pgs;
 	unsigned long address;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cc0f26e17baf..6ff1d1f3322f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -32,6 +32,7 @@
 #include <linux/netdevice.h>
 #include <net/sock.h>
 
+#include "xsk_queue.h"
 #include "xdp_umem.h"
 
 struct xdp_sock {
@@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
+{
+	struct xsk_queue *q;
+
+	if (entries == 0 || *queue || !is_power_of_2(entries))
+		return -EINVAL;
+
+	q = xskq_create(entries);
+	if (!q)
+		return -ENOMEM;
+
+	*queue = q;
+	return 0;
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return 0;
 	}
+	case XDP_UMEM_FILL_QUEUE:
+	{
+		struct xsk_queue **q;
+		int entries;
+
+		if (!xs->umem)
+			return -EINVAL;
+
+		if (copy_from_user(&entries, optval, sizeof(entries)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		q = &xs->umem->fq;
+		err = xsk_init_queue(entries, q);
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	default:
 		break;
 	}
@@ -116,6 +149,36 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 	return -ENOPROTOOPT;
 }
 
+static int xsk_mmap(struct file *file, struct socket *sock,
+		    struct vm_area_struct *vma)
+{
+	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long size = vma->vm_end - vma->vm_start;
+	struct xdp_sock *xs = xdp_sk(sock->sk);
+	struct xsk_queue *q;
+	unsigned long pfn;
+	struct page *qpg;
+	int err;
+
+	if (!xs->umem)
+		return -EINVAL;
+
+	if (offset == XDP_UMEM_PGOFF_FILL_QUEUE)
+		q = xs->umem->fq;
+	else
+		return -EINVAL;
+
+	qpg = virt_to_head_page(q->ring);
+	if (size > (PAGE_SIZE << compound_order(qpg)))
+		return -EINVAL;
+
+	pfn = virt_to_phys(q->ring) >> PAGE_SHIFT;
+	err = remap_pfn_range(vma, vma->vm_start, pfn,
+			      size, vma->vm_page_prot);
+
+	return err;
+}
+
 static struct proto xsk_proto = {
 	.name =		"XDP",
 	.owner =	THIS_MODULE,
@@ -139,7 +202,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	sock_no_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
-	.mmap =		sock_no_mmap,
+	.mmap =		xsk_mmap,
 	.sendpage =	sock_no_sendpage,
 };
 
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
new file mode 100644
index 000000000000..fd4bb06aa112
--- /dev/null
+++ b/net/xdp/xsk_queue.c
@@ -0,0 +1,54 @@
+/*
+ * XDP user-space ring structure
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/slab.h>
+
+#include "xsk_queue.h"
+
+struct xsk_queue *xskq_create(u32 nentries)
+{
+	struct xsk_queue *q;
+	gfp_t gfp_flags;
+	size_t size;
+
+	q = kzalloc(sizeof(*q), GFP_KERNEL);
+	if (!q)
+		return NULL;
+
+	q->nentries = nentries;
+	q->ring_mask = nentries - 1;
+
+	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
+		    __GFP_COMP  | __GFP_NORETRY;
+	size = xskq_umem_get_ring_size(q);
+	q->validation = XSK_VALIDATION_RX;
+
+	q->ring = (struct xdp_queue *)__get_free_pages(gfp_flags,
+						       get_order(size));
+	if (!q->ring) {
+		kfree(q);
+		return NULL;
+	}
+
+	return q;
+}
+
+void xskq_destroy(struct xsk_queue *q)
+{
+	if (!q)
+		return;
+
+	page_frag_free(q->ring);
+	kfree(q);
+}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
new file mode 100644
index 000000000000..fe845a20c153
--- /dev/null
+++ b/net/xdp/xsk_queue.h
@@ -0,0 +1,49 @@
+/*
+ * XDP user-space ring structure
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XDP_QUEUE_H
+#define _LINUX_XDP_QUEUE_H
+
+#include <linux/types.h>
+#include <linux/if_xdp.h>
+
+#include "xdp_umem_props.h"
+
+enum xsk_validation {
+	XSK_VALIDATION_RX,	  /* Only address to packet buffer validated */
+};
+
+struct xsk_queue {
+	enum xsk_validation validation;
+	struct xdp_umem_props *umem_props;
+	u32 ring_mask;
+	u32 nentries;
+	u32 iter_head_idx;
+	u32 cached_head;
+	u32 cached_tail;
+	struct xdp_queue *ring;
+	u64 invalid_descs;
+};
+
+/* Functions operating on UMEM queues only */
+
+static inline u32 xskq_umem_get_ring_size(struct xsk_queue *q)
+{
+	return sizeof(struct xdp_umem_queue) + q->nentries * sizeof(u32);
+}
+
+struct xsk_queue *xskq_create(u32 nentries);
+void xskq_destroy(struct xsk_queue *q_ops);
+
+#endif /* _LINUX_XDP_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 04/14] xsk: add Rx queue setup and mmap support
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (2 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 05/14] xsk: add support for bind for Rx Björn Töpel
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, ravineet.singh

From: Björn Töpel <bjorn.topel@intel.com>

Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
a queue, where the kernel can pass completed Rx frames from the kernel
to user process.

The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/uapi/linux/if_xdp.h | 16 ++++++++++++++++
 net/xdp/xsk.c               | 42 +++++++++++++++++++++++++++++++++---------
 net/xdp/xsk_queue.c         | 11 ++++++++---
 net/xdp/xsk_queue.h         | 11 ++++++++++-
 4 files changed, 67 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 0de1bbf2c5c7..118456064e01 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -21,6 +21,7 @@
 #include <linux/types.h>
 
 /* XDP socket options */
+#define XDP_RX_QUEUE		  1
 #define XDP_UMEM_REG		  3
 #define XDP_UMEM_FILL_QUEUE	  4
 
@@ -32,13 +33,28 @@ struct xdp_umem_reg {
 };
 
 /* Pgoff for mmaping the rings */
+#define XDP_PGOFF_RX_QUEUE			  0
 #define XDP_UMEM_PGOFF_FILL_QUEUE	0x100000000
 
+struct xdp_desc {
+	__u32 idx;
+	__u32 len;
+	__u16 offset;
+	__u8 flags;
+	__u8 padding[5];
+};
+
 struct xdp_queue {
 	__u32 head_idx __attribute__((aligned(64)));
 	__u32 tail_idx __attribute__((aligned(64)));
 };
 
+/* Used for the RX and TX queues for packets */
+struct xdp_rxtx_queue {
+	struct xdp_queue ptrs;
+	struct xdp_desc desc[0] __attribute__((aligned(64)));
+};
+
 /* Used for the fill and completion queues for buffers */
 struct xdp_umem_queue {
 	struct xdp_queue ptrs;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 6ff1d1f3322f..f82f750cadc2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -38,6 +38,8 @@
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
+	struct xsk_queue *rx;
+	struct net_device *dev;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 	struct xdp_umem *umem;
@@ -48,14 +50,15 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
-static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
+static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
+			  bool umem_queue)
 {
 	struct xsk_queue *q;
 
 	if (entries == 0 || *queue || !is_power_of_2(entries))
 		return -EINVAL;
 
-	q = xskq_create(entries);
+	q = xskq_create(entries, umem_queue);
 	if (!q)
 		return -ENOMEM;
 
@@ -97,6 +100,22 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		return -ENOPROTOOPT;
 
 	switch (optname) {
+	case XDP_RX_QUEUE:
+	{
+		struct xsk_queue **q;
+		int entries;
+
+		if (optlen < sizeof(entries))
+			return -EINVAL;
+		if (copy_from_user(&entries, optval, sizeof(entries)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		q = &xs->rx;
+		err = xsk_init_queue(entries, q, false);
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	case XDP_UMEM_REG:
 	{
 		struct xdp_umem_reg mr;
@@ -138,7 +157,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 
 		mutex_lock(&xs->mutex);
 		q = &xs->umem->fq;
-		err = xsk_init_queue(entries, q);
+		err = xsk_init_queue(entries, q, true);
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
@@ -160,13 +179,17 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 	struct page *qpg;
 	int err;
 
-	if (!xs->umem)
-		return -EINVAL;
+	if (offset == XDP_PGOFF_RX_QUEUE) {
+		q = xs->rx;
+	} else {
+		if (!xs->umem)
+			return -EINVAL;
 
-	if (offset == XDP_UMEM_PGOFF_FILL_QUEUE)
-		q = xs->umem->fq;
-	else
-		return -EINVAL;
+		if (offset == XDP_UMEM_PGOFF_FILL_QUEUE)
+			q = xs->umem->fq;
+		else
+			return -EINVAL;
+	}
 
 	qpg = virt_to_head_page(q->ring);
 	if (size > (PAGE_SIZE << compound_order(qpg)))
@@ -213,6 +236,7 @@ static void xsk_destruct(struct sock *sk)
 	if (!sock_flag(sk, SOCK_DEAD))
 		return;
 
+	xskq_destroy(xs->rx);
 	xdp_put_umem(xs->umem);
 
 	sk_refcnt_debug_dec(sk);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index fd4bb06aa112..3ce6ef350850 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -16,7 +16,7 @@
 
 #include "xsk_queue.h"
 
-struct xsk_queue *xskq_create(u32 nentries)
+struct xsk_queue *xskq_create(u32 nentries, bool umem_queue)
 {
 	struct xsk_queue *q;
 	gfp_t gfp_flags;
@@ -31,8 +31,13 @@ struct xsk_queue *xskq_create(u32 nentries)
 
 	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
 		    __GFP_COMP  | __GFP_NORETRY;
-	size = xskq_umem_get_ring_size(q);
-	q->validation = XSK_VALIDATION_RX;
+	if (umem_queue) {
+		size = xskq_umem_get_ring_size(q);
+		q->validation = XSK_VALIDATION_RX;
+	} else {
+		size = xskq_rxtx_get_ring_size(q);
+		q->validation = XSK_VALIDATION_TX;
+	}
 
 	q->ring = (struct xdp_queue *)__get_free_pages(gfp_flags,
 						       get_order(size));
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index fe845a20c153..d79b613a9e0a 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -22,6 +22,7 @@
 
 enum xsk_validation {
 	XSK_VALIDATION_RX,	  /* Only address to packet buffer validated */
+	XSK_VALIDATION_TX	  /* Full descriptor is validated */
 };
 
 struct xsk_queue {
@@ -36,6 +37,14 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
+/* Functions operating on RXTX queues only */
+
+static inline u32 xskq_rxtx_get_ring_size(struct xsk_queue *q)
+{
+	return (sizeof(struct xdp_queue) +
+		q->nentries * sizeof(struct xdp_desc));
+}
+
 /* Functions operating on UMEM queues only */
 
 static inline u32 xskq_umem_get_ring_size(struct xsk_queue *q)
@@ -43,7 +52,7 @@ static inline u32 xskq_umem_get_ring_size(struct xsk_queue *q)
 	return sizeof(struct xdp_umem_queue) + q->nentries * sizeof(u32);
 }
 
-struct xsk_queue *xskq_create(u32 nentries);
+struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q_ops);
 
 #endif /* _LINUX_XDP_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 05/14] xsk: add support for bind for Rx
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (3 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 04/14] xsk: add Rx queue setup and mmap support Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 06/14] xsk: add Rx receive functions and poll support Björn Töpel
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	ravineet.singh

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, the bind syscall is added. Binding an AF_XDP socket, means
associating the socket to an umem, a netdev and a queue index. This
can be done in two ways.

The first way, creating a "socket from scratch". Create the umem using
the XDP_UMEM_REG setsockopt and an associated fill queue with
XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
setsockopt. Call bind passing ifindex and queue index ("channel" in
ethtool speak).

The second way to bind a socket, is simply skipping the
umem/netdev/queue index, and passing another already setup AF_XDP
socket. The new socket will then have the same umem/netdev/queue index
as the parent so it will share the same umem. You must also set the
flags field in the socket address to XDP_SHARED_UMEM.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h |  11 ++++
 net/xdp/xdp_umem.c          |   9 ++++
 net/xdp/xdp_umem.h          |   2 +
 net/xdp/xsk.c               | 125 +++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 146 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 118456064e01..e3593df7323c 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -20,6 +20,17 @@
 
 #include <linux/types.h>
 
+/* Options for the sxdp_flags field */
+#define XDP_SHARED_UMEM 1
+
+struct sockaddr_xdp {
+	__u16 sxdp_family;
+	__u32 sxdp_ifindex;
+	__u32 sxdp_queue_id;
+	__u32 sxdp_shared_umem_fd;
+	__u16 sxdp_flags;
+};
+
 /* XDP socket options */
 #define XDP_RX_QUEUE		  1
 #define XDP_UMEM_REG		  3
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index f2c0768ca63e..96395beb980e 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -97,6 +97,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
 	kfree(umem);
 }
 
+void xdp_get_umem(struct xdp_umem *umem)
+{
+	atomic_inc(&umem->users);
+}
+
 void xdp_put_umem(struct xdp_umem *umem)
 {
 	if (!umem)
@@ -243,3 +248,7 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	return err;
 }
 
+bool xdp_umem_validate_queues(struct xdp_umem *umem)
+{
+	return umem->fq;
+}
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index b3538dd2118c..ad041b911b38 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -36,7 +36,9 @@ struct xdp_umem {
 	struct user_struct *user;
 };
 
+bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
+void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
 int xdp_umem_create(struct xdp_umem **umem);
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f82f750cadc2..d99a1b830f94 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -43,6 +43,8 @@ struct xdp_sock {
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 	struct xdp_umem *umem;
+	u32 ifindex;
+	u16 queue_id;
 };
 
 static struct xdp_sock *xdp_sk(struct sock *sk)
@@ -66,9 +68,18 @@ static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 	return 0;
 }
 
+static void __xsk_release(struct xdp_sock *xs)
+{
+	/* Wait for driver to stop using the xdp socket. */
+	synchronize_net();
+
+	dev_put(xs->dev);
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
 	struct net *net;
 
 	if (!sk)
@@ -80,6 +91,11 @@ static int xsk_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	local_bh_enable();
 
+	if (xs->dev) {
+		__xsk_release(xs);
+		xs->dev = NULL;
+	}
+
 	sock_orphan(sk);
 	sock->sk = NULL;
 
@@ -89,6 +105,113 @@ static int xsk_release(struct socket *sock)
 	return 0;
 }
 
+static struct socket *xsk_lookup_xsk_from_fd(int fd, int *err)
+{
+	struct socket *sock;
+
+	*err = -ENOTSOCK;
+	sock = sockfd_lookup(fd, err);
+	if (!sock)
+		return NULL;
+
+	if (sock->sk->sk_family != PF_XDP) {
+		*err = -ENOPROTOOPT;
+		sockfd_put(sock);
+		return NULL;
+	}
+
+	*err = 0;
+	return sock;
+}
+
+static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
+	struct sock *sk = sock->sk;
+	struct net_device *dev, *dev_curr;
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct xdp_umem *old_umem = NULL;
+	int err = 0;
+
+	if (addr_len < sizeof(struct sockaddr_xdp))
+		return -EINVAL;
+	if (sxdp->sxdp_family != AF_XDP)
+		return -EINVAL;
+
+	mutex_lock(&xs->mutex);
+	dev_curr = xs->dev;
+	dev = dev_get_by_index(sock_net(sk), sxdp->sxdp_ifindex);
+	if (!dev) {
+		err = -ENODEV;
+		goto out_release;
+	}
+
+	if (!xs->rx) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (sxdp->sxdp_queue_id >= dev->num_rx_queues) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (sxdp->sxdp_flags & XDP_SHARED_UMEM) {
+		struct xdp_sock *umem_xs;
+		struct socket *sock;
+
+		if (xs->umem) {
+			/* We have already our own. */
+			err = -EINVAL;
+			goto out_unlock;
+		}
+
+		sock = xsk_lookup_xsk_from_fd(sxdp->sxdp_shared_umem_fd, &err);
+		if (!sock)
+			goto out_unlock;
+
+		umem_xs = xdp_sk(sock->sk);
+		if (!umem_xs->umem) {
+			/* No umem to inherit. */
+			err = -EBADF;
+			sockfd_put(sock);
+			goto out_unlock;
+		} else if (umem_xs->dev != dev ||
+			   umem_xs->queue_id != sxdp->sxdp_queue_id) {
+			err = -EINVAL;
+			sockfd_put(sock);
+			goto out_unlock;
+		}
+
+		xdp_get_umem(umem_xs->umem);
+		old_umem = xs->umem;
+		xs->umem = umem_xs->umem;
+		sockfd_put(sock);
+	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Rebind? */
+	if (dev_curr && (dev_curr != dev ||
+			 xs->queue_id != sxdp->sxdp_queue_id)) {
+		__xsk_release(xs);
+		if (old_umem)
+			xdp_put_umem(old_umem);
+	}
+
+	xs->dev = dev;
+	xs->ifindex = sxdp->sxdp_ifindex;
+	xs->queue_id = sxdp->sxdp_queue_id;
+
+out_unlock:
+	if (err)
+		dev_put(dev);
+out_release:
+	mutex_unlock(&xs->mutex);
+	return err;
+}
+
 static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			  char __user *optval, unsigned int optlen)
 {
@@ -212,7 +335,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.family =	PF_XDP,
 	.owner =	THIS_MODULE,
 	.release =	xsk_release,
-	.bind =		sock_no_bind,
+	.bind =		xsk_bind,
 	.connect =	sock_no_connect,
 	.socketpair =	sock_no_socketpair,
 	.accept =	sock_no_accept,
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 06/14] xsk: add Rx receive functions and poll support
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (4 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 05/14] xsk: add support for bind for Rx Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 07/14] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, ravineet.singh

From: Björn Töpel <bjorn.topel@intel.com>

Here the actual receive functions of AF_XDP are implemented, that in a
later commit, will be called from the XDP layers.

There's one set of functions for the XDP_DRV side and another for
XDP_SKB (generic).

Support for the poll syscall is also implemented.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xdp_umem.h  |  18 +++++
 net/xdp/xsk.c       |  81 ++++++++++++++++++++-
 net/xdp/xsk_queue.h | 206 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 304 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index ad041b911b38..5e7105b7760b 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -36,6 +36,24 @@ struct xdp_umem {
 	struct user_struct *user;
 };
 
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
+{
+	u64 pg, off;
+	char *data;
+
+	pg = idx >> umem->nfpplog2;
+	off = (idx - (pg << umem->nfpplog2)) << umem->frame_size_log2;
+
+	data = page_address(umem->pgs[pg]);
+	return data + off;
+}
+
+static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
+						    u32 idx)
+{
+	return xdp_umem_get_data(umem, idx) + umem->frame_headroom;
+}
+
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
 void xdp_get_umem(struct xdp_umem *umem);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index d99a1b830f94..a60b1fcfb2b3 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -35,10 +35,14 @@
 #include "xsk_queue.h"
 #include "xdp_umem.h"
 
+#define RX_BATCH_SIZE 16
+
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
 	struct xsk_queue *rx;
+	struct xskq_iter rx_it;
+	u64 rx_dropped;
 	struct net_device *dev;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
@@ -52,6 +56,74 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+static inline int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	u32 len = xdp->data_end - xdp->data;
+	void *buffer;
+	int err = 0;
+	u32 id;
+
+	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+		return -EINVAL;
+
+	if (!xskq_next_frame_deq(xs->umem->fq, &xs->rx_it, RX_BATCH_SIZE))
+		return -ENOSPC;
+
+	id = xdp_umem_get_id(xs->umem->fq, &xs->rx_it);
+	buffer = xdp_umem_get_data_with_headroom(xs->umem, id);
+	memcpy(buffer, xdp->data, len);
+	err = xskq_rxtx_enq_frame(xs->rx, id, len, xs->umem->frame_headroom);
+	if (err)
+		xskq_deq_return_frame(&xs->rx_it);
+
+	return err;
+}
+
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	int err;
+
+	err = __xsk_rcv(xs, xdp);
+	if (!err)
+		page_frag_free(xdp->data);
+	else
+		xs->rx_dropped++;
+
+	return err;
+}
+
+void xsk_flush(struct xdp_sock *xs)
+{
+	xskq_enq_flush(xs->rx);
+	xs->sk.sk_data_ready(&xs->sk);
+}
+
+int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	int err;
+
+	err = __xsk_rcv(xs, xdp);
+	if (!err)
+		xsk_flush(xs);
+	else
+		xs->rx_dropped++;
+
+	return err;
+}
+
+static unsigned int xsk_poll(struct file *file, struct socket *sock,
+			     struct poll_table_struct *wait)
+{
+	unsigned int mask = datagram_poll(file, sock, wait);
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (xs->rx && !xskq_empty(xs->rx))
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
 static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 			  bool umem_queue)
 {
@@ -190,6 +262,9 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
 		err = -EINVAL;
 		goto out_unlock;
+	} else {
+		/* This xsk has its own umem. */
+		xskq_set_umem(xs->umem->fq, &xs->umem->props);
 	}
 
 	/* Rebind? */
@@ -204,6 +279,10 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	xs->ifindex = sxdp->sxdp_ifindex;
 	xs->queue_id = sxdp->sxdp_queue_id;
 
+	xskq_init_iter(&xs->rx_it);
+
+	xskq_set_umem(xs->rx, &xs->umem->props);
+
 out_unlock:
 	if (err)
 		dev_put(dev);
@@ -340,7 +419,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.socketpair =	sock_no_socketpair,
 	.accept =	sock_no_accept,
 	.getname =	sock_no_getname,
-	.poll =		sock_no_poll,
+	.poll =		xsk_poll,
 	.ioctl =	sock_no_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index d79b613a9e0a..af6e651f1207 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -37,6 +37,187 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
+struct xskq_iter {
+	u32 head;
+	u32 tail;
+	struct xdp_desc desc_copy;
+};
+
+/* Common functions operating for both RXTX and umem queues */
+
+static inline bool xskq_is_valid_rx_entry(struct xsk_queue *q,
+					  u32 idx)
+{
+	if (unlikely(idx >= q->umem_props->nframes)) {
+		q->invalid_descs++;
+		return false;
+	}
+	return true;
+}
+
+static inline bool xskq_is_valid_tx_entry(struct xsk_queue *q,
+					  struct xdp_desc *d)
+{
+	u32 buff_len;
+
+	if (unlikely(d->idx >= q->umem_props->nframes)) {
+		q->invalid_descs++;
+		return false;
+	}
+
+	buff_len = q->umem_props->frame_size;
+	if (unlikely(d->len > buff_len || d->len == 0 ||
+		     d->offset > buff_len || d->offset + d->len > buff_len)) {
+		q->invalid_descs++;
+		return false;
+	}
+
+	return true;
+}
+
+static inline u32 xskq_nb_free(struct xsk_queue *q, u32 head_idx, u32 dcnt)
+{
+	u32 free_entries = q->nentries - (head_idx - q->cached_tail);
+
+	if (free_entries >= dcnt)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cached_tail = READ_ONCE(q->ring->tail_idx);
+	return q->nentries - (head_idx - q->cached_tail);
+}
+
+static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
+{
+	u32 entries = q->cached_head - q->cached_tail;
+
+	if (entries == 0)
+		/* Refresh the local head pointer */
+		q->cached_head = READ_ONCE(q->ring->head_idx);
+
+	entries = q->cached_head - q->cached_tail;
+	return (entries > dcnt) ? dcnt : entries;
+}
+
+static inline bool xskq_empty(struct xsk_queue *q)
+{
+	if (xskq_nb_free(q, q->cached_head, 1) == q->nentries)
+		return true;
+	return false;
+}
+
+static inline bool xskq_full(struct xsk_queue *q)
+{
+	if (xskq_nb_avail(q, q->nentries) == q->nentries)
+		return true;
+	return false;
+}
+
+static inline void xskq_init_iter(struct xskq_iter *it)
+{
+	it->head = 0;
+	it->tail = 0;
+}
+
+static inline void xskq_set_umem(struct xsk_queue *q,
+				 struct xdp_umem_props *umem_props)
+{
+	q->umem_props = umem_props;
+}
+
+static inline bool xskq_iter_end(struct xskq_iter *it)
+{
+	return it->tail == it->head;
+}
+
+static inline void xskq_iter_validate(struct xsk_queue *q,
+				      struct xskq_iter *it)
+{
+	while (!xskq_iter_end(it)) {
+		unsigned int idx = it->tail & q->ring_mask;
+
+		if (q->validation == XSK_VALIDATION_TX) {
+			struct xdp_rxtx_queue *ring =
+				(struct xdp_rxtx_queue *)q->ring;
+
+			it->desc_copy.idx = ring->desc[idx].idx;
+			it->desc_copy.len = ring->desc[idx].len;
+			it->desc_copy.offset = ring->desc[idx].offset;
+
+			if (xskq_is_valid_tx_entry(q, &it->desc_copy))
+				break;
+		} else {
+			/* XSK_VALIDATION_RX */
+			struct xdp_umem_queue *ring =
+				(struct xdp_umem_queue *)q->ring;
+
+			if (xskq_is_valid_rx_entry(q, ring->desc[idx]))
+				break;
+		}
+
+		it->tail++;
+	}
+}
+
+static inline void xskq_deq_iter(struct xsk_queue *q,
+				 struct xskq_iter *it, int cnt)
+{
+	it->tail = q->cached_tail;
+	it->head = q->cached_tail + xskq_nb_avail(q, cnt);
+
+	/* Order tail and data */
+	smp_rmb();
+
+	xskq_iter_validate(q, it);
+}
+
+static inline void xskq_deq_iter_next(struct xsk_queue *q,
+				      struct xskq_iter *it)
+{
+	it->tail++;
+	xskq_iter_validate(q, it);
+}
+
+static inline void xskq_deq_iter_done(struct xsk_queue *q,
+				      struct xskq_iter *it)
+{
+	q->cached_tail = it->tail;
+	WRITE_ONCE(q->ring->tail_idx, it->tail);
+}
+
+static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q)
+{
+	return q ? q->invalid_descs : 0;
+}
+
+static inline bool xskq_next_frame_deq(struct xsk_queue *q,
+				       struct xskq_iter *it,
+				       u32 batch_size)
+{
+	if (xskq_iter_end(it)) {
+		xskq_deq_iter_done(q, it);
+		xskq_deq_iter(q, it, batch_size);
+		return !xskq_iter_end(it);
+	}
+
+	xskq_deq_iter_next(q, it);
+	return !xskq_iter_end(it);
+}
+
+static inline void xskq_deq_return_frame(struct xskq_iter *it)
+{
+	it->tail--;
+}
+
+static inline void xskq_enq_flush(struct xsk_queue *q)
+{
+	/* Order flags and data */
+	smp_wmb();
+
+	WRITE_ONCE(q->ring->head_idx, q->iter_head_idx);
+	q->cached_head = q->iter_head_idx;
+}
+
 /* Functions operating on RXTX queues only */
 
 static inline u32 xskq_rxtx_get_ring_size(struct xsk_queue *q)
@@ -45,6 +226,23 @@ static inline u32 xskq_rxtx_get_ring_size(struct xsk_queue *q)
 		q->nentries * sizeof(struct xdp_desc));
 }
 
+static inline int xskq_rxtx_enq_frame(struct xsk_queue *q,
+				      u32 id, u32 len, u16 offset)
+{
+	struct xdp_rxtx_queue *ring = (struct xdp_rxtx_queue *)q->ring;
+	unsigned int idx;
+
+	if (xskq_nb_free(q, q->iter_head_idx, 1) == 0)
+		return -ENOSPC;
+
+	idx = (q->iter_head_idx++) & q->ring_mask;
+	ring->desc[idx].idx = id;
+	ring->desc[idx].len = len;
+	ring->desc[idx].offset = offset;
+
+	return 0;
+}
+
 /* Functions operating on UMEM queues only */
 
 static inline u32 xskq_umem_get_ring_size(struct xsk_queue *q)
@@ -52,6 +250,14 @@ static inline u32 xskq_umem_get_ring_size(struct xsk_queue *q)
 	return sizeof(struct xdp_umem_queue) + q->nentries * sizeof(u32);
 }
 
+static inline u32 xdp_umem_get_id(struct xsk_queue *q,
+				  struct xskq_iter *it)
+{
+	struct xdp_umem_queue *ring = (struct xdp_umem_queue *)q->ring;
+
+	return ring->desc[it->tail & q->ring_mask];
+}
+
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q_ops);
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 07/14] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (5 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 06/14] xsk: add Rx receive functions and poll support Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 08/14] xsk: wire up XDP_DRV side of AF_XDP Björn Töpel
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, ravineet.singh

From: Björn Töpel <bjorn.topel@intel.com>

The xskmap is yet another BPF map, very much inspired by
dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
adds AF_XDP sockets into the map, and by using the bpf_redirect_map
helper, an XDP program can redirect XDP frames to an AF_XDP socket.

Note that a socket that is bound to certain ifindex/queue index will
*only* accept XDP frames from that netdev/queue index. If an XDP
program tries to redirect from a netdev/queue index other than what
the socket is bound to, the frame will not be received on the socket.

A socket can reside in multiple maps.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/bpf.h       |  26 +++++
 include/linux/bpf_types.h |   3 +
 include/net/xdp_sock.h    |  34 ++++++
 include/uapi/linux/bpf.h  |   1 +
 kernel/bpf/Makefile       |   3 +
 kernel/bpf/verifier.c     |   8 +-
 kernel/bpf/xskmap.c       | 281 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 354 insertions(+), 2 deletions(-)
 create mode 100644 include/net/xdp_sock.h
 create mode 100644 kernel/bpf/xskmap.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 819229c80eca..0fe9d080adc3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -657,6 +657,32 @@ static inline int sock_map_prog(struct bpf_map *map,
 }
 #endif
 
+#if defined(CONFIG_XDP_SOCKETS)
+struct xdp_sock;
+struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key);
+int __xsk_map_redirect(struct bpf_map *map, u32 index,
+		       struct xdp_buff *xdp, struct xdp_sock *xs);
+void __xsk_map_flush(struct bpf_map *map);
+#else
+struct xdp_sock;
+static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map,
+						     u32 key)
+{
+	return NULL;
+}
+
+static inline int __xsk_map_redirect(struct bpf_map *map, u32 index,
+				     struct xdp_buff *xdp,
+				     struct xdp_sock *xs)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void __xsk_map_flush(struct bpf_map *map)
+{
+}
+#endif
+
 /* verifier prototypes for helper functions called from eBPF programs */
 extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern const struct bpf_func_proto bpf_map_update_elem_proto;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 5e2e8a49fb21..b525862c98ab 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -47,4 +47,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
+#if defined(CONFIG_XDP_SOCKETS)
+BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
+#endif
 #endif
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
new file mode 100644
index 000000000000..80d119a685b2
--- /dev/null
+++ b/include/net/xdp_sock.h
@@ -0,0 +1,34 @@
+/*
+ * AF_XDP internal functions
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XDP_SOCK_H
+#define _LINUX_XDP_SOCK_H
+
+struct xdp_sock;
+struct xdp_buff;
+#ifdef CONFIG_XDP_SOCKETS
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
+void xsk_flush(struct xdp_sock *xs);
+#else
+static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	return -ENOTSUPP;
+}
+
+static inline void xsk_flush(struct xdp_sock *xs)
+{
+}
+#endif /* CONFIG_XDP_SOCKETS */
+
+#endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 18b7c510c511..c8e1d2977712 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -114,6 +114,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_DEVMAP,
 	BPF_MAP_TYPE_SOCKMAP,
 	BPF_MAP_TYPE_CPUMAP,
+	BPF_MAP_TYPE_XSKMAP,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index a713fd23ec88..3c59d9bcae14 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -7,6 +7,9 @@ obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
+ifeq ($(CONFIG_XDP_SOCKETS),y)
+obj-$(CONFIG_BPF_SYSCALL) += xskmap.o
+endif
 obj-$(CONFIG_BPF_SYSCALL) += offload.o
 ifeq ($(CONFIG_STREAM_PARSER),y)
 ifeq ($(CONFIG_INET),y)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e9f7c20691c1..46d525539a3b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2059,8 +2059,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (func_id != BPF_FUNC_redirect_map)
 			goto error;
 		break;
-	/* Restrict bpf side of cpumap, open when use-cases appear */
+	/* Restrict bpf side of cpumap and xskmap, open when use-cases
+	 * appear.
+	 */
 	case BPF_MAP_TYPE_CPUMAP:
+	case BPF_MAP_TYPE_XSKMAP:
 		if (func_id != BPF_FUNC_redirect_map)
 			goto error;
 		break;
@@ -2107,7 +2110,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_FUNC_redirect_map:
 		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
-		    map->map_type != BPF_MAP_TYPE_CPUMAP)
+		    map->map_type != BPF_MAP_TYPE_CPUMAP &&
+		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
 	case BPF_FUNC_sk_redirect_map:
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
new file mode 100644
index 000000000000..77553f24daa2
--- /dev/null
+++ b/kernel/bpf/xskmap.c
@@ -0,0 +1,281 @@
+/*
+ * XSKMAP used for AF_XDP sockets
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/bpf.h>
+#include <linux/capability.h>
+#include <net/xdp_sock.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <net/sock.h>
+
+struct xsk_map_entry {
+	struct xdp_sock *xs;
+	struct rcu_head rcu;
+};
+
+struct xsk_map {
+	struct bpf_map map;
+	struct xsk_map_entry **xsk_map;
+	unsigned long __percpu *flush_needed;
+};
+
+static u64 xsk_map_bitmap_size(const union bpf_attr *attr)
+{
+	return BITS_TO_LONGS((u64) attr->max_entries) * sizeof(unsigned long);
+}
+
+static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
+{
+	struct xsk_map *m;
+	int err = -EINVAL;
+	u64 cost;
+
+	if (!capable(CAP_NET_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (attr->max_entries == 0 || attr->key_size != 4 ||
+	    attr->value_size != 4 ||
+	    attr->map_flags & ~(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY))
+		return ERR_PTR(-EINVAL);
+
+	m = kzalloc(sizeof(*m), GFP_USER);
+	if (!m)
+		return ERR_PTR(-ENOMEM);
+
+	bpf_map_init_from_attr(&m->map, attr);
+
+	cost = (u64)m->map.max_entries * sizeof(struct xsk_map_entry *);
+	cost += xsk_map_bitmap_size(attr) * num_possible_cpus();
+	if (cost >= U32_MAX - PAGE_SIZE)
+		goto free_m;
+
+	m->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+	/* Notice returns -EPERM on if map size is larger than memlock limit */
+	err = bpf_map_precharge_memlock(m->map.pages);
+	if (err)
+		goto free_m;
+
+	m->flush_needed = __alloc_percpu(xsk_map_bitmap_size(attr),
+					    __alignof__(unsigned long));
+	if (!m->flush_needed)
+		goto free_m;
+
+	m->xsk_map = bpf_map_area_alloc(m->map.max_entries *
+					   sizeof(struct xsk_map_entry *),
+					   m->map.numa_node);
+	if (!m->xsk_map)
+		goto free_percpu;
+	return &m->map;
+
+free_percpu:
+	free_percpu(m->flush_needed);
+free_m:
+	kfree(m);
+	return ERR_PTR(err);
+}
+
+static void xsk_map_free(struct bpf_map *map)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	int i, cpu;
+
+	/* At this point bpf_prog->aux->refcnt == 0 and this
+	 * map->refcnt == 0, so the programs (can be more than one
+	 * that used this map) were disconnected from events. Wait for
+	 * outstanding critical sections in these programs to
+	 * complete. The rcu critical section only guarantees no
+	 * further reads against xsk_map. It does __not__ ensure
+	 * pending flush operations (if any) are complete.
+	 */
+
+	synchronize_rcu();
+
+	/* To ensure all pending flush operations have completed wait
+	 * for flush bitmap to indicate all flush_needed bits to be
+	 * zero on _all_ cpus.  Because the above synchronize_rcu()
+	 * ensures the map is disconnected from the program we can
+	 * assume no new bits will be set.
+	 */
+	for_each_online_cpu(cpu) {
+		unsigned long *bitmap = per_cpu_ptr(m->flush_needed, cpu);
+
+		while (!bitmap_empty(bitmap, map->max_entries))
+			cond_resched();
+	}
+
+	for (i = 0; i < map->max_entries; i++) {
+		struct xsk_map_entry *entry;
+
+		entry = m->xsk_map[i];
+		if (!entry)
+			continue;
+
+		sock_put((struct sock *)entry->xs);
+		kfree(entry);
+	}
+
+	free_percpu(m->flush_needed);
+	bpf_map_area_free(m->xsk_map);
+	kfree(m);
+}
+
+static int xsk_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	u32 index = key ? *(u32 *)key : U32_MAX;
+	u32 *next = next_key;
+
+	if (index >= m->map.max_entries) {
+		*next = 0;
+		return 0;
+	}
+
+	if (index == m->map.max_entries - 1)
+		return -ENOENT;
+	*next = index + 1;
+	return 0;
+}
+
+struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xsk_map_entry *entry;
+
+	if (key >= map->max_entries)
+		return NULL;
+
+	entry = READ_ONCE(m->xsk_map[key]);
+	return entry ? entry->xs : NULL;
+}
+
+int __xsk_map_redirect(struct bpf_map *map, u32 index,
+		       struct xdp_buff *xdp, struct xdp_sock *xs)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	unsigned long *bitmap = this_cpu_ptr(m->flush_needed);
+	int err;
+
+	err = xsk_rcv(xs, xdp);
+	if (err)
+		return err;
+
+	__set_bit(index, bitmap);
+	return 0;
+}
+
+void __xsk_map_flush(struct bpf_map *map)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	unsigned long *bitmap = this_cpu_ptr(m->flush_needed);
+	u32 bit;
+
+	for_each_set_bit(bit, bitmap, map->max_entries) {
+		struct xsk_map_entry *entry = READ_ONCE(m->xsk_map[bit]);
+
+		/* This is possible if the entry is removed by user
+		 * space between xdp redirect and flush op.
+		 */
+		if (unlikely(!entry))
+			continue;
+
+		__clear_bit(bit, bitmap);
+		xsk_flush(entry->xs);
+	}
+}
+
+static void *xsk_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return NULL;
+}
+
+static void __xsk_map_entry_free(struct rcu_head *rcu)
+{
+	struct xsk_map_entry *entry;
+
+	entry = container_of(rcu, struct xsk_map_entry, rcu);
+	xsk_flush(entry->xs);
+	sock_put((struct sock *)entry->xs);
+	kfree(entry);
+}
+
+static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
+			       u64 map_flags)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xsk_map_entry *entry, *old_entry;
+	u32 i = *(u32 *)key, fd = *(u32 *)value;
+	struct socket *sock;
+	int err;
+
+	if (unlikely(map_flags > BPF_EXIST))
+		return -EINVAL;
+	if (unlikely(i >= m->map.max_entries))
+		return -E2BIG;
+	if (unlikely(map_flags == BPF_NOEXIST))
+		return -EEXIST;
+
+	sock = sockfd_lookup(fd, &err);
+	if (!sock)
+		return err;
+
+	if (sock->sk->sk_family != PF_XDP) {
+		sockfd_put(sock);
+		return -EOPNOTSUPP;
+	}
+
+	entry = kmalloc_node(sizeof(*entry), GFP_ATOMIC | __GFP_NOWARN,
+			     map->numa_node);
+	if (!entry) {
+		sockfd_put(sock);
+		return -ENOMEM;
+	}
+
+	sock_hold(sock->sk);
+	entry->xs = (struct xdp_sock *)sock->sk;
+
+	old_entry = xchg(&m->xsk_map[i], entry);
+	if (old_entry)
+		call_rcu(&old_entry->rcu, __xsk_map_entry_free);
+
+	sockfd_put(sock);
+	return 0;
+}
+
+static int xsk_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xsk_map_entry *old_entry;
+	int k = *(u32 *)key;
+
+	if (k >= map->max_entries)
+		return -EINVAL;
+
+	old_entry = xchg(&m->xsk_map[k], NULL);
+	if (old_entry)
+		call_rcu(&old_entry->rcu, __xsk_map_entry_free);
+
+	return 0;
+}
+
+const struct bpf_map_ops xsk_map_ops = {
+	.map_alloc = xsk_map_alloc,
+	.map_free = xsk_map_free,
+	.map_get_next_key = xsk_map_get_next_key,
+	.map_lookup_elem = xsk_map_lookup_elem,
+	.map_update_elem = xsk_map_update_elem,
+	.map_delete_elem = xsk_map_delete_elem,
+};
+
+
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 08/14] xsk: wire up XDP_DRV side of AF_XDP
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (6 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 07/14] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 09/14] xsk: wire up XDP_SKB " Björn Töpel
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, ravineet.singh

From: Björn Töpel <bjorn.topel@intel.com>

This commit wires up the xskmap to XDP_DRV layer.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/core/filter.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 00c711c5f1a2..4b09c0a02814 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2768,7 +2768,8 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 {
 	int err;
 
-	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP: {
 		struct net_device *dev = fwd;
 
 		if (!dev->netdev_ops->ndo_xdp_xmit)
@@ -2778,14 +2779,25 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 		if (err)
 			return err;
 		__dev_map_insert_ctx(map, index);
-
-	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP) {
+		break;
+	}
+	case BPF_MAP_TYPE_CPUMAP: {
 		struct bpf_cpu_map_entry *rcpu = fwd;
 
 		err = cpu_map_enqueue(rcpu, xdp, dev_rx);
 		if (err)
 			return err;
 		__cpu_map_insert_ctx(map, index);
+		break;
+	}
+	case BPF_MAP_TYPE_XSKMAP: {
+		struct xdp_sock *xs = fwd;
+
+		err = __xsk_map_redirect(map, index, xdp, xs);
+		return err;
+	}
+	default:
+		break;
 	}
 	return 0;
 }
@@ -2804,6 +2816,9 @@ void xdp_do_flush_map(void)
 		case BPF_MAP_TYPE_CPUMAP:
 			__cpu_map_flush(map);
 			break;
+		case BPF_MAP_TYPE_XSKMAP:
+			__xsk_map_flush(map);
+			break;
 		default:
 			break;
 		}
@@ -2818,6 +2833,8 @@ static void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
 		return __dev_map_lookup_elem(map, index);
 	case BPF_MAP_TYPE_CPUMAP:
 		return __cpu_map_lookup_elem(map, index);
+	case BPF_MAP_TYPE_XSKMAP:
+		return __xsk_map_lookup_elem(map, index);
 	default:
 		return NULL;
 	}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 09/14] xsk: wire up XDP_SKB side of AF_XDP
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (7 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 08/14] xsk: wire up XDP_DRV side of AF_XDP Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 10/14] xsk: add umem completion queue support and mmap Björn Töpel
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, ravineet.singh

From: Björn Töpel <bjorn.topel@intel.com>

This commit wires up the xskmap to XDP_SKB layer.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/filter.h |  2 +-
 include/net/xdp_sock.h |  6 ++++++
 net/core/dev.c         | 28 +++++++++++++++-------------
 net/core/filter.c      | 17 ++++++++++++++---
 4 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 109d05ccea9a..6fa5e53d5fba 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -763,7 +763,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
  * This does not appear to be a real limitation for existing software.
  */
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
-			    struct bpf_prog *prog);
+			    struct xdp_buff *xdp, struct bpf_prog *prog);
 int xdp_do_redirect(struct net_device *dev,
 		    struct xdp_buff *xdp,
 		    struct bpf_prog *prog);
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 80d119a685b2..7b3b29ba9ff0 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -18,9 +18,15 @@
 struct xdp_sock;
 struct xdp_buff;
 #ifdef CONFIG_XDP_SOCKETS
+int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 #else
+static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	return -ENOTSUPP;
+}
+
 static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	return -ENOTSUPP;
diff --git a/net/core/dev.c b/net/core/dev.c
index 97a96df4b6da..3632d959af1b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3985,11 +3985,11 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 }
 
 static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+				     struct xdp_buff *xdp,
 				     struct bpf_prog *xdp_prog)
 {
 	struct netdev_rx_queue *rxqueue;
 	u32 metalen, act = XDP_DROP;
-	struct xdp_buff xdp;
 	void *orig_data;
 	int hlen, off;
 	u32 mac_len;
@@ -4025,18 +4025,18 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	 */
 	mac_len = skb->data - skb_mac_header(skb);
 	hlen = skb_headlen(skb) + mac_len;
-	xdp.data = skb->data - mac_len;
-	xdp.data_meta = xdp.data;
-	xdp.data_end = xdp.data + hlen;
-	xdp.data_hard_start = skb->data - skb_headroom(skb);
-	orig_data = xdp.data;
+	xdp->data = skb->data - mac_len;
+	xdp->data_meta = xdp->data;
+	xdp->data_end = xdp->data + hlen;
+	xdp->data_hard_start = skb->data - skb_headroom(skb);
+	orig_data = xdp->data;
 
 	rxqueue = netif_get_rxqueue(skb);
-	xdp.rxq = &rxqueue->xdp_rxq;
+	xdp->rxq = &rxqueue->xdp_rxq;
 
-	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
-	off = xdp.data - orig_data;
+	off = xdp->data - orig_data;
 	if (off > 0)
 		__skb_pull(skb, off);
 	else if (off < 0)
@@ -4049,7 +4049,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 		__skb_push(skb, mac_len);
 		break;
 	case XDP_PASS:
-		metalen = xdp.data - xdp.data_meta;
+		metalen = xdp->data - xdp->data_meta;
 		if (metalen)
 			skb_metadata_set(skb, metalen);
 		break;
@@ -4099,17 +4099,19 @@ static struct static_key generic_xdp_needed __read_mostly;
 int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 {
 	if (xdp_prog) {
-		u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+		struct xdp_buff xdp;
+		u32 act;
 		int err;
 
+		act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
 		if (act != XDP_PASS) {
 			switch (act) {
 			case XDP_REDIRECT:
 				err = xdp_do_generic_redirect(skb->dev, skb,
-							      xdp_prog);
+							      &xdp, xdp_prog);
 				if (err)
 					goto out_redir;
-			/* fallthru to submit skb */
+				break;
 			case XDP_TX:
 				generic_xdp_tx(skb, xdp_prog);
 				break;
diff --git a/net/core/filter.c b/net/core/filter.c
index 4b09c0a02814..fc71b53542d2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -57,6 +57,7 @@
 #include <net/busy_poll.h>
 #include <net/tcp.h>
 #include <linux/bpf_trace.h>
+#include <net/xdp_sock.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -2932,13 +2933,14 @@ static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
 
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
+				       struct xdp_buff *xdp,
 				       struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	unsigned long map_owner = ri->map_owner;
 	struct bpf_map *map = ri->map;
-	struct net_device *fwd = NULL;
 	u32 index = ri->ifindex;
+	void *fwd = NULL;
 	int err = 0;
 
 	ri->ifindex = 0;
@@ -2960,6 +2962,14 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 		if (unlikely((err = __xdp_generic_ok_fwd_dev(skb, fwd))))
 			goto err;
 		skb->dev = fwd;
+		generic_xdp_tx(skb, xdp_prog);
+	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
+		struct xdp_sock *xs = fwd;
+
+		err = xsk_generic_rcv(xs, xdp);
+		if (err)
+			goto err;
+		consume_skb(skb);
 	} else {
 		/* TODO: Handle BPF_MAP_TYPE_CPUMAP */
 		err = -EBADRQC;
@@ -2974,7 +2984,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 }
 
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
-			    struct bpf_prog *xdp_prog)
+			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	u32 index = ri->ifindex;
@@ -2982,7 +2992,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 	int err = 0;
 
 	if (ri->map)
-		return xdp_do_generic_redirect_map(dev, skb, xdp_prog);
+		return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog);
 
 	ri->ifindex = 0;
 	fwd = dev_get_by_index_rcu(dev_net(dev), index);
@@ -2996,6 +3006,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 	skb->dev = fwd;
 	_trace_xdp_redirect(dev, xdp_prog, index);
+	generic_xdp_tx(skb, xdp_prog);
 	return 0;
 err:
 	_trace_xdp_redirect_err(dev, xdp_prog, index, err);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 10/14] xsk: add umem completion queue support and mmap
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (8 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 09/14] xsk: wire up XDP_SKB " Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 11/14] xsk: add Tx queue setup and mmap support Björn Töpel
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	ravineet.singh

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, we add another setsockopt for registered user memory (umem)
called XDP_UMEM_COMPLETION_QUEUE. Using this socket option, the
process can ask the kernel to allocate a queue (ring buffer) and also
mmap it (XDP_UMEM_PGOFF_COMPLETION_QUEUE) into the process.

The queue is used to explicitly pass ownership of umem frames from the
kernel to user process. This will be used by the TX path to tell user
space that a certain frame has been transmitted and user space can use
it for something else, if it wishes.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 2 ++
 net/xdp/xdp_umem.c          | 7 ++++++-
 net/xdp/xdp_umem.h          | 1 +
 net/xdp/xsk.c               | 6 +++++-
 4 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index e3593df7323c..aec2eee4e751 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -35,6 +35,7 @@ struct sockaddr_xdp {
 #define XDP_RX_QUEUE		  1
 #define XDP_UMEM_REG		  3
 #define XDP_UMEM_FILL_QUEUE	  4
+#define XDP_UMEM_COMPLETION_QUEUE 5
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -46,6 +47,7 @@ struct xdp_umem_reg {
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_QUEUE			  0
 #define XDP_UMEM_PGOFF_FILL_QUEUE	0x100000000
+#define XDP_UMEM_PGOFF_COMPLETION_QUEUE 0x180000000
 
 struct xdp_desc {
 	__u32 idx;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 96395beb980e..8912022d22ee 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -71,6 +71,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
 		umem->fq = NULL;
 	}
 
+	if (umem->cq) {
+		xskq_destroy(umem->cq);
+		umem->cq = NULL;
+	}
+
 	if (umem->pgs) {
 		xdp_umem_unpin_pages(umem);
 
@@ -250,5 +255,5 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 
 bool xdp_umem_validate_queues(struct xdp_umem *umem)
 {
-	return umem->fq;
+	return (umem->fq && umem->cq);
 }
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 5e7105b7760b..3eda63797516 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -23,6 +23,7 @@
 
 struct xdp_umem {
 	struct xsk_queue *fq;
+	struct xsk_queue *cq;
 	struct pid *pid;
 	struct page **pgs;
 	unsigned long address;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a60b1fcfb2b3..2071365c39b1 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -347,6 +347,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		return 0;
 	}
 	case XDP_UMEM_FILL_QUEUE:
+	case XDP_UMEM_COMPLETION_QUEUE:
 	{
 		struct xsk_queue **q;
 		int entries;
@@ -358,7 +359,8 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			return -EFAULT;
 
 		mutex_lock(&xs->mutex);
-		q = &xs->umem->fq;
+		q = (optname == XDP_UMEM_FILL_QUEUE) ? &xs->umem->fq :
+			&xs->umem->cq;
 		err = xsk_init_queue(entries, q, true);
 		mutex_unlock(&xs->mutex);
 		return err;
@@ -389,6 +391,8 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 
 		if (offset == XDP_UMEM_PGOFF_FILL_QUEUE)
 			q = xs->umem->fq;
+		else if (offset == XDP_UMEM_PGOFF_COMPLETION_QUEUE)
+			q = xs->umem->cq;
 		else
 			return -EINVAL;
 	}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 11/14] xsk: add Tx queue setup and mmap support
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (9 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 10/14] xsk: add umem completion queue support and mmap Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 12/14] xsk: support for Tx Björn Töpel
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	ravineet.singh

From: Magnus Karlsson <magnus.karlsson@intel.com>

Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate
a queue, where the user process can pass frames to be transmitted by
the kernel.

The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 2 ++
 net/xdp/xsk.c               | 9 +++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index aec2eee4e751..6cab56fb8be6 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -33,6 +33,7 @@ struct sockaddr_xdp {
 
 /* XDP socket options */
 #define XDP_RX_QUEUE		  1
+#define XDP_TX_QUEUE		  2
 #define XDP_UMEM_REG		  3
 #define XDP_UMEM_FILL_QUEUE	  4
 #define XDP_UMEM_COMPLETION_QUEUE 5
@@ -46,6 +47,7 @@ struct xdp_umem_reg {
 
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_QUEUE			  0
+#define XDP_PGOFF_TX_QUEUE		 0x80000000
 #define XDP_UMEM_PGOFF_FILL_QUEUE	0x100000000
 #define XDP_UMEM_PGOFF_COMPLETION_QUEUE 0x180000000
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 2071365c39b1..a9bceb0958d8 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -43,6 +43,7 @@ struct xdp_sock {
 	struct xsk_queue *rx;
 	struct xskq_iter rx_it;
 	u64 rx_dropped;
+	struct xsk_queue *tx;
 	struct net_device *dev;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
@@ -218,7 +219,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		goto out_release;
 	}
 
-	if (!xs->rx) {
+	if (!xs->rx && !xs->tx) {
 		err = -EINVAL;
 		goto out_unlock;
 	}
@@ -303,6 +304,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 
 	switch (optname) {
 	case XDP_RX_QUEUE:
+	case XDP_TX_QUEUE:
 	{
 		struct xsk_queue **q;
 		int entries;
@@ -313,7 +315,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			return -EFAULT;
 
 		mutex_lock(&xs->mutex);
-		q = &xs->rx;
+		q = (optname == XDP_TX_QUEUE) ? &xs->tx : &xs->rx;
 		err = xsk_init_queue(entries, q, false);
 		mutex_unlock(&xs->mutex);
 		return err;
@@ -385,6 +387,8 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 
 	if (offset == XDP_PGOFF_RX_QUEUE) {
 		q = xs->rx;
+	} else if (offset == XDP_PGOFF_TX_QUEUE) {
+		q = xs->tx;
 	} else {
 		if (!xs->umem)
 			return -EINVAL;
@@ -443,6 +447,7 @@ static void xsk_destruct(struct sock *sk)
 		return;
 
 	xskq_destroy(xs->rx);
+	xskq_destroy(xs->tx);
 	xdp_put_umem(xs->umem);
 
 	sk_refcnt_debug_dec(sk);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 12/14] xsk: support for Tx
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (10 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 11/14] xsk: add Tx queue setup and mmap support Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 13/14] xsk: statistics support Björn Töpel
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	ravineet.singh

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, Tx support is added. The user fills the Tx queue with frames to
be sent by the kernel, and let's the kernel know using the sendmsg
syscall.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 net/xdp/xsk.c       | 151 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.h |  33 ++++++++++++
 2 files changed, 183 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a9bceb0958d8..685b6f360628 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -35,6 +35,7 @@
 #include "xsk_queue.h"
 #include "xdp_umem.h"
 
+#define TX_BATCH_SIZE 16
 #define RX_BATCH_SIZE 16
 
 struct xdp_sock {
@@ -44,10 +45,12 @@ struct xdp_sock {
 	struct xskq_iter rx_it;
 	u64 rx_dropped;
 	struct xsk_queue *tx;
+	struct xskq_iter tx_it;
 	struct net_device *dev;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 	struct xdp_umem *umem;
+	u32 tx_umem_head;
 	u32 ifindex;
 	u16 queue_id;
 };
@@ -112,6 +115,146 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+static void xsk_destruct_skb(struct sk_buff *skb)
+{
+	u32 buff_id = (u32)(long)skb_shinfo(skb)->destructor_arg;
+	struct xdp_sock *xs = xdp_sk(skb->sk);
+
+	WARN_ON_ONCE(xdp_umem_enq_one(xs->umem->cq, buff_id));
+
+	sock_wfree(skb);
+}
+
+static int xsk_xmit_skb(struct sk_buff *skb)
+{
+	struct net_device *dev = skb->dev;
+	struct sk_buff *orig_skb = skb;
+	struct netdev_queue *txq;
+	int ret = NETDEV_TX_BUSY;
+	bool again = false;
+
+	if (unlikely(!netif_running(dev) || !netif_carrier_ok(dev)))
+		goto drop;
+
+	skb = validate_xmit_skb_list(skb, dev, &again);
+	if (skb != orig_skb)
+		return NET_XMIT_DROP;
+
+	txq = skb_get_tx_queue(dev, skb);
+
+	local_bh_disable();
+
+	HARD_TX_LOCK(dev, txq, smp_processor_id());
+	if (!netif_xmit_frozen_or_drv_stopped(txq))
+		ret = netdev_start_xmit(skb, dev, txq, false);
+	HARD_TX_UNLOCK(dev, txq);
+
+	local_bh_enable();
+
+	if (!dev_xmit_complete(ret))
+		goto out_err;
+
+	return ret;
+drop:
+	atomic_long_inc(&dev->tx_dropped);
+out_err:
+	return NET_XMIT_DROP;
+}
+
+static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
+			    size_t total_len)
+{
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
+	struct xdp_sock *xs = xdp_sk(sk);
+	bool sent_frame = false;
+	struct sk_buff *skb;
+	u32 max_outstanding;
+	int err = 0;
+
+	if (unlikely(!xs->tx))
+		return -ENOBUFS;
+	if (need_wait)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&xs->mutex);
+
+	max_outstanding = xskq_nb_free(xs->umem->cq, xs->tx_umem_head,
+				       TX_BATCH_SIZE);
+
+	while (xskq_next_frame_deq(xs->tx, &xs->tx_it, TX_BATCH_SIZE)) {
+		char *buffer;
+		u32 id, len;
+
+		if (max_outstanding-- == 0) {
+			err = -EAGAIN;
+			goto out_err;
+		}
+
+		len = xskq_rxtx_get_len(&xs->tx_it);
+		if (unlikely(len > xs->dev->mtu)) {
+			err = -EMSGSIZE;
+			goto out_err;
+		}
+
+		skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
+		if (unlikely(!skb)) {
+			err = -EAGAIN;
+			goto out_err;
+		}
+
+		skb_put(skb, len);
+		id = xskq_rxtx_get_id(&xs->tx_it);
+		buffer = xdp_umem_get_data(xs->umem, id) +
+			 xskq_rxtx_get_offset(&xs->tx_it);
+		err = skb_store_bits(skb, 0, buffer, len);
+		if (unlikely(err))
+			goto out_store;
+
+		skb->dev = xs->dev;
+		skb->priority = sk->sk_priority;
+		skb->mark = sk->sk_mark;
+		skb_set_queue_mapping(skb, xs->queue_id);
+		skb_shinfo(skb)->destructor_arg = (void *)(long)id;
+		skb->destructor = xsk_destruct_skb;
+
+		err = xsk_xmit_skb(skb);
+		/* Ignore NET_XMIT_CN as packet might have been sent */
+		if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
+			err = -EAGAIN;
+			goto out_store;
+		}
+
+		xs->tx_umem_head++;
+		sent_frame = true;
+	}
+
+	goto out;
+
+out_store:
+	kfree_skb(skb);
+out_err:
+	xskq_deq_return_frame(&xs->tx_it);
+out:
+	if (sent_frame)
+		sk->sk_write_space(sk);
+
+	mutex_unlock(&xs->mutex);
+	return err;
+}
+
+static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (unlikely(!xs->dev))
+		return -ENXIO;
+	if (unlikely(!(xs->dev->flags & IFF_UP)))
+		return -ENETDOWN;
+
+	return xsk_generic_xmit(sk, m, total_len);
+}
+
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
 			     struct poll_table_struct *wait)
 {
@@ -121,6 +264,8 @@ static unsigned int xsk_poll(struct file *file, struct socket *sock,
 
 	if (xs->rx && !xskq_empty(xs->rx))
 		mask |= POLLIN | POLLRDNORM;
+	if (xs->tx && !xskq_full(xs->tx))
+		mask |= POLLOUT | POLLWRNORM;
 
 	return mask;
 }
@@ -266,6 +411,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	} else {
 		/* This xsk has its own umem. */
 		xskq_set_umem(xs->umem->fq, &xs->umem->props);
+		xskq_set_umem(xs->umem->cq, &xs->umem->props);
 	}
 
 	/* Rebind? */
@@ -281,8 +427,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	xs->queue_id = sxdp->sxdp_queue_id;
 
 	xskq_init_iter(&xs->rx_it);
+	xskq_init_iter(&xs->tx_it);
+	xs->tx_umem_head = 0;
 
 	xskq_set_umem(xs->rx, &xs->umem->props);
+	xskq_set_umem(xs->tx, &xs->umem->props);
 
 out_unlock:
 	if (err)
@@ -433,7 +582,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	xsk_setsockopt,
 	.getsockopt =	sock_no_getsockopt,
-	.sendmsg =	sock_no_sendmsg,
+	.sendmsg =	xsk_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
 	.mmap =		xsk_mmap,
 	.sendpage =	sock_no_sendpage,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index af6e651f1207..94edc7e7a503 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -209,6 +209,11 @@ static inline void xskq_deq_return_frame(struct xskq_iter *it)
 	it->tail--;
 }
 
+static inline void xskq_enq_return_frame(struct xsk_queue *q)
+{
+	q->iter_head_idx--;
+}
+
 static inline void xskq_enq_flush(struct xsk_queue *q)
 {
 	/* Order flags and data */
@@ -226,6 +231,21 @@ static inline u32 xskq_rxtx_get_ring_size(struct xsk_queue *q)
 		q->nentries * sizeof(struct xdp_desc));
 }
 
+static inline u32 xskq_rxtx_get_id(struct xskq_iter *it)
+{
+	return it->desc_copy.idx;
+}
+
+static inline u32 xskq_rxtx_get_len(struct xskq_iter *it)
+{
+	return it->desc_copy.len;
+}
+
+static inline u32 xskq_rxtx_get_offset(struct xskq_iter *it)
+{
+	return it->desc_copy.offset;
+}
+
 static inline int xskq_rxtx_enq_frame(struct xsk_queue *q,
 				      u32 id, u32 len, u16 offset)
 {
@@ -258,6 +278,19 @@ static inline u32 xdp_umem_get_id(struct xsk_queue *q,
 	return ring->desc[it->tail & q->ring_mask];
 }
 
+static inline int xdp_umem_enq_one(struct xsk_queue *q, u32 idx)
+{
+	struct xdp_umem_queue *ring = (struct xdp_umem_queue *)q->ring;
+
+	if (xskq_nb_free(q, q->iter_head_idx, 1) == 0)
+		return -ENOSPC;
+
+	ring->desc[q->iter_head_idx++ & q->ring_mask] = idx;
+
+	xskq_enq_flush(q);
+	return 0;
+}
+
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q_ops);
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 13/14] xsk: statistics support
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (11 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 12/14] xsk: support for Tx Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-03-27 16:59 ` [RFC PATCH v2 14/14] samples/bpf: sample application for AF_XDP sockets Björn Töpel
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	ravineet.singh

From: Magnus Karlsson <magnus.karlsson@intel.com>

In this commit, a new getsockopt is added: XDP_STATISTICS. This is
used to obtain stats from the sockets.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h |  7 +++++++
 net/xdp/xsk.c               | 42 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 6cab56fb8be6..4d8142c1596d 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -37,6 +37,7 @@ struct sockaddr_xdp {
 #define XDP_UMEM_REG		  3
 #define XDP_UMEM_FILL_QUEUE	  4
 #define XDP_UMEM_COMPLETION_QUEUE 5
+#define XDP_STATISTICS		  6
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -45,6 +46,12 @@ struct xdp_umem_reg {
 	__u32 frame_headroom; /* Frame head room */
 };
 
+struct xdp_statistics {
+	__u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+	__u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+	__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+};
+
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_QUEUE			  0
 #define XDP_PGOFF_TX_QUEUE		 0x80000000
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 685b6f360628..780a5b42d9b2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -523,6 +523,46 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 	return -ENOPROTOOPT;
 }
 
+static int xsk_getsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int __user *optlen)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	int len;
+
+	if (level != SOL_XDP)
+		return -ENOPROTOOPT;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	if (len < 0)
+		return -EINVAL;
+
+	switch (optname) {
+	case XDP_STATISTICS:
+	{
+		struct xdp_statistics stats;
+
+		if (len != sizeof(stats))
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		stats.rx_dropped = xs->rx_dropped;
+		stats.rx_invalid_descs = xskq_nb_invalid_descs(xs->rx);
+		stats.tx_invalid_descs = xskq_nb_invalid_descs(xs->tx);
+		mutex_unlock(&xs->mutex);
+
+		if (copy_to_user(optval, &stats, sizeof(stats)))
+			return -EFAULT;
+		return 0;
+	}
+	default:
+		break;
+	}
+
+	return -EOPNOTSUPP;
+}
+
 static int xsk_mmap(struct file *file, struct socket *sock,
 		    struct vm_area_struct *vma)
 {
@@ -581,7 +621,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	xsk_setsockopt,
-	.getsockopt =	sock_no_getsockopt,
+	.getsockopt =	xsk_getsockopt,
 	.sendmsg =	xsk_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
 	.mmap =		xsk_mmap,
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 14/14] samples/bpf: sample application for AF_XDP sockets
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (12 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 13/14] xsk: statistics support Björn Töpel
@ 2018-03-27 16:59 ` Björn Töpel
  2018-04-12 11:05   ` Jesper Dangaard Brouer
  2018-03-28 21:18 ` [RFC PATCH v2 00/14] Introducing AF_XDP support Eric Leblond
  2018-04-09 21:51 ` William Tu
  15 siblings, 1 reply; 32+ messages in thread
From: Björn Töpel @ 2018-03-27 16:59 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	ravineet.singh, Björn Töpel

From: Magnus Karlsson <magnus.karlsson@intel.com>

This is a sample application for AF_XDP sockets. The application
supports three different modes of operation: rxdrop, txonly and l2fwd.

To show-case a simple round-robin load-balancing between a set of
sockets in an xskmap, set the RR_LB compile time define option to 1 in
"xdpsock.h".

Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 samples/bpf/Makefile       |   4 +
 samples/bpf/xdpsock.h      |  10 +
 samples/bpf/xdpsock_kern.c |  55 +++
 samples/bpf/xdpsock_user.c | 923 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 992 insertions(+)
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 2c2a587e0942..11c60c791228 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor
 hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
+hostprogs-y += xdpsock
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
+xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -147,6 +149,7 @@ always += xdp_rxq_info_kern.o
 always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
+always += xdpsock_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -192,6 +195,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
+HOSTLOADLIBES_xdpsock += -lelf -pthread
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
new file mode 100644
index 000000000000..4479233dcaae
--- /dev/null
+++ b/samples/bpf/xdpsock.h
@@ -0,0 +1,10 @@
+#ifndef XDPSOCK_H_
+#define XDPSOCK_H_
+
+/* Power-of-2 number of sockets */
+#define MAX_SOCKS 4
+
+/* Round-robin receive */
+#define RR_LB 0
+
+#endif /* XDPSOCK_H_ */
diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
new file mode 100644
index 000000000000..69af48842182
--- /dev/null
+++ b/samples/bpf/xdpsock_kern.c
@@ -0,0 +1,55 @@
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#include "xdpsock.h"
+
+struct bpf_map_def SEC("maps") qidconf_map = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.key_size	= sizeof(int),
+	.value_size	= sizeof(int),
+	.max_entries	= 1,
+};
+
+struct bpf_map_def SEC("maps") xsks_map = {
+	.type = BPF_MAP_TYPE_XSKMAP,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = 4,
+};
+
+struct bpf_map_def SEC("maps") rr_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(unsigned int),
+	.max_entries = 1,
+};
+
+SEC("xdp_sock")
+int xdp_sock_prog(struct xdp_md *ctx)
+{
+	int *qidconf, key = 0, idx;
+	unsigned int *rr;
+
+	qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
+	if (!qidconf)
+		return XDP_ABORTED;
+
+	if (*qidconf != ctx->rx_queue_index)
+		return XDP_PASS;
+
+#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
+	rr = bpf_map_lookup_elem(&rr_map, &key);
+	if (!rr)
+		return XDP_ABORTED;
+
+	*rr = (*rr + 1) & (MAX_SOCKS - 1);
+	idx = *rr;
+#else
+	idx = 0;
+#endif
+
+	return bpf_redirect_map(&xsks_map, idx, 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
new file mode 100644
index 000000000000..e7639a560e85
--- /dev/null
+++ b/samples/bpf/xdpsock_user.c
@@ -0,0 +1,923 @@
+/*
+ *  Copyright(c) 2017 - 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <getopt.h>
+#include <libgen.h>
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <linux/if_xdp.h>
+#include <linux/if_ether.h>
+#include <net/if.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/ethernet.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <locale.h>
+#include <sys/types.h>
+#include <poll.h>
+
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "libbpf.h"
+
+#include "xdpsock.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+#ifndef AF_XDP
+#define AF_XDP 44
+#endif
+
+#ifndef PF_XDP
+#define PF_XDP AF_XDP
+#endif
+
+#define NUM_FRAMES 131072
+#define FRAME_HEADROOM 0
+#define FRAME_SIZE 2048
+#define NUM_DESCS 1024
+#define BATCH_SIZE 16
+
+#define FQ_NUM_DESCS 1024
+#define CQ_NUM_DESCS 1024
+
+#define DEBUG_HEXDUMP 0
+
+typedef __u32 u32;
+
+static unsigned long start_time;
+
+enum benchmark_type {
+	BENCH_RXDROP = 0,
+	BENCH_TXONLY = 1,
+	BENCH_L2FWD = 2,
+};
+
+static enum benchmark_type opt_bench = BENCH_RXDROP;
+static u32 opt_xdp_flags;
+static const char *opt_if = "";
+static int opt_ifindex;
+static int opt_queue;
+static int opt_poll;
+static int opt_shared_packet_buffer;
+
+struct xdp_umem_uqueue {
+	u32 cached_head;
+	u32 cached_tail;
+	u32 mask;
+	u32 size;
+	struct xdp_umem_queue *q;
+};
+
+struct xdp_umem {
+	char (*frames)[FRAME_SIZE];
+	struct xdp_umem_uqueue fq;
+	struct xdp_umem_uqueue cq;
+	int fd;
+};
+
+struct xdp_uqueue {
+	u32 cached_head;
+	u32 cached_tail;
+	u32 mask;
+	u32 size;
+	struct xdp_rxtx_queue *q;
+};
+
+struct xdpsock {
+	struct xdp_uqueue rx;
+	struct xdp_uqueue tx;
+	int sfd;
+	struct xdp_umem *umem;
+	u32 outstanding_tx;
+	unsigned long rx_npkts;
+	unsigned long tx_npkts;
+};
+
+#define MAX_SOCKS 4
+static int num_socks;
+struct xdpsock *xsks[MAX_SOCKS];
+
+static unsigned long get_nsecs(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
+}
+
+static void dump_stats(void);
+
+#define lassert(expr)							\
+	do {								\
+		if (!(expr)) {						\
+			fprintf(stderr, "%s:%s:%i: Assertion failed: "	\
+				#expr ": errno: %d/\"%s\"\n",		\
+				__FILE__, __func__, __LINE__,		\
+				errno, strerror(errno));		\
+			dump_stats();					\
+			exit(EXIT_FAILURE);				\
+		}							\
+	} while (0)
+
+#define barrier() __asm__ __volatile__("": : :"memory")
+#define u_smp_rmb() barrier()
+#define u_smp_wmb() barrier()
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+
+static const char pkt_data[] =
+	"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
+	"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
+	"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
+	"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
+
+static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
+{
+	u32 free_entries = q->size - (q->cached_head - q->cached_tail);
+
+	if (free_entries >= nb)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cached_tail = q->q->ptrs.tail_idx;
+
+	return q->size - (q->cached_head - q->cached_tail);
+}
+
+static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
+{
+	u32 free_entries = q->cached_tail - q->cached_head;
+
+	if (free_entries >= ndescs)
+		return free_entries;
+
+	/* Refreash the local tail pointer */
+	q->cached_tail = q->q->ptrs.tail_idx + q->size;
+	return q->cached_tail - q->cached_head;
+}
+
+static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb)
+{
+	u32 entries = q->cached_head - q->cached_tail;
+
+	if (entries == 0)
+		q->cached_head = q->q->ptrs.head_idx;
+
+	entries = q->cached_head - q->cached_tail;
+
+	return (entries > nb) ? nb : entries;
+}
+
+static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs)
+{
+	u32 entries = q->cached_head - q->cached_tail;
+
+	if (entries == 0)
+		q->cached_head = q->q->ptrs.head_idx;
+
+	entries = q->cached_head - q->cached_tail;
+	return (entries > ndescs) ? ndescs : entries;
+}
+
+static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
+					 struct xdp_desc *d,
+					 size_t nb)
+{
+	u32 i;
+
+	if (umem_nb_free(fq, nb) < nb)
+		return -ENOSPC;
+
+	for (i = 0; i < nb; i++) {
+		u32 idx = fq->cached_head++ & fq->mask;
+
+		fq->q->desc[idx] = d[i].idx;
+	}
+
+	u_smp_wmb();
+
+	fq->q->ptrs.head_idx = fq->cached_head;
+
+	return 0;
+}
+
+static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
+				      size_t nb)
+{
+	u32 i;
+
+	if (umem_nb_free(fq, nb) < nb)
+		return -ENOSPC;
+
+	for (i = 0; i < nb; i++) {
+		u32 idx = fq->cached_head++ & fq->mask;
+
+		fq->q->desc[idx] = d[i];
+	}
+
+	u_smp_wmb();
+
+	fq->q->ptrs.head_idx = fq->cached_head;
+
+	return 0;
+}
+
+static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
+					       u32 *d, size_t nb)
+{
+	u32 idx, i, entries = umem_nb_avail(cq, nb);
+
+	u_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = cq->cached_tail++ & cq->mask;
+		d[i] = cq->q->desc[idx];
+	}
+
+	if (entries > 0) {
+		u_smp_wmb();
+
+		cq->q->ptrs.tail_idx = cq->cached_tail;
+	}
+
+	return entries;
+}
+
+static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
+{
+	lassert(idx < NUM_FRAMES);
+	return &xsk->umem->frames[idx][off];
+}
+
+static inline int xq_enq(struct xdp_uqueue *uq,
+			 const struct xdp_desc *descs,
+			 unsigned int ndescs)
+{
+	struct xdp_rxtx_queue *q = uq->q;
+	unsigned int i;
+
+	if (xq_nb_free(uq, ndescs) < ndescs)
+		return -ENOSPC;
+
+	for (i = 0; i < ndescs; i++) {
+		u32 idx = uq->cached_head++ & uq->mask;
+
+		q->desc[idx].idx	= descs[i].idx;
+		q->desc[idx].len	= descs[i].len;
+		q->desc[idx].offset	= descs[i].offset;
+	}
+
+	u_smp_wmb();
+
+	q->ptrs.head_idx = uq->cached_head;
+	return 0;
+}
+
+static inline int xq_deq(struct xdp_uqueue *uq,
+			 struct xdp_desc *descs,
+			 int ndescs)
+{
+	struct xdp_rxtx_queue *q = uq->q;
+	unsigned int idx;
+	int i, entries;
+
+	entries = xq_nb_avail(uq, ndescs);
+
+	u_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = uq->cached_tail++ & uq->mask;
+		descs[i] = q->desc[idx];
+	}
+
+	if (entries > 0) {
+		u_smp_wmb();
+
+		q->ptrs.tail_idx = uq->cached_tail;
+	}
+
+	return entries;
+}
+
+static void swap_mac_addresses(void *data)
+{
+	struct ether_header *eth = (struct ether_header *)data;
+	struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
+	struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
+	struct ether_addr tmp;
+
+	tmp = *src_addr;
+	*src_addr = *dst_addr;
+	*dst_addr = tmp;
+}
+
+#if DEBUG_HEXDUMP
+static void hex_dump(void *pkt, size_t length, const char *prefix)
+{
+	int i = 0;
+	const unsigned char *address = (unsigned char *)pkt;
+	const unsigned char *line = address;
+	size_t line_size = 32;
+	unsigned char c;
+
+	printf("length = %zu\n", length);
+	printf("%s | ", prefix);
+	while (length-- > 0) {
+		printf("%02X ", *address++);
+		if (!(++i % line_size) || (length == 0 && i % line_size)) {
+			if (length == 0) {
+				while (i++ % line_size)
+					printf("__ ");
+			}
+			printf(" | ");	/* right close */
+			while (line < address) {
+				c = *line++;
+				printf("%c", (c < 33 || c == 255) ? 0x2E : c);
+			}
+			printf("\n");
+			if (length > 0)
+				printf("%s | ", prefix);
+		}
+	}
+	printf("\n");
+}
+#endif
+
+static size_t gen_eth_frame(char *frame)
+{
+	memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
+	return sizeof(pkt_data) - 1;
+}
+
+static struct xdp_umem *xdp_umem_configure(int sfd)
+{
+	int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS;
+	struct xdp_umem_reg mr;
+	struct xdp_umem *umem;
+	void *bufs;
+
+	umem = calloc(1, sizeof(*umem));
+	lassert(umem);
+
+	lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
+			       NUM_FRAMES * FRAME_SIZE) == 0);
+
+	mr.addr = (__u64)bufs;
+	mr.len = NUM_FRAMES * FRAME_SIZE;
+	mr.frame_size = FRAME_SIZE;
+	mr.frame_headroom = FRAME_HEADROOM;
+
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_QUEUE, &fq_size,
+			   sizeof(int)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_QUEUE, &cq_size,
+			   sizeof(int)) == 0);
+
+	umem->fq.q = mmap(0, sizeof(struct xdp_umem_queue) +
+			  FQ_NUM_DESCS * sizeof(u32),
+			  PROT_READ | PROT_WRITE,
+			  MAP_SHARED | MAP_POPULATE, sfd,
+			  XDP_UMEM_PGOFF_FILL_QUEUE);
+	lassert(umem->fq.q != MAP_FAILED);
+
+	umem->fq.mask = FQ_NUM_DESCS - 1;
+	umem->fq.size = FQ_NUM_DESCS;
+
+	umem->cq.q = mmap(0, sizeof(struct xdp_umem_queue) +
+			  CQ_NUM_DESCS * sizeof(u32),
+			  PROT_READ | PROT_WRITE,
+			  MAP_SHARED | MAP_POPULATE, sfd,
+			  XDP_UMEM_PGOFF_COMPLETION_QUEUE);
+	lassert(umem->cq.q != MAP_FAILED);
+
+	umem->cq.mask = CQ_NUM_DESCS - 1;
+	umem->cq.size = CQ_NUM_DESCS;
+
+	umem->frames = (char (*)[FRAME_SIZE])bufs;
+	umem->fd = sfd;
+
+	if (opt_bench == BENCH_TXONLY) {
+		int i;
+
+		for (i = 0; i < NUM_FRAMES; i++)
+			(void)gen_eth_frame(&umem->frames[i][0]);
+	}
+
+	return umem;
+}
+
+static struct xdpsock *xsk_configure(struct xdp_umem *umem)
+{
+	struct sockaddr_xdp sxdp = {};
+	int sfd, ndescs = NUM_DESCS;
+	struct xdpsock *xsk;
+	bool shared = true;
+	u32 i;
+
+	sfd = socket(PF_XDP, SOCK_RAW, 0);
+	lassert(sfd >= 0);
+
+	xsk = calloc(1, sizeof(*xsk));
+	lassert(xsk);
+
+	xsk->sfd = sfd;
+	xsk->outstanding_tx = 0;
+
+	if (!umem) {
+		shared = false;
+		xsk->umem = xdp_umem_configure(sfd);
+	} else {
+		xsk->umem = umem;
+	}
+
+	lassert(setsockopt(sfd, SOL_XDP, XDP_RX_QUEUE,
+			   &ndescs, sizeof(int)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_TX_QUEUE,
+			   &ndescs, sizeof(int)) == 0);
+
+	/* Rx */
+	xsk->rx.q = mmap(NULL,
+			 sizeof(struct xdp_queue) +
+			 NUM_DESCS * sizeof(struct xdp_desc),
+			 PROT_READ | PROT_WRITE,
+			 MAP_SHARED | MAP_POPULATE, sfd,
+			 XDP_PGOFF_RX_QUEUE);
+	lassert(xsk->rx.q != MAP_FAILED);
+
+	if (!shared) {
+		for (i = 0; i < NUM_DESCS / 2; i++)
+			lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
+				== 0);
+	}
+
+	/* Tx */
+	xsk->tx.q = mmap(NULL,
+			 sizeof(struct xdp_queue) +
+			 NUM_DESCS * sizeof(struct xdp_desc),
+			 PROT_READ | PROT_WRITE,
+			 MAP_SHARED | MAP_POPULATE, sfd,
+			 XDP_PGOFF_TX_QUEUE);
+	lassert(xsk->tx.q != MAP_FAILED);
+
+	xsk->rx.mask = NUM_DESCS - 1;
+	xsk->rx.size = NUM_DESCS;
+
+	xsk->tx.mask = NUM_DESCS - 1;
+	xsk->tx.size = NUM_DESCS;
+
+	sxdp.sxdp_family = PF_XDP;
+	sxdp.sxdp_ifindex = opt_ifindex;
+	sxdp.sxdp_queue_id = opt_queue;
+	if (shared) {
+		sxdp.sxdp_flags = XDP_SHARED_UMEM;
+		sxdp.sxdp_shared_umem_fd = umem->fd;
+	}
+
+	lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
+
+	return xsk;
+}
+
+static void print_benchmark(bool running)
+{
+	const char *bench_str = "INVALID";
+
+	if (opt_bench == BENCH_RXDROP)
+		bench_str = "rxdrop";
+	else if (opt_bench == BENCH_TXONLY)
+		bench_str = "txonly";
+	else if (opt_bench == BENCH_L2FWD)
+		bench_str = "l2fwd";
+
+	printf("%s:%d %s ", opt_if, opt_queue, bench_str);
+	if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
+		printf("xdp-skb ");
+	else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
+		printf("xdp-drv ");
+	else
+		printf("	");
+
+	if (opt_poll)
+		printf("poll() ");
+
+	if (running) {
+		printf("running...");
+		fflush(stdout);
+	}
+}
+
+static void dump_stats(void)
+{
+	unsigned long stop_time = get_nsecs();
+	long dt = stop_time - start_time;
+	int i;
+
+	for (i = 0; i < num_socks; i++) {
+		double rx_pps = xsks[i]->rx_npkts * 1000000000. / dt;
+		double tx_pps = xsks[i]->tx_npkts * 1000000000. / dt;
+		char *fmt = "%-15s %'-11.0f %'-11lu\n";
+
+		printf("\n sock%d@", i);
+		print_benchmark(false);
+		printf("\n");
+
+		printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
+		       dt / 1000000000.);
+		printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
+		printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
+	}
+}
+
+static void *poller(void *arg)
+{
+	(void)arg;
+	for (;;) {
+		sleep(1);
+		dump_stats();
+	}
+
+	return NULL;
+}
+
+static void int_exit(int sig)
+{
+	(void)sig;
+	dump_stats();
+	bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
+	exit(EXIT_SUCCESS);
+}
+
+static struct option long_options[] = {
+	{"rxdrop", no_argument, 0, 'r'},
+	{"txonly", no_argument, 0, 't'},
+	{"l2fwd", no_argument, 0, 'l'},
+	{"interface", required_argument, 0, 'i'},
+	{"queue", required_argument, 0, 'q'},
+	{"poll", no_argument, 0, 'p'},
+	{"shared-buffer", no_argument, 0, 's'},
+	{"xdp-skb", no_argument, 0, 'S'},
+	{"xdp-native", no_argument, 0, 'N'},
+	{0, 0, 0, 0}
+};
+
+static void usage(const char *prog)
+{
+	const char *str =
+		"  Usage: %s [OPTIONS]\n"
+		"  Options:\n"
+		"  -r, --rxdrop		Discard all incoming packets (default)\n"
+		"  -t, --txonly		Only send packets\n"
+		"  -l, --l2fwd		MAC swap L2 forwarding\n"
+		"  -i, --interface=n	Run on interface n\n"
+		"  -q, --queue=n	Use queue n (default 0)\n"
+		"  -p, --poll		Use poll syscall\n"
+		"  -s, --shared-buffer	Use shared packet buffer\n"
+		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
+		"  -N, --xdp-native=n	Enfore XDP native mode\n"
+		"\n";
+	fprintf(stderr, str, prog);
+	exit(EXIT_FAILURE);
+}
+
+static void parse_command_line(int argc, char **argv)
+{
+	int option_index, c;
+
+	opterr = 0;
+
+	for (;;) {
+		c = getopt_long(argc, argv, "rtli:q:psSN", long_options,
+				&option_index);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 'r':
+			opt_bench = BENCH_RXDROP;
+			break;
+		case 't':
+			opt_bench = BENCH_TXONLY;
+			break;
+		case 'l':
+			opt_bench = BENCH_L2FWD;
+			break;
+		case 'i':
+			opt_if = optarg;
+			break;
+		case 'q':
+			opt_queue = atoi(optarg);
+			break;
+		case 's':
+			opt_shared_packet_buffer = 1;
+			break;
+		case 'p':
+			opt_poll = 1;
+			break;
+		case 'S':
+			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
+			break;
+		default:
+			usage(basename(argv[0]));
+		}
+	}
+
+	opt_ifindex = if_nametoindex(opt_if);
+	if (!opt_ifindex) {
+		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
+			opt_if);
+		usage(basename(argv[0]));
+	}
+}
+
+static void kick_tx(int fd)
+{
+	int ret;
+
+	ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
+		return;
+	lassert(0);
+}
+
+static inline void complete_tx_l2fwd(struct xdpsock *xsk)
+{
+	u32 descs[BATCH_SIZE];
+	unsigned int rcvd;
+	size_t ndescs;
+
+	if (!xsk->outstanding_tx)
+		return;
+
+	kick_tx(xsk->sfd);
+	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
+		 xsk->outstanding_tx;
+
+	/* re-add completed Tx buffers */
+	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs);
+	if (rcvd > 0) {
+		umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd);
+		xsk->outstanding_tx -= rcvd;
+		xsk->tx_npkts += rcvd;
+	}
+}
+
+static inline void complete_tx_only(struct xdpsock *xsk)
+{
+	u32 descs[BATCH_SIZE];
+	unsigned int rcvd;
+
+	if (!xsk->outstanding_tx)
+		return;
+
+	kick_tx(xsk->sfd);
+
+	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE);
+	if (rcvd > 0) {
+		xsk->outstanding_tx -= rcvd;
+		xsk->tx_npkts += rcvd;
+	}
+}
+
+static void rx_drop(struct xdpsock *xsk)
+{
+	struct xdp_desc descs[BATCH_SIZE];
+	unsigned int rcvd, i;
+
+	rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
+	if (!rcvd)
+		return;
+
+	for (i = 0; i < rcvd; i++) {
+		u32 idx = descs[i].idx;
+
+		lassert(idx < NUM_FRAMES);
+#if DEBUG_HEXDUMP
+		char *pkt;
+		char buf[32];
+
+		pkt = xq_get_data(xsk, idx, descs[i].offset);
+		sprintf(buf, "idx=%d", idx);
+		hex_dump(pkt, descs[i].len, buf);
+#endif
+	}
+
+	xsk->rx_npkts += rcvd;
+
+	umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
+}
+
+static void rx_drop_all(void)
+{
+	struct pollfd fds[MAX_SOCKS + 1];
+	int i, ret, timeout, nfds = 1;
+
+	memset(fds, 0, sizeof(fds));
+
+	for (i = 0; i < num_socks; i++) {
+		fds[i].fd = xsks[i]->sfd;
+		fds[i].events = POLLIN;
+		timeout = 1000; /* 1sn */
+	}
+
+	for (;;) {
+		if (opt_poll) {
+			ret = poll(fds, nfds, timeout);
+			if (ret <= 0)
+				continue;
+		}
+
+		for (i = 0; i < num_socks; i++)
+			rx_drop(xsks[i]);
+	}
+}
+
+static void gen_tx_descs(struct xdp_desc *descs, unsigned int idx,
+			 unsigned int ndescs)
+{
+	int i;
+
+	for (i = 0; i < ndescs; i++) {
+		descs[i].idx = idx + i;
+		descs[i].len = sizeof(pkt_data) - 1;
+		descs[i].offset = 0;
+		descs[i].flags = 0;
+	}
+}
+
+static void tx_only(struct xdpsock *xsk)
+{
+	int timeout, ret, nfds = 1;
+	struct pollfd fds[nfds + 1];
+	unsigned int idx = 0;
+
+	memset(fds, 0, sizeof(fds));
+	fds[0].fd = xsk->sfd;
+	fds[0].events = POLLOUT;
+	timeout = 1000; /* 1sn */
+
+	for (;;) {
+		struct xdp_desc descs[BATCH_SIZE];
+
+		if (opt_poll) {
+			ret = poll(fds, nfds, timeout);
+			if (ret <= 0)
+				continue;
+
+			if (fds[0].fd != xsk->sfd ||
+			    !(fds[0].revents & POLLOUT))
+				continue;
+		}
+
+		if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) {
+			gen_tx_descs(descs, idx, BATCH_SIZE);
+			lassert(xq_enq(&xsk->tx, descs, BATCH_SIZE) == 0);
+
+			xsk->outstanding_tx += BATCH_SIZE;
+			idx += BATCH_SIZE;
+			idx %= NUM_FRAMES;
+		}
+
+		complete_tx_only(xsk);
+	}
+}
+
+static void l2fwd(struct xdpsock *xsk)
+{
+	for (;;) {
+		struct xdp_desc descs[BATCH_SIZE];
+		unsigned int rcvd, i;
+		int ret;
+
+		for (;;) {
+			complete_tx_l2fwd(xsk);
+
+			rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
+			if (rcvd > 0)
+				break;
+		}
+
+		for (i = 0; i < rcvd; i++) {
+			char *pkt = xq_get_data(xsk, descs[i].idx,
+						descs[i].offset);
+
+			swap_mac_addresses(pkt);
+#if DEBUG_HEXDUMP
+			char buf[32];
+			u32 idx = descs[i].idx;
+
+			sprintf(buf, "idx=%d", idx);
+			hex_dump(pkt, descs[i].len, buf);
+#endif
+		}
+
+		xsk->rx_npkts += rcvd;
+
+		ret = xq_enq(&xsk->tx, descs, rcvd);
+		lassert(ret == 0);
+		xsk->outstanding_tx += rcvd;
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	char xdp_filename[256];
+	int i, ret, key = 0;
+	pthread_t pt;
+
+	parse_command_line(argc, argv);
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(xdp_filename)) {
+		fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
+		exit(EXIT_FAILURE);
+	}
+
+	if (!prog_fd[0]) {
+		fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
+		fprintf(stderr, "ERROR: link set xdp fd failed\n");
+		exit(EXIT_FAILURE);
+	}
+
+	ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0);
+	if (ret) {
+		fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Create sockets... */
+	xsks[num_socks++] = xsk_configure(NULL);
+
+#if RR_LB
+	for (i = 0; i < MAX_SOCKS - 1; i++)
+		xsks[num_socks++] = xsk_configure(xsks[0]->umem);
+#endif
+
+	/* ...and insert them into the map. */
+	for (i = 0; i < num_socks; i++) {
+		key = i;
+		ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0);
+		if (ret) {
+			fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i);
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+	signal(SIGABRT, int_exit);
+
+	setlocale(LC_ALL, "");
+
+	ret = pthread_create(&pt, NULL, poller, NULL);
+	lassert(ret == 0);
+
+	start_time = get_nsecs();
+
+	if (opt_bench == BENCH_RXDROP)
+		rx_drop_all();
+	else if (opt_bench == BENCH_TXONLY)
+		tx_only(xsks[0]);
+	else
+		l2fwd(xsks[0]);
+
+	return 0;
+}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (13 preceding siblings ...)
  2018-03-27 16:59 ` [RFC PATCH v2 14/14] samples/bpf: sample application for AF_XDP sockets Björn Töpel
@ 2018-03-28 21:18 ` Eric Leblond
  2018-03-29  6:16   ` Björn Töpel
  2018-04-09 21:51 ` William Tu
  15 siblings, 1 reply; 32+ messages in thread
From: Eric Leblond @ 2018-03-28 21:18 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, ravineet.singh

Hello,

On Tue, 2018-03-27 at 18:59 +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> 
optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for
> RX
> 

...
> 
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently
> mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.

If I get it correctly, this feature will have to be used to bound
multiple sockets to a single queue and the eBPF filter will be
responsible of the load balancing. Am I correct ?

> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If
> the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
...

Thanks a lot for this work, I'm gonna try to implement this in
Suricata.

Best regards,

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
  2018-03-28 21:18 ` [RFC PATCH v2 00/14] Introducing AF_XDP support Eric Leblond
@ 2018-03-29  6:16   ` Björn Töpel
  2018-03-29 15:36     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 32+ messages in thread
From: Björn Töpel @ 2018-03-29  6:16 UTC (permalink / raw)
  To: Eric Leblond
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z, ravineet.singh

2018-03-28 23:18 GMT+02:00 Eric Leblond <eric@regit.org>:
> Hello,
>
> On Tue, 2018-03-27 at 18:59 +0200, Björn Töpel wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>>
> optimized for high performance packet processing and, in upcoming
>> patch sets, zero-copy semantics. In this v2 version, we have removed
>> all zero-copy related code in order to make it smaller, simpler and
>> hopefully more review friendly. This RFC only supports copy-mode for
>> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for
>> RX
>>
>
> ...
>>
>> How is then packets distributed between these two XSK? We have
>> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
>> full). The user-space application can place an XSK at an arbitrary
>> place in this map. The XDP program can then redirect a packet to a
>> specific index in this map and at this point XDP validates that the
>> XSK in that map was indeed bound to that device and queue number. If
>> not, the packet is dropped. If the map is empty at that index, the
>> packet is also dropped. This also means that it is currently
>> mandatory
>> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
>> to get any traffic to user space through the XSK.
>
> If I get it correctly, this feature will have to be used to bound
> multiple sockets to a single queue and the eBPF filter will be
> responsible of the load balancing. Am I correct ?
>

Exactly! The XDP program executing for a certain Rx queue will
distribute the packets to the socket(s) in the xskmap.

>> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If
>> the
>> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> ...
>
> Thanks a lot for this work, I'm gonna try to implement this in
> Suricata.
>

Thanks for trying it out! All input is very much appreciated
(clunkiness of API, crashes...)!


Björn

> Best regards,
> --
> Eric Leblond

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
  2018-03-29  6:16   ` Björn Töpel
@ 2018-03-29 15:36     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-29 15:36 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Eric Leblond, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z, ravineet.singh, brouer



On Thu, 29 Mar 2018 08:16:23 +0200 Björn Töpel <bjorn.topel@gmail.com> wrote:

> 2018-03-28 23:18 GMT+02:00 Eric Leblond <eric@regit.org>:
> > Hello,
> >
> > On Tue, 2018-03-27 at 18:59 +0200, Björn Töpel wrote:  
> >> From: Björn Töpel <bjorn.topel@intel.com>
> >>
> >>  
> > optimized for high performance packet processing and, in upcoming  
> >> patch sets, zero-copy semantics. In this v2 version, we have removed
> >> all zero-copy related code in order to make it smaller, simpler and
> >> hopefully more review friendly. This RFC only supports copy-mode for
> >> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for
> >> RX
> >>  
> >
> > ...  
> >>
> >> How is then packets distributed between these two XSK? We have
> >> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> >> full). The user-space application can place an XSK at an arbitrary
> >> place in this map. The XDP program can then redirect a packet to a
> >> specific index in this map and at this point XDP validates that the
> >> XSK in that map was indeed bound to that device and queue number. If
> >> not, the packet is dropped. If the map is empty at that index, the
> >> packet is also dropped. This also means that it is currently
> >> mandatory
> >> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> >> to get any traffic to user space through the XSK.  
> >
> > If I get it correctly, this feature will have to be used to bound
> > multiple sockets to a single queue and the eBPF filter will be
> > responsible of the load balancing. Am I correct ?
> >  
> 
> Exactly! The XDP program executing for a certain Rx queue will
> distribute the packets to the socket(s) in the xskmap.

It is important to understand that we (want/need to) maintain a Single
Producer Single Consumer (SPSC) scenario here, for performance reasons.

This _is_ maintained in this patchset AFAIK. But as the API user, you
have to understand that the responsibility of aligning this is yours!
If you don't the frames are dropped (silently).
  The BPF programmer MUST select the correct XSKMAP index,  such that
the ctx->rx_queue_index match the queue_id registered in the xdp_sock
(and bounded ifindex also match).


Bjørn, Magnus and I have discussed other API options. E.g. where the
XSKMAP index _is_ the rx_queue_index, and BPF programmer is not allowed
select another index.  We settled on the API in the patchset, where BPF
programmer get more freedom, and can select an invalid index, that
cause packets to be dropped.

An advantage of this API is that we allow one RX-queue, to multiplex
into many xdk_sock's (all bound to this same RX-queue and ifindex).
This still maintain a Single Producer, as the RX-queue just have a
Single Producer relationship's with each xdp_sock.


I imagine, that Suricata/Eric, want to capture all the RX-queues on
the net_device.  For this to happen, he need to create a xdp_sock per
RX-queue, and have a side-bpf-map that assist in the XSKMAP lookup, or
simply populate the XSKMAP to correspond to the rx_queue_index.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
  2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
                   ` (14 preceding siblings ...)
  2018-03-28 21:18 ` [RFC PATCH v2 00/14] Introducing AF_XDP support Eric Leblond
@ 2018-04-09 21:51 ` William Tu
  2018-04-10  6:47   ` Björn Töpel
  15 siblings, 1 reply; 32+ messages in thread
From: William Tu @ 2018-04-09 21:51 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain,
	qi.z.zhang, ravineet.singh

On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work is
> already on the mailing list for review. We will publish our zero-copy
> support for RX and TX on top of his patch sets at a later point in
> time.
>
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is called a UMEM. It consists of
> a number of equally size frames and each frame has a unique frame
> id. A descriptor in one of the queues references a frame by
> referencing its frame id. The user space allocates memory for this
> UMEM using whatever means it feels is most appropriate (malloc, mmap,
> huge pages, etc). This memory area is then registered with the kernel
> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues:
> the FILL queue and the COMPLETION queue. The fill queue is used by the
> application to send down frame ids for the kernel to fill in with RX
> packet data. References to these frames will then appear in the RX
> queue of the XSK once they have been received. The completion queue,
> on the other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
Can we register a UMEM to multiple device's queue?

So far the l2fwd sample code is sending/receiving from the same
queue. I'm thinking about forwarding packets from one device to another.
Now I'm copying packets from one device's RX desc to another device's TX
completion queue. But this introduces one extra copy.

One way I can do is to call bpf_redirect helper function, but sometimes
I still need to process the packet in userspace.

I like this work!
Thanks a lot.
William

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
  2018-04-09 21:51 ` William Tu
@ 2018-04-10  6:47   ` Björn Töpel
  2018-04-10 14:14     ` William Tu
  0 siblings, 1 reply; 32+ messages in thread
From: Björn Töpel @ 2018-04-10  6:47 UTC (permalink / raw)
  To: William Tu
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain, Zhang,
	Qi Z, ravineet.singh

2018-04-09 23:51 GMT+02:00 William Tu <u9012063@gmail.com>:
> On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> This RFC introduces a new address family called AF_XDP that is
>> optimized for high performance packet processing and, in upcoming
>> patch sets, zero-copy semantics. In this v2 version, we have removed
>> all zero-copy related code in order to make it smaller, simpler and
>> hopefully more review friendly. This RFC only supports copy-mode for
>> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
>> using the XDP_DRV path. Zero-copy support requires XDP and driver
>> changes that Jesper Dangaard Brouer is working on. Some of his work is
>> already on the mailing list for review. We will publish our zero-copy
>> support for RX and TX on top of his patch sets at a later point in
>> time.
>>
>> An AF_XDP socket (XSK) is created with the normal socket()
>> syscall. Associated with each XSK are two queues: the RX queue and the
>> TX queue. A socket can receive packets on the RX queue and it can send
>> packets on the TX queue. These queues are registered and sized with
>> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is
>> mandatory to have at least one of these queues for each socket. In
>> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
>> packet buffers. An RX or TX descriptor points to a data buffer in a
>> memory area called a UMEM. RX and TX can share the same UMEM so that a
>> packet does not have to be copied between RX and TX. Moreover, if a
>> packet needs to be kept for a while due to a possible retransmit, the
>> descriptor that points to that packet can be changed to point to
>> another and reused right away. This again avoids copying data.
>>
>> This new dedicated packet buffer area is called a UMEM. It consists of
>> a number of equally size frames and each frame has a unique frame
>> id. A descriptor in one of the queues references a frame by
>> referencing its frame id. The user space allocates memory for this
>> UMEM using whatever means it feels is most appropriate (malloc, mmap,
>> huge pages, etc). This memory area is then registered with the kernel
>> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues:
>> the FILL queue and the COMPLETION queue. The fill queue is used by the
>> application to send down frame ids for the kernel to fill in with RX
>> packet data. References to these frames will then appear in the RX
>> queue of the XSK once they have been received. The completion queue,
>> on the other hand, contains frame ids that the kernel has transmitted
>> completely and can now be used again by user space, for either TX or
>> RX. Thus, the frame ids appearing in the completion queue are ids that
>> were previously transmitted using the TX queue. In summary, the RX and
>> FILL queues are used for the RX path and the TX and COMPLETION queues
>> are used for the TX path.
>>
> Can we register a UMEM to multiple device's queue?
>

No, one UMEM, one netdev queue in this RFC. That being said, there's
nothing stopping a user from creating an additional UMEM, say UMEM',
pointing to the same memory as UMEM, but bound to another
netdev/queue. Note that the user space application has to make sure
that the buffer handling is sane (user/kernel frame ownership).

We used to allow to share UMEM between unrelated sockets, but after
the introduction of the UMEM queues (fill/completion) that's no the
case any more. For the zero-copy scenario, having to manage multiple
DMA mappings per UMEM was a bit of a mess, so we went for the simpler
(current) solution with one UMEM per netdev/queue.

> So far the l2fwd sample code is sending/receiving from the same
> queue. I'm thinking about forwarding packets from one device to another.
> Now I'm copying packets from one device's RX desc to another device's TX
> completion queue. But this introduces one extra copy.
>

So you've setup two identical UMEMs? Then you can just forward the
incoming Rx descriptor to the other netdev's Tx queue. Note, that you
only need to copy the descriptor, not the actual frame data.

> One way I can do is to call bpf_redirect helper function, but sometimes
> I still need to process the packet in userspace.
>
> I like this work!
> Thanks a lot.

Happy to hear that, and thanks a bunch for trying it out. Keep that
feedback coming!


Björn

> William

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
  2018-04-10  6:47   ` Björn Töpel
@ 2018-04-10 14:14     ` William Tu
  2018-04-11 12:17       ` Björn Töpel
  0 siblings, 1 reply; 32+ messages in thread
From: William Tu @ 2018-04-10 14:14 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain, Zhang,
	Qi Z, ravineet.singh

On Mon, Apr 9, 2018 at 11:47 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> 2018-04-09 23:51 GMT+02:00 William Tu <u9012063@gmail.com>:
>> On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>
>>> This RFC introduces a new address family called AF_XDP that is
>>> optimized for high performance packet processing and, in upcoming
>>> patch sets, zero-copy semantics. In this v2 version, we have removed
>>> all zero-copy related code in order to make it smaller, simpler and
>>> hopefully more review friendly. This RFC only supports copy-mode for
>>> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
>>> using the XDP_DRV path. Zero-copy support requires XDP and driver
>>> changes that Jesper Dangaard Brouer is working on. Some of his work is
>>> already on the mailing list for review. We will publish our zero-copy
>>> support for RX and TX on top of his patch sets at a later point in
>>> time.
>>>
>>> An AF_XDP socket (XSK) is created with the normal socket()
>>> syscall. Associated with each XSK are two queues: the RX queue and the
>>> TX queue. A socket can receive packets on the RX queue and it can send
>>> packets on the TX queue. These queues are registered and sized with
>>> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is
>>> mandatory to have at least one of these queues for each socket. In
>>> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
>>> packet buffers. An RX or TX descriptor points to a data buffer in a
>>> memory area called a UMEM. RX and TX can share the same UMEM so that a
>>> packet does not have to be copied between RX and TX. Moreover, if a
>>> packet needs to be kept for a while due to a possible retransmit, the
>>> descriptor that points to that packet can be changed to point to
>>> another and reused right away. This again avoids copying data.
>>>
>>> This new dedicated packet buffer area is called a UMEM. It consists of
>>> a number of equally size frames and each frame has a unique frame
>>> id. A descriptor in one of the queues references a frame by
>>> referencing its frame id. The user space allocates memory for this
>>> UMEM using whatever means it feels is most appropriate (malloc, mmap,
>>> huge pages, etc). This memory area is then registered with the kernel
>>> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues:
>>> the FILL queue and the COMPLETION queue. The fill queue is used by the
>>> application to send down frame ids for the kernel to fill in with RX
>>> packet data. References to these frames will then appear in the RX
>>> queue of the XSK once they have been received. The completion queue,
>>> on the other hand, contains frame ids that the kernel has transmitted
>>> completely and can now be used again by user space, for either TX or
>>> RX. Thus, the frame ids appearing in the completion queue are ids that
>>> were previously transmitted using the TX queue. In summary, the RX and
>>> FILL queues are used for the RX path and the TX and COMPLETION queues
>>> are used for the TX path.
>>>
>> Can we register a UMEM to multiple device's queue?
>>
>
> No, one UMEM, one netdev queue in this RFC. That being said, there's
> nothing stopping a user from creating an additional UMEM, say UMEM',
> pointing to the same memory as UMEM, but bound to another
> netdev/queue. Note that the user space application has to make sure
> that the buffer handling is sane (user/kernel frame ownership).
>
> We used to allow to share UMEM between unrelated sockets, but after
> the introduction of the UMEM queues (fill/completion) that's no the
> case any more. For the zero-copy scenario, having to manage multiple
> DMA mappings per UMEM was a bit of a mess, so we went for the simpler
> (current) solution with one UMEM per netdev/queue.
>
>> So far the l2fwd sample code is sending/receiving from the same
>> queue. I'm thinking about forwarding packets from one device to another.
>> Now I'm copying packets from one device's RX desc to another device's TX
>> completion queue. But this introduces one extra copy.
>>
>
> So you've setup two identical UMEMs? Then you can just forward the
> incoming Rx descriptor to the other netdev's Tx queue. Note, that you
> only need to copy the descriptor, not the actual frame data.
>

Thanks!
I will give it a try, I guess you're saying I can do below:

int sfd1; // for device1
int sfd2; // for device2
...
// create 2 umem
umem1 = calloc(1, sizeof(*umem));
umem2 = calloc(1, sizeof(*umem));

// allocate 1 shared buffer, 1 xdp_umem_reg
posix_memalign(&bufs, ...)
mr.addr = (__u64)bufs; // shared for umem1,2
...

// umem reg the same mr
setsockopt(sfd1, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr))
setsockopt(sfd2, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr))

// setup fill, completion, mmap for sfd1 and sfd2
...

Since both device can put frame data in 'bufs', I only need to copy
the descs between 2 umem1 and umem2. Am I understand correct?

Regards,
William

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
  2018-04-10 14:14     ` William Tu
@ 2018-04-11 12:17       ` Björn Töpel
  2018-04-11 18:43         ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Björn Töpel @ 2018-04-11 12:17 UTC (permalink / raw)
  To: William Tu
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain, Zhang,
	Qi Z, ravineet.singh

2018-04-10 16:14 GMT+02:00 William Tu <u9012063@gmail.com>:
> On Mon, Apr 9, 2018 at 11:47 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:

[...]

>>>
>>
>> So you've setup two identical UMEMs? Then you can just forward the
>> incoming Rx descriptor to the other netdev's Tx queue. Note, that you
>> only need to copy the descriptor, not the actual frame data.
>>
>
> Thanks!
> I will give it a try, I guess you're saying I can do below:
>
> int sfd1; // for device1
> int sfd2; // for device2
> ...
> // create 2 umem
> umem1 = calloc(1, sizeof(*umem));
> umem2 = calloc(1, sizeof(*umem));
>
> // allocate 1 shared buffer, 1 xdp_umem_reg
> posix_memalign(&bufs, ...)
> mr.addr = (__u64)bufs; // shared for umem1,2
> ...
>
> // umem reg the same mr
> setsockopt(sfd1, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr))
> setsockopt(sfd2, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr))
>
> // setup fill, completion, mmap for sfd1 and sfd2
> ...
>
> Since both device can put frame data in 'bufs', I only need to copy
> the descs between 2 umem1 and umem2. Am I understand correct?
>

Yup, spot on! umem1 and umem2 have the same layout/index "address
space", so you can just forward the descriptors and never touch the
data.

In the current RFC you are required to create both an Rx and Tx queue
to bind the socket, which is just weird for your "Rx on one device, Tx
to another" scenario. I'll fix that in the next RFC.


Björn

> Regards,
> William

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
  2018-04-11 12:17       ` Björn Töpel
@ 2018-04-11 18:43         ` Alexei Starovoitov
  2018-04-12 14:14           ` Björn Töpel
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2018-04-11 18:43 UTC (permalink / raw)
  To: Björn Töpel, William Tu
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Jesper Dangaard Brouer, Willem de Bruijn,
	Daniel Borkmann, Linux Kernel Network Developers,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Anjali Singhai Jain, Zhang, Qi Z, ravineet.singh

On 4/11/18 5:17 AM, Björn Töpel wrote:
>
> In the current RFC you are required to create both an Rx and Tx queue
> to bind the socket, which is just weird for your "Rx on one device, Tx
> to another" scenario. I'll fix that in the next RFC.

I would defer on adding new features until the key functionality lands.
imo it's in good shape and I would submit it without RFC tag as soon as
net-next reopens.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
  2018-03-27 16:59 ` [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap Björn Töpel
@ 2018-04-12  2:15   ` Michael S. Tsirkin
  2018-04-12  7:38     ` Karlsson, Magnus
  0 siblings, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2018-04-12  2:15 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, ravineet.singh

On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote:
> @@ -30,4 +31,18 @@ struct xdp_umem_reg {
>  	__u32 frame_headroom; /* Frame head room */
>  };
>  
> +/* Pgoff for mmaping the rings */
> +#define XDP_UMEM_PGOFF_FILL_QUEUE	0x100000000
> +
> +struct xdp_queue {
> +	__u32 head_idx __attribute__((aligned(64)));
> +	__u32 tail_idx __attribute__((aligned(64)));
> +};
> +
> +/* Used for the fill and completion queues for buffers */
> +struct xdp_umem_queue {
> +	struct xdp_queue ptrs;
> +	__u32 desc[0] __attribute__((aligned(64)));
> +};
> +
>  #endif /* _LINUX_IF_XDP_H */

So IIUC it's a head/tail ring of 32 bit descriptors.

In my experience (from implementing ptr_ring) this
implies that head/tail cache lines bounce a lot between
CPUs. Caching will help some. You are also forced to
use barriers to check validity which is slow on
some architectures.

If instead you can use a special descriptor value (e.g. 0) as
a valid signal, things work much better:

- you read descriptor atomically, if it's not 0 it's fine
- same with write - write 0 to pass it to the other side
- there is a data dependency so no need for barriers (except on dec alpha)
- no need for power of 2 limitations, you can make it any size you like
- easy to resize too

architecture (if not implementation) would be shared with ptr_ring
so some of the optimization ideas like batched updates could
be lifted from there.

When I was building ptr_ring, any head/tail design underperformed
storing valid flag with data itself. YMMV.

-- 
MST

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
  2018-04-12  2:15   ` Michael S. Tsirkin
@ 2018-04-12  7:38     ` Karlsson, Magnus
  2018-04-12  8:54       ` Jesper Dangaard Brouer
  2018-04-12 14:04       ` Michael S. Tsirkin
  0 siblings, 2 replies; 32+ messages in thread
From: Karlsson, Magnus @ 2018-04-12  7:38 UTC (permalink / raw)
  To: Michael S. Tsirkin, Björn Töpel
  Cc: Duyck, Alexander H, alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, netdev, michael.lundkvist,
	Brandeburg, Jesse, Singhai, Anjali, Zhang, Qi Z, ravineet.singh



> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Thursday, April 12, 2018 4:16 AM
> To: Björn Töpel <bjorn.topel@gmail.com>
> Cc: Karlsson, Magnus <magnus.karlsson@intel.com>; Duyck, Alexander H
> <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ravineet.singh@ericsson.com
> Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> mmap
> 
> On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote:
> > @@ -30,4 +31,18 @@ struct xdp_umem_reg {
> >  	__u32 frame_headroom; /* Frame head room */  };
> >
> > +/* Pgoff for mmaping the rings */
> > +#define XDP_UMEM_PGOFF_FILL_QUEUE	0x100000000
> > +
> > +struct xdp_queue {
> > +	__u32 head_idx __attribute__((aligned(64)));
> > +	__u32 tail_idx __attribute__((aligned(64))); };
> > +
> > +/* Used for the fill and completion queues for buffers */ struct
> > +xdp_umem_queue {
> > +	struct xdp_queue ptrs;
> > +	__u32 desc[0] __attribute__((aligned(64))); };
> > +
> >  #endif /* _LINUX_IF_XDP_H */
> 
> So IIUC it's a head/tail ring of 32 bit descriptors.
> 
> In my experience (from implementing ptr_ring) this implies that head/tail
> cache lines bounce a lot between CPUs. Caching will help some. You are also
> forced to use barriers to check validity which is slow on some architectures.
> 
> If instead you can use a special descriptor value (e.g. 0) as a valid signal,
> things work much better:
> 
> - you read descriptor atomically, if it's not 0 it's fine
> - same with write - write 0 to pass it to the other side
> - there is a data dependency so no need for barriers (except on dec alpha)
> - no need for power of 2 limitations, you can make it any size you like
> - easy to resize too
> 
> architecture (if not implementation) would be shared with ptr_ring so some
> of the optimization ideas like batched updates could be lifted from there.
> 
> When I was building ptr_ring, any head/tail design underperformed storing
> valid flag with data itself. YMMV.
> 
> --
> MST

I think you are definitely right in that there are ways in which
we can improve performance here. That said, the current queue
performs slightly better than the previous one we had that was
more or less a copy of one of your first virtio 1.1 proposals
from little over a year ago. It had bidirectional queues and a
valid flag in the descriptor itself. The reason we abandoned this
was not poor performance (it was good), but a need to go to
unidirectional queues. Maybe I should have only changed that
aspect and kept the valid flag.

Anyway, I will take a look at ptr_ring and run some experiments
along the lines of what you propose to get some
numbers. Considering your experience with these kind of
structures, you are likely right. I just need to convince
myself :-).

/Magnus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
  2018-04-12  7:38     ` Karlsson, Magnus
@ 2018-04-12  8:54       ` Jesper Dangaard Brouer
  2018-04-12 14:04       ` Michael S. Tsirkin
  1 sibling, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2018-04-12  8:54 UTC (permalink / raw)
  To: Karlsson, Magnus
  Cc: Michael S. Tsirkin, Björn Töpel, Duyck, Alexander H,
	alexander.duyck, john.fastabend, ast, willemdebruijn.kernel,
	daniel, netdev, michael.lundkvist, Brandeburg, Jesse, Singhai,
	Anjali, Zhang, Qi Z, ravineet.singh, brouer

On Thu, 12 Apr 2018 07:38:25 +0000
"Karlsson, Magnus" <magnus.karlsson@intel.com> wrote:

> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Thursday, April 12, 2018 4:16 AM
> > To: Björn Töpel <bjorn.topel@gmail.com>
> > Cc: Karlsson, Magnus <magnus.karlsson@intel.com>; Duyck, Alexander H
> > <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> > john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> > willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> > netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> > Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> > <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> > ravineet.singh@ericsson.com
> > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> > mmap
> > 
> > On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote:  
> > > @@ -30,4 +31,18 @@ struct xdp_umem_reg {
> > >  	__u32 frame_headroom; /* Frame head room */  };
> > >
> > > +/* Pgoff for mmaping the rings */
> > > +#define XDP_UMEM_PGOFF_FILL_QUEUE	0x100000000
> > > +
> > > +struct xdp_queue {
> > > +	__u32 head_idx __attribute__((aligned(64)));
> > > +	__u32 tail_idx __attribute__((aligned(64))); };
> > > +
> > > +/* Used for the fill and completion queues for buffers */ struct
> > > +xdp_umem_queue {
> > > +	struct xdp_queue ptrs;
> > > +	__u32 desc[0] __attribute__((aligned(64))); };
> > > +
> > >  #endif /* _LINUX_IF_XDP_H */  
> > 
> > So IIUC it's a head/tail ring of 32 bit descriptors.
> > 
> > In my experience (from implementing ptr_ring) this implies that head/tail
> > cache lines bounce a lot between CPUs. Caching will help some. You are also
> > forced to use barriers to check validity which is slow on some architectures.
> > 
> > If instead you can use a special descriptor value (e.g. 0) as a valid signal,
> > things work much better:
> > 
> > - you read descriptor atomically, if it's not 0 it's fine
> > - same with write - write 0 to pass it to the other side
> > - there is a data dependency so no need for barriers (except on dec alpha)
> > - no need for power of 2 limitations, you can make it any size you like
> > - easy to resize too
> > 
> > architecture (if not implementation) would be shared with ptr_ring so some
> > of the optimization ideas like batched updates could be lifted from there.
> > 
> > When I was building ptr_ring, any head/tail design underperformed storing
> > valid flag with data itself. YMMV.

I fully agree with MST here. This is also my experience.  I even
dropped my own Array-based Lock-Free (ALF) queue implementation[1] in
favor of ptr_ring. (Where I try to amortize this cost by bulking, but
this cause the queue to become non-wait-free)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h

> I think you are definitely right in that there are ways in which
> we can improve performance here. That said, the current queue
> performs slightly better than the previous one we had that was
> more or less a copy of one of your first virtio 1.1 proposals
> from little over a year ago. It had bidirectional queues and a
> valid flag in the descriptor itself. The reason we abandoned this
> was not poor performance (it was good), but a need to go to
> unidirectional queues. Maybe I should have only changed that
> aspect and kept the valid flag.
> 
> Anyway, I will take a look at ptr_ring and run some experiments
> along the lines of what you propose to get some
> numbers. Considering your experience with these kind of
> structures, you are likely right. I just need to convince
> myself :-).

When benchmarking, be careful that you don't measure the "wrong"
queue situation.  When doing this kind of "overload" benchmarking, you
will likely create a situation where the queue is always full (which
hopefully isn't a production use-case).  In the almost/always full
queue situation, using the element values to sync-on (like MST propose)
will still cause the cache-line bouncing (that we want to avoid).

MST explain and have addressed this situation for ptr_ring in:
 commit fb9de9704775 ("ptr_ring: batch ring zeroing")
 https://git.kernel.org/torvalds/c/fb9de9704775

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 14/14] samples/bpf: sample application for AF_XDP sockets
  2018-03-27 16:59 ` [RFC PATCH v2 14/14] samples/bpf: sample application for AF_XDP sockets Björn Töpel
@ 2018-04-12 11:05   ` Jesper Dangaard Brouer
  2018-04-12 11:08     ` Karlsson, Magnus
  0 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2018-04-12 11:05 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, willemdebruijn.kernel, daniel, netdev,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	ravineet.singh, Björn Töpel, brouer

On Tue, 27 Mar 2018 18:59:19 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> +static void dump_stats(void)
> +{
> +	unsigned long stop_time = get_nsecs();
> +	long dt = stop_time - start_time;
> +	int i;
> +
> +	for (i = 0; i < num_socks; i++) {
> +		double rx_pps = xsks[i]->rx_npkts * 1000000000. / dt;
> +		double tx_pps = xsks[i]->tx_npkts * 1000000000. / dt;
> +		char *fmt = "%-15s %'-11.0f %'-11lu\n";
> +
> +		printf("\n sock%d@", i);
> +		print_benchmark(false);
> +		printf("\n");
> +
> +		printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
> +		       dt / 1000000000.);
> +		printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
> +		printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
> +	}
> +}
> +
> +static void *poller(void *arg)
> +{
> +	(void)arg;
> +	for (;;) {
> +		sleep(1);
> +		dump_stats();
> +	}
> +
> +	return NULL;
> +}

You are printing the "pps" (packets per sec) as an average over the
entire test run... could you please change that to, at least also, have
an more up-to-date value like between the last measurement?

The problem is that when you start the test, the first reading will be
too low, and it takes time to average out/up. For ixgbe, first reading
will be zero, because it does a link-down+up, which stops my pktgen.

The second annoyance is that I like to change system/kernel setting
during the run, and observe the effect. E.g change CPU sleep states
(via tuned-adm) during the test-run to see the effect, which I cannot
with this long average.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC PATCH v2 14/14] samples/bpf: sample application for AF_XDP sockets
  2018-04-12 11:05   ` Jesper Dangaard Brouer
@ 2018-04-12 11:08     ` Karlsson, Magnus
  0 siblings, 0 replies; 32+ messages in thread
From: Karlsson, Magnus @ 2018-04-12 11:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Björn Töpel
  Cc: Duyck, Alexander H, alexander.duyck, john.fastabend, ast,
	willemdebruijn.kernel, daniel, netdev, michael.lundkvist,
	Brandeburg, Jesse, Singhai, Anjali, Zhang, Qi Z, ravineet.singh,
	Topel, Bjorn



> -----Original Message-----
> From: Jesper Dangaard Brouer [mailto:brouer@redhat.com]
> Sent: Thursday, April 12, 2018 1:05 PM
> To: Björn Töpel <bjorn.topel@gmail.com>
> Cc: Karlsson, Magnus <magnus.karlsson@intel.com>; Duyck, Alexander H
> <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> john.fastabend@gmail.com; ast@fb.com;
> willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ravineet.singh@ericsson.com; Topel, Bjorn <bjorn.topel@intel.com>;
> brouer@redhat.com
> Subject: Re: [RFC PATCH v2 14/14] samples/bpf: sample application for
> AF_XDP sockets
> 
> On Tue, 27 Mar 2018 18:59:19 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
> 
> > +static void dump_stats(void)
> > +{
> > +	unsigned long stop_time = get_nsecs();
> > +	long dt = stop_time - start_time;
> > +	int i;
> > +
> > +	for (i = 0; i < num_socks; i++) {
> > +		double rx_pps = xsks[i]->rx_npkts * 1000000000.
> / dt;
> > +		double tx_pps = xsks[i]->tx_npkts * 1000000000.
> / dt;
> > +		char *fmt = "%-15s %'-11.0f %'-11lu\n";
> > +
> > +		printf("\n sock%d@", i);
> > +		print_benchmark(false);
> > +		printf("\n");
> > +
> > +		printf("%-15s %-11s %-11s %-11.2f\n", "", "pps",
> "pkts",
> > +		       dt / 1000000000.);
> > +		printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
> > +		printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
> > +	}
> > +}
> > +
> > +static void *poller(void *arg)
> > +{
> > +	(void)arg;
> > +	for (;;) {
> > +		sleep(1);
> > +		dump_stats();
> > +	}
> > +
> > +	return NULL;
> > +}
> 
> You are printing the "pps" (packets per sec) as an average over the entire
> test run... could you please change that to, at least also, have an more up-to-
> date value like between the last measurement?
> 
> The problem is that when you start the test, the first reading will be too low,
> and it takes time to average out/up. For ixgbe, first reading will be zero,
> because it does a link-down+up, which stops my pktgen.
> 
> The second annoyance is that I like to change system/kernel setting during
> the run, and observe the effect. E.g change CPU sleep states (via tuned-
> adm) during the test-run to see the effect, which I cannot with this long
> average.

Good points. Will fix.

/Magnus

> 
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
  2018-04-12  7:38     ` Karlsson, Magnus
  2018-04-12  8:54       ` Jesper Dangaard Brouer
@ 2018-04-12 14:04       ` Michael S. Tsirkin
  2018-04-12 15:19         ` Karlsson, Magnus
  1 sibling, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2018-04-12 14:04 UTC (permalink / raw)
  To: Karlsson, Magnus
  Cc: Björn Töpel, Duyck, Alexander H, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z, ravineet.singh

On Thu, Apr 12, 2018 at 07:38:25AM +0000, Karlsson, Magnus wrote:
> I think you are definitely right in that there are ways in which
> we can improve performance here. That said, the current queue
> performs slightly better than the previous one we had that was
> more or less a copy of one of your first virtio 1.1 proposals
> from little over a year ago. It had bidirectional queues and a
> valid flag in the descriptor itself. The reason we abandoned this
> was not poor performance (it was good), but a need to go to
> unidirectional queues. Maybe I should have only changed that
> aspect and kept the valid flag.

Is there a summary about unidirectional queues anywhere?  I'm curious to
know whether there are any lessons here to be learned for virtio
or ptr_ring.

-- 
MST

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
  2018-04-11 18:43         ` Alexei Starovoitov
@ 2018-04-12 14:14           ` Björn Töpel
  0 siblings, 0 replies; 32+ messages in thread
From: Björn Töpel @ 2018-04-12 14:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: William Tu, Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Jesper Dangaard Brouer, Willem de Bruijn,
	Daniel Borkmann, Linux Kernel Network Developers,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Anjali Singhai Jain, Zhang, Qi Z, ravineet.singh

2018-04-11 20:43 GMT+02:00 Alexei Starovoitov <ast@fb.com>:
> On 4/11/18 5:17 AM, Björn Töpel wrote:
>>
>>
>> In the current RFC you are required to create both an Rx and Tx
>> queue to bind the socket, which is just weird for your "Rx on one
>> device, Tx to another" scenario. I'll fix that in the next RFC.
>
> I would defer on adding new features until the key functionality
> lands.  imo it's in good shape and I would submit it without RFC tag
> as soon as net-next reopens.

Yes, makes sense. We're doing some ptr_ring-like vs head/tail
measurements, and depending on the result we'll send out a proper
patch when net-next is open again.

What tree should we target -- bpf-next or net-next?


Thanks!
Björn

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
  2018-04-12 14:04       ` Michael S. Tsirkin
@ 2018-04-12 15:19         ` Karlsson, Magnus
  2018-04-23 10:26           ` Karlsson, Magnus
  0 siblings, 1 reply; 32+ messages in thread
From: Karlsson, Magnus @ 2018-04-12 15:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Björn Töpel, Duyck, Alexander H, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z, ravineet.singh



> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Thursday, April 12, 2018 4:05 PM
> To: Karlsson, Magnus <magnus.karlsson@intel.com>
> Cc: Björn Töpel <bjorn.topel@gmail.com>; Duyck, Alexander H
> <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ravineet.singh@ericsson.com
> Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> mmap
> 
> On Thu, Apr 12, 2018 at 07:38:25AM +0000, Karlsson, Magnus wrote:
> > I think you are definitely right in that there are ways in which we
> > can improve performance here. That said, the current queue performs
> > slightly better than the previous one we had that was more or less a
> > copy of one of your first virtio 1.1 proposals from little over a year
> > ago. It had bidirectional queues and a valid flag in the descriptor
> > itself. The reason we abandoned this was not poor performance (it was
> > good), but a need to go to unidirectional queues. Maybe I should have
> > only changed that aspect and kept the valid flag.
> 
> Is there a summary about unidirectional queues anywhere?  I'm curious to
> know whether there are any lessons here to be learned for virtio or ptr_ring.

I did a quick hack in which I used your ptr_ring for the fill queue instead of
our head/tail based one. In the corner cases (usually empty or usually full), there
is basically no difference. But for the case when the queue is always half full,
the ptr_ring implementation boosts the performance from 5.6 to 5.7 Mpps 
(as there is no cache line bouncing in this case) 
on my system (slower than Björn's that was used for the numbers in the RFC).

So I think this should be implemented properly so we can get some real numbers.
Especially since 0.1 Mpps with copies will likely become much more with zero-copy
as we are really chasing cycles there. We will get back a better evaluation in a few
days.

Thanks: Magnus

> --
> MST

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
  2018-04-12 15:19         ` Karlsson, Magnus
@ 2018-04-23 10:26           ` Karlsson, Magnus
  0 siblings, 0 replies; 32+ messages in thread
From: Karlsson, Magnus @ 2018-04-23 10:26 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: 'Björn Töpel',
	Duyck, Alexander H, 'alexander.duyck@gmail.com',
	'john.fastabend@gmail.com', 'ast@fb.com',
	'brouer@redhat.com',
	'willemdebruijn.kernel@gmail.com',
	'daniel@iogearbox.net', 'netdev@vger.kernel.org',
	'michael.lundkvist@ericsson.com',
	Brandeburg, Jesse, Singhai, Anjali, Zhang, Qi Z,
	'ravineet.singh@ericsson.com'



> -----Original Message-----
> From: Karlsson, Magnus
> Sent: Thursday, April 12, 2018 5:20 PM
> To: Michael S. Tsirkin <mst@redhat.com>
> Cc: Björn Töpel <bjorn.topel@gmail.com>; Duyck, Alexander H
> <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ravineet.singh@ericsson.com
> Subject: RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> mmap
> 
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Thursday, April 12, 2018 4:05 PM
> > To: Karlsson, Magnus <magnus.karlsson@intel.com>
> > Cc: Björn Töpel <bjorn.topel@gmail.com>; Duyck, Alexander H
> > <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> > john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> > willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> > netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> > Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> > <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> > ravineet.singh@ericsson.com
> > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> > mmap
> >
> > On Thu, Apr 12, 2018 at 07:38:25AM +0000, Karlsson, Magnus wrote:
> > > I think you are definitely right in that there are ways in which we
> > > can improve performance here. That said, the current queue performs
> > > slightly better than the previous one we had that was more or less a
> > > copy of one of your first virtio 1.1 proposals from little over a
> > > year ago. It had bidirectional queues and a valid flag in the
> > > descriptor itself. The reason we abandoned this was not poor
> > > performance (it was good), but a need to go to unidirectional
> > > queues. Maybe I should have only changed that aspect and kept the valid
> flag.
> >
> > Is there a summary about unidirectional queues anywhere?  I'm curious
> > to know whether there are any lessons here to be learned for virtio or
> ptr_ring.
> 
> I did a quick hack in which I used your ptr_ring for the fill queue instead of
> our head/tail based one. In the corner cases (usually empty or usually full),
> there is basically no difference. But for the case when the queue is always
> half full, the ptr_ring implementation boosts the performance from 5.6 to 5.7
> Mpps (as there is no cache line bouncing in this case) on my system (slower
> than Björn's that was used for the numbers in the RFC).
> 
> So I think this should be implemented properly so we can get some real
> numbers.
> Especially since 0.1 Mpps with copies will likely become much more with
> zero-copy as we are really chasing cycles there. We will get back a better
> evaluation in a few days.
> 
> Thanks: Magnus
> 
> > --
> > MST

Hi Michael,

Sorry for the late reply. Been travelling. Björn and I have now
made a real implementation of the ptr_ring principles in the
af_xdp code. We just added a switch in bind (only for the purpose
of this test) to be able to pick what ring implementation to use
from the user space test program. The main difference between our
version of ptr_ring and our head/tail ring is that the ptr_ring
version uses the idx field to signal if the entry is available or
not (idx == 0 indicating empty descriptor) and that it does not
use the head and tail pointers at all. Even though it is not
a "ring of pointers" in our implementation, we will still call it
ptr_ring for the purpose of this mail.

In summary, our experiments show that the two rings perform the
same in our micro benchmarks when the queues are balanced and
rarely full or empty, but the head/tail version performs better
for RX when the queues are not perfectly balanced. Why is that?
We do not exactly know, but there are a number of differences
between a ptr_ring in the kernel and one between user and kernel
space for the use in af_xdp.

* The user space descriptors have to be validated as we are
  communicating between user space and kernel space. Done slightly
  differently for the two rings due to the batching below.

* The RX and TX ring have descriptors that are larger than one
  pointer, so need to have barriers here even with ptr_ring. We can
  not rely on address dependency because it is not a pointer.

* Batching performed slightly differently in both versions. We
  avoid touching head and tail for as long as possible. At the
  worst it is once per batch, but it might be much less than that
  on the consumer side. The drawback with the accesses to the
  head/tail pointers is that it usually ends up to be a cache
  line bounce. But with ptr_ring, the drawback is that it is
  always N writes (setting idx = 0) for a batch size of N. The
  good thing though, is that these will not incur any cache
  line bouncing if the rings are balanced (well, they will be
  read by the producer at some point, but only once per traversal
  of the ring).

Something to note is that we think that the head/tail version
provides an easier-to-use user space interface since the indexes start
from 0 instead of 1 as in the ptr_ring case. With ptr_ring you
have to teach the user space application writer not to use index
0. With the head/tail version no such restriction is needed.

Here are just some of the results for a workload where user space
is faster than kernel space. This is for the case in which the user
space program has no problem keeping up with the kernel.

head/tail 16-batch

  sock0@p3p2:16 rxdrop
                 pps        
rx              9,782,718   
tx              0           

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,504,235   
tx              2,504,232   


ptr_ring 16-batch

  sock0@p3p2:16 rxdrop
                 pps        
rx              9,519,373   
tx              0           

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,519,265   
tx              2,519,265   


ptr_ring with batch sizes calculated as in ptr_ring.h

  sock0@p3p2:16 rxdrop
                 pps        
rx              7,470,658   
tx              0           
^C

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,431,701   
tx              2,431,701   

/Magnus

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-04-23 10:26 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-27 16:59 [RFC PATCH v2 00/14] Introducing AF_XDP support Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 01/14] net: initial AF_XDP skeleton Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 02/14] xsk: add user memory registration support sockopt Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap Björn Töpel
2018-04-12  2:15   ` Michael S. Tsirkin
2018-04-12  7:38     ` Karlsson, Magnus
2018-04-12  8:54       ` Jesper Dangaard Brouer
2018-04-12 14:04       ` Michael S. Tsirkin
2018-04-12 15:19         ` Karlsson, Magnus
2018-04-23 10:26           ` Karlsson, Magnus
2018-03-27 16:59 ` [RFC PATCH v2 04/14] xsk: add Rx queue setup and mmap support Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 05/14] xsk: add support for bind for Rx Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 06/14] xsk: add Rx receive functions and poll support Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 07/14] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 08/14] xsk: wire up XDP_DRV side of AF_XDP Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 09/14] xsk: wire up XDP_SKB " Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 10/14] xsk: add umem completion queue support and mmap Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 11/14] xsk: add Tx queue setup and mmap support Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 12/14] xsk: support for Tx Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 13/14] xsk: statistics support Björn Töpel
2018-03-27 16:59 ` [RFC PATCH v2 14/14] samples/bpf: sample application for AF_XDP sockets Björn Töpel
2018-04-12 11:05   ` Jesper Dangaard Brouer
2018-04-12 11:08     ` Karlsson, Magnus
2018-03-28 21:18 ` [RFC PATCH v2 00/14] Introducing AF_XDP support Eric Leblond
2018-03-29  6:16   ` Björn Töpel
2018-03-29 15:36     ` Jesper Dangaard Brouer
2018-04-09 21:51 ` William Tu
2018-04-10  6:47   ` Björn Töpel
2018-04-10 14:14     ` William Tu
2018-04-11 12:17       ` Björn Töpel
2018-04-11 18:43         ` Alexei Starovoitov
2018-04-12 14:14           ` Björn Töpel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.