All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/15] Introducing AF_XDP support
@ 2018-04-23 13:56 Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 01/15] net: initial AF_XDP skeleton Björn Töpel
                   ` (17 more replies)
  0 siblings, 18 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This RFC introduces a new address family called AF_XDP that is
optimized for high performance packet processing and, in upcoming
patch sets, zero-copy semantics. In this v2 version, we have removed
all zero-copy related code in order to make it smaller, simpler and
hopefully more review friendly. This RFC only supports copy-mode for
the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
using the XDP_DRV path. Zero-copy support requires XDP and driver
changes that Jesper Dangaard Brouer is working on. Some of his work
has already been accepted. We will publish our zero-copy support for
RX and TX on top of his patch sets at a later point in time.

An AF_XDP socket (XSK) is created with the normal socket()
syscall. Associated with each XSK are two queues: the RX queue and the
TX queue. A socket can receive packets on the RX queue and it can send
packets on the TX queue. These queues are registered and sized with
the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
mandatory to have at least one of these queues for each socket. In
contrast to AF_PACKET V2/V3 these descriptor queues are separated from
packet buffers. An RX or TX descriptor points to a data buffer in a
memory area called a UMEM. RX and TX can share the same UMEM so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, the
descriptor that points to that packet can be changed to point to
another and reused right away. This again avoids copying data.

This new dedicated packet buffer area is call a UMEM. It consists of a
number of equally size frames and each frame has a unique frame id. A
descriptor in one of the queues references a frame by referencing its
frame id. The user space allocates memory for this UMEM using whatever
means it feels is most appropriate (malloc, mmap, huge pages,
etc). This memory area is then registered with the kernel using the new
setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
and the COMPLETION queue. The fill queue is used by the application to
send down frame ids for the kernel to fill in with RX packet
data. References to these frames will then appear in the RX queue of
the XSK once they have been received. The completion queue, on the
other hand, contains frame ids that the kernel has transmitted
completely and can now be used again by user space, for either TX or
RX. Thus, the frame ids appearing in the completion queue are ids that
were previously transmitted using the TX queue. In summary, the RX and
FILL queues are used for the RX path and the TX and COMPLETION queues
are used for the TX path.

The socket is then finally bound with a bind() call to a device and a
specific queue id on that device, and it is not until bind is
completed that traffic starts to flow. Note that in this RFC, all
packet data is copied out to user-space.

A new feature in this RFC is that the UMEM can be shared between
processes, if desired. If a process wants to do this, it simply skips
the registration of the UMEM and its corresponding two queues, sets a
flag in the bind call and submits the XSK of the process it would like
to share UMEM with as well as its own newly created XSK socket. The
new process will then receive frame id references in its own RX queue
that point to this shared UMEM. Note that since the queue structures
are single-consumer / single-producer (for performance reasons), the
new process has to create its own socket with associated RX and TX
queues, since it cannot share this with the other process. This is
also the reason that there is only one set of FILL and COMPLETION
queues per UMEM. It is the responsibility of a single process to
handle the UMEM. If multiple-producer / multiple-consumer queues are
implemented in the future, this requirement could be relaxed.

How is then packets distributed between these two XSK? We have
introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
full). The user-space application can place an XSK at an arbitrary
place in this map. The XDP program can then redirect a packet to a
specific index in this map and at this point XDP validates that the
XSK in that map was indeed bound to that device and queue number. If
not, the packet is dropped. If the map is empty at that index, the
packet is also dropped. This also means that it is currently mandatory
to have an XDP program loaded (and one XSK in the XSKMAP) to be able
to get any traffic to user space through the XSK.

AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
driver does not have support for XDP, or XDP_SKB is explicitly chosen
when loading the XDP program, XDP_SKB mode is employed that uses SKBs
together with the generic XDP support and copies out the data to user
space. A fallback mode that works for any network device. On the other
hand, if the driver has support for XDP, it will be used by the AF_XDP
code to provide better performance, but there is still a copy of the
data into user space.

There is a xdpsock benchmarking/test application included that
demonstrates how to use AF_XDP sockets with both private and shared
UMEMs. Say that you would like your UDP traffic from port 4242 to end
up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode can then be done
using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by commercial packet generator HW that is
generating packets at full 40 Gbit/s line rate.

AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
Benchmark   XDP_SKB   XDP_DRV
rxdrop       2.9(3.0)   9.4(9.3)  
txpush       2.5(2.2)   NA*
l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV
rxdrop       2.1(2.2)   3.3(3.1)  
l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)

* NA since we have no support for TX using the XDP_DRV infrastructure
  in this RFC. This is for a future patch set since it involves
  changes to the XDP NDOs. Some of this has been upstreamed by Jesper
  Dangaard Brouer.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      32,921,521  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3,289,491   0

Changes from RFC V2:

* Optimizations and simplifications to the ring structures inspired by
  ptr_ring.h 
* Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
  consistent with AF_PACKET
* Support for only having an RX queue or a TX queue defined
* Some bug fixes and code cleanup

The structure of the patch set is as follows:

Patches 1-2: Basic socket and umem plumbing 
Patches 3-10: RX support together with the new XSKMAP
Patches 11-14: TX support
Patch 15: Sample application

We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
Clean up btf.h in uapi")

Questions:

* How to deal with cache alignment for uapi when different
  architectures can have different cache line sizes? We have just
  aligned it to 64 bytes for now, which works for many popular
  architectures, but not all. Please advise.

To do:

* Optimize performance

* Kernel selftest

Post-series plan:

* Kernel load module support of AF_XDP would be nice. Unclear how to
  achieve this though since our XDP code depends on net/core.

* Support for AF_XDP sockets without an XPD program loaded. In this
  case all the traffic on a queue should go up to the user space socket.

* Daniel Borkmann's suggestion for a "copy to XDP socket, and return
  XDP_PASS" for a tcpdump-like functionality.

* And of course getting to zero-copy support in small increments. 

Thanks: Björn and Magnus

Björn Töpel (8):
  net: initial AF_XDP skeleton
  xsk: add user memory registration support sockopt
  xsk: add Rx queue setup and mmap support
  xdp: introduce xdp_return_buff API
  xsk: add Rx receive functions and poll support
  bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  xsk: wire up XDP_DRV side of AF_XDP
  xsk: wire up XDP_SKB side of AF_XDP

Magnus Karlsson (7):
  xsk: add umem fill queue support and mmap
  xsk: add support for bind for Rx
  xsk: add umem completion queue support and mmap
  xsk: add Tx queue setup and mmap support
  xsk: support for Tx
  xsk: statistics support
  samples/bpf: sample application for AF_XDP sockets

 MAINTAINERS                         |   8 +
 include/linux/bpf.h                 |  26 +
 include/linux/bpf_types.h           |   3 +
 include/linux/filter.h              |   2 +-
 include/linux/socket.h              |   5 +-
 include/net/xdp.h                   |   1 +
 include/net/xdp_sock.h              |  46 ++
 include/uapi/linux/bpf.h            |   1 +
 include/uapi/linux/if_xdp.h         |  87 ++++
 kernel/bpf/Makefile                 |   3 +
 kernel/bpf/verifier.c               |   8 +-
 kernel/bpf/xskmap.c                 | 286 +++++++++++
 net/Kconfig                         |   1 +
 net/Makefile                        |   1 +
 net/core/dev.c                      |  34 +-
 net/core/filter.c                   |  40 +-
 net/core/sock.c                     |  12 +-
 net/core/xdp.c                      |  15 +-
 net/xdp/Kconfig                     |   7 +
 net/xdp/Makefile                    |   2 +
 net/xdp/xdp_umem.c                  | 256 ++++++++++
 net/xdp/xdp_umem.h                  |  65 +++
 net/xdp/xdp_umem_props.h            |  23 +
 net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
 net/xdp/xsk_queue.c                 |  73 +++
 net/xdp/xsk_queue.h                 | 245 ++++++++++
 samples/bpf/Makefile                |   4 +
 samples/bpf/xdpsock.h               |  11 +
 samples/bpf/xdpsock_kern.c          |  56 +++
 samples/bpf/xdpsock_user.c          | 947 ++++++++++++++++++++++++++++++++++++
 security/selinux/hooks.c            |   4 +-
 security/selinux/include/classmap.h |   4 +-
 32 files changed, 2945 insertions(+), 35 deletions(-)
 create mode 100644 include/net/xdp_sock.h
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 kernel/bpf/xskmap.c
 create mode 100644 net/xdp/Kconfig
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xdp_umem.c
 create mode 100644 net/xdp/xdp_umem.h
 create mode 100644 net/xdp/xdp_umem_props.h
 create mode 100644 net/xdp/xsk.c
 create mode 100644 net/xdp/xsk_queue.c
 create mode 100644 net/xdp/xsk_queue.h
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c

-- 
2.14.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 01/15] net: initial AF_XDP skeleton
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt Björn Töpel
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Buildable skeleton of AF_XDP without any functionality. Just what it
takes to register a new address family.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 MAINTAINERS                         |  8 ++++++++
 include/linux/socket.h              |  5 ++++-
 net/Kconfig                         |  1 +
 net/core/sock.c                     | 12 ++++++++----
 net/xdp/Kconfig                     |  7 +++++++
 security/selinux/hooks.c            |  4 +++-
 security/selinux/include/classmap.h |  4 +++-
 7 files changed, 34 insertions(+), 7 deletions(-)
 create mode 100644 net/xdp/Kconfig

diff --git a/MAINTAINERS b/MAINTAINERS
index fc812fb5857a..ff93d024e6c3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15405,6 +15405,14 @@ T:	git git://linuxtv.org/media_tree.git
 S:	Maintained
 F:	drivers/media/tuners/tuner-xc2028.*
 
+XDP SOCKETS (AF_XDP)
+M:	Björn Töpel <bjorn.topel@intel.com>
+M:	Magnus Karlsson <magnus.karlsson@intel.com>
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	kernel/bpf/xskmap.c
+F:	net/xdp/
+
 XEN BLOCK SUBSYSTEM
 M:	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 M:	Roger Pau Monné <roger.pau@citrix.com>
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ea50f4a65816..7ed4713d5337 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -207,8 +207,9 @@ struct ucred {
 				 * PF_SMC protocol family that
 				 * reuses AF_INET address family
 				 */
+#define AF_XDP		44	/* XDP sockets			*/
 
-#define AF_MAX		44	/* For now.. */
+#define AF_MAX		45	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -257,6 +258,7 @@ struct ucred {
 #define PF_KCM		AF_KCM
 #define PF_QIPCRTR	AF_QIPCRTR
 #define PF_SMC		AF_SMC
+#define PF_XDP		AF_XDP
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -338,6 +340,7 @@ struct ucred {
 #define SOL_NFC		280
 #define SOL_KCM		281
 #define SOL_TLS		282
+#define SOL_XDP		283
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/net/Kconfig b/net/Kconfig
index 6fa1a4493b8c..86471a1c1ed4 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -59,6 +59,7 @@ source "net/tls/Kconfig"
 source "net/xfrm/Kconfig"
 source "net/iucv/Kconfig"
 source "net/smc/Kconfig"
+source "net/xdp/Kconfig"
 
 config INET
 	bool "TCP/IP networking"
diff --git a/net/core/sock.c b/net/core/sock.c
index b2c3db169ca1..e7d8b6c955c6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -226,7 +226,8 @@ static struct lock_class_key af_family_kern_slock_keys[AF_MAX];
   x "AF_RXRPC" ,	x "AF_ISDN"     ,	x "AF_PHONET"   , \
   x "AF_IEEE802154",	x "AF_CAIF"	,	x "AF_ALG"      , \
   x "AF_NFC"   ,	x "AF_VSOCK"    ,	x "AF_KCM"      , \
-  x "AF_QIPCRTR",	x "AF_SMC"	,	x "AF_MAX"
+  x "AF_QIPCRTR",	x "AF_SMC"	,	x "AF_XDP"	, \
+  x "AF_MAX"
 
 static const char *const af_family_key_strings[AF_MAX+1] = {
 	_sock_locks("sk_lock-")
@@ -262,7 +263,8 @@ static const char *const af_family_rlock_key_strings[AF_MAX+1] = {
   "rlock-AF_RXRPC" , "rlock-AF_ISDN"     , "rlock-AF_PHONET"   ,
   "rlock-AF_IEEE802154", "rlock-AF_CAIF" , "rlock-AF_ALG"      ,
   "rlock-AF_NFC"   , "rlock-AF_VSOCK"    , "rlock-AF_KCM"      ,
-  "rlock-AF_QIPCRTR", "rlock-AF_SMC"     , "rlock-AF_MAX"
+  "rlock-AF_QIPCRTR", "rlock-AF_SMC"     , "rlock-AF_XDP"      ,
+  "rlock-AF_MAX"
 };
 static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
   "wlock-AF_UNSPEC", "wlock-AF_UNIX"     , "wlock-AF_INET"     ,
@@ -279,7 +281,8 @@ static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
   "wlock-AF_RXRPC" , "wlock-AF_ISDN"     , "wlock-AF_PHONET"   ,
   "wlock-AF_IEEE802154", "wlock-AF_CAIF" , "wlock-AF_ALG"      ,
   "wlock-AF_NFC"   , "wlock-AF_VSOCK"    , "wlock-AF_KCM"      ,
-  "wlock-AF_QIPCRTR", "wlock-AF_SMC"     , "wlock-AF_MAX"
+  "wlock-AF_QIPCRTR", "wlock-AF_SMC"     , "wlock-AF_XDP"      ,
+  "wlock-AF_MAX"
 };
 static const char *const af_family_elock_key_strings[AF_MAX+1] = {
   "elock-AF_UNSPEC", "elock-AF_UNIX"     , "elock-AF_INET"     ,
@@ -296,7 +299,8 @@ static const char *const af_family_elock_key_strings[AF_MAX+1] = {
   "elock-AF_RXRPC" , "elock-AF_ISDN"     , "elock-AF_PHONET"   ,
   "elock-AF_IEEE802154", "elock-AF_CAIF" , "elock-AF_ALG"      ,
   "elock-AF_NFC"   , "elock-AF_VSOCK"    , "elock-AF_KCM"      ,
-  "elock-AF_QIPCRTR", "elock-AF_SMC"     , "elock-AF_MAX"
+  "elock-AF_QIPCRTR", "elock-AF_SMC"     , "elock-AF_XDP"      ,
+  "elock-AF_MAX"
 };
 
 /*
diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
new file mode 100644
index 000000000000..90e4a7152854
--- /dev/null
+++ b/net/xdp/Kconfig
@@ -0,0 +1,7 @@
+config XDP_SOCKETS
+	bool "XDP sockets"
+	depends on BPF_SYSCALL
+	default n
+	help
+	  XDP sockets allows a channel between XDP programs and
+	  userspace applications.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 4cafe6a19167..5c508d26b367 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -1471,7 +1471,9 @@ static inline u16 socket_type_to_security_class(int family, int type, int protoc
 			return SECCLASS_QIPCRTR_SOCKET;
 		case PF_SMC:
 			return SECCLASS_SMC_SOCKET;
-#if PF_MAX > 44
+		case PF_XDP:
+			return SECCLASS_XDP_SOCKET;
+#if PF_MAX > 45
 #error New address family defined, please update this function.
 #endif
 		}
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 7f0372426494..bd5fe0d3204a 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -240,9 +240,11 @@ struct security_class_mapping secclass_map[] = {
 	  { "manage_subnet", NULL } },
 	{ "bpf",
 	  {"map_create", "map_read", "map_write", "prog_load", "prog_run"} },
+	{ "xdp_socket",
+	  { COMMON_SOCK_PERMS, NULL } },
 	{ NULL }
   };
 
-#if PF_MAX > 44
+#if PF_MAX > 45
 #error New address family defined, please update secclass_map.
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 01/15] net: initial AF_XDP skeleton Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-23 16:18   ` Michael S. Tsirkin
                     ` (2 more replies)
  2018-04-23 13:56 ` [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap Björn Töpel
                   ` (15 subsequent siblings)
  17 siblings, 3 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

In this commit the base structure of the AF_XDP address family is set
up. Further, we introduce the abilty register a window of user memory
to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
window is viewed by an AF_XDP socket as a set of equally large
frames. After a user memory registration all frames are "owned" by the
user application, and not the kernel.

Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/uapi/linux/if_xdp.h |  34 +++++++
 net/Makefile                |   1 +
 net/xdp/Makefile            |   2 +
 net/xdp/xdp_umem.c          | 237 ++++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.h          |  42 ++++++++
 net/xdp/xdp_umem_props.h    |  23 +++++
 net/xdp/xsk.c               | 223 +++++++++++++++++++++++++++++++++++++++++
 7 files changed, 562 insertions(+)
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xdp_umem.c
 create mode 100644 net/xdp/xdp_umem.h
 create mode 100644 net/xdp/xdp_umem_props.h
 create mode 100644 net/xdp/xsk.c

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
new file mode 100644
index 000000000000..41252135a0fe
--- /dev/null
+++ b/include/uapi/linux/if_xdp.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+ *
+ * if_xdp: XDP socket user-space interface
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Author(s): Björn Töpel <bjorn.topel@intel.com>
+ *	      Magnus Karlsson <magnus.karlsson@intel.com>
+ */
+
+#ifndef _LINUX_IF_XDP_H
+#define _LINUX_IF_XDP_H
+
+#include <linux/types.h>
+
+/* XDP socket options */
+#define XDP_UMEM_REG			3
+
+struct xdp_umem_reg {
+	__u64 addr; /* Start of packet data area */
+	__u64 len; /* Length of packet data area */
+	__u32 frame_size; /* Frame size */
+	__u32 frame_headroom; /* Frame head room */
+};
+
+#endif /* _LINUX_IF_XDP_H */
diff --git a/net/Makefile b/net/Makefile
index a6147c61b174..77aaddedbd29 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -85,3 +85,4 @@ obj-y				+= l3mdev/
 endif
 obj-$(CONFIG_QRTR)		+= qrtr/
 obj-$(CONFIG_NET_NCSI)		+= ncsi/
+obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
new file mode 100644
index 000000000000..a5d736640a0f
--- /dev/null
+++ b/net/xdp/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
+
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
new file mode 100644
index 000000000000..bff058f5a769
--- /dev/null
+++ b/net/xdp/xdp_umem.c
@@ -0,0 +1,237 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/init.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/uaccess.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/mm.h>
+
+#include "xdp_umem.h"
+
+#define XDP_UMEM_MIN_FRAME_SIZE 2048
+
+int xdp_umem_create(struct xdp_umem **umem)
+{
+	*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
+
+	if (!(*umem))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void xdp_umem_unpin_pages(struct xdp_umem *umem)
+{
+	unsigned int i;
+
+	if (umem->pgs) {
+		for (i = 0; i < umem->npgs; i++)
+			put_page(umem->pgs[i]);
+
+		kfree(umem->pgs);
+		umem->pgs = NULL;
+	}
+}
+
+static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
+{
+	if (umem->user) {
+		atomic_long_sub(umem->npgs, &umem->user->locked_vm);
+		free_uid(umem->user);
+	}
+}
+
+static void xdp_umem_release(struct xdp_umem *umem)
+{
+	struct task_struct *task;
+	struct mm_struct *mm;
+	unsigned long diff;
+
+	if (umem->pgs) {
+		xdp_umem_unpin_pages(umem);
+
+		task = get_pid_task(umem->pid, PIDTYPE_PID);
+		put_pid(umem->pid);
+		if (!task)
+			goto out;
+		mm = get_task_mm(task);
+		put_task_struct(task);
+		if (!mm)
+			goto out;
+
+		diff = umem->size >> PAGE_SHIFT;
+
+		down_write(&mm->mmap_sem);
+		mm->pinned_vm -= diff;
+		up_write(&mm->mmap_sem);
+		mmput(mm);
+		umem->pgs = NULL;
+	}
+
+	xdp_umem_unaccount_pages(umem);
+out:
+	kfree(umem);
+}
+
+void xdp_put_umem(struct xdp_umem *umem)
+{
+	if (!umem)
+		return;
+
+	if (atomic_dec_and_test(&umem->users))
+		xdp_umem_release(umem);
+}
+
+static int xdp_umem_pin_pages(struct xdp_umem *umem)
+{
+	unsigned int gup_flags = FOLL_WRITE;
+	long npgs;
+	int err;
+
+	umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), GFP_KERNEL);
+	if (!umem->pgs)
+		return -ENOMEM;
+
+	npgs = get_user_pages(umem->address, umem->npgs,
+			      gup_flags, &umem->pgs[0], NULL);
+	if (npgs != umem->npgs) {
+		if (npgs >= 0) {
+			umem->npgs = npgs;
+			err = -ENOMEM;
+			goto out_pin;
+		}
+		err = npgs;
+		goto out_pgs;
+	}
+	return 0;
+
+out_pin:
+	xdp_umem_unpin_pages(umem);
+out_pgs:
+	kfree(umem->pgs);
+	umem->pgs = NULL;
+	return err;
+}
+
+static int xdp_umem_account_pages(struct xdp_umem *umem)
+{
+	unsigned long lock_limit, new_npgs, old_npgs;
+
+	if (capable(CAP_IPC_LOCK))
+		return 0;
+
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	umem->user = get_uid(current_user());
+
+	do {
+		old_npgs = atomic_long_read(&umem->user->locked_vm);
+		new_npgs = old_npgs + umem->npgs;
+		if (new_npgs > lock_limit) {
+			free_uid(umem->user);
+			umem->user = NULL;
+			return -ENOBUFS;
+		}
+	} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
+				     new_npgs) != old_npgs);
+	return 0;
+}
+
+static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
+{
+	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
+	u64 addr = mr->addr, size = mr->len;
+	unsigned int nframes;
+	int size_chk, err;
+
+	if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
+		/* Strictly speaking we could support this, if:
+		 * - huge pages, or*
+		 * - using an IOMMU, or
+		 * - making sure the memory area is consecutive
+		 * but for now, we simply say "computer says no".
+		 */
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(frame_size))
+		return -EINVAL;
+
+	if (!PAGE_ALIGNED(addr)) {
+		/* Memory area has to be page size aligned. For
+		 * simplicity, this might change.
+		 */
+		return -EINVAL;
+	}
+
+	if ((addr + size) < addr)
+		return -EINVAL;
+
+	nframes = size / frame_size;
+	if (nframes == 0 || nframes > UINT_MAX)
+		return -EINVAL;
+
+	frame_headroom = ALIGN(frame_headroom, 64);
+
+	size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
+	if (size_chk < 0)
+		return -EINVAL;
+
+	umem->pid = get_task_pid(current, PIDTYPE_PID);
+	umem->size = (size_t)size;
+	umem->address = (unsigned long)addr;
+	umem->props.frame_size = frame_size;
+	umem->props.nframes = nframes;
+	umem->frame_headroom = frame_headroom;
+	umem->npgs = size / PAGE_SIZE;
+	umem->pgs = NULL;
+	umem->user = NULL;
+
+	umem->frame_size_log2 = ilog2(frame_size);
+	umem->nfpp_mask = (PAGE_SIZE / frame_size) - 1;
+	umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size);
+	atomic_set(&umem->users, 1);
+
+	err = xdp_umem_account_pages(umem);
+	if (err)
+		goto out;
+
+	err = xdp_umem_pin_pages(umem);
+	if (err)
+		goto out;
+	return 0;
+
+out:
+	put_pid(umem->pid);
+	return err;
+}
+
+int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
+{
+	int err;
+
+	if (!umem)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+
+	err = __xdp_umem_reg(umem, mr);
+
+	up_write(&current->mm->mmap_sem);
+	return err;
+}
+
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
new file mode 100644
index 000000000000..58714f4f7f25
--- /dev/null
+++ b/net/xdp/xdp_umem.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef XDP_UMEM_H_
+#define XDP_UMEM_H_
+
+#include <linux/mm.h>
+#include <linux/if_xdp.h>
+
+#include "xdp_umem_props.h"
+
+struct xdp_umem {
+	struct page **pgs;
+	struct xdp_umem_props props;
+	u32 npgs;
+	u32 frame_headroom;
+	u32 nfpp_mask;
+	u32 nfpplog2;
+	u32 frame_size_log2;
+	struct user_struct *user;
+	struct pid *pid;
+	unsigned long address;
+	size_t size;
+	atomic_t users;
+};
+
+int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
+void xdp_put_umem(struct xdp_umem *umem);
+int xdp_umem_create(struct xdp_umem **umem);
+
+#endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
new file mode 100644
index 000000000000..77fb5daf29f3
--- /dev/null
+++ b/net/xdp/xdp_umem_props.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef XDP_UMEM_PROPS_H_
+#define XDP_UMEM_PROPS_H_
+
+struct xdp_umem_props {
+	u32 frame_size;
+	u32 nframes;
+};
+
+#endif /* XDP_UMEM_PROPS_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
new file mode 100644
index 000000000000..19fc719cbe0d
--- /dev/null
+++ b/net/xdp/xsk.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XDP sockets
+ *
+ * AF_XDP sockets allows a channel between XDP programs and userspace
+ * applications.
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Author(s): Björn Töpel <bjorn.topel@intel.com>
+ *	      Magnus Karlsson <magnus.karlsson@intel.com>
+ */
+
+#define pr_fmt(fmt) "AF_XDP: %s: " fmt, __func__
+
+#include <linux/if_xdp.h>
+#include <linux/init.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/socket.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/net.h>
+#include <linux/netdevice.h>
+#include <net/sock.h>
+
+#include "xdp_umem.h"
+
+struct xdp_sock {
+	/* struct sock must be the first member of struct xdp_sock */
+	struct sock sk;
+	struct xdp_umem *umem;
+	/* Protects multiple processes in the control path */
+	struct mutex mutex;
+};
+
+static struct xdp_sock *xdp_sk(struct sock *sk)
+{
+	return (struct xdp_sock *)sk;
+}
+
+static int xsk_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct net *net;
+
+	if (!sk)
+		return 0;
+
+	net = sock_net(sk);
+
+	local_bh_disable();
+	sock_prot_inuse_add(net, sk->sk_prot, -1);
+	local_bh_enable();
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+
+	sk_refcnt_debug_release(sk);
+	sock_put(sk);
+
+	return 0;
+}
+
+static int xsk_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, unsigned int optlen)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	int err;
+
+	if (level != SOL_XDP)
+		return -ENOPROTOOPT;
+
+	switch (optname) {
+	case XDP_UMEM_REG:
+	{
+		struct xdp_umem_reg mr;
+		struct xdp_umem *umem;
+
+		if (xs->umem)
+			return -EBUSY;
+
+		if (copy_from_user(&mr, optval, sizeof(mr)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		err = xdp_umem_create(&umem);
+
+		err = xdp_umem_reg(umem, &mr);
+		if (err) {
+			kfree(umem);
+			mutex_unlock(&xs->mutex);
+			return err;
+		}
+
+		/* Make sure umem is ready before it can be seen by others */
+		smp_wmb();
+
+		xs->umem = umem;
+		mutex_unlock(&xs->mutex);
+		return 0;
+	}
+	default:
+		break;
+	}
+
+	return -ENOPROTOOPT;
+}
+
+static struct proto xsk_proto = {
+	.name =		"XDP",
+	.owner =	THIS_MODULE,
+	.obj_size =	sizeof(struct xdp_sock),
+};
+
+static const struct proto_ops xsk_proto_ops = {
+	.family =	PF_XDP,
+	.owner =	THIS_MODULE,
+	.release =	xsk_release,
+	.bind =		sock_no_bind,
+	.connect =	sock_no_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	sock_no_accept,
+	.getname =	sock_no_getname,
+	.poll =		sock_no_poll,
+	.ioctl =	sock_no_ioctl,
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	xsk_setsockopt,
+	.getsockopt =	sock_no_getsockopt,
+	.sendmsg =	sock_no_sendmsg,
+	.recvmsg =	sock_no_recvmsg,
+	.mmap =		sock_no_mmap,
+	.sendpage =	sock_no_sendpage,
+};
+
+static void xsk_destruct(struct sock *sk)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (!sock_flag(sk, SOCK_DEAD))
+		return;
+
+	xdp_put_umem(xs->umem);
+
+	sk_refcnt_debug_dec(sk);
+}
+
+static int xsk_create(struct net *net, struct socket *sock, int protocol,
+		      int kern)
+{
+	struct sock *sk;
+	struct xdp_sock *xs;
+
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
+		return -EPERM;
+	if (sock->type != SOCK_RAW)
+		return -ESOCKTNOSUPPORT;
+
+	if (protocol)
+		return -EPROTONOSUPPORT;
+
+	sock->state = SS_UNCONNECTED;
+
+	sk = sk_alloc(net, PF_XDP, GFP_KERNEL, &xsk_proto, kern);
+	if (!sk)
+		return -ENOBUFS;
+
+	sock->ops = &xsk_proto_ops;
+
+	sock_init_data(sock, sk);
+
+	sk->sk_family = PF_XDP;
+
+	sk->sk_destruct = xsk_destruct;
+	sk_refcnt_debug_inc(sk);
+
+	xs = xdp_sk(sk);
+	mutex_init(&xs->mutex);
+
+	local_bh_disable();
+	sock_prot_inuse_add(net, &xsk_proto, 1);
+	local_bh_enable();
+
+	return 0;
+}
+
+static const struct net_proto_family xsk_family_ops = {
+	.family = PF_XDP,
+	.create = xsk_create,
+	.owner	= THIS_MODULE,
+};
+
+static int __init xsk_init(void)
+{
+	int err;
+
+	err = proto_register(&xsk_proto, 0 /* no slab */);
+	if (err)
+		goto out;
+
+	err = sock_register(&xsk_family_ops);
+	if (err)
+		goto out_proto;
+
+	return 0;
+
+out_proto:
+	proto_unregister(&xsk_proto);
+out:
+	return err;
+}
+
+fs_initcall(xsk_init);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 01/15] net: initial AF_XDP skeleton Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-23 23:16   ` Michael S. Tsirkin
  2018-04-23 23:21   ` Michael S. Tsirkin
  2018-04-23 13:56 ` [PATCH bpf-next 04/15] xsk: add Rx queue setup and mmap support Björn Töpel
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, we add another setsockopt for registered user memory (umem)
called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
ask the kernel to allocate a queue (ring buffer) and also mmap it
(XDP_UMEM_PGOFF_FILL_QUEUE) into the process.

The queue is used to explicitly pass ownership of umem frames from the
user process to the kernel. These frames will in a later patch be
filled in with Rx packet data by the kernel.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 15 +++++++++++
 net/xdp/Makefile            |  2 +-
 net/xdp/xdp_umem.c          |  5 ++++
 net/xdp/xdp_umem.h          |  2 ++
 net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
 7 files changed, 180 insertions(+), 2 deletions(-)
 create mode 100644 net/xdp/xsk_queue.c
 create mode 100644 net/xdp/xsk_queue.h

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 41252135a0fe..975661e1baca 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -23,6 +23,7 @@
 
 /* XDP socket options */
 #define XDP_UMEM_REG			3
+#define XDP_UMEM_FILL_RING		4
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -31,4 +32,18 @@ struct xdp_umem_reg {
 	__u32 frame_headroom; /* Frame head room */
 };
 
+/* Pgoff for mmaping the rings */
+#define XDP_UMEM_PGOFF_FILL_RING	0x100000000
+
+struct xdp_ring {
+	__u32 producer __attribute__((aligned(64)));
+	__u32 consumer __attribute__((aligned(64)));
+};
+
+/* Used for the fill and completion queues for buffers */
+struct xdp_umem_ring {
+	struct xdp_ring ptrs;
+	__u32 desc[0] __attribute__((aligned(64)));
+};
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
index a5d736640a0f..074fb2b2d51c 100644
--- a/net/xdp/Makefile
+++ b/net/xdp/Makefile
@@ -1,2 +1,2 @@
-obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
 
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index bff058f5a769..6fc233e03f30 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -62,6 +62,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
 	struct mm_struct *mm;
 	unsigned long diff;
 
+	if (umem->fq) {
+		xskq_destroy(umem->fq);
+		umem->fq = NULL;
+	}
+
 	if (umem->pgs) {
 		xdp_umem_unpin_pages(umem);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 58714f4f7f25..3086091aebdd 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -18,9 +18,11 @@
 #include <linux/mm.h>
 #include <linux/if_xdp.h>
 
+#include "xsk_queue.h"
 #include "xdp_umem_props.h"
 
 struct xdp_umem {
+	struct xsk_queue *fq;
 	struct page **pgs;
 	struct xdp_umem_props props;
 	u32 npgs;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 19fc719cbe0d..bf6a1151df28 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -32,6 +32,7 @@
 #include <linux/netdevice.h>
 #include <net/sock.h>
 
+#include "xsk_queue.h"
 #include "xdp_umem.h"
 
 struct xdp_sock {
@@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
+{
+	struct xsk_queue *q;
+
+	if (entries == 0 || *queue || !is_power_of_2(entries))
+		return -EINVAL;
+
+	q = xskq_create(entries);
+	if (!q)
+		return -ENOMEM;
+
+	*queue = q;
+	return 0;
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return 0;
 	}
+	case XDP_UMEM_FILL_RING:
+	{
+		struct xsk_queue **q;
+		int entries;
+
+		if (!xs->umem)
+			return -EINVAL;
+
+		if (copy_from_user(&entries, optval, sizeof(entries)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		q = &xs->umem->fq;
+		err = xsk_init_queue(entries, q);
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	default:
 		break;
 	}
@@ -116,6 +149,33 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 	return -ENOPROTOOPT;
 }
 
+static int xsk_mmap(struct file *file, struct socket *sock,
+		    struct vm_area_struct *vma)
+{
+	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long size = vma->vm_end - vma->vm_start;
+	struct xdp_sock *xs = xdp_sk(sock->sk);
+	struct xsk_queue *q;
+	unsigned long pfn;
+	struct page *qpg;
+
+	if (!xs->umem)
+		return -EINVAL;
+
+	if (offset == XDP_UMEM_PGOFF_FILL_RING)
+		q = xs->umem->fq;
+	else
+		return -EINVAL;
+
+	qpg = virt_to_head_page(q->ring);
+	if (size > (PAGE_SIZE << compound_order(qpg)))
+		return -EINVAL;
+
+	pfn = virt_to_phys(q->ring) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn,
+			       size, vma->vm_page_prot);
+}
+
 static struct proto xsk_proto = {
 	.name =		"XDP",
 	.owner =	THIS_MODULE,
@@ -139,7 +199,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	sock_no_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
-	.mmap =		sock_no_mmap,
+	.mmap =		xsk_mmap,
 	.sendpage =	sock_no_sendpage,
 };
 
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
new file mode 100644
index 000000000000..23da4f29d3fb
--- /dev/null
+++ b/net/xdp/xsk_queue.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XDP user-space ring structure
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/slab.h>
+
+#include "xsk_queue.h"
+
+static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
+{
+	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
+}
+
+struct xsk_queue *xskq_create(u32 nentries)
+{
+	struct xsk_queue *q;
+	gfp_t gfp_flags;
+	size_t size;
+
+	q = kzalloc(sizeof(*q), GFP_KERNEL);
+	if (!q)
+		return NULL;
+
+	q->nentries = nentries;
+	q->ring_mask = nentries - 1;
+
+	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
+		    __GFP_COMP  | __GFP_NORETRY;
+	size = xskq_umem_get_ring_size(q);
+
+	q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
+						      get_order(size));
+	if (!q->ring) {
+		kfree(q);
+		return NULL;
+	}
+
+	return q;
+}
+
+void xskq_destroy(struct xsk_queue *q)
+{
+	if (!q)
+		return;
+
+	page_frag_free(q->ring);
+	kfree(q);
+}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
new file mode 100644
index 000000000000..7eb556bf73be
--- /dev/null
+++ b/net/xdp/xsk_queue.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * XDP user-space ring structure
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XSK_QUEUE_H
+#define _LINUX_XSK_QUEUE_H
+
+#include <linux/types.h>
+#include <linux/if_xdp.h>
+
+#include "xdp_umem_props.h"
+
+struct xsk_queue {
+	struct xdp_umem_props umem_props;
+	u32 ring_mask;
+	u32 nentries;
+	u32 prod_head;
+	u32 prod_tail;
+	u32 cons_head;
+	u32 cons_tail;
+	struct xdp_ring *ring;
+	u64 invalid_descs;
+};
+
+struct xsk_queue *xskq_create(u32 nentries);
+void xskq_destroy(struct xsk_queue *q);
+
+#endif /* _LINUX_XSK_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 04/15] xsk: add Rx queue setup and mmap support
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (2 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 05/15] xsk: add support for bind for Rx Björn Töpel
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
a queue, where the kernel can pass completed Rx frames from the kernel
to user process.

The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/uapi/linux/if_xdp.h | 16 ++++++++++++++++
 net/xdp/xsk.c               | 42 +++++++++++++++++++++++++++++++++---------
 net/xdp/xsk_queue.c         | 11 +++++++++--
 net/xdp/xsk_queue.h         |  2 +-
 4 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 975661e1baca..65324558829d 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -22,6 +22,7 @@
 #include <linux/types.h>
 
 /* XDP socket options */
+#define XDP_RX_RING			1
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
 
@@ -33,13 +34,28 @@ struct xdp_umem_reg {
 };
 
 /* Pgoff for mmaping the rings */
+#define XDP_PGOFF_RX_RING			  0
 #define XDP_UMEM_PGOFF_FILL_RING	0x100000000
 
+struct xdp_desc {
+	__u32 idx;
+	__u32 len;
+	__u16 offset;
+	__u8 flags;
+	__u8 padding[5];
+};
+
 struct xdp_ring {
 	__u32 producer __attribute__((aligned(64)));
 	__u32 consumer __attribute__((aligned(64)));
 };
 
+/* Used for the RX and TX queues for packets */
+struct xdp_rxtx_ring {
+	struct xdp_ring ptrs;
+	struct xdp_desc desc[0] __attribute__((aligned(64)));
+};
+
 /* Used for the fill and completion queues for buffers */
 struct xdp_umem_ring {
 	struct xdp_ring ptrs;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index bf6a1151df28..1f448d1a9409 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -38,6 +38,8 @@
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
+	struct xsk_queue *rx;
+	struct net_device *dev;
 	struct xdp_umem *umem;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
@@ -48,14 +50,15 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
-static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
+static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
+			  bool umem_queue)
 {
 	struct xsk_queue *q;
 
 	if (entries == 0 || *queue || !is_power_of_2(entries))
 		return -EINVAL;
 
-	q = xskq_create(entries);
+	q = xskq_create(entries, umem_queue);
 	if (!q)
 		return -ENOMEM;
 
@@ -97,6 +100,22 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		return -ENOPROTOOPT;
 
 	switch (optname) {
+	case XDP_RX_RING:
+	{
+		struct xsk_queue **q;
+		int entries;
+
+		if (optlen < sizeof(entries))
+			return -EINVAL;
+		if (copy_from_user(&entries, optval, sizeof(entries)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		q = &xs->rx;
+		err = xsk_init_queue(entries, q, false);
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	case XDP_UMEM_REG:
 	{
 		struct xdp_umem_reg mr;
@@ -138,7 +157,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 
 		mutex_lock(&xs->mutex);
 		q = &xs->umem->fq;
-		err = xsk_init_queue(entries, q);
+		err = xsk_init_queue(entries, q, true);
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
@@ -159,13 +178,17 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 	unsigned long pfn;
 	struct page *qpg;
 
-	if (!xs->umem)
-		return -EINVAL;
+	if (offset == XDP_PGOFF_RX_RING) {
+		q = xs->rx;
+	} else {
+		if (!xs->umem)
+			return -EINVAL;
 
-	if (offset == XDP_UMEM_PGOFF_FILL_RING)
-		q = xs->umem->fq;
-	else
-		return -EINVAL;
+		if (offset == XDP_UMEM_PGOFF_FILL_RING)
+			q = xs->umem->fq;
+		else
+			return -EINVAL;
+	}
 
 	qpg = virt_to_head_page(q->ring);
 	if (size > (PAGE_SIZE << compound_order(qpg)))
@@ -210,6 +233,7 @@ static void xsk_destruct(struct sock *sk)
 	if (!sock_flag(sk, SOCK_DEAD))
 		return;
 
+	xskq_destroy(xs->rx);
 	xdp_put_umem(xs->umem);
 
 	sk_refcnt_debug_dec(sk);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 23da4f29d3fb..894f9f89afc7 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -21,7 +21,13 @@ static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
 	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
 }
 
-struct xsk_queue *xskq_create(u32 nentries)
+static u32 xskq_rxtx_get_ring_size(struct xsk_queue *q)
+{
+	return (sizeof(struct xdp_ring) +
+		q->nentries * sizeof(struct xdp_desc));
+}
+
+struct xsk_queue *xskq_create(u32 nentries, bool umem_queue)
 {
 	struct xsk_queue *q;
 	gfp_t gfp_flags;
@@ -36,7 +42,8 @@ struct xsk_queue *xskq_create(u32 nentries)
 
 	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
 		    __GFP_COMP  | __GFP_NORETRY;
-	size = xskq_umem_get_ring_size(q);
+	size = umem_queue ? xskq_umem_get_ring_size(q) :
+	       xskq_rxtx_get_ring_size(q);
 
 	q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
 						      get_order(size));
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 7eb556bf73be..5439fa381763 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -32,7 +32,7 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
-struct xsk_queue *xskq_create(u32 nentries);
+struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q);
 
 #endif /* _LINUX_XSK_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 05/15] xsk: add support for bind for Rx
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (3 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 04/15] xsk: add Rx queue setup and mmap support Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-24 16:55   ` Willem de Bruijn
  2018-04-23 13:56 ` [PATCH bpf-next 06/15] xdp: introduce xdp_return_buff API Björn Töpel
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, the bind syscall is added. Binding an AF_XDP socket, means
associating the socket to an umem, a netdev and a queue index. This
can be done in two ways.

The first way, creating a "socket from scratch". Create the umem using
the XDP_UMEM_REG setsockopt and an associated fill queue with
XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
setsockopt. Call bind passing ifindex and queue index ("channel" in
ethtool speak).

The second way to bind a socket, is simply skipping the
umem/netdev/queue index, and passing another already setup AF_XDP
socket. The new socket will then have the same umem/netdev/queue index
as the parent so it will share the same umem. You must also set the
flags field in the socket address to XDP_SHARED_UMEM.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h |  11 ++++
 net/xdp/xdp_umem.c          |   9 ++++
 net/xdp/xdp_umem.h          |   2 +
 net/xdp/xsk.c               | 125 +++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.c         |   8 +++
 net/xdp/xsk_queue.h         |   1 +
 6 files changed, 155 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 65324558829d..e5091881f776 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -21,6 +21,17 @@
 
 #include <linux/types.h>
 
+/* Options for the sxdp_flags field */
+#define XDP_SHARED_UMEM 1
+
+struct sockaddr_xdp {
+	__u16 sxdp_family;
+	__u32 sxdp_ifindex;
+	__u32 sxdp_queue_id;
+	__u32 sxdp_shared_umem_fd;
+	__u16 sxdp_flags;
+};
+
 /* XDP socket options */
 #define XDP_RX_RING			1
 #define XDP_UMEM_REG			3
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 6fc233e03f30..6b36bb365c01 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -93,6 +93,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
 	kfree(umem);
 }
 
+void xdp_get_umem(struct xdp_umem *umem)
+{
+	atomic_inc(&umem->users);
+}
+
 void xdp_put_umem(struct xdp_umem *umem)
 {
 	if (!umem)
@@ -240,3 +245,7 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	return err;
 }
 
+bool xdp_umem_validate_queues(struct xdp_umem *umem)
+{
+	return umem->fq;
+}
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 3086091aebdd..e4653f6c52a6 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -37,7 +37,9 @@ struct xdp_umem {
 	atomic_t users;
 };
 
+bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
+void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
 int xdp_umem_create(struct xdp_umem **umem);
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 1f448d1a9409..59aa02a88b6b 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -41,6 +41,7 @@ struct xdp_sock {
 	struct xsk_queue *rx;
 	struct net_device *dev;
 	struct xdp_umem *umem;
+	u16 queue_id;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 };
@@ -66,9 +67,18 @@ static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 	return 0;
 }
 
+static void __xsk_release(struct xdp_sock *xs)
+{
+	/* Wait for driver to stop using the xdp socket. */
+	synchronize_net();
+
+	dev_put(xs->dev);
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
 	struct net *net;
 
 	if (!sk)
@@ -80,6 +90,11 @@ static int xsk_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	local_bh_enable();
 
+	if (xs->dev) {
+		__xsk_release(xs);
+		xs->dev = NULL;
+	}
+
 	sock_orphan(sk);
 	sock->sk = NULL;
 
@@ -89,6 +104,114 @@ static int xsk_release(struct socket *sock)
 	return 0;
 }
 
+static struct socket *xsk_lookup_xsk_from_fd(int fd, int *err)
+{
+	struct socket *sock;
+
+	*err = -ENOTSOCK;
+	sock = sockfd_lookup(fd, err);
+	if (!sock)
+		return NULL;
+
+	if (sock->sk->sk_family != PF_XDP) {
+		*err = -ENOPROTOOPT;
+		sockfd_put(sock);
+		return NULL;
+	}
+
+	*err = 0;
+	return sock;
+}
+
+static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
+	struct sock *sk = sock->sk;
+	struct net_device *dev, *dev_curr;
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct xdp_umem *old_umem = NULL;
+	int err = 0;
+
+	if (addr_len < sizeof(struct sockaddr_xdp))
+		return -EINVAL;
+	if (sxdp->sxdp_family != AF_XDP)
+		return -EINVAL;
+
+	mutex_lock(&xs->mutex);
+	dev_curr = xs->dev;
+	dev = dev_get_by_index(sock_net(sk), sxdp->sxdp_ifindex);
+	if (!dev) {
+		err = -ENODEV;
+		goto out_release;
+	}
+
+	if (!xs->rx) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (sxdp->sxdp_queue_id >= dev->num_rx_queues) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (sxdp->sxdp_flags & XDP_SHARED_UMEM) {
+		struct xdp_sock *umem_xs;
+		struct socket *sock;
+
+		if (xs->umem) {
+			/* We have already our own. */
+			err = -EINVAL;
+			goto out_unlock;
+		}
+
+		sock = xsk_lookup_xsk_from_fd(sxdp->sxdp_shared_umem_fd, &err);
+		if (!sock)
+			goto out_unlock;
+
+		umem_xs = xdp_sk(sock->sk);
+		if (!umem_xs->umem) {
+			/* No umem to inherit. */
+			err = -EBADF;
+			sockfd_put(sock);
+			goto out_unlock;
+		} else if (umem_xs->dev != dev ||
+			   umem_xs->queue_id != sxdp->sxdp_queue_id) {
+			err = -EINVAL;
+			sockfd_put(sock);
+			goto out_unlock;
+		}
+
+		xdp_get_umem(umem_xs->umem);
+		old_umem = xs->umem;
+		xs->umem = umem_xs->umem;
+		sockfd_put(sock);
+	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Rebind? */
+	if (dev_curr && (dev_curr != dev ||
+			 xs->queue_id != sxdp->sxdp_queue_id)) {
+		__xsk_release(xs);
+		if (old_umem)
+			xdp_put_umem(old_umem);
+	}
+
+	xs->dev = dev;
+	xs->queue_id = sxdp->sxdp_queue_id;
+
+	xskq_set_umem(xs->rx, &xs->umem->props);
+
+out_unlock:
+	if (err)
+		dev_put(dev);
+out_release:
+	mutex_unlock(&xs->mutex);
+	return err;
+}
+
 static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			  char __user *optval, unsigned int optlen)
 {
@@ -209,7 +332,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.family =	PF_XDP,
 	.owner =	THIS_MODULE,
 	.release =	xsk_release,
-	.bind =		sock_no_bind,
+	.bind =		xsk_bind,
 	.connect =	sock_no_connect,
 	.socketpair =	sock_no_socketpair,
 	.accept =	sock_no_accept,
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 894f9f89afc7..d012e5e23591 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -16,6 +16,14 @@
 
 #include "xsk_queue.h"
 
+void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props)
+{
+	if (!q)
+		return;
+
+	q->umem_props = *umem_props;
+}
+
 static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
 {
 	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 5439fa381763..9ddd2ee07a84 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -32,6 +32,7 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
+void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props);
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q);
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 06/15] xdp: introduce xdp_return_buff API
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (4 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 05/15] xsk: add support for bind for Rx Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support Björn Töpel
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Adding xdp_return_buff, which is analogous to xdp_return_frame, but
acts upon an struct xdp_buff. The API will be used by AF_XDP in future
commits.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp.h |  1 +
 net/core/xdp.c    | 15 ++++++++++++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 137ad5f9f40f..0b689cf561c7 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -104,6 +104,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 }
 
 void xdp_return_frame(struct xdp_frame *xdpf);
+void xdp_return_buff(struct xdp_buff *xdp);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 0c86b53a3a63..bf6758f74339 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -308,11 +308,9 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-void xdp_return_frame(struct xdp_frame *xdpf)
+static void xdp_return(void *data, struct xdp_mem_info *mem)
 {
-	struct xdp_mem_info *mem = &xdpf->mem;
 	struct xdp_mem_allocator *xa;
-	void *data = xdpf->data;
 	struct page *page;
 
 	switch (mem->type) {
@@ -339,4 +337,15 @@ void xdp_return_frame(struct xdp_frame *xdpf)
 		break;
 	}
 }
+
+void xdp_return_frame(struct xdp_frame *xdpf)
+{
+	xdp_return(xdpf->data, &xdpf->mem);
+}
 EXPORT_SYMBOL_GPL(xdp_return_frame);
+
+void xdp_return_buff(struct xdp_buff *xdp)
+{
+	xdp_return(xdp->data, &xdp->rxq->mem);
+}
+EXPORT_SYMBOL_GPL(xdp_return_buff);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (5 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 06/15] xdp: introduce xdp_return_buff API Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-24 16:56   ` Willem de Bruijn
  2018-04-23 13:56 ` [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Here the actual receive functions of AF_XDP are implemented, that in a
later commit, will be called from the XDP layers.

There's one set of functions for the XDP_DRV side and another for
XDP_SKB (generic).

Support for the poll syscall is also implemented.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  40 ++++++++++++++++++
 net/xdp/xdp_umem.h     |  18 ++++++++
 net/xdp/xsk.c          |  76 ++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.h    | 112 ++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 244 insertions(+), 2 deletions(-)
 create mode 100644 include/net/xdp_sock.h

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
new file mode 100644
index 000000000000..bf5dd505e65c
--- /dev/null
+++ b/include/net/xdp_sock.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * AF_XDP internal functions
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XDP_SOCK_H
+#define _LINUX_XDP_SOCK_H
+
+struct xdp_sock;
+struct xdp_buff;
+#ifdef CONFIG_XDP_SOCKETS
+int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
+void xsk_flush(struct xdp_sock *xs);
+#else
+static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	return -ENOTSUPP;
+}
+
+static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	return -ENOTSUPP;
+}
+
+static inline void xsk_flush(struct xdp_sock *xs)
+{
+}
+#endif /* CONFIG_XDP_SOCKETS */
+
+#endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index e4653f6c52a6..8706c904d732 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -37,6 +37,24 @@ struct xdp_umem {
 	atomic_t users;
 };
 
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
+{
+	u64 pg, off;
+	char *data;
+
+	pg = idx >> umem->nfpplog2;
+	off = (idx & umem->nfpp_mask) << umem->frame_size_log2;
+
+	data = page_address(umem->pgs[pg]);
+	return data + off;
+}
+
+static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
+						    u32 idx)
+{
+	return xdp_umem_get_data(umem, idx) + umem->frame_headroom;
+}
+
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
 void xdp_get_umem(struct xdp_umem *umem);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 59aa02a88b6b..bdfb529608e8 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -30,7 +30,9 @@
 #include <linux/uaccess.h>
 #include <linux/net.h>
 #include <linux/netdevice.h>
+#include <net/xdp_sock.h>
 #include <net/sock.h>
+#include <net/xdp.h>
 
 #include "xsk_queue.h"
 #include "xdp_umem.h"
@@ -41,6 +43,7 @@ struct xdp_sock {
 	struct xsk_queue *rx;
 	struct net_device *dev;
 	struct xdp_umem *umem;
+	u64 rx_dropped;
 	u16 queue_id;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
@@ -51,6 +54,74 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	u32 *id, len = xdp->data_end - xdp->data;
+	void *buffer;
+	int err = 0;
+
+	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+		return -EINVAL;
+
+	id = xskq_peek_id(xs->umem->fq);
+	if (!id)
+		return -ENOSPC;
+
+	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
+	memcpy(buffer, xdp->data, len);
+	err = xskq_produce_batch_desc(xs->rx, *id, len,
+				      xs->umem->frame_headroom);
+	if (!err)
+		xskq_discard_id(xs->umem->fq);
+
+	return err;
+}
+
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	int err;
+
+	err = __xsk_rcv(xs, xdp);
+	if (likely(!err))
+		xdp_return_buff(xdp);
+	else
+		xs->rx_dropped++;
+
+	return err;
+}
+
+void xsk_flush(struct xdp_sock *xs)
+{
+	xskq_produce_flush_desc(xs->rx);
+	xs->sk.sk_data_ready(&xs->sk);
+}
+
+int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	int err;
+
+	err = __xsk_rcv(xs, xdp);
+	if (!err)
+		xsk_flush(xs);
+	else
+		xs->rx_dropped++;
+
+	return err;
+}
+
+static unsigned int xsk_poll(struct file *file, struct socket *sock,
+			     struct poll_table_struct *wait)
+{
+	unsigned int mask = datagram_poll(file, sock, wait);
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (xs->rx && !xskq_empty_desc(xs->rx))
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
 static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 			  bool umem_queue)
 {
@@ -189,6 +260,9 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
 		err = -EINVAL;
 		goto out_unlock;
+	} else {
+		/* This xsk has its own umem. */
+		xskq_set_umem(xs->umem->fq, &xs->umem->props);
 	}
 
 	/* Rebind? */
@@ -337,7 +411,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.socketpair =	sock_no_socketpair,
 	.accept =	sock_no_accept,
 	.getname =	sock_no_getname,
-	.poll =		sock_no_poll,
+	.poll =		xsk_poll,
 	.ioctl =	sock_no_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 9ddd2ee07a84..2ae913fb7a09 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -20,6 +20,8 @@
 
 #include "xdp_umem_props.h"
 
+#define RX_BATCH_SIZE 16
+
 struct xsk_queue {
 	struct xdp_umem_props umem_props;
 	u32 ring_mask;
@@ -32,8 +34,116 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
+/* Common functions operating for both RXTX and umem queues */
+
+static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
+{
+	u32 entries = q->prod_tail - q->cons_tail;
+
+	if (entries == 0) {
+		/* Refresh the local pointer */
+		q->prod_tail = READ_ONCE(q->ring->producer);
+	}
+
+	entries = q->prod_tail - q->cons_tail;
+	return (entries > dcnt) ? dcnt : entries;
+}
+
+static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
+{
+	u32 free_entries = q->nentries - (producer - q->cons_tail);
+
+	if (free_entries >= dcnt)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cons_tail = READ_ONCE(q->ring->consumer);
+	return q->nentries - (producer - q->cons_tail);
+}
+
+/* UMEM queue */
+
+static inline bool xskq_is_valid_id(struct xsk_queue *q, u32 idx)
+{
+	if (unlikely(idx >= q->umem_props.nframes)) {
+		q->invalid_descs++;
+		return false;
+	}
+	return true;
+}
+
+static inline u32 *xskq_validate_id(struct xsk_queue *q)
+{
+	while (q->cons_tail != q->cons_head) {
+		struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+		unsigned int idx = q->cons_tail & q->ring_mask;
+
+		if (xskq_is_valid_id(q, ring->desc[idx]))
+			return &ring->desc[idx];
+	}
+
+	return NULL;
+}
+
+static inline u32 *xskq_peek_id(struct xsk_queue *q)
+{
+	struct xdp_umem_ring *ring;
+
+	if (q->cons_tail == q->cons_head) {
+		WRITE_ONCE(q->ring->consumer, q->cons_tail);
+		q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
+
+		/* Order consumer and data */
+		smp_rmb();
+
+		return xskq_validate_id(q);
+	}
+
+	ring = (struct xdp_umem_ring *)q->ring;
+	return &ring->desc[q->cons_tail & q->ring_mask];
+}
+
+static inline void xskq_discard_id(struct xsk_queue *q)
+{
+	q->cons_tail++;
+	(void)xskq_validate_id(q);
+}
+
+/* Rx queue */
+
+static inline int xskq_produce_batch_desc(struct xsk_queue *q,
+					  u32 id, u32 len, u16 offset)
+{
+	struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
+	unsigned int idx;
+
+	if (xskq_nb_free(q, q->prod_head, 1) == 0)
+		return -ENOSPC;
+
+	idx = (q->prod_head++) & q->ring_mask;
+	ring->desc[idx].idx = id;
+	ring->desc[idx].len = len;
+	ring->desc[idx].offset = offset;
+
+	return 0;
+}
+
+static inline void xskq_produce_flush_desc(struct xsk_queue *q)
+{
+	/* Order producer and data */
+	smp_wmb();
+
+	q->prod_tail = q->prod_head,
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+}
+
+static inline bool xskq_empty_desc(struct xsk_queue *q)
+{
+	return (xskq_nb_free(q, q->prod_tail, 1) == q->nentries);
+}
+
 void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props);
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
-void xskq_destroy(struct xsk_queue *q);
+void xskq_destroy(struct xsk_queue *q_ops);
 
 #endif /* _LINUX_XSK_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (6 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-24 16:56   ` Willem de Bruijn
  2018-04-23 13:56 ` [PATCH bpf-next 09/15] xsk: wire up XDP_DRV side of AF_XDP Björn Töpel
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

The xskmap is yet another BPF map, very much inspired by
dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
adds AF_XDP sockets into the map, and by using the bpf_redirect_map
helper, an XDP program can redirect XDP frames to an AF_XDP socket.

Note that a socket that is bound to certain ifindex/queue index will
*only* accept XDP frames from that netdev/queue index. If an XDP
program tries to redirect from a netdev/queue index other than what
the socket is bound to, the frame will not be received on the socket.

A socket can reside in multiple maps.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/bpf.h       |  26 +++++
 include/linux/bpf_types.h |   3 +
 include/net/xdp_sock.h    |   6 +
 include/uapi/linux/bpf.h  |   1 +
 kernel/bpf/Makefile       |   3 +
 kernel/bpf/verifier.c     |   8 +-
 kernel/bpf/xskmap.c       | 286 ++++++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk.c             |   5 +
 8 files changed, 336 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/xskmap.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ee5275e7d4df..3fd7252a2a32 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -674,6 +674,32 @@ static inline int sock_map_prog(struct bpf_map *map,
 }
 #endif
 
+#if defined(CONFIG_XDP_SOCKETS)
+struct xdp_sock;
+struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key);
+int __xsk_map_redirect(struct bpf_map *map, u32 index,
+		       struct xdp_buff *xdp, struct xdp_sock *xs);
+void __xsk_map_flush(struct bpf_map *map);
+#else
+struct xdp_sock;
+static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map,
+						     u32 key)
+{
+	return NULL;
+}
+
+static inline int __xsk_map_redirect(struct bpf_map *map, u32 index,
+				     struct xdp_buff *xdp,
+				     struct xdp_sock *xs)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void __xsk_map_flush(struct bpf_map *map)
+{
+}
+#endif
+
 /* verifier prototypes for helper functions called from eBPF programs */
 extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern const struct bpf_func_proto bpf_map_update_elem_proto;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b28fcf6f6ae..d7df1b323082 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -49,4 +49,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
+#if defined(CONFIG_XDP_SOCKETS)
+BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
+#endif
 #endif
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index bf5dd505e65c..e09ae39417bb 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -21,6 +21,7 @@ struct xdp_buff;
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
+bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
@@ -35,6 +36,11 @@ static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 static inline void xsk_flush(struct xdp_sock *xs)
 {
 }
+
+static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
+{
+	return false;
+}
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c8383a289f7b..5b0c1492e4d2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_DEVMAP,
 	BPF_MAP_TYPE_SOCKMAP,
 	BPF_MAP_TYPE_CPUMAP,
+	BPF_MAP_TYPE_XSKMAP,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 35c485fa9ea3..f27f5496d6fe 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,6 +8,9 @@ obj-$(CONFIG_BPF_SYSCALL) += btf.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
+ifeq ($(CONFIG_XDP_SOCKETS),y)
+obj-$(CONFIG_BPF_SYSCALL) += xskmap.o
+endif
 obj-$(CONFIG_BPF_SYSCALL) += offload.o
 ifeq ($(CONFIG_STREAM_PARSER),y)
 ifeq ($(CONFIG_INET),y)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dd1dcb902bf..7091a05af536 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2071,8 +2071,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (func_id != BPF_FUNC_redirect_map)
 			goto error;
 		break;
-	/* Restrict bpf side of cpumap, open when use-cases appear */
+	/* Restrict bpf side of cpumap and xskmap, open when use-cases
+	 * appear.
+	 */
 	case BPF_MAP_TYPE_CPUMAP:
+	case BPF_MAP_TYPE_XSKMAP:
 		if (func_id != BPF_FUNC_redirect_map)
 			goto error;
 		break;
@@ -2119,7 +2122,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_FUNC_redirect_map:
 		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
-		    map->map_type != BPF_MAP_TYPE_CPUMAP)
+		    map->map_type != BPF_MAP_TYPE_CPUMAP &&
+		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
 	case BPF_FUNC_sk_redirect_map:
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
new file mode 100644
index 000000000000..a31be2670edc
--- /dev/null
+++ b/kernel/bpf/xskmap.c
@@ -0,0 +1,286 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XSKMAP used for AF_XDP sockets
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/bpf.h>
+#include <linux/capability.h>
+#include <net/xdp_sock.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <net/sock.h>
+
+struct xsk_map_entry {
+	struct xdp_sock *xs;
+	struct rcu_head rcu;
+};
+
+struct xsk_map {
+	struct bpf_map map;
+	struct xsk_map_entry **xsk_map;
+	unsigned long __percpu *flush_needed;
+};
+
+static u64 xsk_map_bitmap_size(const union bpf_attr *attr)
+{
+	return BITS_TO_LONGS((u64) attr->max_entries) * sizeof(unsigned long);
+}
+
+static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
+{
+	struct xsk_map *m;
+	int err = -EINVAL;
+	u64 cost;
+
+	if (!capable(CAP_NET_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (attr->max_entries == 0 || attr->key_size != 4 ||
+	    attr->value_size != 4 ||
+	    attr->map_flags & ~(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY))
+		return ERR_PTR(-EINVAL);
+
+	m = kzalloc(sizeof(*m), GFP_USER);
+	if (!m)
+		return ERR_PTR(-ENOMEM);
+
+	bpf_map_init_from_attr(&m->map, attr);
+
+	cost = (u64)m->map.max_entries * sizeof(struct xsk_map_entry *);
+	cost += xsk_map_bitmap_size(attr) * num_possible_cpus();
+	if (cost >= U32_MAX - PAGE_SIZE)
+		goto free_m;
+
+	m->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+	/* Notice returns -EPERM on if map size is larger than memlock limit */
+	err = bpf_map_precharge_memlock(m->map.pages);
+	if (err)
+		goto free_m;
+
+	m->flush_needed = __alloc_percpu(xsk_map_bitmap_size(attr),
+					    __alignof__(unsigned long));
+	if (!m->flush_needed)
+		goto free_m;
+
+	m->xsk_map = bpf_map_area_alloc(m->map.max_entries *
+					   sizeof(struct xsk_map_entry *),
+					   m->map.numa_node);
+	if (!m->xsk_map)
+		goto free_percpu;
+	return &m->map;
+
+free_percpu:
+	free_percpu(m->flush_needed);
+free_m:
+	kfree(m);
+	return ERR_PTR(err);
+}
+
+static void xsk_map_free(struct bpf_map *map)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	int i, cpu;
+
+	/* At this point bpf_prog->aux->refcnt == 0 and this
+	 * map->refcnt == 0, so the programs (can be more than one
+	 * that used this map) were disconnected from events. Wait for
+	 * outstanding critical sections in these programs to
+	 * complete. The rcu critical section only guarantees no
+	 * further reads against xsk_map. It does __not__ ensure
+	 * pending flush operations (if any) are complete.
+	 */
+
+	synchronize_rcu();
+
+	/* To ensure all pending flush operations have completed wait
+	 * for flush bitmap to indicate all flush_needed bits to be
+	 * zero on _all_ cpus.  Because the above synchronize_rcu()
+	 * ensures the map is disconnected from the program we can
+	 * assume no new bits will be set.
+	 */
+	for_each_online_cpu(cpu) {
+		unsigned long *bitmap = per_cpu_ptr(m->flush_needed, cpu);
+
+		while (!bitmap_empty(bitmap, map->max_entries))
+			cond_resched();
+	}
+
+	for (i = 0; i < map->max_entries; i++) {
+		struct xsk_map_entry *entry;
+
+		entry = m->xsk_map[i];
+		if (!entry)
+			continue;
+
+		sock_put((struct sock *)entry->xs);
+		kfree(entry);
+	}
+
+	free_percpu(m->flush_needed);
+	bpf_map_area_free(m->xsk_map);
+	kfree(m);
+}
+
+static int xsk_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	u32 index = key ? *(u32 *)key : U32_MAX;
+	u32 *next = next_key;
+
+	if (index >= m->map.max_entries) {
+		*next = 0;
+		return 0;
+	}
+
+	if (index == m->map.max_entries - 1)
+		return -ENOENT;
+	*next = index + 1;
+	return 0;
+}
+
+struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xsk_map_entry *entry;
+
+	if (key >= map->max_entries)
+		return NULL;
+
+	entry = READ_ONCE(m->xsk_map[key]);
+	return entry ? entry->xs : NULL;
+}
+
+int __xsk_map_redirect(struct bpf_map *map, u32 index,
+		       struct xdp_buff *xdp, struct xdp_sock *xs)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	unsigned long *bitmap = this_cpu_ptr(m->flush_needed);
+	int err;
+
+	err = xsk_rcv(xs, xdp);
+	if (err)
+		return err;
+
+	__set_bit(index, bitmap);
+	return 0;
+}
+
+void __xsk_map_flush(struct bpf_map *map)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	unsigned long *bitmap = this_cpu_ptr(m->flush_needed);
+	u32 bit;
+
+	for_each_set_bit(bit, bitmap, map->max_entries) {
+		struct xsk_map_entry *entry = READ_ONCE(m->xsk_map[bit]);
+
+		/* This is possible if the entry is removed by user
+		 * space between xdp redirect and flush op.
+		 */
+		if (unlikely(!entry))
+			continue;
+
+		__clear_bit(bit, bitmap);
+		xsk_flush(entry->xs);
+	}
+}
+
+static void *xsk_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return NULL;
+}
+
+static void __xsk_map_entry_free(struct rcu_head *rcu)
+{
+	struct xsk_map_entry *entry;
+
+	entry = container_of(rcu, struct xsk_map_entry, rcu);
+	xsk_flush(entry->xs);
+	sock_put((struct sock *)entry->xs);
+	kfree(entry);
+}
+
+static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
+			       u64 map_flags)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xsk_map_entry *entry, *old_entry;
+	u32 i = *(u32 *)key, fd = *(u32 *)value;
+	struct socket *sock;
+	int err;
+
+	if (unlikely(map_flags > BPF_EXIST))
+		return -EINVAL;
+	if (unlikely(i >= m->map.max_entries))
+		return -E2BIG;
+	if (unlikely(map_flags == BPF_NOEXIST))
+		return -EEXIST;
+
+	sock = sockfd_lookup(fd, &err);
+	if (!sock)
+		return err;
+
+	if (sock->sk->sk_family != PF_XDP) {
+		sockfd_put(sock);
+		return -EOPNOTSUPP;
+	}
+
+	if (!xsk_is_setup_for_bpf_map((struct xdp_sock *)sock->sk)) {
+		sockfd_put(sock);
+		return -EOPNOTSUPP;
+	}
+
+	entry = kmalloc_node(sizeof(*entry), GFP_ATOMIC | __GFP_NOWARN,
+			     map->numa_node);
+	if (!entry) {
+		sockfd_put(sock);
+		return -ENOMEM;
+	}
+
+	sock_hold(sock->sk);
+	entry->xs = (struct xdp_sock *)sock->sk;
+
+	old_entry = xchg(&m->xsk_map[i], entry);
+	if (old_entry)
+		call_rcu(&old_entry->rcu, __xsk_map_entry_free);
+
+	sockfd_put(sock);
+	return 0;
+}
+
+static int xsk_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xsk_map_entry *old_entry;
+	int k = *(u32 *)key;
+
+	if (k >= map->max_entries)
+		return -EINVAL;
+
+	old_entry = xchg(&m->xsk_map[k], NULL);
+	if (old_entry)
+		call_rcu(&old_entry->rcu, __xsk_map_entry_free);
+
+	return 0;
+}
+
+const struct bpf_map_ops xsk_map_ops = {
+	.map_alloc = xsk_map_alloc,
+	.map_free = xsk_map_free,
+	.map_get_next_key = xsk_map_get_next_key,
+	.map_lookup_elem = xsk_map_lookup_elem,
+	.map_update_elem = xsk_map_update_elem,
+	.map_delete_elem = xsk_map_delete_elem,
+};
+
+
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index bdfb529608e8..2ae501a8814a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -54,6 +54,11 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
+{
+	return !!xs->rx;
+}
+
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	u32 *id, len = xdp->data_end - xdp->data;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 09/15] xsk: wire up XDP_DRV side of AF_XDP
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (7 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 10/15] xsk: wire up XDP_SKB " Björn Töpel
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This commit wires up the xskmap to XDP_DRV layer.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/core/filter.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e25bc4a3aa1a..f053cc799253 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2800,7 +2800,8 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 {
 	int err;
 
-	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP: {
 		struct net_device *dev = fwd;
 		struct xdp_frame *xdpf;
 
@@ -2818,14 +2819,25 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 		if (err)
 			return err;
 		__dev_map_insert_ctx(map, index);
-
-	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP) {
+		break;
+	}
+	case BPF_MAP_TYPE_CPUMAP: {
 		struct bpf_cpu_map_entry *rcpu = fwd;
 
 		err = cpu_map_enqueue(rcpu, xdp, dev_rx);
 		if (err)
 			return err;
 		__cpu_map_insert_ctx(map, index);
+		break;
+	}
+	case BPF_MAP_TYPE_XSKMAP: {
+		struct xdp_sock *xs = fwd;
+
+		err = __xsk_map_redirect(map, index, xdp, xs);
+		return err;
+	}
+	default:
+		break;
 	}
 	return 0;
 }
@@ -2844,6 +2856,9 @@ void xdp_do_flush_map(void)
 		case BPF_MAP_TYPE_CPUMAP:
 			__cpu_map_flush(map);
 			break;
+		case BPF_MAP_TYPE_XSKMAP:
+			__xsk_map_flush(map);
+			break;
 		default:
 			break;
 		}
@@ -2858,6 +2873,8 @@ static void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
 		return __dev_map_lookup_elem(map, index);
 	case BPF_MAP_TYPE_CPUMAP:
 		return __cpu_map_lookup_elem(map, index);
+	case BPF_MAP_TYPE_XSKMAP:
+		return __xsk_map_lookup_elem(map, index);
 	default:
 		return NULL;
 	}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 10/15] xsk: wire up XDP_SKB side of AF_XDP
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (8 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 09/15] xsk: wire up XDP_DRV side of AF_XDP Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 11/15] xsk: add umem completion queue support and mmap Björn Töpel
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This commit wires up the xskmap to XDP_SKB layer.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/filter.h |  2 +-
 net/core/dev.c         | 34 ++++++++++++++++++----------------
 net/core/filter.c      | 17 ++++++++++++++---
 3 files changed, 33 insertions(+), 20 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b2308174..6ab9a6765b00 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -759,7 +759,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
  * This does not appear to be a real limitation for existing software.
  */
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
-			    struct bpf_prog *prog);
+			    struct xdp_buff *xdp, struct bpf_prog *prog);
 int xdp_do_redirect(struct net_device *dev,
 		    struct xdp_buff *xdp,
 		    struct bpf_prog *prog);
diff --git a/net/core/dev.c b/net/core/dev.c
index c624a04dad1f..6e8e35af9a8b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3994,12 +3994,12 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 }
 
 static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+				     struct xdp_buff *xdp,
 				     struct bpf_prog *xdp_prog)
 {
 	struct netdev_rx_queue *rxqueue;
 	void *orig_data, *orig_data_end;
 	u32 metalen, act = XDP_DROP;
-	struct xdp_buff xdp;
 	int hlen, off;
 	u32 mac_len;
 
@@ -4034,19 +4034,19 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	 */
 	mac_len = skb->data - skb_mac_header(skb);
 	hlen = skb_headlen(skb) + mac_len;
-	xdp.data = skb->data - mac_len;
-	xdp.data_meta = xdp.data;
-	xdp.data_end = xdp.data + hlen;
-	xdp.data_hard_start = skb->data - skb_headroom(skb);
-	orig_data_end = xdp.data_end;
-	orig_data = xdp.data;
+	xdp->data = skb->data - mac_len;
+	xdp->data_meta = xdp->data;
+	xdp->data_end = xdp->data + hlen;
+	xdp->data_hard_start = skb->data - skb_headroom(skb);
+	orig_data_end = xdp->data_end;
+	orig_data = xdp->data;
 
 	rxqueue = netif_get_rxqueue(skb);
-	xdp.rxq = &rxqueue->xdp_rxq;
+	xdp->rxq = &rxqueue->xdp_rxq;
 
-	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
-	off = xdp.data - orig_data;
+	off = xdp->data - orig_data;
 	if (off > 0)
 		__skb_pull(skb, off);
 	else if (off < 0)
@@ -4056,9 +4056,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	/* check if bpf_xdp_adjust_tail was used. it can only "shrink"
 	 * pckt.
 	 */
-	off = orig_data_end - xdp.data_end;
+	off = orig_data_end - xdp->data_end;
 	if (off != 0)
-		skb_set_tail_pointer(skb, xdp.data_end - xdp.data);
+		skb_set_tail_pointer(skb, xdp->data_end - xdp->data);
 
 	switch (act) {
 	case XDP_REDIRECT:
@@ -4066,7 +4066,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 		__skb_push(skb, mac_len);
 		break;
 	case XDP_PASS:
-		metalen = xdp.data - xdp.data_meta;
+		metalen = xdp->data - xdp->data_meta;
 		if (metalen)
 			skb_metadata_set(skb, metalen);
 		break;
@@ -4116,17 +4116,19 @@ static struct static_key generic_xdp_needed __read_mostly;
 int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 {
 	if (xdp_prog) {
-		u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+		struct xdp_buff xdp;
+		u32 act;
 		int err;
 
+		act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
 		if (act != XDP_PASS) {
 			switch (act) {
 			case XDP_REDIRECT:
 				err = xdp_do_generic_redirect(skb->dev, skb,
-							      xdp_prog);
+							      &xdp, xdp_prog);
 				if (err)
 					goto out_redir;
-			/* fallthru to submit skb */
+				break;
 			case XDP_TX:
 				generic_xdp_tx(skb, xdp_prog);
 				break;
diff --git a/net/core/filter.c b/net/core/filter.c
index f053cc799253..315bf3b8d576 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -58,6 +58,7 @@
 #include <net/busy_poll.h>
 #include <net/tcp.h>
 #include <linux/bpf_trace.h>
+#include <net/xdp_sock.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -2972,13 +2973,14 @@ static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
 
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
+				       struct xdp_buff *xdp,
 				       struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	unsigned long map_owner = ri->map_owner;
 	struct bpf_map *map = ri->map;
-	struct net_device *fwd = NULL;
 	u32 index = ri->ifindex;
+	void *fwd = NULL;
 	int err = 0;
 
 	ri->ifindex = 0;
@@ -3000,6 +3002,14 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 		if (unlikely((err = __xdp_generic_ok_fwd_dev(skb, fwd))))
 			goto err;
 		skb->dev = fwd;
+		generic_xdp_tx(skb, xdp_prog);
+	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
+		struct xdp_sock *xs = fwd;
+
+		err = xsk_generic_rcv(xs, xdp);
+		if (err)
+			goto err;
+		consume_skb(skb);
 	} else {
 		/* TODO: Handle BPF_MAP_TYPE_CPUMAP */
 		err = -EBADRQC;
@@ -3014,7 +3024,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 }
 
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
-			    struct bpf_prog *xdp_prog)
+			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	u32 index = ri->ifindex;
@@ -3022,7 +3032,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 	int err = 0;
 
 	if (ri->map)
-		return xdp_do_generic_redirect_map(dev, skb, xdp_prog);
+		return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog);
 
 	ri->ifindex = 0;
 	fwd = dev_get_by_index_rcu(dev_net(dev), index);
@@ -3036,6 +3046,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 	skb->dev = fwd;
 	_trace_xdp_redirect(dev, xdp_prog, index);
+	generic_xdp_tx(skb, xdp_prog);
 	return 0;
 err:
 	_trace_xdp_redirect_err(dev, xdp_prog, index, err);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 11/15] xsk: add umem completion queue support and mmap
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (9 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 10/15] xsk: wire up XDP_SKB " Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 12/15] xsk: add Tx queue setup and mmap support Björn Töpel
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, we add another setsockopt for registered user memory (umem)
called XDP_UMEM_COMPLETION_QUEUE. Using this socket option, the
process can ask the kernel to allocate a queue (ring buffer) and also
mmap it (XDP_UMEM_PGOFF_COMPLETION_QUEUE) into the process.

The queue is used to explicitly pass ownership of umem frames from the
kernel to user process. This will be used by the TX path to tell user
space that a certain frame has been transmitted and user space can use
it for something else, if it wishes.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 2 ++
 net/xdp/xdp_umem.c          | 7 ++++++-
 net/xdp/xdp_umem.h          | 1 +
 net/xdp/xsk.c               | 7 ++++++-
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index e5091881f776..71581a139f26 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -36,6 +36,7 @@ struct sockaddr_xdp {
 #define XDP_RX_RING			1
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
+#define XDP_UMEM_COMPLETION_RING	5
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -47,6 +48,7 @@ struct xdp_umem_reg {
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
 #define XDP_UMEM_PGOFF_FILL_RING	0x100000000
+#define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000
 
 struct xdp_desc {
 	__u32 idx;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 6b36bb365c01..f1e835e46c03 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -67,6 +67,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
 		umem->fq = NULL;
 	}
 
+	if (umem->cq) {
+		xskq_destroy(umem->cq);
+		umem->cq = NULL;
+	}
+
 	if (umem->pgs) {
 		xdp_umem_unpin_pages(umem);
 
@@ -247,5 +252,5 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 
 bool xdp_umem_validate_queues(struct xdp_umem *umem)
 {
-	return umem->fq;
+	return (umem->fq && umem->cq);
 }
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 8706c904d732..f8c2e27dc105 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -23,6 +23,7 @@
 
 struct xdp_umem {
 	struct xsk_queue *fq;
+	struct xsk_queue *cq;
 	struct page **pgs;
 	struct xdp_umem_props props;
 	u32 npgs;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 2ae501a8814a..e8eec4ac08d4 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -268,6 +268,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	} else {
 		/* This xsk has its own umem. */
 		xskq_set_umem(xs->umem->fq, &xs->umem->props);
+		xskq_set_umem(xs->umem->cq, &xs->umem->props);
 	}
 
 	/* Rebind? */
@@ -347,6 +348,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		return 0;
 	}
 	case XDP_UMEM_FILL_RING:
+	case XDP_UMEM_COMPLETION_RING:
 	{
 		struct xsk_queue **q;
 		int entries;
@@ -358,7 +360,8 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			return -EFAULT;
 
 		mutex_lock(&xs->mutex);
-		q = &xs->umem->fq;
+		q = (optname == XDP_UMEM_FILL_RING) ? &xs->umem->fq :
+			&xs->umem->cq;
 		err = xsk_init_queue(entries, q, true);
 		mutex_unlock(&xs->mutex);
 		return err;
@@ -388,6 +391,8 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 
 		if (offset == XDP_UMEM_PGOFF_FILL_RING)
 			q = xs->umem->fq;
+		else if (offset == XDP_UMEM_PGOFF_COMPLETION_RING)
+			q = xs->umem->cq;
 		else
 			return -EINVAL;
 	}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 12/15] xsk: add Tx queue setup and mmap support
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (10 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 11/15] xsk: add umem completion queue support and mmap Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-23 13:56 ` [PATCH bpf-next 13/15] xsk: support for Tx Björn Töpel
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate
a queue, where the user process can pass frames to be transmitted by
the kernel.

The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 2 ++
 net/xdp/xsk.c               | 9 +++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 71581a139f26..e2ea878d025c 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -34,6 +34,7 @@ struct sockaddr_xdp {
 
 /* XDP socket options */
 #define XDP_RX_RING			1
+#define XDP_TX_RING			2
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
 #define XDP_UMEM_COMPLETION_RING	5
@@ -47,6 +48,7 @@ struct xdp_umem_reg {
 
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
+#define XDP_PGOFF_TX_RING		 0x80000000
 #define XDP_UMEM_PGOFF_FILL_RING	0x100000000
 #define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e8eec4ac08d4..cc55f6ba5d4d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -45,6 +45,7 @@ struct xdp_sock {
 	struct xdp_umem *umem;
 	u64 rx_dropped;
 	u16 queue_id;
+	struct xsk_queue *tx ____cacheline_aligned_in_smp;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 };
@@ -221,7 +222,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		goto out_release;
 	}
 
-	if (!xs->rx) {
+	if (!xs->rx && !xs->tx) {
 		err = -EINVAL;
 		goto out_unlock;
 	}
@@ -304,6 +305,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 
 	switch (optname) {
 	case XDP_RX_RING:
+	case XDP_TX_RING:
 	{
 		struct xsk_queue **q;
 		int entries;
@@ -314,7 +316,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			return -EFAULT;
 
 		mutex_lock(&xs->mutex);
-		q = &xs->rx;
+		q = (optname == XDP_TX_RING) ? &xs->tx : &xs->rx;
 		err = xsk_init_queue(entries, q, false);
 		mutex_unlock(&xs->mutex);
 		return err;
@@ -385,6 +387,8 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 
 	if (offset == XDP_PGOFF_RX_RING) {
 		q = xs->rx;
+	} else if (offset == XDP_PGOFF_TX_RING) {
+		q = xs->tx;
 	} else {
 		if (!xs->umem)
 			return -EINVAL;
@@ -441,6 +445,7 @@ static void xsk_destruct(struct sock *sk)
 		return;
 
 	xskq_destroy(xs->rx);
+	xskq_destroy(xs->tx);
 	xdp_put_umem(xs->umem);
 
 	sk_refcnt_debug_dec(sk);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 13/15] xsk: support for Tx
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (11 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 12/15] xsk: add Tx queue setup and mmap support Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-24 16:57   ` Willem de Bruijn
  2018-04-23 13:56 ` [PATCH bpf-next 14/15] xsk: statistics support Björn Töpel
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, Tx support is added. The user fills the Tx queue with frames to
be sent by the kernel, and let's the kernel know using the sendmsg
syscall.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 net/xdp/xsk.c       | 147 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.h |  93 ++++++++++++++++++++++++++++++++-
 2 files changed, 238 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cc55f6ba5d4d..1c0b1ea10453 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -37,6 +37,8 @@
 #include "xsk_queue.h"
 #include "xdp_umem.h"
 
+#define TX_BATCH_SIZE 16
+
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
@@ -115,6 +117,146 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+static void xsk_destruct_skb(struct sk_buff *skb)
+{
+	u32 id = (u32)(long)skb_shinfo(skb)->destructor_arg;
+	struct xdp_sock *xs = xdp_sk(skb->sk);
+
+	WARN_ON_ONCE(xskq_produce_id(xs->umem->cq, id));
+
+	sock_wfree(skb);
+}
+
+static int xsk_xmit_skb(struct sk_buff *skb)
+{
+	struct net_device *dev = skb->dev;
+	struct sk_buff *orig_skb = skb;
+	struct netdev_queue *txq;
+	int ret = NETDEV_TX_BUSY;
+	bool again = false;
+
+	if (unlikely(!netif_running(dev) || !netif_carrier_ok(dev)))
+		goto drop;
+
+	skb = validate_xmit_skb_list(skb, dev, &again);
+	if (skb != orig_skb)
+		return NET_XMIT_DROP;
+
+	txq = skb_get_tx_queue(dev, skb);
+
+	local_bh_disable();
+
+	HARD_TX_LOCK(dev, txq, smp_processor_id());
+	if (!netif_xmit_frozen_or_drv_stopped(txq))
+		ret = netdev_start_xmit(skb, dev, txq, false);
+	HARD_TX_UNLOCK(dev, txq);
+
+	local_bh_enable();
+
+	if (!dev_xmit_complete(ret))
+		goto out_err;
+
+	return ret;
+drop:
+	atomic_long_inc(&dev->tx_dropped);
+out_err:
+	return NET_XMIT_DROP;
+}
+
+static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
+			    size_t total_len)
+{
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
+	u32 max_batch = TX_BATCH_SIZE;
+	struct xdp_sock *xs = xdp_sk(sk);
+	bool sent_frame = false;
+	struct xdp_desc desc;
+	struct sk_buff *skb;
+	int err = 0;
+
+	if (unlikely(!xs->tx))
+		return -ENOBUFS;
+	if (need_wait)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&xs->mutex);
+
+	while (xskq_peek_desc(xs->tx, &desc)) {
+		char *buffer;
+		u32 id, len;
+
+		if (max_batch-- == 0) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		if (xskq_reserve_id(xs->umem->cq)) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		len = desc.len;
+		if (unlikely(len > xs->dev->mtu)) {
+			err = -EMSGSIZE;
+			goto out;
+		}
+
+		skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
+		if (unlikely(!skb)) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		skb_put(skb, len);
+		id = desc.idx;
+		buffer = xdp_umem_get_data(xs->umem, id) + desc.offset;
+		err = skb_store_bits(skb, 0, buffer, len);
+		if (unlikely(err))
+			goto out_store;
+
+		skb->dev = xs->dev;
+		skb->priority = sk->sk_priority;
+		skb->mark = sk->sk_mark;
+		skb_set_queue_mapping(skb, xs->queue_id);
+		skb_shinfo(skb)->destructor_arg = (void *)(long)id;
+		skb->destructor = xsk_destruct_skb;
+
+		err = xsk_xmit_skb(skb);
+		/* Ignore NET_XMIT_CN as packet might have been sent */
+		if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
+			err = -EAGAIN;
+			goto out_store;
+		}
+
+		sent_frame = true;
+		xskq_discard_desc(xs->tx);
+	}
+
+	goto out;
+
+out_store:
+	kfree_skb(skb);
+out:
+	if (sent_frame)
+		sk->sk_write_space(sk);
+
+	mutex_unlock(&xs->mutex);
+	return err;
+}
+
+static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (unlikely(!xs->dev))
+		return -ENXIO;
+	if (unlikely(!(xs->dev->flags & IFF_UP)))
+		return -ENETDOWN;
+
+	return xsk_generic_xmit(sk, m, total_len);
+}
+
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
 			     struct poll_table_struct *wait)
 {
@@ -124,6 +266,8 @@ static unsigned int xsk_poll(struct file *file, struct socket *sock,
 
 	if (xs->rx && !xskq_empty_desc(xs->rx))
 		mask |= POLLIN | POLLRDNORM;
+	if (xs->tx && !xskq_full_desc(xs->tx))
+		mask |= POLLOUT | POLLWRNORM;
 
 	return mask;
 }
@@ -284,6 +428,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	xs->queue_id = sxdp->sxdp_queue_id;
 
 	xskq_set_umem(xs->rx, &xs->umem->props);
+	xskq_set_umem(xs->tx, &xs->umem->props);
 
 out_unlock:
 	if (err)
@@ -431,7 +576,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	xsk_setsockopt,
 	.getsockopt =	sock_no_getsockopt,
-	.sendmsg =	sock_no_sendmsg,
+	.sendmsg =	xsk_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
 	.mmap =		xsk_mmap,
 	.sendpage =	sock_no_sendpage,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 2ae913fb7a09..ea3be9f9e95a 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -109,7 +109,93 @@ static inline void xskq_discard_id(struct xsk_queue *q)
 	(void)xskq_validate_id(q);
 }
 
-/* Rx queue */
+static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+
+	ring->desc[q->prod_tail++ & q->ring_mask] = id;
+
+	/* Order producer and data */
+	smp_wmb();
+
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+	return 0;
+}
+
+static inline int xskq_reserve_id(struct xsk_queue *q)
+{
+	if (xskq_nb_free(q, q->prod_head, 1) == 0)
+		return -ENOSPC;
+
+	q->prod_head++;
+	return 0;
+}
+
+/* Rx/Tx queue */
+
+static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
+{
+	u32 buff_len;
+
+	if (unlikely(d->idx >= q->umem_props.nframes)) {
+		q->invalid_descs++;
+		return false;
+	}
+
+	buff_len = q->umem_props.frame_size;
+	if (unlikely(d->len > buff_len || d->len == 0 ||
+		     d->offset > buff_len || d->offset + d->len > buff_len)) {
+		q->invalid_descs++;
+		return false;
+	}
+
+	return true;
+}
+
+static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
+						  struct xdp_desc *desc)
+{
+	while (q->cons_tail != q->cons_head) {
+		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
+		unsigned int idx = q->cons_tail & q->ring_mask;
+
+		if (xskq_is_valid_desc(q, &ring->desc[idx])) {
+			if (desc)
+				*desc = ring->desc[idx];
+			return desc;
+		}
+
+		q->cons_tail++;
+	}
+
+	return NULL;
+}
+
+static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
+					      struct xdp_desc *desc)
+{
+	struct xdp_rxtx_ring *ring;
+
+	if (q->cons_tail == q->cons_head) {
+		WRITE_ONCE(q->ring->consumer, q->cons_tail);
+		q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
+
+		/* Order consumer and data */
+		smp_rmb();
+
+		return xskq_validate_desc(q, desc);
+	}
+
+	ring = (struct xdp_rxtx_ring *)q->ring;
+	*desc = ring->desc[q->cons_tail & q->ring_mask];
+	return desc;
+}
+
+static inline void xskq_discard_desc(struct xsk_queue *q)
+{
+	q->cons_tail++;
+	(void)xskq_validate_desc(q, NULL);
+}
 
 static inline int xskq_produce_batch_desc(struct xsk_queue *q,
 					  u32 id, u32 len, u16 offset)
@@ -137,6 +223,11 @@ static inline void xskq_produce_flush_desc(struct xsk_queue *q)
 	WRITE_ONCE(q->ring->producer, q->prod_tail);
 }
 
+static inline bool xskq_full_desc(struct xsk_queue *q)
+{
+	return (xskq_nb_avail(q, q->nentries) == q->nentries);
+}
+
 static inline bool xskq_empty_desc(struct xsk_queue *q)
 {
 	return (xskq_nb_free(q, q->prod_tail, 1) == q->nentries);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 14/15] xsk: statistics support
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (12 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 13/15] xsk: support for Tx Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-24 16:58   ` Willem de Bruijn
  2018-04-23 13:56 ` [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets Björn Töpel
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

In this commit, a new getsockopt is added: XDP_STATISTICS. This is
used to obtain stats from the sockets.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h |  7 +++++++
 net/xdp/xsk.c               | 42 +++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.h         |  5 +++++
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index e2ea878d025c..77b88c4efe98 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -38,6 +38,7 @@ struct sockaddr_xdp {
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
 #define XDP_UMEM_COMPLETION_RING	5
+#define XDP_STATISTICS			6
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -46,6 +47,12 @@ struct xdp_umem_reg {
 	__u32 frame_headroom; /* Frame head room */
 };
 
+struct xdp_statistics {
+	__u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+	__u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+	__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+};
+
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
 #define XDP_PGOFF_TX_RING		 0x80000000
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 1c0b1ea10453..6d115609f9ed 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -520,6 +520,46 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 	return -ENOPROTOOPT;
 }
 
+static int xsk_getsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int __user *optlen)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	int len;
+
+	if (level != SOL_XDP)
+		return -ENOPROTOOPT;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	if (len < 0)
+		return -EINVAL;
+
+	switch (optname) {
+	case XDP_STATISTICS:
+	{
+		struct xdp_statistics stats;
+
+		if (len != sizeof(stats))
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		stats.rx_dropped = xs->rx_dropped;
+		stats.rx_invalid_descs = xskq_nb_invalid_descs(xs->rx);
+		stats.tx_invalid_descs = xskq_nb_invalid_descs(xs->tx);
+		mutex_unlock(&xs->mutex);
+
+		if (copy_to_user(optval, &stats, sizeof(stats)))
+			return -EFAULT;
+		return 0;
+	}
+	default:
+		break;
+	}
+
+	return -EOPNOTSUPP;
+}
+
 static int xsk_mmap(struct file *file, struct socket *sock,
 		    struct vm_area_struct *vma)
 {
@@ -575,7 +615,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	xsk_setsockopt,
-	.getsockopt =	sock_no_getsockopt,
+	.getsockopt =	xsk_getsockopt,
 	.sendmsg =	xsk_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
 	.mmap =		xsk_mmap,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index ea3be9f9e95a..7686ef355b83 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -36,6 +36,11 @@ struct xsk_queue {
 
 /* Common functions operating for both RXTX and umem queues */
 
+static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q)
+{
+	return q ? q->invalid_descs : 0;
+}
+
 static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
 {
 	u32 entries = q->prod_tail - q->cons_tail;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (13 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 14/15] xsk: statistics support Björn Töpel
@ 2018-04-23 13:56 ` Björn Töpel
  2018-04-23 23:31   ` Michael S. Tsirkin
  2018-04-23 23:22 ` [PATCH bpf-next 00/15] Introducing AF_XDP support Michael S. Tsirkin
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 13:56 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	Björn Töpel

From: Magnus Karlsson <magnus.karlsson@intel.com>

This is a sample application for AF_XDP sockets. The application
supports three different modes of operation: rxdrop, txonly and l2fwd.

To show-case a simple round-robin load-balancing between a set of
sockets in an xskmap, set the RR_LB compile time define option to 1 in
"xdpsock.h".

Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 samples/bpf/Makefile       |   4 +
 samples/bpf/xdpsock.h      |  11 +
 samples/bpf/xdpsock_kern.c |  56 +++
 samples/bpf/xdpsock_user.c | 947 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1018 insertions(+)
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index aa8c392e2e52..d0ddc1abf20d 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
 hostprogs-y += xdp_adjust_tail
+hostprogs-y += xdpsock
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -97,6 +98,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
 xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
+xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
 always += xdp_adjust_tail_kern.o
+always += xdpsock_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
 HOSTLOADLIBES_xdp_adjust_tail += -lelf
+HOSTLOADLIBES_xdpsock += -lelf -pthread
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
new file mode 100644
index 000000000000..533ab81adfa1
--- /dev/null
+++ b/samples/bpf/xdpsock.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef XDPSOCK_H_
+#define XDPSOCK_H_
+
+/* Power-of-2 number of sockets */
+#define MAX_SOCKS 4
+
+/* Round-robin receive */
+#define RR_LB 0
+
+#endif /* XDPSOCK_H_ */
diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
new file mode 100644
index 000000000000..d8806c41362e
--- /dev/null
+++ b/samples/bpf/xdpsock_kern.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#include "xdpsock.h"
+
+struct bpf_map_def SEC("maps") qidconf_map = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.key_size	= sizeof(int),
+	.value_size	= sizeof(int),
+	.max_entries	= 1,
+};
+
+struct bpf_map_def SEC("maps") xsks_map = {
+	.type = BPF_MAP_TYPE_XSKMAP,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = 4,
+};
+
+struct bpf_map_def SEC("maps") rr_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(unsigned int),
+	.max_entries = 1,
+};
+
+SEC("xdp_sock")
+int xdp_sock_prog(struct xdp_md *ctx)
+{
+	int *qidconf, key = 0, idx;
+	unsigned int *rr;
+
+	qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
+	if (!qidconf)
+		return XDP_ABORTED;
+
+	if (*qidconf != ctx->rx_queue_index)
+		return XDP_PASS;
+
+#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
+	rr = bpf_map_lookup_elem(&rr_map, &key);
+	if (!rr)
+		return XDP_ABORTED;
+
+	*rr = (*rr + 1) & (MAX_SOCKS - 1);
+	idx = *rr;
+#else
+	idx = 0;
+#endif
+
+	return bpf_redirect_map(&xsks_map, idx, 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
new file mode 100644
index 000000000000..690bac1a0ab7
--- /dev/null
+++ b/samples/bpf/xdpsock_user.c
@@ -0,0 +1,947 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2017 - 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <getopt.h>
+#include <libgen.h>
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <linux/if_xdp.h>
+#include <linux/if_ether.h>
+#include <net/if.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/ethernet.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <locale.h>
+#include <sys/types.h>
+#include <poll.h>
+
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "libbpf.h"
+
+#include "xdpsock.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+#ifndef AF_XDP
+#define AF_XDP 44
+#endif
+
+#ifndef PF_XDP
+#define PF_XDP AF_XDP
+#endif
+
+#define NUM_FRAMES 131072
+#define FRAME_HEADROOM 0
+#define FRAME_SIZE 2048
+#define NUM_DESCS 1024
+#define BATCH_SIZE 16
+
+#define FQ_NUM_DESCS 1024
+#define CQ_NUM_DESCS 1024
+
+#define DEBUG_HEXDUMP 0
+
+typedef __u32 u32;
+
+static unsigned long prev_time;
+
+enum benchmark_type {
+	BENCH_RXDROP = 0,
+	BENCH_TXONLY = 1,
+	BENCH_L2FWD = 2,
+};
+
+static enum benchmark_type opt_bench = BENCH_RXDROP;
+static u32 opt_xdp_flags;
+static const char *opt_if = "";
+static int opt_ifindex;
+static int opt_queue;
+static int opt_poll;
+static int opt_shared_packet_buffer;
+static int opt_interval = 1;
+
+struct xdp_umem_uqueue {
+	u32 cached_prod;
+	u32 cached_cons;
+	u32 mask;
+	u32 size;
+	struct xdp_umem_ring *ring;
+};
+
+struct xdp_umem {
+	char (*frames)[FRAME_SIZE];
+	struct xdp_umem_uqueue fq;
+	struct xdp_umem_uqueue cq;
+	int fd;
+};
+
+struct xdp_uqueue {
+	u32 cached_prod;
+	u32 cached_cons;
+	u32 mask;
+	u32 size;
+	struct xdp_rxtx_ring *ring;
+};
+
+struct xdpsock {
+	struct xdp_uqueue rx;
+	struct xdp_uqueue tx;
+	int sfd;
+	struct xdp_umem *umem;
+	u32 outstanding_tx;
+	unsigned long rx_npkts;
+	unsigned long tx_npkts;
+	unsigned long prev_rx_npkts;
+	unsigned long prev_tx_npkts;
+};
+
+#define MAX_SOCKS 4
+static int num_socks;
+struct xdpsock *xsks[MAX_SOCKS];
+
+static unsigned long get_nsecs(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
+}
+
+static void dump_stats(void);
+
+#define lassert(expr)							\
+	do {								\
+		if (!(expr)) {						\
+			fprintf(stderr, "%s:%s:%i: Assertion failed: "	\
+				#expr ": errno: %d/\"%s\"\n",		\
+				__FILE__, __func__, __LINE__,		\
+				errno, strerror(errno));		\
+			dump_stats();					\
+			exit(EXIT_FAILURE);				\
+		}							\
+	} while (0)
+
+#define barrier() __asm__ __volatile__("": : :"memory")
+#define u_smp_rmb() barrier()
+#define u_smp_wmb() barrier()
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+
+static const char pkt_data[] =
+	"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
+	"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
+	"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
+	"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
+
+static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
+{
+	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
+
+	if (free_entries >= nb)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cached_cons = q->ring->ptrs.consumer;
+
+	return q->size - (q->cached_prod - q->cached_cons);
+}
+
+static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
+{
+	u32 free_entries = q->cached_cons - q->cached_prod;
+
+	if (free_entries >= ndescs)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cached_cons = q->ring->ptrs.consumer + q->size;
+	return q->cached_cons - q->cached_prod;
+}
+
+static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb)
+{
+	u32 entries = q->cached_prod - q->cached_cons;
+
+	if (entries == 0)
+		q->cached_prod = q->ring->ptrs.producer;
+
+	entries = q->cached_prod - q->cached_cons;
+
+	return (entries > nb) ? nb : entries;
+}
+
+static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs)
+{
+	u32 entries = q->cached_prod - q->cached_cons;
+
+	if (entries == 0)
+		q->cached_prod = q->ring->ptrs.producer;
+
+	entries = q->cached_prod - q->cached_cons;
+	return (entries > ndescs) ? ndescs : entries;
+}
+
+static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
+					 struct xdp_desc *d,
+					 size_t nb)
+{
+	u32 i;
+
+	if (umem_nb_free(fq, nb) < nb)
+		return -ENOSPC;
+
+	for (i = 0; i < nb; i++) {
+		u32 idx = fq->cached_prod++ & fq->mask;
+
+		fq->ring->desc[idx] = d[i].idx;
+	}
+
+	u_smp_wmb();
+
+	fq->ring->ptrs.producer = fq->cached_prod;
+
+	return 0;
+}
+
+static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
+				      size_t nb)
+{
+	u32 i;
+
+	if (umem_nb_free(fq, nb) < nb)
+		return -ENOSPC;
+
+	for (i = 0; i < nb; i++) {
+		u32 idx = fq->cached_prod++ & fq->mask;
+
+		fq->ring->desc[idx] = d[i];
+	}
+
+	u_smp_wmb();
+
+	fq->ring->ptrs.producer = fq->cached_prod;
+
+	return 0;
+}
+
+static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
+					       u32 *d, size_t nb)
+{
+	u32 idx, i, entries = umem_nb_avail(cq, nb);
+
+	u_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = cq->cached_cons++ & cq->mask;
+		d[i] = cq->ring->desc[idx];
+	}
+
+	if (entries > 0) {
+		u_smp_wmb();
+
+		cq->ring->ptrs.consumer = cq->cached_cons;
+	}
+
+	return entries;
+}
+
+static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
+{
+	lassert(idx < NUM_FRAMES);
+	return &xsk->umem->frames[idx][off];
+}
+
+static inline int xq_enq(struct xdp_uqueue *uq,
+			 const struct xdp_desc *descs,
+			 unsigned int ndescs)
+{
+	struct xdp_rxtx_ring *r = uq->ring;
+	unsigned int i;
+
+	if (xq_nb_free(uq, ndescs) < ndescs)
+		return -ENOSPC;
+
+	for (i = 0; i < ndescs; i++) {
+		u32 idx = uq->cached_prod++ & uq->mask;
+
+		r->desc[idx].idx = descs[i].idx;
+		r->desc[idx].len = descs[i].len;
+		r->desc[idx].offset = descs[i].offset;
+	}
+
+	u_smp_wmb();
+
+	r->ptrs.producer = uq->cached_prod;
+	return 0;
+}
+
+static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
+				 __u32 idx, unsigned int ndescs)
+{
+	struct xdp_rxtx_ring *q = uq->ring;
+	unsigned int i;
+
+	if (xq_nb_free(uq, ndescs) < ndescs)
+		return -ENOSPC;
+
+	for (i = 0; i < ndescs; i++) {
+		u32 idx = uq->cached_prod++ & uq->mask;
+
+		q->desc[idx].idx	= idx + i;
+		q->desc[idx].len	= sizeof(pkt_data) - 1;
+		q->desc[idx].offset	= 0;
+	}
+
+	u_smp_wmb();
+
+	q->ptrs.producer = uq->cached_prod;
+	return 0;
+}
+
+static inline int xq_deq(struct xdp_uqueue *uq,
+			 struct xdp_desc *descs,
+			 int ndescs)
+{
+	struct xdp_rxtx_ring *r = uq->ring;
+	unsigned int idx;
+	int i, entries;
+
+	entries = xq_nb_avail(uq, ndescs);
+
+	u_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = uq->cached_cons++ & uq->mask;
+		descs[i] = r->desc[idx];
+	}
+
+	if (entries > 0) {
+		u_smp_wmb();
+
+		r->ptrs.consumer = uq->cached_cons;
+	}
+
+	return entries;
+}
+
+static void swap_mac_addresses(void *data)
+{
+	struct ether_header *eth = (struct ether_header *)data;
+	struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
+	struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
+	struct ether_addr tmp;
+
+	tmp = *src_addr;
+	*src_addr = *dst_addr;
+	*dst_addr = tmp;
+}
+
+#if DEBUG_HEXDUMP
+static void hex_dump(void *pkt, size_t length, const char *prefix)
+{
+	int i = 0;
+	const unsigned char *address = (unsigned char *)pkt;
+	const unsigned char *line = address;
+	size_t line_size = 32;
+	unsigned char c;
+
+	printf("length = %zu\n", length);
+	printf("%s | ", prefix);
+	while (length-- > 0) {
+		printf("%02X ", *address++);
+		if (!(++i % line_size) || (length == 0 && i % line_size)) {
+			if (length == 0) {
+				while (i++ % line_size)
+					printf("__ ");
+			}
+			printf(" | ");	/* right close */
+			while (line < address) {
+				c = *line++;
+				printf("%c", (c < 33 || c == 255) ? 0x2E : c);
+			}
+			printf("\n");
+			if (length > 0)
+				printf("%s | ", prefix);
+		}
+	}
+	printf("\n");
+}
+#endif
+
+static size_t gen_eth_frame(char *frame)
+{
+	memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
+	return sizeof(pkt_data) - 1;
+}
+
+static struct xdp_umem *xdp_umem_configure(int sfd)
+{
+	int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS;
+	struct xdp_umem_reg mr;
+	struct xdp_umem *umem;
+	void *bufs;
+
+	umem = calloc(1, sizeof(*umem));
+	lassert(umem);
+
+	lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
+			       NUM_FRAMES * FRAME_SIZE) == 0);
+
+	mr.addr = (__u64)bufs;
+	mr.len = NUM_FRAMES * FRAME_SIZE;
+	mr.frame_size = FRAME_SIZE;
+	mr.frame_headroom = FRAME_HEADROOM;
+
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size,
+			   sizeof(int)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size,
+			   sizeof(int)) == 0);
+
+	umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
+			     FQ_NUM_DESCS * sizeof(u32),
+			     PROT_READ | PROT_WRITE,
+			     MAP_SHARED | MAP_POPULATE, sfd,
+			     XDP_UMEM_PGOFF_FILL_RING);
+	lassert(umem->fq.ring != MAP_FAILED);
+
+	umem->fq.mask = FQ_NUM_DESCS - 1;
+	umem->fq.size = FQ_NUM_DESCS;
+
+	umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
+			     CQ_NUM_DESCS * sizeof(u32),
+			     PROT_READ | PROT_WRITE,
+			     MAP_SHARED | MAP_POPULATE, sfd,
+			     XDP_UMEM_PGOFF_COMPLETION_RING);
+	lassert(umem->cq.ring != MAP_FAILED);
+
+	umem->cq.mask = CQ_NUM_DESCS - 1;
+	umem->cq.size = CQ_NUM_DESCS;
+
+	umem->frames = (char (*)[FRAME_SIZE])bufs;
+	umem->fd = sfd;
+
+	if (opt_bench == BENCH_TXONLY) {
+		int i;
+
+		for (i = 0; i < NUM_FRAMES; i++)
+			(void)gen_eth_frame(&umem->frames[i][0]);
+	}
+
+	return umem;
+}
+
+static struct xdpsock *xsk_configure(struct xdp_umem *umem)
+{
+	struct sockaddr_xdp sxdp = {};
+	int sfd, ndescs = NUM_DESCS;
+	struct xdpsock *xsk;
+	bool shared = true;
+	u32 i;
+
+	sfd = socket(PF_XDP, SOCK_RAW, 0);
+	lassert(sfd >= 0);
+
+	xsk = calloc(1, sizeof(*xsk));
+	lassert(xsk);
+
+	xsk->sfd = sfd;
+	xsk->outstanding_tx = 0;
+
+	if (!umem) {
+		shared = false;
+		xsk->umem = xdp_umem_configure(sfd);
+	} else {
+		xsk->umem = umem;
+	}
+
+	lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING,
+			   &ndescs, sizeof(int)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING,
+			   &ndescs, sizeof(int)) == 0);
+
+	/* Rx */
+	xsk->rx.ring = mmap(NULL,
+			    sizeof(struct xdp_ring) +
+			    NUM_DESCS * sizeof(struct xdp_desc),
+			    PROT_READ | PROT_WRITE,
+			    MAP_SHARED | MAP_POPULATE, sfd,
+			    XDP_PGOFF_RX_RING);
+	lassert(xsk->rx.ring != MAP_FAILED);
+
+	if (!shared) {
+		for (i = 0; i < NUM_DESCS / 2; i++)
+			lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
+				== 0);
+	}
+
+	/* Tx */
+	xsk->tx.ring = mmap(NULL,
+			 sizeof(struct xdp_ring) +
+			 NUM_DESCS * sizeof(struct xdp_desc),
+			 PROT_READ | PROT_WRITE,
+			 MAP_SHARED | MAP_POPULATE, sfd,
+			 XDP_PGOFF_TX_RING);
+	lassert(xsk->tx.ring != MAP_FAILED);
+
+	xsk->rx.mask = NUM_DESCS - 1;
+	xsk->rx.size = NUM_DESCS;
+
+	xsk->tx.mask = NUM_DESCS - 1;
+	xsk->tx.size = NUM_DESCS;
+
+	sxdp.sxdp_family = PF_XDP;
+	sxdp.sxdp_ifindex = opt_ifindex;
+	sxdp.sxdp_queue_id = opt_queue;
+	if (shared) {
+		sxdp.sxdp_flags = XDP_SHARED_UMEM;
+		sxdp.sxdp_shared_umem_fd = umem->fd;
+	}
+
+	lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
+
+	return xsk;
+}
+
+static void print_benchmark(bool running)
+{
+	const char *bench_str = "INVALID";
+
+	if (opt_bench == BENCH_RXDROP)
+		bench_str = "rxdrop";
+	else if (opt_bench == BENCH_TXONLY)
+		bench_str = "txonly";
+	else if (opt_bench == BENCH_L2FWD)
+		bench_str = "l2fwd";
+
+	printf("%s:%d %s ", opt_if, opt_queue, bench_str);
+	if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
+		printf("xdp-skb ");
+	else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
+		printf("xdp-drv ");
+	else
+		printf("	");
+
+	if (opt_poll)
+		printf("poll() ");
+
+	if (running) {
+		printf("running...");
+		fflush(stdout);
+	}
+}
+
+static void dump_stats(void)
+{
+	unsigned long now = get_nsecs();
+	long dt = now - prev_time;
+	int i;
+
+	prev_time = now;
+
+	for (i = 0; i < num_socks; i++) {
+		char *fmt = "%-15s %'-11.0f %'-11lu\n";
+		double rx_pps, tx_pps;
+
+		rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) *
+			 1000000000. / dt;
+		tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) *
+			 1000000000. / dt;
+
+		printf("\n sock%d@", i);
+		print_benchmark(false);
+		printf("\n");
+
+		printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
+		       dt / 1000000000.);
+		printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
+		printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
+
+		xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts;
+		xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts;
+	}
+}
+
+static void *poller(void *arg)
+{
+	(void)arg;
+	for (;;) {
+		sleep(opt_interval);
+		dump_stats();
+	}
+
+	return NULL;
+}
+
+static void int_exit(int sig)
+{
+	(void)sig;
+	dump_stats();
+	bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
+	exit(EXIT_SUCCESS);
+}
+
+static struct option long_options[] = {
+	{"rxdrop", no_argument, 0, 'r'},
+	{"txonly", no_argument, 0, 't'},
+	{"l2fwd", no_argument, 0, 'l'},
+	{"interface", required_argument, 0, 'i'},
+	{"queue", required_argument, 0, 'q'},
+	{"poll", no_argument, 0, 'p'},
+	{"shared-buffer", no_argument, 0, 's'},
+	{"xdp-skb", no_argument, 0, 'S'},
+	{"xdp-native", no_argument, 0, 'N'},
+	{"interval", required_argument, 0, 'n'},
+	{0, 0, 0, 0}
+};
+
+static void usage(const char *prog)
+{
+	const char *str =
+		"  Usage: %s [OPTIONS]\n"
+		"  Options:\n"
+		"  -r, --rxdrop		Discard all incoming packets (default)\n"
+		"  -t, --txonly		Only send packets\n"
+		"  -l, --l2fwd		MAC swap L2 forwarding\n"
+		"  -i, --interface=n	Run on interface n\n"
+		"  -q, --queue=n	Use queue n (default 0)\n"
+		"  -p, --poll		Use poll syscall\n"
+		"  -s, --shared-buffer	Use shared packet buffer\n"
+		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
+		"  -N, --xdp-native=n	Enfore XDP native mode\n"
+		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
+		"\n";
+	fprintf(stderr, str, prog);
+	exit(EXIT_FAILURE);
+}
+
+static void parse_command_line(int argc, char **argv)
+{
+	int option_index, c;
+
+	opterr = 0;
+
+	for (;;) {
+		c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
+				&option_index);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 'r':
+			opt_bench = BENCH_RXDROP;
+			break;
+		case 't':
+			opt_bench = BENCH_TXONLY;
+			break;
+		case 'l':
+			opt_bench = BENCH_L2FWD;
+			break;
+		case 'i':
+			opt_if = optarg;
+			break;
+		case 'q':
+			opt_queue = atoi(optarg);
+			break;
+		case 's':
+			opt_shared_packet_buffer = 1;
+			break;
+		case 'p':
+			opt_poll = 1;
+			break;
+		case 'S':
+			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
+			break;
+		case 'n':
+			opt_interval = atoi(optarg);
+			break;
+		default:
+			usage(basename(argv[0]));
+		}
+	}
+
+	opt_ifindex = if_nametoindex(opt_if);
+	if (!opt_ifindex) {
+		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
+			opt_if);
+		usage(basename(argv[0]));
+	}
+}
+
+static void kick_tx(int fd)
+{
+	int ret;
+
+	ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
+		return;
+	lassert(0);
+}
+
+static inline void complete_tx_l2fwd(struct xdpsock *xsk)
+{
+	u32 descs[BATCH_SIZE];
+	unsigned int rcvd;
+	size_t ndescs;
+
+	if (!xsk->outstanding_tx)
+		return;
+
+	kick_tx(xsk->sfd);
+	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
+		 xsk->outstanding_tx;
+
+	/* re-add completed Tx buffers */
+	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs);
+	if (rcvd > 0) {
+		umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd);
+		xsk->outstanding_tx -= rcvd;
+		xsk->tx_npkts += rcvd;
+	}
+}
+
+static inline void complete_tx_only(struct xdpsock *xsk)
+{
+	u32 descs[BATCH_SIZE];
+	unsigned int rcvd;
+
+	if (!xsk->outstanding_tx)
+		return;
+
+	kick_tx(xsk->sfd);
+
+	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE);
+	if (rcvd > 0) {
+		xsk->outstanding_tx -= rcvd;
+		xsk->tx_npkts += rcvd;
+	}
+}
+
+static void rx_drop(struct xdpsock *xsk)
+{
+	struct xdp_desc descs[BATCH_SIZE];
+	unsigned int rcvd, i;
+
+	rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
+	if (!rcvd)
+		return;
+
+	for (i = 0; i < rcvd; i++) {
+		u32 idx = descs[i].idx;
+
+		lassert(idx < NUM_FRAMES);
+#if DEBUG_HEXDUMP
+		char *pkt;
+		char buf[32];
+
+		pkt = xq_get_data(xsk, idx, descs[i].offset);
+		sprintf(buf, "idx=%d", idx);
+		hex_dump(pkt, descs[i].len, buf);
+#endif
+	}
+
+	xsk->rx_npkts += rcvd;
+
+	umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
+}
+
+static void rx_drop_all(void)
+{
+	struct pollfd fds[MAX_SOCKS + 1];
+	int i, ret, timeout, nfds = 1;
+
+	memset(fds, 0, sizeof(fds));
+
+	for (i = 0; i < num_socks; i++) {
+		fds[i].fd = xsks[i]->sfd;
+		fds[i].events = POLLIN;
+		timeout = 1000; /* 1sn */
+	}
+
+	for (;;) {
+		if (opt_poll) {
+			ret = poll(fds, nfds, timeout);
+			if (ret <= 0)
+				continue;
+		}
+
+		for (i = 0; i < num_socks; i++)
+			rx_drop(xsks[i]);
+	}
+}
+
+static void tx_only(struct xdpsock *xsk)
+{
+	int timeout, ret, nfds = 1;
+	struct pollfd fds[nfds + 1];
+	unsigned int idx = 0;
+
+	memset(fds, 0, sizeof(fds));
+	fds[0].fd = xsk->sfd;
+	fds[0].events = POLLOUT;
+	timeout = 1000; /* 1sn */
+
+	for (;;) {
+		if (opt_poll) {
+			ret = poll(fds, nfds, timeout);
+			if (ret <= 0)
+				continue;
+
+			if (fds[0].fd != xsk->sfd ||
+			    !(fds[0].revents & POLLOUT))
+				continue;
+		}
+
+		if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) {
+			lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0);
+
+			xsk->outstanding_tx += BATCH_SIZE;
+			idx += BATCH_SIZE;
+			idx %= NUM_FRAMES;
+		}
+
+		complete_tx_only(xsk);
+	}
+}
+
+static void l2fwd(struct xdpsock *xsk)
+{
+	for (;;) {
+		struct xdp_desc descs[BATCH_SIZE];
+		unsigned int rcvd, i;
+		int ret;
+
+		for (;;) {
+			complete_tx_l2fwd(xsk);
+
+			rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
+			if (rcvd > 0)
+				break;
+		}
+
+		for (i = 0; i < rcvd; i++) {
+			char *pkt = xq_get_data(xsk, descs[i].idx,
+						descs[i].offset);
+
+			swap_mac_addresses(pkt);
+#if DEBUG_HEXDUMP
+			char buf[32];
+			u32 idx = descs[i].idx;
+
+			sprintf(buf, "idx=%d", idx);
+			hex_dump(pkt, descs[i].len, buf);
+#endif
+		}
+
+		xsk->rx_npkts += rcvd;
+
+		ret = xq_enq(&xsk->tx, descs, rcvd);
+		lassert(ret == 0);
+		xsk->outstanding_tx += rcvd;
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	char xdp_filename[256];
+	int i, ret, key = 0;
+	pthread_t pt;
+
+	parse_command_line(argc, argv);
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(xdp_filename)) {
+		fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
+		exit(EXIT_FAILURE);
+	}
+
+	if (!prog_fd[0]) {
+		fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
+		fprintf(stderr, "ERROR: link set xdp fd failed\n");
+		exit(EXIT_FAILURE);
+	}
+
+	ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0);
+	if (ret) {
+		fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Create sockets... */
+	xsks[num_socks++] = xsk_configure(NULL);
+
+#if RR_LB
+	for (i = 0; i < MAX_SOCKS - 1; i++)
+		xsks[num_socks++] = xsk_configure(xsks[0]->umem);
+#endif
+
+	/* ...and insert them into the map. */
+	for (i = 0; i < num_socks; i++) {
+		key = i;
+		ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0);
+		if (ret) {
+			fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i);
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+	signal(SIGABRT, int_exit);
+
+	setlocale(LC_ALL, "");
+
+	ret = pthread_create(&pt, NULL, poller, NULL);
+	lassert(ret == 0);
+
+	prev_time = get_nsecs();
+
+	if (opt_bench == BENCH_RXDROP)
+		rx_drop_all();
+	else if (opt_bench == BENCH_TXONLY)
+		tx_only(xsks[0]);
+	else
+		l2fwd(xsks[0]);
+
+	return 0;
+}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
  2018-04-23 13:56 ` [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt Björn Töpel
@ 2018-04-23 16:18   ` Michael S. Tsirkin
  2018-04-23 20:00     ` Björn Töpel
  2018-04-23 23:04   ` Willem de Bruijn
  2018-04-24 14:27   ` kbuild test robot
  2 siblings, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2018-04-23 16:18 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, Björn Töpel, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang

On Mon, Apr 23, 2018 at 03:56:06PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> In this commit the base structure of the AF_XDP address family is set
> up. Further, we introduce the abilty register a window of user memory
> to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
> window is viewed by an AF_XDP socket as a set of equally large
> frames. After a user memory registration all frames are "owned" by the
> user application, and not the kernel.
> 
> Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  include/uapi/linux/if_xdp.h |  34 +++++++
>  net/Makefile                |   1 +
>  net/xdp/Makefile            |   2 +
>  net/xdp/xdp_umem.c          | 237 ++++++++++++++++++++++++++++++++++++++++++++
>  net/xdp/xdp_umem.h          |  42 ++++++++
>  net/xdp/xdp_umem_props.h    |  23 +++++
>  net/xdp/xsk.c               | 223 +++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 562 insertions(+)
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xdp_umem.c
>  create mode 100644 net/xdp/xdp_umem.h
>  create mode 100644 net/xdp/xdp_umem_props.h
>  create mode 100644 net/xdp/xsk.c
> 
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> new file mode 100644
> index 000000000000..41252135a0fe
> --- /dev/null
> +++ b/include/uapi/linux/if_xdp.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
> + *
> + * if_xdp: XDP socket user-space interface
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * Author(s): Björn Töpel <bjorn.topel@intel.com>
> + *	      Magnus Karlsson <magnus.karlsson@intel.com>
> + */
> +
> +#ifndef _LINUX_IF_XDP_H
> +#define _LINUX_IF_XDP_H
> +
> +#include <linux/types.h>
> +
> +/* XDP socket options */
> +#define XDP_UMEM_REG			3
> +
> +struct xdp_umem_reg {
> +	__u64 addr; /* Start of packet data area */
> +	__u64 len; /* Length of packet data area */
> +	__u32 frame_size; /* Frame size */
> +	__u32 frame_headroom; /* Frame head room */
> +};
> +
> +#endif /* _LINUX_IF_XDP_H */
> diff --git a/net/Makefile b/net/Makefile
> index a6147c61b174..77aaddedbd29 100644
> --- a/net/Makefile
> +++ b/net/Makefile
> @@ -85,3 +85,4 @@ obj-y				+= l3mdev/
>  endif
>  obj-$(CONFIG_QRTR)		+= qrtr/
>  obj-$(CONFIG_NET_NCSI)		+= ncsi/
> +obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
> new file mode 100644
> index 000000000000..a5d736640a0f
> --- /dev/null
> +++ b/net/xdp/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
> +
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> new file mode 100644
> index 000000000000..bff058f5a769
> --- /dev/null
> +++ b/net/xdp/xdp_umem.c
> @@ -0,0 +1,237 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* XDP user-space packet buffer
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/task.h>
> +#include <linux/uaccess.h>
> +#include <linux/slab.h>
> +#include <linux/bpf.h>
> +#include <linux/mm.h>
> +
> +#include "xdp_umem.h"
> +
> +#define XDP_UMEM_MIN_FRAME_SIZE 2048
> +
> +int xdp_umem_create(struct xdp_umem **umem)
> +{
> +	*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
> +
> +	if (!(*umem))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> +{
> +	unsigned int i;
> +
> +	if (umem->pgs) {
> +		for (i = 0; i < umem->npgs; i++)

Since you pin them with FOLL_WRITE, I assume these pages
are written to.
Don't you need set_page_dirty_lock here?

> +			put_page(umem->pgs[i]);
> +
> +		kfree(umem->pgs);
> +		umem->pgs = NULL;
> +	}
> +}
> +
> +static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
> +{
> +	if (umem->user) {
> +		atomic_long_sub(umem->npgs, &umem->user->locked_vm);
> +		free_uid(umem->user);
> +	}
> +}
> +
> +static void xdp_umem_release(struct xdp_umem *umem)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned long diff;
> +
> +	if (umem->pgs) {
> +		xdp_umem_unpin_pages(umem);
> +
> +		task = get_pid_task(umem->pid, PIDTYPE_PID);
> +		put_pid(umem->pid);
> +		if (!task)
> +			goto out;
> +		mm = get_task_mm(task);
> +		put_task_struct(task);
> +		if (!mm)
> +			goto out;
> +
> +		diff = umem->size >> PAGE_SHIFT;
> +
> +		down_write(&mm->mmap_sem);
> +		mm->pinned_vm -= diff;
> +		up_write(&mm->mmap_sem);
> +		mmput(mm);
> +		umem->pgs = NULL;
> +	}
> +
> +	xdp_umem_unaccount_pages(umem);
> +out:
> +	kfree(umem);
> +}
> +
> +void xdp_put_umem(struct xdp_umem *umem)
> +{
> +	if (!umem)
> +		return;
> +
> +	if (atomic_dec_and_test(&umem->users))
> +		xdp_umem_release(umem);
> +}
> +
> +static int xdp_umem_pin_pages(struct xdp_umem *umem)
> +{
> +	unsigned int gup_flags = FOLL_WRITE;
> +	long npgs;
> +	int err;
> +
> +	umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), GFP_KERNEL);
> +	if (!umem->pgs)
> +		return -ENOMEM;
> +
> +	npgs = get_user_pages(umem->address, umem->npgs,
> +			      gup_flags, &umem->pgs[0], NULL);
> +	if (npgs != umem->npgs) {
> +		if (npgs >= 0) {
> +			umem->npgs = npgs;
> +			err = -ENOMEM;
> +			goto out_pin;
> +		}
> +		err = npgs;
> +		goto out_pgs;
> +	}
> +	return 0;
> +
> +out_pin:
> +	xdp_umem_unpin_pages(umem);
> +out_pgs:
> +	kfree(umem->pgs);
> +	umem->pgs = NULL;
> +	return err;
> +}
> +
> +static int xdp_umem_account_pages(struct xdp_umem *umem)
> +{
> +	unsigned long lock_limit, new_npgs, old_npgs;
> +
> +	if (capable(CAP_IPC_LOCK))
> +		return 0;
> +
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	umem->user = get_uid(current_user());
> +
> +	do {
> +		old_npgs = atomic_long_read(&umem->user->locked_vm);
> +		new_npgs = old_npgs + umem->npgs;
> +		if (new_npgs > lock_limit) {
> +			free_uid(umem->user);
> +			umem->user = NULL;
> +			return -ENOBUFS;
> +		}
> +	} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
> +				     new_npgs) != old_npgs);
> +	return 0;
> +}
> +
> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> +{
> +	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> +	u64 addr = mr->addr, size = mr->len;
> +	unsigned int nframes;
> +	int size_chk, err;
> +
> +	if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> +		/* Strictly speaking we could support this, if:
> +		 * - huge pages, or*

what does "or*" here mean?

> +		 * - using an IOMMU, or
> +		 * - making sure the memory area is consecutive
> +		 * but for now, we simply say "computer says no".
> +		 */
> +		return -EINVAL;
> +	}
> +
> +	if (!is_power_of_2(frame_size))
> +		return -EINVAL;
> +
> +	if (!PAGE_ALIGNED(addr)) {
> +		/* Memory area has to be page size aligned. For
> +		 * simplicity, this might change.
> +		 */
> +		return -EINVAL;
> +	}
> +
> +	if ((addr + size) < addr)
> +		return -EINVAL;
> +
> +	nframes = size / frame_size;
> +	if (nframes == 0 || nframes > UINT_MAX)
> +		return -EINVAL;
> +
> +	frame_headroom = ALIGN(frame_headroom, 64);
> +
> +	size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
> +	if (size_chk < 0)
> +		return -EINVAL;
> +
> +	umem->pid = get_task_pid(current, PIDTYPE_PID);
> +	umem->size = (size_t)size;
> +	umem->address = (unsigned long)addr;
> +	umem->props.frame_size = frame_size;
> +	umem->props.nframes = nframes;
> +	umem->frame_headroom = frame_headroom;
> +	umem->npgs = size / PAGE_SIZE;
> +	umem->pgs = NULL;
> +	umem->user = NULL;
> +
> +	umem->frame_size_log2 = ilog2(frame_size);
> +	umem->nfpp_mask = (PAGE_SIZE / frame_size) - 1;
> +	umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size);
> +	atomic_set(&umem->users, 1);
> +
> +	err = xdp_umem_account_pages(umem);
> +	if (err)
> +		goto out;
> +
> +	err = xdp_umem_pin_pages(umem);
> +	if (err)
> +		goto out;
> +	return 0;
> +
> +out:
> +	put_pid(umem->pid);
> +	return err;
> +}
> +
> +int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> +{
> +	int err;
> +
> +	if (!umem)
> +		return -EINVAL;
> +
> +	down_write(&current->mm->mmap_sem);
> +
> +	err = __xdp_umem_reg(umem, mr);
> +
> +	up_write(&current->mm->mmap_sem);
> +	return err;
> +}
> +
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> new file mode 100644
> index 000000000000..58714f4f7f25
> --- /dev/null
> +++ b/net/xdp/xdp_umem.h
> @@ -0,0 +1,42 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + * XDP user-space packet buffer
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#ifndef XDP_UMEM_H_
> +#define XDP_UMEM_H_
> +
> +#include <linux/mm.h>
> +#include <linux/if_xdp.h>
> +
> +#include "xdp_umem_props.h"
> +
> +struct xdp_umem {
> +	struct page **pgs;
> +	struct xdp_umem_props props;
> +	u32 npgs;
> +	u32 frame_headroom;
> +	u32 nfpp_mask;
> +	u32 nfpplog2;
> +	u32 frame_size_log2;
> +	struct user_struct *user;
> +	struct pid *pid;
> +	unsigned long address;
> +	size_t size;
> +	atomic_t users;
> +};
> +
> +int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
> +void xdp_put_umem(struct xdp_umem *umem);
> +int xdp_umem_create(struct xdp_umem **umem);
> +
> +#endif /* XDP_UMEM_H_ */
> diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
> new file mode 100644
> index 000000000000..77fb5daf29f3
> --- /dev/null
> +++ b/net/xdp/xdp_umem_props.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + * XDP user-space packet buffer
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#ifndef XDP_UMEM_PROPS_H_
> +#define XDP_UMEM_PROPS_H_
> +
> +struct xdp_umem_props {
> +	u32 frame_size;
> +	u32 nframes;
> +};
> +
> +#endif /* XDP_UMEM_PROPS_H_ */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> new file mode 100644
> index 000000000000..19fc719cbe0d
> --- /dev/null
> +++ b/net/xdp/xsk.c
> @@ -0,0 +1,223 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* XDP sockets
> + *
> + * AF_XDP sockets allows a channel between XDP programs and userspace
> + * applications.
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * Author(s): Björn Töpel <bjorn.topel@intel.com>
> + *	      Magnus Karlsson <magnus.karlsson@intel.com>
> + */
> +
> +#define pr_fmt(fmt) "AF_XDP: %s: " fmt, __func__
> +
> +#include <linux/if_xdp.h>
> +#include <linux/init.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/task.h>
> +#include <linux/socket.h>
> +#include <linux/file.h>
> +#include <linux/uaccess.h>
> +#include <linux/net.h>
> +#include <linux/netdevice.h>
> +#include <net/sock.h>
> +
> +#include "xdp_umem.h"
> +
> +struct xdp_sock {
> +	/* struct sock must be the first member of struct xdp_sock */
> +	struct sock sk;
> +	struct xdp_umem *umem;
> +	/* Protects multiple processes in the control path */
> +	struct mutex mutex;
> +};
> +
> +static struct xdp_sock *xdp_sk(struct sock *sk)
> +{
> +	return (struct xdp_sock *)sk;
> +}
> +
> +static int xsk_release(struct socket *sock)
> +{
> +	struct sock *sk = sock->sk;
> +	struct net *net;
> +
> +	if (!sk)
> +		return 0;
> +
> +	net = sock_net(sk);
> +
> +	local_bh_disable();
> +	sock_prot_inuse_add(net, sk->sk_prot, -1);
> +	local_bh_enable();
> +
> +	sock_orphan(sk);
> +	sock->sk = NULL;
> +
> +	sk_refcnt_debug_release(sk);
> +	sock_put(sk);
> +
> +	return 0;
> +}
> +
> +static int xsk_setsockopt(struct socket *sock, int level, int optname,
> +			  char __user *optval, unsigned int optlen)
> +{
> +	struct sock *sk = sock->sk;
> +	struct xdp_sock *xs = xdp_sk(sk);
> +	int err;
> +
> +	if (level != SOL_XDP)
> +		return -ENOPROTOOPT;
> +
> +	switch (optname) {
> +	case XDP_UMEM_REG:
> +	{
> +		struct xdp_umem_reg mr;
> +		struct xdp_umem *umem;
> +
> +		if (xs->umem)
> +			return -EBUSY;
> +
> +		if (copy_from_user(&mr, optval, sizeof(mr)))
> +			return -EFAULT;
> +
> +		mutex_lock(&xs->mutex);
> +		err = xdp_umem_create(&umem);
> +
> +		err = xdp_umem_reg(umem, &mr);
> +		if (err) {
> +			kfree(umem);
> +			mutex_unlock(&xs->mutex);
> +			return err;
> +		}
> +
> +		/* Make sure umem is ready before it can be seen by others */
> +		smp_wmb();
> +
> +		xs->umem = umem;
> +		mutex_unlock(&xs->mutex);
> +		return 0;
> +	}
> +	default:
> +		break;
> +	}
> +
> +	return -ENOPROTOOPT;
> +}
> +
> +static struct proto xsk_proto = {
> +	.name =		"XDP",
> +	.owner =	THIS_MODULE,
> +	.obj_size =	sizeof(struct xdp_sock),
> +};
> +
> +static const struct proto_ops xsk_proto_ops = {
> +	.family =	PF_XDP,
> +	.owner =	THIS_MODULE,
> +	.release =	xsk_release,
> +	.bind =		sock_no_bind,
> +	.connect =	sock_no_connect,
> +	.socketpair =	sock_no_socketpair,
> +	.accept =	sock_no_accept,
> +	.getname =	sock_no_getname,
> +	.poll =		sock_no_poll,
> +	.ioctl =	sock_no_ioctl,
> +	.listen =	sock_no_listen,
> +	.shutdown =	sock_no_shutdown,
> +	.setsockopt =	xsk_setsockopt,
> +	.getsockopt =	sock_no_getsockopt,
> +	.sendmsg =	sock_no_sendmsg,
> +	.recvmsg =	sock_no_recvmsg,
> +	.mmap =		sock_no_mmap,
> +	.sendpage =	sock_no_sendpage,
> +};
> +
> +static void xsk_destruct(struct sock *sk)
> +{
> +	struct xdp_sock *xs = xdp_sk(sk);
> +
> +	if (!sock_flag(sk, SOCK_DEAD))
> +		return;
> +
> +	xdp_put_umem(xs->umem);
> +
> +	sk_refcnt_debug_dec(sk);
> +}
> +
> +static int xsk_create(struct net *net, struct socket *sock, int protocol,
> +		      int kern)
> +{
> +	struct sock *sk;
> +	struct xdp_sock *xs;
> +
> +	if (!ns_capable(net->user_ns, CAP_NET_RAW))
> +		return -EPERM;
> +	if (sock->type != SOCK_RAW)
> +		return -ESOCKTNOSUPPORT;
> +
> +	if (protocol)
> +		return -EPROTONOSUPPORT;
> +
> +	sock->state = SS_UNCONNECTED;
> +
> +	sk = sk_alloc(net, PF_XDP, GFP_KERNEL, &xsk_proto, kern);
> +	if (!sk)
> +		return -ENOBUFS;
> +
> +	sock->ops = &xsk_proto_ops;
> +
> +	sock_init_data(sock, sk);
> +
> +	sk->sk_family = PF_XDP;
> +
> +	sk->sk_destruct = xsk_destruct;
> +	sk_refcnt_debug_inc(sk);
> +
> +	xs = xdp_sk(sk);
> +	mutex_init(&xs->mutex);
> +
> +	local_bh_disable();
> +	sock_prot_inuse_add(net, &xsk_proto, 1);
> +	local_bh_enable();
> +
> +	return 0;
> +}
> +
> +static const struct net_proto_family xsk_family_ops = {
> +	.family = PF_XDP,
> +	.create = xsk_create,
> +	.owner	= THIS_MODULE,
> +};
> +
> +static int __init xsk_init(void)
> +{
> +	int err;
> +
> +	err = proto_register(&xsk_proto, 0 /* no slab */);
> +	if (err)
> +		goto out;
> +
> +	err = sock_register(&xsk_family_ops);
> +	if (err)
> +		goto out_proto;
> +
> +	return 0;
> +
> +out_proto:
> +	proto_unregister(&xsk_proto);
> +out:
> +	return err;
> +}
> +
> +fs_initcall(xsk_init);
> -- 
> 2.14.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
  2018-04-23 16:18   ` Michael S. Tsirkin
@ 2018-04-23 20:00     ` Björn Töpel
  2018-04-23 20:11       ` Michael S. Tsirkin
  0 siblings, 1 reply; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 20:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:

[...]

>> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>> +{
>> +     unsigned int i;
>> +
>> +     if (umem->pgs) {
>> +             for (i = 0; i < umem->npgs; i++)
>
> Since you pin them with FOLL_WRITE, I assume these pages
> are written to.
> Don't you need set_page_dirty_lock here?
>

Hmm, I actually *removed* it from the RFC V2, but after doing some
homework, I think you're right. Thanks for pointing this out!

Thinking more about this; This function is called from sk_destruct,
and in the Tx case the sk_destruct can be called from interrupt
context, where set_page_dirty_lock cannot be called.

Are there any preferred ways of solving this? Scheduling the whole
xsk_destruct call to a workqueue is one way (I think). Any
cleaner/better way?

[...]

>> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> +{
>> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> +     u64 addr = mr->addr, size = mr->len;
>> +     unsigned int nframes;
>> +     int size_chk, err;
>> +
>> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> +             /* Strictly speaking we could support this, if:
>> +              * - huge pages, or*
>
> what does "or*" here mean?
>

Oops, I'll change to just 'or' in the next revision.


Thanks!
Björn

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
  2018-04-23 20:00     ` Björn Töpel
@ 2018-04-23 20:11       ` Michael S. Tsirkin
  2018-04-23 20:15         ` Björn Töpel
  0 siblings, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2018-04-23 20:11 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> 
> [...]
> 
> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> >> +{
> >> +     unsigned int i;
> >> +
> >> +     if (umem->pgs) {
> >> +             for (i = 0; i < umem->npgs; i++)
> >
> > Since you pin them with FOLL_WRITE, I assume these pages
> > are written to.
> > Don't you need set_page_dirty_lock here?
> >
> 
> Hmm, I actually *removed* it from the RFC V2, but after doing some
> homework, I think you're right. Thanks for pointing this out!
> 
> Thinking more about this; This function is called from sk_destruct,
> and in the Tx case the sk_destruct can be called from interrupt
> context, where set_page_dirty_lock cannot be called.
> 
> Are there any preferred ways of solving this? Scheduling the whole
> xsk_destruct call to a workqueue is one way (I think). Any
> cleaner/better way?
> 
> [...]

Defer unpinning pages until the next tx call?


> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> >> +{
> >> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> >> +     u64 addr = mr->addr, size = mr->len;
> >> +     unsigned int nframes;
> >> +     int size_chk, err;
> >> +
> >> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> >> +             /* Strictly speaking we could support this, if:
> >> +              * - huge pages, or*
> >
> > what does "or*" here mean?
> >
> 
> Oops, I'll change to just 'or' in the next revision.
> 
> 
> Thanks!
> Björn

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
  2018-04-23 20:11       ` Michael S. Tsirkin
@ 2018-04-23 20:15         ` Björn Töpel
  2018-04-23 20:26           ` Michael S. Tsirkin
  0 siblings, 1 reply; 54+ messages in thread
From: Björn Töpel @ 2018-04-23 20:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

2018-04-23 22:11 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
>> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
>>
>> [...]
>>
>> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>> >> +{
>> >> +     unsigned int i;
>> >> +
>> >> +     if (umem->pgs) {
>> >> +             for (i = 0; i < umem->npgs; i++)
>> >
>> > Since you pin them with FOLL_WRITE, I assume these pages
>> > are written to.
>> > Don't you need set_page_dirty_lock here?
>> >
>>
>> Hmm, I actually *removed* it from the RFC V2, but after doing some
>> homework, I think you're right. Thanks for pointing this out!
>>
>> Thinking more about this; This function is called from sk_destruct,
>> and in the Tx case the sk_destruct can be called from interrupt
>> context, where set_page_dirty_lock cannot be called.
>>
>> Are there any preferred ways of solving this? Scheduling the whole
>> xsk_destruct call to a workqueue is one way (I think). Any
>> cleaner/better way?
>>
>> [...]
>
> Defer unpinning pages until the next tx call?
>

If the sock is released, there wont be another tx call. Or am I
missing something obvious?

>
>> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> >> +{
>> >> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> >> +     u64 addr = mr->addr, size = mr->len;
>> >> +     unsigned int nframes;
>> >> +     int size_chk, err;
>> >> +
>> >> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> >> +             /* Strictly speaking we could support this, if:
>> >> +              * - huge pages, or*
>> >
>> > what does "or*" here mean?
>> >
>>
>> Oops, I'll change to just 'or' in the next revision.
>>
>>
>> Thanks!
>> Björn

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
  2018-04-23 20:15         ` Björn Töpel
@ 2018-04-23 20:26           ` Michael S. Tsirkin
  2018-04-24  7:01             ` Björn Töpel
  0 siblings, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2018-04-23 20:26 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On Mon, Apr 23, 2018 at 10:15:18PM +0200, Björn Töpel wrote:
> 2018-04-23 22:11 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> > On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
> >> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> >>
> >> [...]
> >>
> >> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> >> >> +{
> >> >> +     unsigned int i;
> >> >> +
> >> >> +     if (umem->pgs) {
> >> >> +             for (i = 0; i < umem->npgs; i++)
> >> >
> >> > Since you pin them with FOLL_WRITE, I assume these pages
> >> > are written to.
> >> > Don't you need set_page_dirty_lock here?
> >> >
> >>
> >> Hmm, I actually *removed* it from the RFC V2, but after doing some
> >> homework, I think you're right. Thanks for pointing this out!
> >>
> >> Thinking more about this; This function is called from sk_destruct,
> >> and in the Tx case the sk_destruct can be called from interrupt
> >> context, where set_page_dirty_lock cannot be called.
> >>
> >> Are there any preferred ways of solving this? Scheduling the whole
> >> xsk_destruct call to a workqueue is one way (I think). Any
> >> cleaner/better way?
> >>
> >> [...]
> >
> > Defer unpinning pages until the next tx call?
> >
> 
> If the sock is released, there wont be another tx call.

unpin them on socket release too?

> Or am I
> missing something obvious?
> 
> >
> >> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> >> >> +{
> >> >> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> >> >> +     u64 addr = mr->addr, size = mr->len;
> >> >> +     unsigned int nframes;
> >> >> +     int size_chk, err;
> >> >> +
> >> >> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> >> >> +             /* Strictly speaking we could support this, if:
> >> >> +              * - huge pages, or*
> >> >
> >> > what does "or*" here mean?
> >> >
> >>
> >> Oops, I'll change to just 'or' in the next revision.
> >>
> >>
> >> Thanks!
> >> Björn

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
  2018-04-23 13:56 ` [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt Björn Töpel
  2018-04-23 16:18   ` Michael S. Tsirkin
@ 2018-04-23 23:04   ` Willem de Bruijn
  2018-04-24  7:30     ` Björn Töpel
  2018-04-24 14:27   ` kbuild test robot
  2 siblings, 1 reply; 54+ messages in thread
From: Willem de Bruijn @ 2018-04-23 23:04 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> In this commit the base structure of the AF_XDP address family is set
> up. Further, we introduce the abilty register a window of user memory
> to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
> window is viewed by an AF_XDP socket as a set of equally large
> frames. After a user memory registration all frames are "owned" by the
> user application, and not the kernel.
>
> Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

> +static void xdp_umem_release(struct xdp_umem *umem)
> +{
> +       struct task_struct *task;
> +       struct mm_struct *mm;
> +       unsigned long diff;
> +
> +       if (umem->pgs) {
> +               xdp_umem_unpin_pages(umem);
> +
> +               task = get_pid_task(umem->pid, PIDTYPE_PID);
> +               put_pid(umem->pid);
> +               if (!task)
> +                       goto out;
> +               mm = get_task_mm(task);
> +               put_task_struct(task);
> +               if (!mm)
> +                       goto out;
> +
> +               diff = umem->size >> PAGE_SHIFT;

Need to round up or size must always be a multiple of PAGE_SIZE.

> +
> +               down_write(&mm->mmap_sem);
> +               mm->pinned_vm -= diff;
> +               up_write(&mm->mmap_sem);

When using user->locked_vm for resource limit checks, no need
to also update mm->pinned_vm?

> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> +{
> +       u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> +       u64 addr = mr->addr, size = mr->len;
> +       unsigned int nframes;
> +       int size_chk, err;
> +
> +       if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> +               /* Strictly speaking we could support this, if:
> +                * - huge pages, or*
> +                * - using an IOMMU, or
> +                * - making sure the memory area is consecutive
> +                * but for now, we simply say "computer says no".
> +                */
> +               return -EINVAL;
> +       }

Ideally, AF_XDP subsumes all packet socket use cases. It does not
have packet v3's small packet optimizations of variable sized frames
and block signaling.

I don't suggest adding that now. But for the non-zerocopy case, it may
make sense to ensure that nothing is blocking a later addition of these
features. Especially for header-only (snaplen) workloads. So far, I don't
see any issues.

> +       if (!is_power_of_2(frame_size))
> +               return -EINVAL;
> +
> +       if (!PAGE_ALIGNED(addr)) {
> +               /* Memory area has to be page size aligned. For
> +                * simplicity, this might change.
> +                */
> +               return -EINVAL;
> +       }
> +
> +       if ((addr + size) < addr)
> +               return -EINVAL;
> +
> +       nframes = size / frame_size;
> +       if (nframes == 0 || nframes > UINT_MAX)
> +               return -EINVAL;

You may also want a check here that nframes * frame_size is at least
PAGE_SIZE and probably a multiple of that.

> +       frame_headroom = ALIGN(frame_headroom, 64);
> +
> +       size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
> +       if (size_chk < 0)
> +               return -EINVAL;
> +
> +       umem->pid = get_task_pid(current, PIDTYPE_PID);
> +       umem->size = (size_t)size;
> +       umem->address = (unsigned long)addr;
> +       umem->props.frame_size = frame_size;
> +       umem->props.nframes = nframes;
> +       umem->frame_headroom = frame_headroom;
> +       umem->npgs = size / PAGE_SIZE;
> +       umem->pgs = NULL;
> +       umem->user = NULL;
> +
> +       umem->frame_size_log2 = ilog2(frame_size);
> +       umem->nfpp_mask = (PAGE_SIZE / frame_size) - 1;
> +       umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size);
> +       atomic_set(&umem->users, 1);
> +
> +       err = xdp_umem_account_pages(umem);
> +       if (err)
> +               goto out;
> +
> +       err = xdp_umem_pin_pages(umem);
> +       if (err)

need to call xdp_umem_unaccount_pages on error
> +               goto out;
> +       return 0;
> +
> +out:
> +       put_pid(umem->pid);
> +       return err;
> +}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
  2018-04-23 13:56 ` [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap Björn Töpel
@ 2018-04-23 23:16   ` Michael S. Tsirkin
  2018-04-25 12:37     ` Björn Töpel
  2018-04-23 23:21   ` Michael S. Tsirkin
  1 sibling, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2018-04-23 23:16 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang

On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Here, we add another setsockopt for registered user memory (umem)
> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
> ask the kernel to allocate a queue (ring buffer) and also mmap it
> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
> 
> The queue is used to explicitly pass ownership of umem frames from the
> user process to the kernel. These frames will in a later patch be
> filled in with Rx packet data by the kernel.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>  net/xdp/Makefile            |  2 +-
>  net/xdp/xdp_umem.c          |  5 ++++
>  net/xdp/xdp_umem.h          |  2 ++
>  net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
>  net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
>  7 files changed, 180 insertions(+), 2 deletions(-)
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
> 
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 41252135a0fe..975661e1baca 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -23,6 +23,7 @@
>  
>  /* XDP socket options */
>  #define XDP_UMEM_REG			3
> +#define XDP_UMEM_FILL_RING		4
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>  	__u32 frame_headroom; /* Frame head room */
>  };
>  
> +/* Pgoff for mmaping the rings */
> +#define XDP_UMEM_PGOFF_FILL_RING	0x100000000
> +
> +struct xdp_ring {
> +	__u32 producer __attribute__((aligned(64)));
> +	__u32 consumer __attribute__((aligned(64)));
> +};
> +
> +/* Used for the fill and completion queues for buffers */
> +struct xdp_umem_ring {
> +	struct xdp_ring ptrs;
> +	__u32 desc[0] __attribute__((aligned(64)));
> +};
> +
>  #endif /* _LINUX_IF_XDP_H */
> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
> index a5d736640a0f..074fb2b2d51c 100644
> --- a/net/xdp/Makefile
> +++ b/net/xdp/Makefile
> @@ -1,2 +1,2 @@
> -obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
>  
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index bff058f5a769..6fc233e03f30 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -62,6 +62,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
>  	struct mm_struct *mm;
>  	unsigned long diff;
>  
> +	if (umem->fq) {
> +		xskq_destroy(umem->fq);
> +		umem->fq = NULL;
> +	}
> +
>  	if (umem->pgs) {
>  		xdp_umem_unpin_pages(umem);
>  
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> index 58714f4f7f25..3086091aebdd 100644
> --- a/net/xdp/xdp_umem.h
> +++ b/net/xdp/xdp_umem.h
> @@ -18,9 +18,11 @@
>  #include <linux/mm.h>
>  #include <linux/if_xdp.h>
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem_props.h"
>  
>  struct xdp_umem {
> +	struct xsk_queue *fq;
>  	struct page **pgs;
>  	struct xdp_umem_props props;
>  	u32 npgs;
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 19fc719cbe0d..bf6a1151df28 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -32,6 +32,7 @@
>  #include <linux/netdevice.h>
>  #include <net/sock.h>
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem.h"
>  
>  struct xdp_sock {
> @@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
>  	return (struct xdp_sock *)sk;
>  }
>  
> +static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
> +{
> +	struct xsk_queue *q;
> +
> +	if (entries == 0 || *queue || !is_power_of_2(entries))
> +		return -EINVAL;
> +
> +	q = xskq_create(entries);
> +	if (!q)
> +		return -ENOMEM;
> +
> +	*queue = q;
> +	return 0;
> +}
> +
>  static int xsk_release(struct socket *sock)
>  {
>  	struct sock *sk = sock->sk;
> @@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		mutex_unlock(&xs->mutex);
>  		return 0;
>  	}
> +	case XDP_UMEM_FILL_RING:
> +	{
> +		struct xsk_queue **q;
> +		int entries;
> +
> +		if (!xs->umem)
> +			return -EINVAL;
> +
> +		if (copy_from_user(&entries, optval, sizeof(entries)))
> +			return -EFAULT;
> +
> +		mutex_lock(&xs->mutex);
> +		q = &xs->umem->fq;
> +		err = xsk_init_queue(entries, q);
> +		mutex_unlock(&xs->mutex);
> +		return err;
> +	}
>  	default:
>  		break;
>  	}
> @@ -116,6 +149,33 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  	return -ENOPROTOOPT;
>  }
>  
> +static int xsk_mmap(struct file *file, struct socket *sock,
> +		    struct vm_area_struct *vma)
> +{
> +	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
> +	unsigned long size = vma->vm_end - vma->vm_start;
> +	struct xdp_sock *xs = xdp_sk(sock->sk);
> +	struct xsk_queue *q;
> +	unsigned long pfn;
> +	struct page *qpg;
> +
> +	if (!xs->umem)
> +		return -EINVAL;
> +
> +	if (offset == XDP_UMEM_PGOFF_FILL_RING)
> +		q = xs->umem->fq;
> +	else
> +		return -EINVAL;
> +
> +	qpg = virt_to_head_page(q->ring);
> +	if (size > (PAGE_SIZE << compound_order(qpg)))
> +		return -EINVAL;
> +
> +	pfn = virt_to_phys(q->ring) >> PAGE_SHIFT;
> +	return remap_pfn_range(vma, vma->vm_start, pfn,
> +			       size, vma->vm_page_prot);
> +}
> +
>  static struct proto xsk_proto = {
>  	.name =		"XDP",
>  	.owner =	THIS_MODULE,
> @@ -139,7 +199,7 @@ static const struct proto_ops xsk_proto_ops = {
>  	.getsockopt =	sock_no_getsockopt,
>  	.sendmsg =	sock_no_sendmsg,
>  	.recvmsg =	sock_no_recvmsg,
> -	.mmap =		sock_no_mmap,
> +	.mmap =		xsk_mmap,
>  	.sendpage =	sock_no_sendpage,
>  };
>  
> diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
> new file mode 100644
> index 000000000000..23da4f29d3fb
> --- /dev/null
> +++ b/net/xdp/xsk_queue.c
> @@ -0,0 +1,58 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* XDP user-space ring structure
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/slab.h>
> +
> +#include "xsk_queue.h"
> +
> +static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
> +{
> +	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
> +}
> +
> +struct xsk_queue *xskq_create(u32 nentries)
> +{
> +	struct xsk_queue *q;
> +	gfp_t gfp_flags;
> +	size_t size;
> +
> +	q = kzalloc(sizeof(*q), GFP_KERNEL);
> +	if (!q)
> +		return NULL;
> +
> +	q->nentries = nentries;
> +	q->ring_mask = nentries - 1;
> +
> +	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
> +		    __GFP_COMP  | __GFP_NORETRY;
> +	size = xskq_umem_get_ring_size(q);
> +
> +	q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
> +						      get_order(size));
> +	if (!q->ring) {
> +		kfree(q);
> +		return NULL;
> +	}
> +
> +	return q;
> +}
> +
> +void xskq_destroy(struct xsk_queue *q)
> +{
> +	if (!q)
> +		return;
> +
> +	page_frag_free(q->ring);
> +	kfree(q);
> +}
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> new file mode 100644
> index 000000000000..7eb556bf73be
> --- /dev/null
> +++ b/net/xdp/xsk_queue.h
> @@ -0,0 +1,38 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + * XDP user-space ring structure
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#ifndef _LINUX_XSK_QUEUE_H
> +#define _LINUX_XSK_QUEUE_H
> +
> +#include <linux/types.h>
> +#include <linux/if_xdp.h>
> +
> +#include "xdp_umem_props.h"
> +
> +struct xsk_queue {
> +	struct xdp_umem_props umem_props;
> +	u32 ring_mask;
> +	u32 nentries;
> +	u32 prod_head;
> +	u32 prod_tail;
> +	u32 cons_head;
> +	u32 cons_tail;
> +	struct xdp_ring *ring;
> +	u64 invalid_descs;
> +};

Any documentation on how e.g. the locking works here?


> +
> +struct xsk_queue *xskq_create(u32 nentries);
> +void xskq_destroy(struct xsk_queue *q);
> +
> +#endif /* _LINUX_XSK_QUEUE_H */
> -- 
> 2.14.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
  2018-04-23 13:56 ` [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap Björn Töpel
  2018-04-23 23:16   ` Michael S. Tsirkin
@ 2018-04-23 23:21   ` Michael S. Tsirkin
  2018-04-23 23:59     ` Willem de Bruijn
  1 sibling, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2018-04-23 23:21 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang

On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Here, we add another setsockopt for registered user memory (umem)
> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
> ask the kernel to allocate a queue (ring buffer) and also mmap it
> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
> 
> The queue is used to explicitly pass ownership of umem frames from the
> user process to the kernel. These frames will in a later patch be
> filled in with Rx packet data by the kernel.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>  net/xdp/Makefile            |  2 +-
>  net/xdp/xdp_umem.c          |  5 ++++
>  net/xdp/xdp_umem.h          |  2 ++
>  net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
>  net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
>  7 files changed, 180 insertions(+), 2 deletions(-)
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
> 
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 41252135a0fe..975661e1baca 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -23,6 +23,7 @@
>  
>  /* XDP socket options */
>  #define XDP_UMEM_REG			3
> +#define XDP_UMEM_FILL_RING		4
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>  	__u32 frame_headroom; /* Frame head room */
>  };
>  
> +/* Pgoff for mmaping the rings */
> +#define XDP_UMEM_PGOFF_FILL_RING	0x100000000
> +
> +struct xdp_ring {
> +	__u32 producer __attribute__((aligned(64)));
> +	__u32 consumer __attribute__((aligned(64)));
> +};

Why 64? And do you still need these guys in uapi?

> +
> +/* Used for the fill and completion queues for buffers */
> +struct xdp_umem_ring {
> +	struct xdp_ring ptrs;
> +	__u32 desc[0] __attribute__((aligned(64)));
> +};
> +
>  #endif /* _LINUX_IF_XDP_H */
> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
> index a5d736640a0f..074fb2b2d51c 100644
> --- a/net/xdp/Makefile
> +++ b/net/xdp/Makefile
> @@ -1,2 +1,2 @@
> -obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
>  
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index bff058f5a769..6fc233e03f30 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -62,6 +62,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
>  	struct mm_struct *mm;
>  	unsigned long diff;
>  
> +	if (umem->fq) {
> +		xskq_destroy(umem->fq);
> +		umem->fq = NULL;
> +	}
> +
>  	if (umem->pgs) {
>  		xdp_umem_unpin_pages(umem);
>  
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> index 58714f4f7f25..3086091aebdd 100644
> --- a/net/xdp/xdp_umem.h
> +++ b/net/xdp/xdp_umem.h
> @@ -18,9 +18,11 @@
>  #include <linux/mm.h>
>  #include <linux/if_xdp.h>
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem_props.h"
>  
>  struct xdp_umem {
> +	struct xsk_queue *fq;
>  	struct page **pgs;
>  	struct xdp_umem_props props;
>  	u32 npgs;
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 19fc719cbe0d..bf6a1151df28 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -32,6 +32,7 @@
>  #include <linux/netdevice.h>
>  #include <net/sock.h>
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem.h"
>  
>  struct xdp_sock {
> @@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
>  	return (struct xdp_sock *)sk;
>  }
>  
> +static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
> +{
> +	struct xsk_queue *q;
> +
> +	if (entries == 0 || *queue || !is_power_of_2(entries))
> +		return -EINVAL;
> +
> +	q = xskq_create(entries);
> +	if (!q)
> +		return -ENOMEM;
> +
> +	*queue = q;
> +	return 0;
> +}
> +
>  static int xsk_release(struct socket *sock)
>  {
>  	struct sock *sk = sock->sk;
> @@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		mutex_unlock(&xs->mutex);
>  		return 0;
>  	}
> +	case XDP_UMEM_FILL_RING:
> +	{
> +		struct xsk_queue **q;
> +		int entries;
> +
> +		if (!xs->umem)
> +			return -EINVAL;
> +
> +		if (copy_from_user(&entries, optval, sizeof(entries)))
> +			return -EFAULT;
> +
> +		mutex_lock(&xs->mutex);
> +		q = &xs->umem->fq;
> +		err = xsk_init_queue(entries, q);
> +		mutex_unlock(&xs->mutex);
> +		return err;
> +	}
>  	default:
>  		break;
>  	}
> @@ -116,6 +149,33 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  	return -ENOPROTOOPT;
>  }
>  
> +static int xsk_mmap(struct file *file, struct socket *sock,
> +		    struct vm_area_struct *vma)
> +{
> +	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
> +	unsigned long size = vma->vm_end - vma->vm_start;
> +	struct xdp_sock *xs = xdp_sk(sock->sk);
> +	struct xsk_queue *q;
> +	unsigned long pfn;
> +	struct page *qpg;
> +
> +	if (!xs->umem)
> +		return -EINVAL;
> +
> +	if (offset == XDP_UMEM_PGOFF_FILL_RING)
> +		q = xs->umem->fq;
> +	else
> +		return -EINVAL;
> +
> +	qpg = virt_to_head_page(q->ring);
> +	if (size > (PAGE_SIZE << compound_order(qpg)))
> +		return -EINVAL;
> +
> +	pfn = virt_to_phys(q->ring) >> PAGE_SHIFT;
> +	return remap_pfn_range(vma, vma->vm_start, pfn,
> +			       size, vma->vm_page_prot);
> +}
> +
>  static struct proto xsk_proto = {
>  	.name =		"XDP",
>  	.owner =	THIS_MODULE,
> @@ -139,7 +199,7 @@ static const struct proto_ops xsk_proto_ops = {
>  	.getsockopt =	sock_no_getsockopt,
>  	.sendmsg =	sock_no_sendmsg,
>  	.recvmsg =	sock_no_recvmsg,
> -	.mmap =		sock_no_mmap,
> +	.mmap =		xsk_mmap,
>  	.sendpage =	sock_no_sendpage,
>  };
>  
> diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
> new file mode 100644
> index 000000000000..23da4f29d3fb
> --- /dev/null
> +++ b/net/xdp/xsk_queue.c
> @@ -0,0 +1,58 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* XDP user-space ring structure
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/slab.h>
> +
> +#include "xsk_queue.h"
> +
> +static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
> +{
> +	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
> +}
> +
> +struct xsk_queue *xskq_create(u32 nentries)
> +{
> +	struct xsk_queue *q;
> +	gfp_t gfp_flags;
> +	size_t size;
> +
> +	q = kzalloc(sizeof(*q), GFP_KERNEL);
> +	if (!q)
> +		return NULL;
> +
> +	q->nentries = nentries;
> +	q->ring_mask = nentries - 1;
> +
> +	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
> +		    __GFP_COMP  | __GFP_NORETRY;
> +	size = xskq_umem_get_ring_size(q);
> +
> +	q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
> +						      get_order(size));
> +	if (!q->ring) {
> +		kfree(q);
> +		return NULL;
> +	}
> +
> +	return q;
> +}
> +
> +void xskq_destroy(struct xsk_queue *q)
> +{
> +	if (!q)
> +		return;
> +
> +	page_frag_free(q->ring);
> +	kfree(q);
> +}
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> new file mode 100644
> index 000000000000..7eb556bf73be
> --- /dev/null
> +++ b/net/xdp/xsk_queue.h
> @@ -0,0 +1,38 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + * XDP user-space ring structure
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#ifndef _LINUX_XSK_QUEUE_H
> +#define _LINUX_XSK_QUEUE_H
> +
> +#include <linux/types.h>
> +#include <linux/if_xdp.h>
> +
> +#include "xdp_umem_props.h"
> +
> +struct xsk_queue {
> +	struct xdp_umem_props umem_props;
> +	u32 ring_mask;
> +	u32 nentries;
> +	u32 prod_head;
> +	u32 prod_tail;
> +	u32 cons_head;
> +	u32 cons_tail;
> +	struct xdp_ring *ring;
> +	u64 invalid_descs;
> +};
> +
> +struct xsk_queue *xskq_create(u32 nentries);
> +void xskq_destroy(struct xsk_queue *q);
> +
> +#endif /* _LINUX_XSK_QUEUE_H */
> -- 
> 2.14.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (14 preceding siblings ...)
  2018-04-23 13:56 ` [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets Björn Töpel
@ 2018-04-23 23:22 ` Michael S. Tsirkin
  2018-04-24  6:55   ` Björn Töpel
  2018-04-24  2:29 ` Jason Wang
  2018-04-24 17:03 ` Willem de Bruijn
  17 siblings, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2018-04-23 23:22 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, Björn Töpel, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang

On Mon, Apr 23, 2018 at 03:56:04PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work
> has already been accepted. We will publish our zero-copy support for
> RX and TX on top of his patch sets at a later point in time.
> 
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
> 
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
> 
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this RFC, all
> packet data is copied out to user-space.
> 
> A new feature in this RFC is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
> 
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
> 
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> together with the generic XDP support and copies out the data to user
> space. A fallback mode that works for any network device. On the other
> hand, if the driver has support for XDP, it will be used by the AF_XDP
> code to provide better performance, but there is still a copy of the
> data into user space.
> 
> There is a xdpsock benchmarking/test application included that
> demonstrates how to use AF_XDP sockets with both private and shared
> UMEMs. Say that you would like your UDP traffic from port 4242 to end
> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
> for this:
> 
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
> 
> Running the rxdrop benchmark in XDP_DRV mode can then be done
> using:
> 
>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> 
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
> 
> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.9(3.0)   9.4(9.3)  
> txpush       2.5(2.2)   NA*
> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.1(2.2)   3.3(3.1)  
> l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)
> 
> * NA since we have no support for TX using the XDP_DRV infrastructure
>   in this RFC. This is for a future patch set since it involves
>   changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>   Dangaard Brouer.
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32,921,521  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3,289,491   0
> 
> Changes from RFC V2:
> 
> * Optimizations and simplifications to the ring structures inspired by
>   ptr_ring.h 
> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>   consistent with AF_PACKET
> * Support for only having an RX queue or a TX queue defined
> * Some bug fixes and code cleanup
> 
> The structure of the patch set is as follows:
> 
> Patches 1-2: Basic socket and umem plumbing 
> Patches 3-10: RX support together with the new XSKMAP
> Patches 11-14: TX support
> Patch 15: Sample application
> 
> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
> Clean up btf.h in uapi")
> 
> Questions:
> 
> * How to deal with cache alignment for uapi when different
>   architectures can have different cache line sizes? We have just
>   aligned it to 64 bytes for now, which works for many popular
>   architectures, but not all. Please advise.
> 
> To do:
> 
> * Optimize performance
> 
> * Kernel selftest
> 
> Post-series plan:
> 
> * Kernel load module support of AF_XDP would be nice. Unclear how to
>   achieve this though since our XDP code depends on net/core.
> 
> * Support for AF_XDP sockets without an XPD program loaded. In this
>   case all the traffic on a queue should go up to the user space socket.
> 
> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>   XDP_PASS" for a tcpdump-like functionality.
> 
> * And of course getting to zero-copy support in small increments. 
> 
> Thanks: Björn and Magnus
> 
> Björn Töpel (8):
>   net: initial AF_XDP skeleton
>   xsk: add user memory registration support sockopt
>   xsk: add Rx queue setup and mmap support
>   xdp: introduce xdp_return_buff API
>   xsk: add Rx receive functions and poll support
>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>   xsk: wire up XDP_DRV side of AF_XDP
>   xsk: wire up XDP_SKB side of AF_XDP
> 
> Magnus Karlsson (7):
>   xsk: add umem fill queue support and mmap
>   xsk: add support for bind for Rx
>   xsk: add umem completion queue support and mmap
>   xsk: add Tx queue setup and mmap support
>   xsk: support for Tx
>   xsk: statistics support
>   samples/bpf: sample application for AF_XDP sockets
> 
>  MAINTAINERS                         |   8 +
>  include/linux/bpf.h                 |  26 +
>  include/linux/bpf_types.h           |   3 +
>  include/linux/filter.h              |   2 +-
>  include/linux/socket.h              |   5 +-
>  include/net/xdp.h                   |   1 +
>  include/net/xdp_sock.h              |  46 ++
>  include/uapi/linux/bpf.h            |   1 +
>  include/uapi/linux/if_xdp.h         |  87 ++++
>  kernel/bpf/Makefile                 |   3 +
>  kernel/bpf/verifier.c               |   8 +-
>  kernel/bpf/xskmap.c                 | 286 +++++++++++
>  net/Kconfig                         |   1 +
>  net/Makefile                        |   1 +
>  net/core/dev.c                      |  34 +-
>  net/core/filter.c                   |  40 +-
>  net/core/sock.c                     |  12 +-
>  net/core/xdp.c                      |  15 +-
>  net/xdp/Kconfig                     |   7 +
>  net/xdp/Makefile                    |   2 +
>  net/xdp/xdp_umem.c                  | 256 ++++++++++
>  net/xdp/xdp_umem.h                  |  65 +++
>  net/xdp/xdp_umem_props.h            |  23 +
>  net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
>  net/xdp/xsk_queue.c                 |  73 +++
>  net/xdp/xsk_queue.h                 | 245 ++++++++++
>  samples/bpf/Makefile                |   4 +
>  samples/bpf/xdpsock.h               |  11 +
>  samples/bpf/xdpsock_kern.c          |  56 +++
>  samples/bpf/xdpsock_user.c          | 947 ++++++++++++++++++++++++++++++++++++
>  security/selinux/hooks.c            |   4 +-
>  security/selinux/include/classmap.h |   4 +-
>  32 files changed, 2945 insertions(+), 35 deletions(-)
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 kernel/bpf/xskmap.c
>  create mode 100644 net/xdp/Kconfig
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xdp_umem.c
>  create mode 100644 net/xdp/xdp_umem.h
>  create mode 100644 net/xdp/xdp_umem_props.h
>  create mode 100644 net/xdp/xsk.c
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
>  create mode 100644 samples/bpf/xdpsock.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_user.c

Is there a chance of Documentation/networking/af_xdp.txt ?


> 
> -- 
> 2.14.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets
  2018-04-23 13:56 ` [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets Björn Töpel
@ 2018-04-23 23:31   ` Michael S. Tsirkin
  2018-04-24  8:22     ` Magnus Karlsson
  0 siblings, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2018-04-23 23:31 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, Björn Töpel

On Mon, Apr 23, 2018 at 03:56:19PM +0200, Björn Töpel wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> This is a sample application for AF_XDP sockets. The application
> supports three different modes of operation: rxdrop, txonly and l2fwd.
> 
> To show-case a simple round-robin load-balancing between a set of
> sockets in an xskmap, set the RR_LB compile time define option to 1 in
> "xdpsock.h".
> 
> Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  samples/bpf/Makefile       |   4 +
>  samples/bpf/xdpsock.h      |  11 +
>  samples/bpf/xdpsock_kern.c |  56 +++
>  samples/bpf/xdpsock_user.c | 947 +++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 1018 insertions(+)
>  create mode 100644 samples/bpf/xdpsock.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_user.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index aa8c392e2e52..d0ddc1abf20d 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
>  hostprogs-y += syscall_tp
>  hostprogs-y += cpustat
>  hostprogs-y += xdp_adjust_tail
> +hostprogs-y += xdpsock
>  
>  # Libbpf dependencies
>  LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
> @@ -97,6 +98,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
>  syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
>  cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
>  xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
> +xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
>  
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
> @@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
>  always += syscall_tp_kern.o
>  always += cpustat_kern.o
>  always += xdp_adjust_tail_kern.o
> +always += xdpsock_kern.o
>  
>  HOSTCFLAGS += -I$(objtree)/usr/include
>  HOSTCFLAGS += -I$(srctree)/tools/lib/
> @@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
>  HOSTLOADLIBES_syscall_tp += -lelf
>  HOSTLOADLIBES_cpustat += -lelf
>  HOSTLOADLIBES_xdp_adjust_tail += -lelf
> +HOSTLOADLIBES_xdpsock += -lelf -pthread
>  
>  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
>  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
> diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
> new file mode 100644
> index 000000000000..533ab81adfa1
> --- /dev/null
> +++ b/samples/bpf/xdpsock.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef XDPSOCK_H_
> +#define XDPSOCK_H_
> +
> +/* Power-of-2 number of sockets */
> +#define MAX_SOCKS 4
> +
> +/* Round-robin receive */
> +#define RR_LB 0
> +
> +#endif /* XDPSOCK_H_ */
> diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
> new file mode 100644
> index 000000000000..d8806c41362e
> --- /dev/null
> +++ b/samples/bpf/xdpsock_kern.c
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define KBUILD_MODNAME "foo"
> +#include <uapi/linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +#include "xdpsock.h"
> +
> +struct bpf_map_def SEC("maps") qidconf_map = {
> +	.type		= BPF_MAP_TYPE_ARRAY,
> +	.key_size	= sizeof(int),
> +	.value_size	= sizeof(int),
> +	.max_entries	= 1,
> +};
> +
> +struct bpf_map_def SEC("maps") xsks_map = {
> +	.type = BPF_MAP_TYPE_XSKMAP,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(int),
> +	.max_entries = 4,
> +};
> +
> +struct bpf_map_def SEC("maps") rr_map = {
> +	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(unsigned int),
> +	.max_entries = 1,
> +};
> +
> +SEC("xdp_sock")
> +int xdp_sock_prog(struct xdp_md *ctx)
> +{
> +	int *qidconf, key = 0, idx;
> +	unsigned int *rr;
> +
> +	qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
> +	if (!qidconf)
> +		return XDP_ABORTED;
> +
> +	if (*qidconf != ctx->rx_queue_index)
> +		return XDP_PASS;
> +
> +#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
> +	rr = bpf_map_lookup_elem(&rr_map, &key);
> +	if (!rr)
> +		return XDP_ABORTED;
> +
> +	*rr = (*rr + 1) & (MAX_SOCKS - 1);
> +	idx = *rr;
> +#else
> +	idx = 0;
> +#endif
> +
> +	return bpf_redirect_map(&xsks_map, idx, 0);
> +}
> +
> +char _license[] SEC("license") = "GPL";
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> new file mode 100644
> index 000000000000..690bac1a0ab7
> --- /dev/null
> +++ b/samples/bpf/xdpsock_user.c
> @@ -0,0 +1,947 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2017 - 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <assert.h>
> +#include <errno.h>
> +#include <getopt.h>
> +#include <libgen.h>
> +#include <linux/bpf.h>
> +#include <linux/if_link.h>
> +#include <linux/if_xdp.h>
> +#include <linux/if_ether.h>
> +#include <net/if.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <net/ethernet.h>
> +#include <sys/resource.h>
> +#include <sys/socket.h>
> +#include <sys/mman.h>
> +#include <time.h>
> +#include <unistd.h>
> +#include <pthread.h>
> +#include <locale.h>
> +#include <sys/types.h>
> +#include <poll.h>
> +
> +#include "bpf_load.h"
> +#include "bpf_util.h"
> +#include "libbpf.h"
> +
> +#include "xdpsock.h"
> +
> +#ifndef SOL_XDP
> +#define SOL_XDP 283
> +#endif
> +
> +#ifndef AF_XDP
> +#define AF_XDP 44
> +#endif
> +
> +#ifndef PF_XDP
> +#define PF_XDP AF_XDP
> +#endif
> +
> +#define NUM_FRAMES 131072
> +#define FRAME_HEADROOM 0
> +#define FRAME_SIZE 2048
> +#define NUM_DESCS 1024
> +#define BATCH_SIZE 16
> +
> +#define FQ_NUM_DESCS 1024
> +#define CQ_NUM_DESCS 1024
> +
> +#define DEBUG_HEXDUMP 0
> +
> +typedef __u32 u32;
> +
> +static unsigned long prev_time;
> +
> +enum benchmark_type {
> +	BENCH_RXDROP = 0,
> +	BENCH_TXONLY = 1,
> +	BENCH_L2FWD = 2,
> +};
> +
> +static enum benchmark_type opt_bench = BENCH_RXDROP;
> +static u32 opt_xdp_flags;
> +static const char *opt_if = "";
> +static int opt_ifindex;
> +static int opt_queue;
> +static int opt_poll;
> +static int opt_shared_packet_buffer;
> +static int opt_interval = 1;
> +
> +struct xdp_umem_uqueue {
> +	u32 cached_prod;
> +	u32 cached_cons;
> +	u32 mask;
> +	u32 size;
> +	struct xdp_umem_ring *ring;
> +};
> +
> +struct xdp_umem {
> +	char (*frames)[FRAME_SIZE];
> +	struct xdp_umem_uqueue fq;
> +	struct xdp_umem_uqueue cq;
> +	int fd;
> +};
> +
> +struct xdp_uqueue {
> +	u32 cached_prod;
> +	u32 cached_cons;
> +	u32 mask;
> +	u32 size;
> +	struct xdp_rxtx_ring *ring;
> +};
> +
> +struct xdpsock {
> +	struct xdp_uqueue rx;
> +	struct xdp_uqueue tx;
> +	int sfd;
> +	struct xdp_umem *umem;
> +	u32 outstanding_tx;
> +	unsigned long rx_npkts;
> +	unsigned long tx_npkts;
> +	unsigned long prev_rx_npkts;
> +	unsigned long prev_tx_npkts;
> +};
> +
> +#define MAX_SOCKS 4
> +static int num_socks;
> +struct xdpsock *xsks[MAX_SOCKS];
> +
> +static unsigned long get_nsecs(void)
> +{
> +	struct timespec ts;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &ts);
> +	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
> +}
> +
> +static void dump_stats(void);
> +
> +#define lassert(expr)							\
> +	do {								\
> +		if (!(expr)) {						\
> +			fprintf(stderr, "%s:%s:%i: Assertion failed: "	\
> +				#expr ": errno: %d/\"%s\"\n",		\
> +				__FILE__, __func__, __LINE__,		\
> +				errno, strerror(errno));		\
> +			dump_stats();					\
> +			exit(EXIT_FAILURE);				\
> +		}							\
> +	} while (0)
> +
> +#define barrier() __asm__ __volatile__("": : :"memory")
> +#define u_smp_rmb() barrier()
> +#define u_smp_wmb() barrier()
> +#define likely(x) __builtin_expect(!!(x), 1)
> +#define unlikely(x) __builtin_expect(!!(x), 0)
> +
> +static const char pkt_data[] =
> +	"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
> +	"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
> +	"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
> +	"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
> +
> +static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
> +{
> +	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
> +
> +	if (free_entries >= nb)
> +		return free_entries;
> +
> +	/* Refresh the local tail pointer */
> +	q->cached_cons = q->ring->ptrs.consumer;
> +
> +	return q->size - (q->cached_prod - q->cached_cons);
> +}
> +
> +static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
> +{
> +	u32 free_entries = q->cached_cons - q->cached_prod;
> +
> +	if (free_entries >= ndescs)
> +		return free_entries;
> +
> +	/* Refresh the local tail pointer */
> +	q->cached_cons = q->ring->ptrs.consumer + q->size;
> +	return q->cached_cons - q->cached_prod;
> +}
> +
> +static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb)
> +{
> +	u32 entries = q->cached_prod - q->cached_cons;
> +
> +	if (entries == 0)
> +		q->cached_prod = q->ring->ptrs.producer;
> +
> +	entries = q->cached_prod - q->cached_cons;
> +
> +	return (entries > nb) ? nb : entries;
> +}
> +
> +static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs)
> +{
> +	u32 entries = q->cached_prod - q->cached_cons;
> +
> +	if (entries == 0)
> +		q->cached_prod = q->ring->ptrs.producer;
> +
> +	entries = q->cached_prod - q->cached_cons;
> +	return (entries > ndescs) ? ndescs : entries;
> +}
> +
> +static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
> +					 struct xdp_desc *d,
> +					 size_t nb)
> +{
> +	u32 i;
> +
> +	if (umem_nb_free(fq, nb) < nb)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < nb; i++) {
> +		u32 idx = fq->cached_prod++ & fq->mask;
> +
> +		fq->ring->desc[idx] = d[i].idx;
> +	}
> +
> +	u_smp_wmb();
> +
> +	fq->ring->ptrs.producer = fq->cached_prod;
> +
> +	return 0;
> +}
> +
> +static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
> +				      size_t nb)
> +{
> +	u32 i;
> +
> +	if (umem_nb_free(fq, nb) < nb)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < nb; i++) {
> +		u32 idx = fq->cached_prod++ & fq->mask;
> +
> +		fq->ring->desc[idx] = d[i];
> +	}
> +
> +	u_smp_wmb();
> +
> +	fq->ring->ptrs.producer = fq->cached_prod;
> +
> +	return 0;
> +}
> +
> +static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
> +					       u32 *d, size_t nb)
> +{
> +	u32 idx, i, entries = umem_nb_avail(cq, nb);
> +
> +	u_smp_rmb();
> +
> +	for (i = 0; i < entries; i++) {
> +		idx = cq->cached_cons++ & cq->mask;
> +		d[i] = cq->ring->desc[idx];
> +	}
> +
> +	if (entries > 0) {
> +		u_smp_wmb();
> +
> +		cq->ring->ptrs.consumer = cq->cached_cons;
> +	}
> +
> +	return entries;
> +}
> +
> +static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
> +{
> +	lassert(idx < NUM_FRAMES);
> +	return &xsk->umem->frames[idx][off];
> +}
> +
> +static inline int xq_enq(struct xdp_uqueue *uq,
> +			 const struct xdp_desc *descs,
> +			 unsigned int ndescs)
> +{
> +	struct xdp_rxtx_ring *r = uq->ring;
> +	unsigned int i;
> +
> +	if (xq_nb_free(uq, ndescs) < ndescs)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < ndescs; i++) {
> +		u32 idx = uq->cached_prod++ & uq->mask;
> +
> +		r->desc[idx].idx = descs[i].idx;
> +		r->desc[idx].len = descs[i].len;
> +		r->desc[idx].offset = descs[i].offset;
> +	}
> +
> +	u_smp_wmb();
> +
> +	r->ptrs.producer = uq->cached_prod;
> +	return 0;
> +}
> +
> +static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
> +				 __u32 idx, unsigned int ndescs)
> +{
> +	struct xdp_rxtx_ring *q = uq->ring;
> +	unsigned int i;
> +
> +	if (xq_nb_free(uq, ndescs) < ndescs)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < ndescs; i++) {
> +		u32 idx = uq->cached_prod++ & uq->mask;
> +
> +		q->desc[idx].idx	= idx + i;
> +		q->desc[idx].len	= sizeof(pkt_data) - 1;
> +		q->desc[idx].offset	= 0;
> +	}
> +
> +	u_smp_wmb();
> +
> +	q->ptrs.producer = uq->cached_prod;
> +	return 0;
> +}
> +
> +static inline int xq_deq(struct xdp_uqueue *uq,
> +			 struct xdp_desc *descs,
> +			 int ndescs)
> +{
> +	struct xdp_rxtx_ring *r = uq->ring;
> +	unsigned int idx;
> +	int i, entries;
> +
> +	entries = xq_nb_avail(uq, ndescs);
> +
> +	u_smp_rmb();
> +
> +	for (i = 0; i < entries; i++) {
> +		idx = uq->cached_cons++ & uq->mask;
> +		descs[i] = r->desc[idx];
> +	}
> +
> +	if (entries > 0) {
> +		u_smp_wmb();
> +
> +		r->ptrs.consumer = uq->cached_cons;
> +	}
> +
> +	return entries;
> +}

Interesting, I was under the impression that you were
planning to get rid of consumer/producer counters
and validate the descriptors instead.

That's the ptr_ring design.

You can then drop all the code around synchronising
counter caches, as well as smp_rmb barriers.


> +
> +static void swap_mac_addresses(void *data)
> +{
> +	struct ether_header *eth = (struct ether_header *)data;
> +	struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
> +	struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
> +	struct ether_addr tmp;
> +
> +	tmp = *src_addr;
> +	*src_addr = *dst_addr;
> +	*dst_addr = tmp;
> +}
> +
> +#if DEBUG_HEXDUMP
> +static void hex_dump(void *pkt, size_t length, const char *prefix)
> +{
> +	int i = 0;
> +	const unsigned char *address = (unsigned char *)pkt;
> +	const unsigned char *line = address;
> +	size_t line_size = 32;
> +	unsigned char c;
> +
> +	printf("length = %zu\n", length);
> +	printf("%s | ", prefix);
> +	while (length-- > 0) {
> +		printf("%02X ", *address++);
> +		if (!(++i % line_size) || (length == 0 && i % line_size)) {
> +			if (length == 0) {
> +				while (i++ % line_size)
> +					printf("__ ");
> +			}
> +			printf(" | ");	/* right close */
> +			while (line < address) {
> +				c = *line++;
> +				printf("%c", (c < 33 || c == 255) ? 0x2E : c);
> +			}
> +			printf("\n");
> +			if (length > 0)
> +				printf("%s | ", prefix);
> +		}
> +	}
> +	printf("\n");
> +}
> +#endif
> +
> +static size_t gen_eth_frame(char *frame)
> +{
> +	memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
> +	return sizeof(pkt_data) - 1;
> +}
> +
> +static struct xdp_umem *xdp_umem_configure(int sfd)
> +{
> +	int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS;
> +	struct xdp_umem_reg mr;
> +	struct xdp_umem *umem;
> +	void *bufs;
> +
> +	umem = calloc(1, sizeof(*umem));
> +	lassert(umem);
> +
> +	lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
> +			       NUM_FRAMES * FRAME_SIZE) == 0);
> +
> +	mr.addr = (__u64)bufs;
> +	mr.len = NUM_FRAMES * FRAME_SIZE;
> +	mr.frame_size = FRAME_SIZE;
> +	mr.frame_headroom = FRAME_HEADROOM;
> +
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size,
> +			   sizeof(int)) == 0);
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size,
> +			   sizeof(int)) == 0);
> +
> +	umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
> +			     FQ_NUM_DESCS * sizeof(u32),
> +			     PROT_READ | PROT_WRITE,
> +			     MAP_SHARED | MAP_POPULATE, sfd,
> +			     XDP_UMEM_PGOFF_FILL_RING);
> +	lassert(umem->fq.ring != MAP_FAILED);
> +
> +	umem->fq.mask = FQ_NUM_DESCS - 1;
> +	umem->fq.size = FQ_NUM_DESCS;
> +
> +	umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
> +			     CQ_NUM_DESCS * sizeof(u32),
> +			     PROT_READ | PROT_WRITE,
> +			     MAP_SHARED | MAP_POPULATE, sfd,
> +			     XDP_UMEM_PGOFF_COMPLETION_RING);
> +	lassert(umem->cq.ring != MAP_FAILED);
> +
> +	umem->cq.mask = CQ_NUM_DESCS - 1;
> +	umem->cq.size = CQ_NUM_DESCS;
> +
> +	umem->frames = (char (*)[FRAME_SIZE])bufs;
> +	umem->fd = sfd;
> +
> +	if (opt_bench == BENCH_TXONLY) {
> +		int i;
> +
> +		for (i = 0; i < NUM_FRAMES; i++)
> +			(void)gen_eth_frame(&umem->frames[i][0]);
> +	}
> +
> +	return umem;
> +}
> +
> +static struct xdpsock *xsk_configure(struct xdp_umem *umem)
> +{
> +	struct sockaddr_xdp sxdp = {};
> +	int sfd, ndescs = NUM_DESCS;
> +	struct xdpsock *xsk;
> +	bool shared = true;
> +	u32 i;
> +
> +	sfd = socket(PF_XDP, SOCK_RAW, 0);
> +	lassert(sfd >= 0);
> +
> +	xsk = calloc(1, sizeof(*xsk));
> +	lassert(xsk);
> +
> +	xsk->sfd = sfd;
> +	xsk->outstanding_tx = 0;
> +
> +	if (!umem) {
> +		shared = false;
> +		xsk->umem = xdp_umem_configure(sfd);
> +	} else {
> +		xsk->umem = umem;
> +	}
> +
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING,
> +			   &ndescs, sizeof(int)) == 0);
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING,
> +			   &ndescs, sizeof(int)) == 0);
> +
> +	/* Rx */
> +	xsk->rx.ring = mmap(NULL,
> +			    sizeof(struct xdp_ring) +
> +			    NUM_DESCS * sizeof(struct xdp_desc),
> +			    PROT_READ | PROT_WRITE,
> +			    MAP_SHARED | MAP_POPULATE, sfd,
> +			    XDP_PGOFF_RX_RING);
> +	lassert(xsk->rx.ring != MAP_FAILED);
> +
> +	if (!shared) {
> +		for (i = 0; i < NUM_DESCS / 2; i++)
> +			lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
> +				== 0);
> +	}
> +
> +	/* Tx */
> +	xsk->tx.ring = mmap(NULL,
> +			 sizeof(struct xdp_ring) +
> +			 NUM_DESCS * sizeof(struct xdp_desc),
> +			 PROT_READ | PROT_WRITE,
> +			 MAP_SHARED | MAP_POPULATE, sfd,
> +			 XDP_PGOFF_TX_RING);
> +	lassert(xsk->tx.ring != MAP_FAILED);
> +
> +	xsk->rx.mask = NUM_DESCS - 1;
> +	xsk->rx.size = NUM_DESCS;
> +
> +	xsk->tx.mask = NUM_DESCS - 1;
> +	xsk->tx.size = NUM_DESCS;
> +
> +	sxdp.sxdp_family = PF_XDP;
> +	sxdp.sxdp_ifindex = opt_ifindex;
> +	sxdp.sxdp_queue_id = opt_queue;
> +	if (shared) {
> +		sxdp.sxdp_flags = XDP_SHARED_UMEM;
> +		sxdp.sxdp_shared_umem_fd = umem->fd;
> +	}
> +
> +	lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
> +
> +	return xsk;
> +}
> +
> +static void print_benchmark(bool running)
> +{
> +	const char *bench_str = "INVALID";
> +
> +	if (opt_bench == BENCH_RXDROP)
> +		bench_str = "rxdrop";
> +	else if (opt_bench == BENCH_TXONLY)
> +		bench_str = "txonly";
> +	else if (opt_bench == BENCH_L2FWD)
> +		bench_str = "l2fwd";
> +
> +	printf("%s:%d %s ", opt_if, opt_queue, bench_str);
> +	if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
> +		printf("xdp-skb ");
> +	else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
> +		printf("xdp-drv ");
> +	else
> +		printf("	");
> +
> +	if (opt_poll)
> +		printf("poll() ");
> +
> +	if (running) {
> +		printf("running...");
> +		fflush(stdout);
> +	}
> +}
> +
> +static void dump_stats(void)
> +{
> +	unsigned long now = get_nsecs();
> +	long dt = now - prev_time;
> +	int i;
> +
> +	prev_time = now;
> +
> +	for (i = 0; i < num_socks; i++) {
> +		char *fmt = "%-15s %'-11.0f %'-11lu\n";
> +		double rx_pps, tx_pps;
> +
> +		rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) *
> +			 1000000000. / dt;
> +		tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) *
> +			 1000000000. / dt;
> +
> +		printf("\n sock%d@", i);
> +		print_benchmark(false);
> +		printf("\n");
> +
> +		printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
> +		       dt / 1000000000.);
> +		printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
> +		printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
> +
> +		xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts;
> +		xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts;
> +	}
> +}
> +
> +static void *poller(void *arg)
> +{
> +	(void)arg;
> +	for (;;) {
> +		sleep(opt_interval);
> +		dump_stats();
> +	}
> +
> +	return NULL;
> +}
> +
> +static void int_exit(int sig)
> +{
> +	(void)sig;
> +	dump_stats();
> +	bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
> +	exit(EXIT_SUCCESS);
> +}
> +
> +static struct option long_options[] = {
> +	{"rxdrop", no_argument, 0, 'r'},
> +	{"txonly", no_argument, 0, 't'},
> +	{"l2fwd", no_argument, 0, 'l'},
> +	{"interface", required_argument, 0, 'i'},
> +	{"queue", required_argument, 0, 'q'},
> +	{"poll", no_argument, 0, 'p'},
> +	{"shared-buffer", no_argument, 0, 's'},
> +	{"xdp-skb", no_argument, 0, 'S'},
> +	{"xdp-native", no_argument, 0, 'N'},
> +	{"interval", required_argument, 0, 'n'},
> +	{0, 0, 0, 0}
> +};
> +
> +static void usage(const char *prog)
> +{
> +	const char *str =
> +		"  Usage: %s [OPTIONS]\n"
> +		"  Options:\n"
> +		"  -r, --rxdrop		Discard all incoming packets (default)\n"
> +		"  -t, --txonly		Only send packets\n"
> +		"  -l, --l2fwd		MAC swap L2 forwarding\n"
> +		"  -i, --interface=n	Run on interface n\n"
> +		"  -q, --queue=n	Use queue n (default 0)\n"
> +		"  -p, --poll		Use poll syscall\n"
> +		"  -s, --shared-buffer	Use shared packet buffer\n"
> +		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
> +		"  -N, --xdp-native=n	Enfore XDP native mode\n"
> +		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
> +		"\n";
> +	fprintf(stderr, str, prog);
> +	exit(EXIT_FAILURE);
> +}
> +
> +static void parse_command_line(int argc, char **argv)
> +{
> +	int option_index, c;
> +
> +	opterr = 0;
> +
> +	for (;;) {
> +		c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
> +				&option_index);
> +		if (c == -1)
> +			break;
> +
> +		switch (c) {
> +		case 'r':
> +			opt_bench = BENCH_RXDROP;
> +			break;
> +		case 't':
> +			opt_bench = BENCH_TXONLY;
> +			break;
> +		case 'l':
> +			opt_bench = BENCH_L2FWD;
> +			break;
> +		case 'i':
> +			opt_if = optarg;
> +			break;
> +		case 'q':
> +			opt_queue = atoi(optarg);
> +			break;
> +		case 's':
> +			opt_shared_packet_buffer = 1;
> +			break;
> +		case 'p':
> +			opt_poll = 1;
> +			break;
> +		case 'S':
> +			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
> +			break;
> +		case 'N':
> +			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
> +			break;
> +		case 'n':
> +			opt_interval = atoi(optarg);
> +			break;
> +		default:
> +			usage(basename(argv[0]));
> +		}
> +	}
> +
> +	opt_ifindex = if_nametoindex(opt_if);
> +	if (!opt_ifindex) {
> +		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
> +			opt_if);
> +		usage(basename(argv[0]));
> +	}
> +}
> +
> +static void kick_tx(int fd)
> +{
> +	int ret;
> +
> +	ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> +	if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
> +		return;
> +	lassert(0);
> +}
> +
> +static inline void complete_tx_l2fwd(struct xdpsock *xsk)
> +{
> +	u32 descs[BATCH_SIZE];
> +	unsigned int rcvd;
> +	size_t ndescs;
> +
> +	if (!xsk->outstanding_tx)
> +		return;
> +
> +	kick_tx(xsk->sfd);
> +	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
> +		 xsk->outstanding_tx;
> +
> +	/* re-add completed Tx buffers */
> +	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs);
> +	if (rcvd > 0) {
> +		umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd);
> +		xsk->outstanding_tx -= rcvd;
> +		xsk->tx_npkts += rcvd;
> +	}
> +}
> +
> +static inline void complete_tx_only(struct xdpsock *xsk)
> +{
> +	u32 descs[BATCH_SIZE];
> +	unsigned int rcvd;
> +
> +	if (!xsk->outstanding_tx)
> +		return;
> +
> +	kick_tx(xsk->sfd);
> +
> +	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE);
> +	if (rcvd > 0) {
> +		xsk->outstanding_tx -= rcvd;
> +		xsk->tx_npkts += rcvd;
> +	}
> +}
> +
> +static void rx_drop(struct xdpsock *xsk)
> +{
> +	struct xdp_desc descs[BATCH_SIZE];
> +	unsigned int rcvd, i;
> +
> +	rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
> +	if (!rcvd)
> +		return;
> +
> +	for (i = 0; i < rcvd; i++) {
> +		u32 idx = descs[i].idx;
> +
> +		lassert(idx < NUM_FRAMES);
> +#if DEBUG_HEXDUMP
> +		char *pkt;
> +		char buf[32];
> +
> +		pkt = xq_get_data(xsk, idx, descs[i].offset);
> +		sprintf(buf, "idx=%d", idx);
> +		hex_dump(pkt, descs[i].len, buf);
> +#endif
> +	}
> +
> +	xsk->rx_npkts += rcvd;
> +
> +	umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
> +}
> +
> +static void rx_drop_all(void)
> +{
> +	struct pollfd fds[MAX_SOCKS + 1];
> +	int i, ret, timeout, nfds = 1;
> +
> +	memset(fds, 0, sizeof(fds));
> +
> +	for (i = 0; i < num_socks; i++) {
> +		fds[i].fd = xsks[i]->sfd;
> +		fds[i].events = POLLIN;
> +		timeout = 1000; /* 1sn */
> +	}
> +
> +	for (;;) {
> +		if (opt_poll) {
> +			ret = poll(fds, nfds, timeout);
> +			if (ret <= 0)
> +				continue;
> +		}
> +
> +		for (i = 0; i < num_socks; i++)
> +			rx_drop(xsks[i]);
> +	}
> +}
> +
> +static void tx_only(struct xdpsock *xsk)
> +{
> +	int timeout, ret, nfds = 1;
> +	struct pollfd fds[nfds + 1];
> +	unsigned int idx = 0;
> +
> +	memset(fds, 0, sizeof(fds));
> +	fds[0].fd = xsk->sfd;
> +	fds[0].events = POLLOUT;
> +	timeout = 1000; /* 1sn */
> +
> +	for (;;) {
> +		if (opt_poll) {
> +			ret = poll(fds, nfds, timeout);
> +			if (ret <= 0)
> +				continue;
> +
> +			if (fds[0].fd != xsk->sfd ||
> +			    !(fds[0].revents & POLLOUT))
> +				continue;
> +		}
> +
> +		if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) {
> +			lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0);
> +
> +			xsk->outstanding_tx += BATCH_SIZE;
> +			idx += BATCH_SIZE;
> +			idx %= NUM_FRAMES;
> +		}
> +
> +		complete_tx_only(xsk);
> +	}
> +}
> +
> +static void l2fwd(struct xdpsock *xsk)
> +{
> +	for (;;) {
> +		struct xdp_desc descs[BATCH_SIZE];
> +		unsigned int rcvd, i;
> +		int ret;
> +
> +		for (;;) {
> +			complete_tx_l2fwd(xsk);
> +
> +			rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
> +			if (rcvd > 0)
> +				break;
> +		}
> +
> +		for (i = 0; i < rcvd; i++) {
> +			char *pkt = xq_get_data(xsk, descs[i].idx,
> +						descs[i].offset);
> +
> +			swap_mac_addresses(pkt);
> +#if DEBUG_HEXDUMP
> +			char buf[32];
> +			u32 idx = descs[i].idx;
> +
> +			sprintf(buf, "idx=%d", idx);
> +			hex_dump(pkt, descs[i].len, buf);
> +#endif
> +		}
> +
> +		xsk->rx_npkts += rcvd;
> +
> +		ret = xq_enq(&xsk->tx, descs, rcvd);
> +		lassert(ret == 0);
> +		xsk->outstanding_tx += rcvd;
> +	}
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +	char xdp_filename[256];
> +	int i, ret, key = 0;
> +	pthread_t pt;
> +
> +	parse_command_line(argc, argv);
> +
> +	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +		fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
> +			strerror(errno));
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
> +
> +	if (load_bpf_file(xdp_filename)) {
> +		fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (!prog_fd[0]) {
> +		fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n",
> +			strerror(errno));
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
> +		fprintf(stderr, "ERROR: link set xdp fd failed\n");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0);
> +	if (ret) {
> +		fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Create sockets... */
> +	xsks[num_socks++] = xsk_configure(NULL);
> +
> +#if RR_LB
> +	for (i = 0; i < MAX_SOCKS - 1; i++)
> +		xsks[num_socks++] = xsk_configure(xsks[0]->umem);
> +#endif
> +
> +	/* ...and insert them into the map. */
> +	for (i = 0; i < num_socks; i++) {
> +		key = i;
> +		ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0);
> +		if (ret) {
> +			fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i);
> +			exit(EXIT_FAILURE);
> +		}
> +	}
> +
> +	signal(SIGINT, int_exit);
> +	signal(SIGTERM, int_exit);
> +	signal(SIGABRT, int_exit);
> +
> +	setlocale(LC_ALL, "");
> +
> +	ret = pthread_create(&pt, NULL, poller, NULL);
> +	lassert(ret == 0);
> +
> +	prev_time = get_nsecs();
> +
> +	if (opt_bench == BENCH_RXDROP)
> +		rx_drop_all();
> +	else if (opt_bench == BENCH_TXONLY)
> +		tx_only(xsks[0]);
> +	else
> +		l2fwd(xsks[0]);
> +
> +	return 0;
> +}
> -- 
> 2.14.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
  2018-04-23 23:21   ` Michael S. Tsirkin
@ 2018-04-23 23:59     ` Willem de Bruijn
  2018-04-24  8:08       ` Magnus Karlsson
  0 siblings, 1 reply; 54+ messages in thread
From: Willem de Bruijn @ 2018-04-23 23:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Daniel Borkmann, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On Mon, Apr 23, 2018 at 7:21 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, we add another setsockopt for registered user memory (umem)
>> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
>> ask the kernel to allocate a queue (ring buffer) and also mmap it
>> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
>>
>> The queue is used to explicitly pass ownership of umem frames from the
>> user process to the kernel. These frames will in a later patch be
>> filled in with Rx packet data by the kernel.
>>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> ---
>>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>>  net/xdp/Makefile            |  2 +-
>>  net/xdp/xdp_umem.c          |  5 ++++
>>  net/xdp/xdp_umem.h          |  2 ++
>>  net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
>>  net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
>>  7 files changed, 180 insertions(+), 2 deletions(-)
>>  create mode 100644 net/xdp/xsk_queue.c
>>  create mode 100644 net/xdp/xsk_queue.h
>>
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index 41252135a0fe..975661e1baca 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -23,6 +23,7 @@
>>
>>  /* XDP socket options */
>>  #define XDP_UMEM_REG                 3
>> +#define XDP_UMEM_FILL_RING           4
>>
>>  struct xdp_umem_reg {
>>       __u64 addr; /* Start of packet data area */
>> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>>       __u32 frame_headroom; /* Frame head room */
>>  };
>>
>> +/* Pgoff for mmaping the rings */
>> +#define XDP_UMEM_PGOFF_FILL_RING     0x100000000
>> +
>> +struct xdp_ring {
>> +     __u32 producer __attribute__((aligned(64)));
>> +     __u32 consumer __attribute__((aligned(64)));
>> +};
>
> Why 64? And do you still need these guys in uapi?

I was just about to ask the same. You mean cacheline_aligned?

>> +static int xsk_mmap(struct file *file, struct socket *sock,
>> +                 struct vm_area_struct *vma)
>> +{
>> +     unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
>> +     unsigned long size = vma->vm_end - vma->vm_start;
>> +     struct xdp_sock *xs = xdp_sk(sock->sk);
>> +     struct xsk_queue *q;
>> +     unsigned long pfn;
>> +     struct page *qpg;
>> +
>> +     if (!xs->umem)
>> +             return -EINVAL;
>> +
>> +     if (offset == XDP_UMEM_PGOFF_FILL_RING)
>> +             q = xs->umem->fq;
>> +     else
>> +             return -EINVAL;
>> +
>> +     qpg = virt_to_head_page(q->ring);

Is it assured that q is initialized with a call to setsockopt
XDP_UMEM_FILL_RING before the call the mmap?

In general, with such an extensive new API, it might be worthwhile to
run syzkaller locally on a kernel with these patches. It is pretty
easy to set up (https://github.com/google/syzkaller/blob/master/docs/linux/setup.md),
though it also needs to be taught about any new APIs.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (15 preceding siblings ...)
  2018-04-23 23:22 ` [PATCH bpf-next 00/15] Introducing AF_XDP support Michael S. Tsirkin
@ 2018-04-24  2:29 ` Jason Wang
  2018-04-24  8:44   ` Magnus Karlsson
  2018-04-24 17:03 ` Willem de Bruijn
  17 siblings, 1 reply; 54+ messages in thread
From: Jason Wang @ 2018-04-24  2:29 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang



On 2018年04月23日 21:56, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work
> has already been accepted. We will publish our zero-copy support for
> RX and TX on top of his patch sets at a later point in time.
>
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this RFC, all
> packet data is copied out to user-space.
>
> A new feature in this RFC is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
>
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
>
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> together with the generic XDP support and copies out the data to user
> space. A fallback mode that works for any network device. On the other
> hand, if the driver has support for XDP, it will be used by the AF_XDP
> code to provide better performance, but there is still a copy of the
> data into user space.
>
> There is a xdpsock benchmarking/test application included that
> demonstrates how to use AF_XDP sockets with both private and shared
> UMEMs. Say that you would like your UDP traffic from port 4242 to end
> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
> for this:
>
>        ethtool -N p3p2 rx-flow-hash udp4 fn
>        ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>            action 16
>
> Running the rxdrop benchmark in XDP_DRV mode can then be done
> using:
>
>        samples/bpf/xdpsock -i p3p2 -q 16 -r -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.9(3.0)   9.4(9.3)
> txpush       2.5(2.2)   NA*
> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)

This number looks not very exciting. I can get ~3Mpps when using testpmd 
in a guest with xdp_redirect.sh on host between ixgbe and TAP/vhost. I 
believe we can even better performance without virt. It would be 
interesting to compare this performance with e.g testpmd + 
virito_user(vhost_kernel) + XDP.

>
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.1(2.2)   3.3(3.1)
> l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)
>
> * NA since we have no support for TX using the XDP_DRV infrastructure
>    in this RFC. This is for a future patch set since it involves
>    changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>    Dangaard Brouer.
>
> XDP performance on our system as a base line:
>
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32,921,521  0
>
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3,289,491   0
>
> Changes from RFC V2:
>
> * Optimizations and simplifications to the ring structures inspired by
>    ptr_ring.h
> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>    consistent with AF_PACKET
> * Support for only having an RX queue or a TX queue defined
> * Some bug fixes and code cleanup
>
> The structure of the patch set is as follows:
>
> Patches 1-2: Basic socket and umem plumbing
> Patches 3-10: RX support together with the new XSKMAP
> Patches 11-14: TX support
> Patch 15: Sample application
>
> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
> Clean up btf.h in uapi")
>
> Questions:
>
> * How to deal with cache alignment for uapi when different
>    architectures can have different cache line sizes? We have just
>    aligned it to 64 bytes for now, which works for many popular
>    architectures, but not all. Please advise.
>
> To do:
>
> * Optimize performance
>
> * Kernel selftest
>
> Post-series plan:
>
> * Kernel load module support of AF_XDP would be nice. Unclear how to
>    achieve this though since our XDP code depends on net/core.
>
> * Support for AF_XDP sockets without an XPD program loaded. In this
>    case all the traffic on a queue should go up to the user space socket.

I think we probably need this in the case of TUN XDP for virt guest too.

Thanks

>
> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>    XDP_PASS" for a tcpdump-like functionality.
>
> * And of course getting to zero-copy support in small increments.
>
> Thanks: Björn and Magnus
>
> Björn Töpel (8):
>    net: initial AF_XDP skeleton
>    xsk: add user memory registration support sockopt
>    xsk: add Rx queue setup and mmap support
>    xdp: introduce xdp_return_buff API
>    xsk: add Rx receive functions and poll support
>    bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>    xsk: wire up XDP_DRV side of AF_XDP
>    xsk: wire up XDP_SKB side of AF_XDP
>
> Magnus Karlsson (7):
>    xsk: add umem fill queue support and mmap
>    xsk: add support for bind for Rx
>    xsk: add umem completion queue support and mmap
>    xsk: add Tx queue setup and mmap support
>    xsk: support for Tx
>    xsk: statistics support
>    samples/bpf: sample application for AF_XDP sockets
>
>   MAINTAINERS                         |   8 +
>   include/linux/bpf.h                 |  26 +
>   include/linux/bpf_types.h           |   3 +
>   include/linux/filter.h              |   2 +-
>   include/linux/socket.h              |   5 +-
>   include/net/xdp.h                   |   1 +
>   include/net/xdp_sock.h              |  46 ++
>   include/uapi/linux/bpf.h            |   1 +
>   include/uapi/linux/if_xdp.h         |  87 ++++
>   kernel/bpf/Makefile                 |   3 +
>   kernel/bpf/verifier.c               |   8 +-
>   kernel/bpf/xskmap.c                 | 286 +++++++++++
>   net/Kconfig                         |   1 +
>   net/Makefile                        |   1 +
>   net/core/dev.c                      |  34 +-
>   net/core/filter.c                   |  40 +-
>   net/core/sock.c                     |  12 +-
>   net/core/xdp.c                      |  15 +-
>   net/xdp/Kconfig                     |   7 +
>   net/xdp/Makefile                    |   2 +
>   net/xdp/xdp_umem.c                  | 256 ++++++++++
>   net/xdp/xdp_umem.h                  |  65 +++
>   net/xdp/xdp_umem_props.h            |  23 +
>   net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
>   net/xdp/xsk_queue.c                 |  73 +++
>   net/xdp/xsk_queue.h                 | 245 ++++++++++
>   samples/bpf/Makefile                |   4 +
>   samples/bpf/xdpsock.h               |  11 +
>   samples/bpf/xdpsock_kern.c          |  56 +++
>   samples/bpf/xdpsock_user.c          | 947 ++++++++++++++++++++++++++++++++++++
>   security/selinux/hooks.c            |   4 +-
>   security/selinux/include/classmap.h |   4 +-
>   32 files changed, 2945 insertions(+), 35 deletions(-)
>   create mode 100644 include/net/xdp_sock.h
>   create mode 100644 include/uapi/linux/if_xdp.h
>   create mode 100644 kernel/bpf/xskmap.c
>   create mode 100644 net/xdp/Kconfig
>   create mode 100644 net/xdp/Makefile
>   create mode 100644 net/xdp/xdp_umem.c
>   create mode 100644 net/xdp/xdp_umem.h
>   create mode 100644 net/xdp/xdp_umem_props.h
>   create mode 100644 net/xdp/xsk.c
>   create mode 100644 net/xdp/xsk_queue.c
>   create mode 100644 net/xdp/xsk_queue.h
>   create mode 100644 samples/bpf/xdpsock.h
>   create mode 100644 samples/bpf/xdpsock_kern.c
>   create mode 100644 samples/bpf/xdpsock_user.c
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
  2018-04-23 23:22 ` [PATCH bpf-next 00/15] Introducing AF_XDP support Michael S. Tsirkin
@ 2018-04-24  6:55   ` Björn Töpel
  2018-04-24  7:27     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 54+ messages in thread
From: Björn Töpel @ 2018-04-24  6:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

2018-04-24 1:22 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> On Mon, Apr 23, 2018 at 03:56:04PM +0200, Björn Töpel wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> This RFC introduces a new address family called AF_XDP that is
>> optimized for high performance packet processing and, in upcoming
>> patch sets, zero-copy semantics. In this v2 version, we have removed
>> all zero-copy related code in order to make it smaller, simpler and
>> hopefully more review friendly. This RFC only supports copy-mode for
>> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
>> using the XDP_DRV path. Zero-copy support requires XDP and driver
>> changes that Jesper Dangaard Brouer is working on. Some of his work
>> has already been accepted. We will publish our zero-copy support for
>> RX and TX on top of his patch sets at a later point in time.
>>
>> An AF_XDP socket (XSK) is created with the normal socket()
>> syscall. Associated with each XSK are two queues: the RX queue and the
>> TX queue. A socket can receive packets on the RX queue and it can send
>> packets on the TX queue. These queues are registered and sized with
>> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
>> mandatory to have at least one of these queues for each socket. In
>> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
>> packet buffers. An RX or TX descriptor points to a data buffer in a
>> memory area called a UMEM. RX and TX can share the same UMEM so that a
>> packet does not have to be copied between RX and TX. Moreover, if a
>> packet needs to be kept for a while due to a possible retransmit, the
>> descriptor that points to that packet can be changed to point to
>> another and reused right away. This again avoids copying data.
>>
>> This new dedicated packet buffer area is call a UMEM. It consists of a
>> number of equally size frames and each frame has a unique frame id. A
>> descriptor in one of the queues references a frame by referencing its
>> frame id. The user space allocates memory for this UMEM using whatever
>> means it feels is most appropriate (malloc, mmap, huge pages,
>> etc). This memory area is then registered with the kernel using the new
>> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
>> and the COMPLETION queue. The fill queue is used by the application to
>> send down frame ids for the kernel to fill in with RX packet
>> data. References to these frames will then appear in the RX queue of
>> the XSK once they have been received. The completion queue, on the
>> other hand, contains frame ids that the kernel has transmitted
>> completely and can now be used again by user space, for either TX or
>> RX. Thus, the frame ids appearing in the completion queue are ids that
>> were previously transmitted using the TX queue. In summary, the RX and
>> FILL queues are used for the RX path and the TX and COMPLETION queues
>> are used for the TX path.
>>
>> The socket is then finally bound with a bind() call to a device and a
>> specific queue id on that device, and it is not until bind is
>> completed that traffic starts to flow. Note that in this RFC, all
>> packet data is copied out to user-space.
>>
>> A new feature in this RFC is that the UMEM can be shared between
>> processes, if desired. If a process wants to do this, it simply skips
>> the registration of the UMEM and its corresponding two queues, sets a
>> flag in the bind call and submits the XSK of the process it would like
>> to share UMEM with as well as its own newly created XSK socket. The
>> new process will then receive frame id references in its own RX queue
>> that point to this shared UMEM. Note that since the queue structures
>> are single-consumer / single-producer (for performance reasons), the
>> new process has to create its own socket with associated RX and TX
>> queues, since it cannot share this with the other process. This is
>> also the reason that there is only one set of FILL and COMPLETION
>> queues per UMEM. It is the responsibility of a single process to
>> handle the UMEM. If multiple-producer / multiple-consumer queues are
>> implemented in the future, this requirement could be relaxed.
>>
>> How is then packets distributed between these two XSK? We have
>> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
>> full). The user-space application can place an XSK at an arbitrary
>> place in this map. The XDP program can then redirect a packet to a
>> specific index in this map and at this point XDP validates that the
>> XSK in that map was indeed bound to that device and queue number. If
>> not, the packet is dropped. If the map is empty at that index, the
>> packet is also dropped. This also means that it is currently mandatory
>> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
>> to get any traffic to user space through the XSK.
>>
>> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
>> driver does not have support for XDP, or XDP_SKB is explicitly chosen
>> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
>> together with the generic XDP support and copies out the data to user
>> space. A fallback mode that works for any network device. On the other
>> hand, if the driver has support for XDP, it will be used by the AF_XDP
>> code to provide better performance, but there is still a copy of the
>> data into user space.
>>
>> There is a xdpsock benchmarking/test application included that
>> demonstrates how to use AF_XDP sockets with both private and shared
>> UMEMs. Say that you would like your UDP traffic from port 4242 to end
>> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
>> for this:
>>
>>       ethtool -N p3p2 rx-flow-hash udp4 fn
>>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>>           action 16
>>
>> Running the rxdrop benchmark in XDP_DRV mode can then be done
>> using:
>>
>>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
>>
>> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
>> can be displayed with "-h", as usual.
>>
>> We have run some benchmarks on a dual socket system with two Broadwell
>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>> cores which gives a total of 28, but only two cores are used in these
>> experiments. One for TR/RX and one for the user space application. The
>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>> Intel I40E 40Gbit/s using the i40e driver.
>>
>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>> and 1500 byte packets, generated by commercial packet generator HW that is
>> generating packets at full 40 Gbit/s line rate.
>>
>> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
>> Benchmark   XDP_SKB   XDP_DRV
>> rxdrop       2.9(3.0)   9.4(9.3)
>> txpush       2.5(2.2)   NA*
>> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)
>>
>> AF_XDP performance 1500 byte packets:
>> Benchmark   XDP_SKB   XDP_DRV
>> rxdrop       2.1(2.2)   3.3(3.1)
>> l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)
>>
>> * NA since we have no support for TX using the XDP_DRV infrastructure
>>   in this RFC. This is for a future patch set since it involves
>>   changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>>   Dangaard Brouer.
>>
>> XDP performance on our system as a base line:
>>
>> 64 byte packets:
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      16      32,921,521  0
>>
>> 1500 byte packets:
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      16      3,289,491   0
>>
>> Changes from RFC V2:
>>
>> * Optimizations and simplifications to the ring structures inspired by
>>   ptr_ring.h
>> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>>   consistent with AF_PACKET
>> * Support for only having an RX queue or a TX queue defined
>> * Some bug fixes and code cleanup
>>
>> The structure of the patch set is as follows:
>>
>> Patches 1-2: Basic socket and umem plumbing
>> Patches 3-10: RX support together with the new XSKMAP
>> Patches 11-14: TX support
>> Patch 15: Sample application
>>
>> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
>> Clean up btf.h in uapi")
>>
>> Questions:
>>
>> * How to deal with cache alignment for uapi when different
>>   architectures can have different cache line sizes? We have just
>>   aligned it to 64 bytes for now, which works for many popular
>>   architectures, but not all. Please advise.
>>
>> To do:
>>
>> * Optimize performance
>>
>> * Kernel selftest
>>
>> Post-series plan:
>>
>> * Kernel load module support of AF_XDP would be nice. Unclear how to
>>   achieve this though since our XDP code depends on net/core.
>>
>> * Support for AF_XDP sockets without an XPD program loaded. In this
>>   case all the traffic on a queue should go up to the user space socket.
>>
>> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>>   XDP_PASS" for a tcpdump-like functionality.
>>
>> * And of course getting to zero-copy support in small increments.
>>
>> Thanks: Björn and Magnus
>>
>> Björn Töpel (8):
>>   net: initial AF_XDP skeleton
>>   xsk: add user memory registration support sockopt
>>   xsk: add Rx queue setup and mmap support
>>   xdp: introduce xdp_return_buff API
>>   xsk: add Rx receive functions and poll support
>>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>>   xsk: wire up XDP_DRV side of AF_XDP
>>   xsk: wire up XDP_SKB side of AF_XDP
>>
>> Magnus Karlsson (7):
>>   xsk: add umem fill queue support and mmap
>>   xsk: add support for bind for Rx
>>   xsk: add umem completion queue support and mmap
>>   xsk: add Tx queue setup and mmap support
>>   xsk: support for Tx
>>   xsk: statistics support
>>   samples/bpf: sample application for AF_XDP sockets
>>
>>  MAINTAINERS                         |   8 +
>>  include/linux/bpf.h                 |  26 +
>>  include/linux/bpf_types.h           |   3 +
>>  include/linux/filter.h              |   2 +-
>>  include/linux/socket.h              |   5 +-
>>  include/net/xdp.h                   |   1 +
>>  include/net/xdp_sock.h              |  46 ++
>>  include/uapi/linux/bpf.h            |   1 +
>>  include/uapi/linux/if_xdp.h         |  87 ++++
>>  kernel/bpf/Makefile                 |   3 +
>>  kernel/bpf/verifier.c               |   8 +-
>>  kernel/bpf/xskmap.c                 | 286 +++++++++++
>>  net/Kconfig                         |   1 +
>>  net/Makefile                        |   1 +
>>  net/core/dev.c                      |  34 +-
>>  net/core/filter.c                   |  40 +-
>>  net/core/sock.c                     |  12 +-
>>  net/core/xdp.c                      |  15 +-
>>  net/xdp/Kconfig                     |   7 +
>>  net/xdp/Makefile                    |   2 +
>>  net/xdp/xdp_umem.c                  | 256 ++++++++++
>>  net/xdp/xdp_umem.h                  |  65 +++
>>  net/xdp/xdp_umem_props.h            |  23 +
>>  net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
>>  net/xdp/xsk_queue.c                 |  73 +++
>>  net/xdp/xsk_queue.h                 | 245 ++++++++++
>>  samples/bpf/Makefile                |   4 +
>>  samples/bpf/xdpsock.h               |  11 +
>>  samples/bpf/xdpsock_kern.c          |  56 +++
>>  samples/bpf/xdpsock_user.c          | 947 ++++++++++++++++++++++++++++++++++++
>>  security/selinux/hooks.c            |   4 +-
>>  security/selinux/include/classmap.h |   4 +-
>>  32 files changed, 2945 insertions(+), 35 deletions(-)
>>  create mode 100644 include/net/xdp_sock.h
>>  create mode 100644 include/uapi/linux/if_xdp.h
>>  create mode 100644 kernel/bpf/xskmap.c
>>  create mode 100644 net/xdp/Kconfig
>>  create mode 100644 net/xdp/Makefile
>>  create mode 100644 net/xdp/xdp_umem.c
>>  create mode 100644 net/xdp/xdp_umem.h
>>  create mode 100644 net/xdp/xdp_umem_props.h
>>  create mode 100644 net/xdp/xsk.c
>>  create mode 100644 net/xdp/xsk_queue.c
>>  create mode 100644 net/xdp/xsk_queue.h
>>  create mode 100644 samples/bpf/xdpsock.h
>>  create mode 100644 samples/bpf/xdpsock_kern.c
>>  create mode 100644 samples/bpf/xdpsock_user.c
>
> Is there a chance of Documentation/networking/af_xdp.txt ?
>

Yes. :-) We'll add that to the next spin!

>
>>
>> --
>> 2.14.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
  2018-04-23 20:26           ` Michael S. Tsirkin
@ 2018-04-24  7:01             ` Björn Töpel
  0 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-24  7:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

2018-04-23 22:26 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> On Mon, Apr 23, 2018 at 10:15:18PM +0200, Björn Töpel wrote:
>> 2018-04-23 22:11 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
>> > On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
>> >> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
>> >>
>> >> [...]
>> >>
>> >> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>> >> >> +{
>> >> >> +     unsigned int i;
>> >> >> +
>> >> >> +     if (umem->pgs) {
>> >> >> +             for (i = 0; i < umem->npgs; i++)
>> >> >
>> >> > Since you pin them with FOLL_WRITE, I assume these pages
>> >> > are written to.
>> >> > Don't you need set_page_dirty_lock here?
>> >> >
>> >>
>> >> Hmm, I actually *removed* it from the RFC V2, but after doing some
>> >> homework, I think you're right. Thanks for pointing this out!
>> >>
>> >> Thinking more about this; This function is called from sk_destruct,
>> >> and in the Tx case the sk_destruct can be called from interrupt
>> >> context, where set_page_dirty_lock cannot be called.
>> >>
>> >> Are there any preferred ways of solving this? Scheduling the whole
>> >> xsk_destruct call to a workqueue is one way (I think). Any
>> >> cleaner/better way?
>> >>
>> >> [...]
>> >
>> > Defer unpinning pages until the next tx call?
>> >
>>
>> If the sock is released, there wont be another tx call.
>
> unpin them on socket release too?
>

AF_XDP pins all memory up front, and unpins it when the socket is
released (final sock_put), which in this case is in the skb
destructor. So there's no later point from a sock lifetime
perspective.

I'll make a stab at doing umem clean up in a worker queue.

>> Or am I
>> missing something obvious?
>>
>> >
>> >> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> >> >> +{
>> >> >> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> >> >> +     u64 addr = mr->addr, size = mr->len;
>> >> >> +     unsigned int nframes;
>> >> >> +     int size_chk, err;
>> >> >> +
>> >> >> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> >> >> +             /* Strictly speaking we could support this, if:
>> >> >> +              * - huge pages, or*
>> >> >
>> >> > what does "or*" here mean?
>> >> >
>> >>
>> >> Oops, I'll change to just 'or' in the next revision.
>> >>
>> >>
>> >> Thanks!
>> >> Björn

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
  2018-04-24  6:55   ` Björn Töpel
@ 2018-04-24  7:27     ` Jesper Dangaard Brouer
  2018-04-24  7:33       ` Björn Töpel
  0 siblings, 1 reply; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2018-04-24  7:27 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Michael S. Tsirkin, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z, brouer

On Tue, 24 Apr 2018 08:55:33 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> > Is there a chance of Documentation/networking/af_xdp.txt ?
> >  
> 
> Yes. :-) We'll add that to the next spin!

Could we please create it using RST format (ReStructuredText) from the
start?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
  2018-04-23 23:04   ` Willem de Bruijn
@ 2018-04-24  7:30     ` Björn Töpel
  0 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-24  7:30 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

2018-04-24 1:04 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> In this commit the base structure of the AF_XDP address family is set
>> up. Further, we introduce the abilty register a window of user memory
>> to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
>> window is viewed by an AF_XDP socket as a set of equally large
>> frames. After a user memory registration all frames are "owned" by the
>> user application, and not the kernel.
>>
>> Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>
>> +static void xdp_umem_release(struct xdp_umem *umem)
>> +{
>> +       struct task_struct *task;
>> +       struct mm_struct *mm;
>> +       unsigned long diff;
>> +
>> +       if (umem->pgs) {
>> +               xdp_umem_unpin_pages(umem);
>> +
>> +               task = get_pid_task(umem->pid, PIDTYPE_PID);
>> +               put_pid(umem->pid);
>> +               if (!task)
>> +                       goto out;
>> +               mm = get_task_mm(task);
>> +               put_task_struct(task);
>> +               if (!mm)
>> +                       goto out;
>> +
>> +               diff = umem->size >> PAGE_SHIFT;
>
> Need to round up or size must always be a multiple of PAGE_SIZE.
>

Yes, you're right! I'll add constraints to the umem setup. See further
down in the reply.

>> +
>> +               down_write(&mm->mmap_sem);
>> +               mm->pinned_vm -= diff;
>> +               up_write(&mm->mmap_sem);
>
> When using user->locked_vm for resource limit checks, no need
> to also update mm->pinned_vm?
>

Hmm, dug around in the code, and it looks like you're correct -- i.e.
if user->locked_vm is used, we shouldn't update the mm->pinned_vm.
I'll need to check a bit more, so that I'm certain, but if so, I'll
remove it in the next revision.

>> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> +{
>> +       u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> +       u64 addr = mr->addr, size = mr->len;
>> +       unsigned int nframes;
>> +       int size_chk, err;
>> +
>> +       if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> +               /* Strictly speaking we could support this, if:
>> +                * - huge pages, or*
>> +                * - using an IOMMU, or
>> +                * - making sure the memory area is consecutive
>> +                * but for now, we simply say "computer says no".
>> +                */
>> +               return -EINVAL;
>> +       }
>
> Ideally, AF_XDP subsumes all packet socket use cases. It does not
> have packet v3's small packet optimizations of variable sized frames
> and block signaling.
>
> I don't suggest adding that now. But for the non-zerocopy case, it may
> make sense to ensure that nothing is blocking a later addition of these
> features. Especially for header-only (snaplen) workloads. So far, I don't
> see any issues.
>

Ok. Block signaling is sort of ring batching, so I think we're good
for that case. As for variable sized frames *within* a umem, that's
trickier. To support different sizes, multiple umems (and multiple
queues) -- if that makes sense?

>> +       if (!is_power_of_2(frame_size))
>> +               return -EINVAL;
>> +
>> +       if (!PAGE_ALIGNED(addr)) {
>> +               /* Memory area has to be page size aligned. For
>> +                * simplicity, this might change.
>> +                */
>> +               return -EINVAL;
>> +       }
>> +
>> +       if ((addr + size) < addr)
>> +               return -EINVAL;
>> +
>> +       nframes = size / frame_size;
>> +       if (nframes == 0 || nframes > UINT_MAX)
>> +               return -EINVAL;
>
> You may also want a check here that nframes * frame_size is at least
> PAGE_SIZE and probably a multiple of that.
>

Yup! I'll add those checks. This will make the "diff shift" in the
release code safe as well. Thanks!

>> +       frame_headroom = ALIGN(frame_headroom, 64);
>> +
>> +       size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
>> +       if (size_chk < 0)
>> +               return -EINVAL;
>> +
>> +       umem->pid = get_task_pid(current, PIDTYPE_PID);
>> +       umem->size = (size_t)size;
>> +       umem->address = (unsigned long)addr;
>> +       umem->props.frame_size = frame_size;
>> +       umem->props.nframes = nframes;
>> +       umem->frame_headroom = frame_headroom;
>> +       umem->npgs = size / PAGE_SIZE;
>> +       umem->pgs = NULL;
>> +       umem->user = NULL;
>> +
>> +       umem->frame_size_log2 = ilog2(frame_size);
>> +       umem->nfpp_mask = (PAGE_SIZE / frame_size) - 1;
>> +       umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size);
>> +       atomic_set(&umem->users, 1);
>> +
>> +       err = xdp_umem_account_pages(umem);
>> +       if (err)
>> +               goto out;
>> +
>> +       err = xdp_umem_pin_pages(umem);
>> +       if (err)
>
> need to call xdp_umem_unaccount_pages on error

Indeed! I'll fix that!

>> +               goto out;
>> +       return 0;
>> +
>> +out:
>> +       put_pid(umem->pid);
>> +       return err;
>> +}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
  2018-04-24  7:27     ` Jesper Dangaard Brouer
@ 2018-04-24  7:33       ` Björn Töpel
  0 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-24  7:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Michael S. Tsirkin, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

2018-04-24 9:27 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Tue, 24 Apr 2018 08:55:33 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
>> > Is there a chance of Documentation/networking/af_xdp.txt ?
>> >
>>
>> Yes. :-) We'll add that to the next spin!
>
> Could we please create it using RST format (ReStructuredText) from the
> start?
>

Good point! We'll do a Documentation/net/af_xdp.rst in favor of a text file!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
  2018-04-23 23:59     ` Willem de Bruijn
@ 2018-04-24  8:08       ` Magnus Karlsson
  2018-04-24 16:55         ` Willem de Bruijn
  0 siblings, 1 reply; 54+ messages in thread
From: Magnus Karlsson @ 2018-04-24  8:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Michael S. Tsirkin, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

On Tue, Apr 24, 2018 at 1:59 AM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 7:21 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
>>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>>
>>> Here, we add another setsockopt for registered user memory (umem)
>>> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
>>> ask the kernel to allocate a queue (ring buffer) and also mmap it
>>> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
>>>
>>> The queue is used to explicitly pass ownership of umem frames from the
>>> user process to the kernel. These frames will in a later patch be
>>> filled in with Rx packet data by the kernel.
>>>
>>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>>> ---
>>>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>>>  net/xdp/Makefile            |  2 +-
>>>  net/xdp/xdp_umem.c          |  5 ++++
>>>  net/xdp/xdp_umem.h          |  2 ++
>>>  net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>>>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
>>>  net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
>>>  7 files changed, 180 insertions(+), 2 deletions(-)
>>>  create mode 100644 net/xdp/xsk_queue.c
>>>  create mode 100644 net/xdp/xsk_queue.h
>>>
>>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>>> index 41252135a0fe..975661e1baca 100644
>>> --- a/include/uapi/linux/if_xdp.h
>>> +++ b/include/uapi/linux/if_xdp.h
>>> @@ -23,6 +23,7 @@
>>>
>>>  /* XDP socket options */
>>>  #define XDP_UMEM_REG                 3
>>> +#define XDP_UMEM_FILL_RING           4
>>>
>>>  struct xdp_umem_reg {
>>>       __u64 addr; /* Start of packet data area */
>>> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>>>       __u32 frame_headroom; /* Frame head room */
>>>  };
>>>
>>> +/* Pgoff for mmaping the rings */
>>> +#define XDP_UMEM_PGOFF_FILL_RING     0x100000000
>>> +
>>> +struct xdp_ring {
>>> +     __u32 producer __attribute__((aligned(64)));
>>> +     __u32 consumer __attribute__((aligned(64)));
>>> +};
>>
>> Why 64? And do you still need these guys in uapi?
>
> I was just about to ask the same. You mean cacheline_aligned?

Yes, I would like to have these cache aligned. How can I accomplish
this in a uapi?
I put a note around this in the cover letter:

* How to deal with cache alignment for uapi when different
  architectures can have different cache line sizes? We have just
  aligned it to 64 bytes for now, which works for many popular
  architectures, but not all. Please advise.

>
>>> +static int xsk_mmap(struct file *file, struct socket *sock,
>>> +                 struct vm_area_struct *vma)
>>> +{
>>> +     unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
>>> +     unsigned long size = vma->vm_end - vma->vm_start;
>>> +     struct xdp_sock *xs = xdp_sk(sock->sk);
>>> +     struct xsk_queue *q;
>>> +     unsigned long pfn;
>>> +     struct page *qpg;
>>> +
>>> +     if (!xs->umem)
>>> +             return -EINVAL;
>>> +
>>> +     if (offset == XDP_UMEM_PGOFF_FILL_RING)
>>> +             q = xs->umem->fq;
>>> +     else
>>> +             return -EINVAL;
>>> +
>>> +     qpg = virt_to_head_page(q->ring);
>
> Is it assured that q is initialized with a call to setsockopt
> XDP_UMEM_FILL_RING before the call the mmap?

Unfortunately not, so this is a bug. Case in point for running
syzkaller below, definitely.

> In general, with such an extensive new API, it might be worthwhile to
> run syzkaller locally on a kernel with these patches. It is pretty
> easy to set up (https://github.com/google/syzkaller/blob/master/docs/linux/setup.md),
> though it also needs to be taught about any new APIs.

Good idea. Will set this up and have it torture the API.

Thanks: Magnus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets
  2018-04-23 23:31   ` Michael S. Tsirkin
@ 2018-04-24  8:22     ` Magnus Karlsson
  0 siblings, 0 replies; 54+ messages in thread
From: Magnus Karlsson @ 2018-04-24  8:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z, Björn Töpel

On Tue, Apr 24, 2018 at 1:31 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 03:56:19PM +0200, Björn Töpel wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> This is a sample application for AF_XDP sockets. The application
>> supports three different modes of operation: rxdrop, txonly and l2fwd.
>>
>> To show-case a simple round-robin load-balancing between a set of
>> sockets in an xskmap, set the RR_LB compile time define option to 1 in
>> "xdpsock.h".
>>
>> Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> ---
>>  samples/bpf/Makefile       |   4 +
>>  samples/bpf/xdpsock.h      |  11 +
>>  samples/bpf/xdpsock_kern.c |  56 +++
>>  samples/bpf/xdpsock_user.c | 947 +++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 1018 insertions(+)
>>  create mode 100644 samples/bpf/xdpsock.h
>>  create mode 100644 samples/bpf/xdpsock_kern.c
>>  create mode 100644 samples/bpf/xdpsock_user.c
>>
>> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
>> index aa8c392e2e52..d0ddc1abf20d 100644
>> --- a/samples/bpf/Makefile
>> +++ b/samples/bpf/Makefile
>> @@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
>>  hostprogs-y += syscall_tp
>>  hostprogs-y += cpustat
>>  hostprogs-y += xdp_adjust_tail
>> +hostprogs-y += xdpsock
>>
>>  # Libbpf dependencies
>>  LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
>> @@ -97,6 +98,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
>>  syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
>>  cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
>>  xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
>> +xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
>>
>>  # Tell kbuild to always build the programs
>>  always := $(hostprogs-y)
>> @@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
>>  always += syscall_tp_kern.o
>>  always += cpustat_kern.o
>>  always += xdp_adjust_tail_kern.o
>> +always += xdpsock_kern.o
>>
>>  HOSTCFLAGS += -I$(objtree)/usr/include
>>  HOSTCFLAGS += -I$(srctree)/tools/lib/
>> @@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
>>  HOSTLOADLIBES_syscall_tp += -lelf
>>  HOSTLOADLIBES_cpustat += -lelf
>>  HOSTLOADLIBES_xdp_adjust_tail += -lelf
>> +HOSTLOADLIBES_xdpsock += -lelf -pthread
>>
>>  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
>>  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
>> diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
>> new file mode 100644
>> index 000000000000..533ab81adfa1
>> --- /dev/null
>> +++ b/samples/bpf/xdpsock.h
>> @@ -0,0 +1,11 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef XDPSOCK_H_
>> +#define XDPSOCK_H_
>> +
>> +/* Power-of-2 number of sockets */
>> +#define MAX_SOCKS 4
>> +
>> +/* Round-robin receive */
>> +#define RR_LB 0
>> +
>> +#endif /* XDPSOCK_H_ */
>> diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
>> new file mode 100644
>> index 000000000000..d8806c41362e
>> --- /dev/null
>> +++ b/samples/bpf/xdpsock_kern.c
>> @@ -0,0 +1,56 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#define KBUILD_MODNAME "foo"
>> +#include <uapi/linux/bpf.h>
>> +#include "bpf_helpers.h"
>> +
>> +#include "xdpsock.h"
>> +
>> +struct bpf_map_def SEC("maps") qidconf_map = {
>> +     .type           = BPF_MAP_TYPE_ARRAY,
>> +     .key_size       = sizeof(int),
>> +     .value_size     = sizeof(int),
>> +     .max_entries    = 1,
>> +};
>> +
>> +struct bpf_map_def SEC("maps") xsks_map = {
>> +     .type = BPF_MAP_TYPE_XSKMAP,
>> +     .key_size = sizeof(int),
>> +     .value_size = sizeof(int),
>> +     .max_entries = 4,
>> +};
>> +
>> +struct bpf_map_def SEC("maps") rr_map = {
>> +     .type = BPF_MAP_TYPE_PERCPU_ARRAY,
>> +     .key_size = sizeof(int),
>> +     .value_size = sizeof(unsigned int),
>> +     .max_entries = 1,
>> +};
>> +
>> +SEC("xdp_sock")
>> +int xdp_sock_prog(struct xdp_md *ctx)
>> +{
>> +     int *qidconf, key = 0, idx;
>> +     unsigned int *rr;
>> +
>> +     qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
>> +     if (!qidconf)
>> +             return XDP_ABORTED;
>> +
>> +     if (*qidconf != ctx->rx_queue_index)
>> +             return XDP_PASS;
>> +
>> +#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
>> +     rr = bpf_map_lookup_elem(&rr_map, &key);
>> +     if (!rr)
>> +             return XDP_ABORTED;
>> +
>> +     *rr = (*rr + 1) & (MAX_SOCKS - 1);
>> +     idx = *rr;
>> +#else
>> +     idx = 0;
>> +#endif
>> +
>> +     return bpf_redirect_map(&xsks_map, idx, 0);
>> +}
>> +
>> +char _license[] SEC("license") = "GPL";
>> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
>> new file mode 100644
>> index 000000000000..690bac1a0ab7
>> --- /dev/null
>> +++ b/samples/bpf/xdpsock_user.c
>> @@ -0,0 +1,947 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright(c) 2017 - 2018 Intel Corporation.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + */
>> +
>> +#include <assert.h>
>> +#include <errno.h>
>> +#include <getopt.h>
>> +#include <libgen.h>
>> +#include <linux/bpf.h>
>> +#include <linux/if_link.h>
>> +#include <linux/if_xdp.h>
>> +#include <linux/if_ether.h>
>> +#include <net/if.h>
>> +#include <signal.h>
>> +#include <stdbool.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <net/ethernet.h>
>> +#include <sys/resource.h>
>> +#include <sys/socket.h>
>> +#include <sys/mman.h>
>> +#include <time.h>
>> +#include <unistd.h>
>> +#include <pthread.h>
>> +#include <locale.h>
>> +#include <sys/types.h>
>> +#include <poll.h>
>> +
>> +#include "bpf_load.h"
>> +#include "bpf_util.h"
>> +#include "libbpf.h"
>> +
>> +#include "xdpsock.h"
>> +
>> +#ifndef SOL_XDP
>> +#define SOL_XDP 283
>> +#endif
>> +
>> +#ifndef AF_XDP
>> +#define AF_XDP 44
>> +#endif
>> +
>> +#ifndef PF_XDP
>> +#define PF_XDP AF_XDP
>> +#endif
>> +
>> +#define NUM_FRAMES 131072
>> +#define FRAME_HEADROOM 0
>> +#define FRAME_SIZE 2048
>> +#define NUM_DESCS 1024
>> +#define BATCH_SIZE 16
>> +
>> +#define FQ_NUM_DESCS 1024
>> +#define CQ_NUM_DESCS 1024
>> +
>> +#define DEBUG_HEXDUMP 0
>> +
>> +typedef __u32 u32;
>> +
>> +static unsigned long prev_time;
>> +
>> +enum benchmark_type {
>> +     BENCH_RXDROP = 0,
>> +     BENCH_TXONLY = 1,
>> +     BENCH_L2FWD = 2,
>> +};
>> +
>> +static enum benchmark_type opt_bench = BENCH_RXDROP;
>> +static u32 opt_xdp_flags;
>> +static const char *opt_if = "";
>> +static int opt_ifindex;
>> +static int opt_queue;
>> +static int opt_poll;
>> +static int opt_shared_packet_buffer;
>> +static int opt_interval = 1;
>> +
>> +struct xdp_umem_uqueue {
>> +     u32 cached_prod;
>> +     u32 cached_cons;
>> +     u32 mask;
>> +     u32 size;
>> +     struct xdp_umem_ring *ring;
>> +};
>> +
>> +struct xdp_umem {
>> +     char (*frames)[FRAME_SIZE];
>> +     struct xdp_umem_uqueue fq;
>> +     struct xdp_umem_uqueue cq;
>> +     int fd;
>> +};
>> +
>> +struct xdp_uqueue {
>> +     u32 cached_prod;
>> +     u32 cached_cons;
>> +     u32 mask;
>> +     u32 size;
>> +     struct xdp_rxtx_ring *ring;
>> +};
>> +
>> +struct xdpsock {
>> +     struct xdp_uqueue rx;
>> +     struct xdp_uqueue tx;
>> +     int sfd;
>> +     struct xdp_umem *umem;
>> +     u32 outstanding_tx;
>> +     unsigned long rx_npkts;
>> +     unsigned long tx_npkts;
>> +     unsigned long prev_rx_npkts;
>> +     unsigned long prev_tx_npkts;
>> +};
>> +
>> +#define MAX_SOCKS 4
>> +static int num_socks;
>> +struct xdpsock *xsks[MAX_SOCKS];
>> +
>> +static unsigned long get_nsecs(void)
>> +{
>> +     struct timespec ts;
>> +
>> +     clock_gettime(CLOCK_MONOTONIC, &ts);
>> +     return ts.tv_sec * 1000000000UL + ts.tv_nsec;
>> +}
>> +
>> +static void dump_stats(void);
>> +
>> +#define lassert(expr)                                                        \
>> +     do {                                                            \
>> +             if (!(expr)) {                                          \
>> +                     fprintf(stderr, "%s:%s:%i: Assertion failed: "  \
>> +                             #expr ": errno: %d/\"%s\"\n",           \
>> +                             __FILE__, __func__, __LINE__,           \
>> +                             errno, strerror(errno));                \
>> +                     dump_stats();                                   \
>> +                     exit(EXIT_FAILURE);                             \
>> +             }                                                       \
>> +     } while (0)
>> +
>> +#define barrier() __asm__ __volatile__("": : :"memory")
>> +#define u_smp_rmb() barrier()
>> +#define u_smp_wmb() barrier()
>> +#define likely(x) __builtin_expect(!!(x), 1)
>> +#define unlikely(x) __builtin_expect(!!(x), 0)
>> +
>> +static const char pkt_data[] =
>> +     "\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
>> +     "\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
>> +     "\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
>> +     "\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
>> +
>> +static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
>> +{
>> +     u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
>> +
>> +     if (free_entries >= nb)
>> +             return free_entries;
>> +
>> +     /* Refresh the local tail pointer */
>> +     q->cached_cons = q->ring->ptrs.consumer;
>> +
>> +     return q->size - (q->cached_prod - q->cached_cons);
>> +}
>> +
>> +static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
>> +{
>> +     u32 free_entries = q->cached_cons - q->cached_prod;
>> +
>> +     if (free_entries >= ndescs)
>> +             return free_entries;
>> +
>> +     /* Refresh the local tail pointer */
>> +     q->cached_cons = q->ring->ptrs.consumer + q->size;
>> +     return q->cached_cons - q->cached_prod;
>> +}
>> +
>> +static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb)
>> +{
>> +     u32 entries = q->cached_prod - q->cached_cons;
>> +
>> +     if (entries == 0)
>> +             q->cached_prod = q->ring->ptrs.producer;
>> +
>> +     entries = q->cached_prod - q->cached_cons;
>> +
>> +     return (entries > nb) ? nb : entries;
>> +}
>> +
>> +static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs)
>> +{
>> +     u32 entries = q->cached_prod - q->cached_cons;
>> +
>> +     if (entries == 0)
>> +             q->cached_prod = q->ring->ptrs.producer;
>> +
>> +     entries = q->cached_prod - q->cached_cons;
>> +     return (entries > ndescs) ? ndescs : entries;
>> +}
>> +
>> +static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
>> +                                      struct xdp_desc *d,
>> +                                      size_t nb)
>> +{
>> +     u32 i;
>> +
>> +     if (umem_nb_free(fq, nb) < nb)
>> +             return -ENOSPC;
>> +
>> +     for (i = 0; i < nb; i++) {
>> +             u32 idx = fq->cached_prod++ & fq->mask;
>> +
>> +             fq->ring->desc[idx] = d[i].idx;
>> +     }
>> +
>> +     u_smp_wmb();
>> +
>> +     fq->ring->ptrs.producer = fq->cached_prod;
>> +
>> +     return 0;
>> +}
>> +
>> +static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
>> +                                   size_t nb)
>> +{
>> +     u32 i;
>> +
>> +     if (umem_nb_free(fq, nb) < nb)
>> +             return -ENOSPC;
>> +
>> +     for (i = 0; i < nb; i++) {
>> +             u32 idx = fq->cached_prod++ & fq->mask;
>> +
>> +             fq->ring->desc[idx] = d[i];
>> +     }
>> +
>> +     u_smp_wmb();
>> +
>> +     fq->ring->ptrs.producer = fq->cached_prod;
>> +
>> +     return 0;
>> +}
>> +
>> +static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
>> +                                            u32 *d, size_t nb)
>> +{
>> +     u32 idx, i, entries = umem_nb_avail(cq, nb);
>> +
>> +     u_smp_rmb();
>> +
>> +     for (i = 0; i < entries; i++) {
>> +             idx = cq->cached_cons++ & cq->mask;
>> +             d[i] = cq->ring->desc[idx];
>> +     }
>> +
>> +     if (entries > 0) {
>> +             u_smp_wmb();
>> +
>> +             cq->ring->ptrs.consumer = cq->cached_cons;
>> +     }
>> +
>> +     return entries;
>> +}
>> +
>> +static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
>> +{
>> +     lassert(idx < NUM_FRAMES);
>> +     return &xsk->umem->frames[idx][off];
>> +}
>> +
>> +static inline int xq_enq(struct xdp_uqueue *uq,
>> +                      const struct xdp_desc *descs,
>> +                      unsigned int ndescs)
>> +{
>> +     struct xdp_rxtx_ring *r = uq->ring;
>> +     unsigned int i;
>> +
>> +     if (xq_nb_free(uq, ndescs) < ndescs)
>> +             return -ENOSPC;
>> +
>> +     for (i = 0; i < ndescs; i++) {
>> +             u32 idx = uq->cached_prod++ & uq->mask;
>> +
>> +             r->desc[idx].idx = descs[i].idx;
>> +             r->desc[idx].len = descs[i].len;
>> +             r->desc[idx].offset = descs[i].offset;
>> +     }
>> +
>> +     u_smp_wmb();
>> +
>> +     r->ptrs.producer = uq->cached_prod;
>> +     return 0;
>> +}
>> +
>> +static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
>> +                              __u32 idx, unsigned int ndescs)
>> +{
>> +     struct xdp_rxtx_ring *q = uq->ring;
>> +     unsigned int i;
>> +
>> +     if (xq_nb_free(uq, ndescs) < ndescs)
>> +             return -ENOSPC;
>> +
>> +     for (i = 0; i < ndescs; i++) {
>> +             u32 idx = uq->cached_prod++ & uq->mask;
>> +
>> +             q->desc[idx].idx        = idx + i;
>> +             q->desc[idx].len        = sizeof(pkt_data) - 1;
>> +             q->desc[idx].offset     = 0;
>> +     }
>> +
>> +     u_smp_wmb();
>> +
>> +     q->ptrs.producer = uq->cached_prod;
>> +     return 0;
>> +}
>> +
>> +static inline int xq_deq(struct xdp_uqueue *uq,
>> +                      struct xdp_desc *descs,
>> +                      int ndescs)
>> +{
>> +     struct xdp_rxtx_ring *r = uq->ring;
>> +     unsigned int idx;
>> +     int i, entries;
>> +
>> +     entries = xq_nb_avail(uq, ndescs);
>> +
>> +     u_smp_rmb();
>> +
>> +     for (i = 0; i < entries; i++) {
>> +             idx = uq->cached_cons++ & uq->mask;
>> +             descs[i] = r->desc[idx];
>> +     }
>> +
>> +     if (entries > 0) {
>> +             u_smp_wmb();
>> +
>> +             r->ptrs.consumer = uq->cached_cons;
>> +     }
>> +
>> +     return entries;
>> +}
>
> Interesting, I was under the impression that you were
> planning to get rid of consumer/producer counters
> and validate the descriptors instead.
>
> That's the ptr_ring design.
>
> You can then drop all the code around synchronising
> counter caches, as well as smp_rmb barriers.

We evaluated the current producer/consumer ring vs a
version of the ptr_ring modified for our purposes in a previous
mail thread (https://patchwork.ozlabs.org/patch/891713/)
and came to the conclusion that adopting everything in ptr_ring
was not better. That is the reason while we have kept the prod/cons ring.

Note that we did adopt a number of things from your design, but
not the approach of validating a descriptor by checking for a zero
in a specific field. It did not provide a performance benefit for our
balanced test cases and performed worse in the contended
corner cases.

>
>> +
>> +static void swap_mac_addresses(void *data)
>> +{
>> +     struct ether_header *eth = (struct ether_header *)data;
>> +     struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
>> +     struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
>> +     struct ether_addr tmp;
>> +
>> +     tmp = *src_addr;
>> +     *src_addr = *dst_addr;
>> +     *dst_addr = tmp;
>> +}
>> +
>> +#if DEBUG_HEXDUMP
>> +static void hex_dump(void *pkt, size_t length, const char *prefix)
>> +{
>> +     int i = 0;
>> +     const unsigned char *address = (unsigned char *)pkt;
>> +     const unsigned char *line = address;
>> +     size_t line_size = 32;
>> +     unsigned char c;
>> +
>> +     printf("length = %zu\n", length);
>> +     printf("%s | ", prefix);
>> +     while (length-- > 0) {
>> +             printf("%02X ", *address++);
>> +             if (!(++i % line_size) || (length == 0 && i % line_size)) {
>> +                     if (length == 0) {
>> +                             while (i++ % line_size)
>> +                                     printf("__ ");
>> +                     }
>> +                     printf(" | ");  /* right close */
>> +                     while (line < address) {
>> +                             c = *line++;
>> +                             printf("%c", (c < 33 || c == 255) ? 0x2E : c);
>> +                     }
>> +                     printf("\n");
>> +                     if (length > 0)
>> +                             printf("%s | ", prefix);
>> +             }
>> +     }
>> +     printf("\n");
>> +}
>> +#endif
>> +
>> +static size_t gen_eth_frame(char *frame)
>> +{
>> +     memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
>> +     return sizeof(pkt_data) - 1;
>> +}
>> +
>> +static struct xdp_umem *xdp_umem_configure(int sfd)
>> +{
>> +     int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS;
>> +     struct xdp_umem_reg mr;
>> +     struct xdp_umem *umem;
>> +     void *bufs;
>> +
>> +     umem = calloc(1, sizeof(*umem));
>> +     lassert(umem);
>> +
>> +     lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
>> +                            NUM_FRAMES * FRAME_SIZE) == 0);
>> +
>> +     mr.addr = (__u64)bufs;
>> +     mr.len = NUM_FRAMES * FRAME_SIZE;
>> +     mr.frame_size = FRAME_SIZE;
>> +     mr.frame_headroom = FRAME_HEADROOM;
>> +
>> +     lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
>> +     lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size,
>> +                        sizeof(int)) == 0);
>> +     lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size,
>> +                        sizeof(int)) == 0);
>> +
>> +     umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
>> +                          FQ_NUM_DESCS * sizeof(u32),
>> +                          PROT_READ | PROT_WRITE,
>> +                          MAP_SHARED | MAP_POPULATE, sfd,
>> +                          XDP_UMEM_PGOFF_FILL_RING);
>> +     lassert(umem->fq.ring != MAP_FAILED);
>> +
>> +     umem->fq.mask = FQ_NUM_DESCS - 1;
>> +     umem->fq.size = FQ_NUM_DESCS;
>> +
>> +     umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
>> +                          CQ_NUM_DESCS * sizeof(u32),
>> +                          PROT_READ | PROT_WRITE,
>> +                          MAP_SHARED | MAP_POPULATE, sfd,
>> +                          XDP_UMEM_PGOFF_COMPLETION_RING);
>> +     lassert(umem->cq.ring != MAP_FAILED);
>> +
>> +     umem->cq.mask = CQ_NUM_DESCS - 1;
>> +     umem->cq.size = CQ_NUM_DESCS;
>> +
>> +     umem->frames = (char (*)[FRAME_SIZE])bufs;
>> +     umem->fd = sfd;
>> +
>> +     if (opt_bench == BENCH_TXONLY) {
>> +             int i;
>> +
>> +             for (i = 0; i < NUM_FRAMES; i++)
>> +                     (void)gen_eth_frame(&umem->frames[i][0]);
>> +     }
>> +
>> +     return umem;
>> +}
>> +
>> +static struct xdpsock *xsk_configure(struct xdp_umem *umem)
>> +{
>> +     struct sockaddr_xdp sxdp = {};
>> +     int sfd, ndescs = NUM_DESCS;
>> +     struct xdpsock *xsk;
>> +     bool shared = true;
>> +     u32 i;
>> +
>> +     sfd = socket(PF_XDP, SOCK_RAW, 0);
>> +     lassert(sfd >= 0);
>> +
>> +     xsk = calloc(1, sizeof(*xsk));
>> +     lassert(xsk);
>> +
>> +     xsk->sfd = sfd;
>> +     xsk->outstanding_tx = 0;
>> +
>> +     if (!umem) {
>> +             shared = false;
>> +             xsk->umem = xdp_umem_configure(sfd);
>> +     } else {
>> +             xsk->umem = umem;
>> +     }
>> +
>> +     lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING,
>> +                        &ndescs, sizeof(int)) == 0);
>> +     lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING,
>> +                        &ndescs, sizeof(int)) == 0);
>> +
>> +     /* Rx */
>> +     xsk->rx.ring = mmap(NULL,
>> +                         sizeof(struct xdp_ring) +
>> +                         NUM_DESCS * sizeof(struct xdp_desc),
>> +                         PROT_READ | PROT_WRITE,
>> +                         MAP_SHARED | MAP_POPULATE, sfd,
>> +                         XDP_PGOFF_RX_RING);
>> +     lassert(xsk->rx.ring != MAP_FAILED);
>> +
>> +     if (!shared) {
>> +             for (i = 0; i < NUM_DESCS / 2; i++)
>> +                     lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
>> +                             == 0);
>> +     }
>> +
>> +     /* Tx */
>> +     xsk->tx.ring = mmap(NULL,
>> +                      sizeof(struct xdp_ring) +
>> +                      NUM_DESCS * sizeof(struct xdp_desc),
>> +                      PROT_READ | PROT_WRITE,
>> +                      MAP_SHARED | MAP_POPULATE, sfd,
>> +                      XDP_PGOFF_TX_RING);
>> +     lassert(xsk->tx.ring != MAP_FAILED);
>> +
>> +     xsk->rx.mask = NUM_DESCS - 1;
>> +     xsk->rx.size = NUM_DESCS;
>> +
>> +     xsk->tx.mask = NUM_DESCS - 1;
>> +     xsk->tx.size = NUM_DESCS;
>> +
>> +     sxdp.sxdp_family = PF_XDP;
>> +     sxdp.sxdp_ifindex = opt_ifindex;
>> +     sxdp.sxdp_queue_id = opt_queue;
>> +     if (shared) {
>> +             sxdp.sxdp_flags = XDP_SHARED_UMEM;
>> +             sxdp.sxdp_shared_umem_fd = umem->fd;
>> +     }
>> +
>> +     lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
>> +
>> +     return xsk;
>> +}
>> +
>> +static void print_benchmark(bool running)
>> +{
>> +     const char *bench_str = "INVALID";
>> +
>> +     if (opt_bench == BENCH_RXDROP)
>> +             bench_str = "rxdrop";
>> +     else if (opt_bench == BENCH_TXONLY)
>> +             bench_str = "txonly";
>> +     else if (opt_bench == BENCH_L2FWD)
>> +             bench_str = "l2fwd";
>> +
>> +     printf("%s:%d %s ", opt_if, opt_queue, bench_str);
>> +     if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
>> +             printf("xdp-skb ");
>> +     else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
>> +             printf("xdp-drv ");
>> +     else
>> +             printf("        ");
>> +
>> +     if (opt_poll)
>> +             printf("poll() ");
>> +
>> +     if (running) {
>> +             printf("running...");
>> +             fflush(stdout);
>> +     }
>> +}
>> +
>> +static void dump_stats(void)
>> +{
>> +     unsigned long now = get_nsecs();
>> +     long dt = now - prev_time;
>> +     int i;
>> +
>> +     prev_time = now;
>> +
>> +     for (i = 0; i < num_socks; i++) {
>> +             char *fmt = "%-15s %'-11.0f %'-11lu\n";
>> +             double rx_pps, tx_pps;
>> +
>> +             rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) *
>> +                      1000000000. / dt;
>> +             tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) *
>> +                      1000000000. / dt;
>> +
>> +             printf("\n sock%d@", i);
>> +             print_benchmark(false);
>> +             printf("\n");
>> +
>> +             printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
>> +                    dt / 1000000000.);
>> +             printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
>> +             printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
>> +
>> +             xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts;
>> +             xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts;
>> +     }
>> +}
>> +
>> +static void *poller(void *arg)
>> +{
>> +     (void)arg;
>> +     for (;;) {
>> +             sleep(opt_interval);
>> +             dump_stats();
>> +     }
>> +
>> +     return NULL;
>> +}
>> +
>> +static void int_exit(int sig)
>> +{
>> +     (void)sig;
>> +     dump_stats();
>> +     bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
>> +     exit(EXIT_SUCCESS);
>> +}
>> +
>> +static struct option long_options[] = {
>> +     {"rxdrop", no_argument, 0, 'r'},
>> +     {"txonly", no_argument, 0, 't'},
>> +     {"l2fwd", no_argument, 0, 'l'},
>> +     {"interface", required_argument, 0, 'i'},
>> +     {"queue", required_argument, 0, 'q'},
>> +     {"poll", no_argument, 0, 'p'},
>> +     {"shared-buffer", no_argument, 0, 's'},
>> +     {"xdp-skb", no_argument, 0, 'S'},
>> +     {"xdp-native", no_argument, 0, 'N'},
>> +     {"interval", required_argument, 0, 'n'},
>> +     {0, 0, 0, 0}
>> +};
>> +
>> +static void usage(const char *prog)
>> +{
>> +     const char *str =
>> +             "  Usage: %s [OPTIONS]\n"
>> +             "  Options:\n"
>> +             "  -r, --rxdrop         Discard all incoming packets (default)\n"
>> +             "  -t, --txonly         Only send packets\n"
>> +             "  -l, --l2fwd          MAC swap L2 forwarding\n"
>> +             "  -i, --interface=n    Run on interface n\n"
>> +             "  -q, --queue=n        Use queue n (default 0)\n"
>> +             "  -p, --poll           Use poll syscall\n"
>> +             "  -s, --shared-buffer  Use shared packet buffer\n"
>> +             "  -S, --xdp-skb=n      Use XDP skb-mod\n"
>> +             "  -N, --xdp-native=n   Enfore XDP native mode\n"
>> +             "  -n, --interval=n     Specify statistics update interval (default 1 sec).\n"
>> +             "\n";
>> +     fprintf(stderr, str, prog);
>> +     exit(EXIT_FAILURE);
>> +}
>> +
>> +static void parse_command_line(int argc, char **argv)
>> +{
>> +     int option_index, c;
>> +
>> +     opterr = 0;
>> +
>> +     for (;;) {
>> +             c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
>> +                             &option_index);
>> +             if (c == -1)
>> +                     break;
>> +
>> +             switch (c) {
>> +             case 'r':
>> +                     opt_bench = BENCH_RXDROP;
>> +                     break;
>> +             case 't':
>> +                     opt_bench = BENCH_TXONLY;
>> +                     break;
>> +             case 'l':
>> +                     opt_bench = BENCH_L2FWD;
>> +                     break;
>> +             case 'i':
>> +                     opt_if = optarg;
>> +                     break;
>> +             case 'q':
>> +                     opt_queue = atoi(optarg);
>> +                     break;
>> +             case 's':
>> +                     opt_shared_packet_buffer = 1;
>> +                     break;
>> +             case 'p':
>> +                     opt_poll = 1;
>> +                     break;
>> +             case 'S':
>> +                     opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
>> +                     break;
>> +             case 'N':
>> +                     opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
>> +                     break;
>> +             case 'n':
>> +                     opt_interval = atoi(optarg);
>> +                     break;
>> +             default:
>> +                     usage(basename(argv[0]));
>> +             }
>> +     }
>> +
>> +     opt_ifindex = if_nametoindex(opt_if);
>> +     if (!opt_ifindex) {
>> +             fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
>> +                     opt_if);
>> +             usage(basename(argv[0]));
>> +     }
>> +}
>> +
>> +static void kick_tx(int fd)
>> +{
>> +     int ret;
>> +
>> +     ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
>> +     if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
>> +             return;
>> +     lassert(0);
>> +}
>> +
>> +static inline void complete_tx_l2fwd(struct xdpsock *xsk)
>> +{
>> +     u32 descs[BATCH_SIZE];
>> +     unsigned int rcvd;
>> +     size_t ndescs;
>> +
>> +     if (!xsk->outstanding_tx)
>> +             return;
>> +
>> +     kick_tx(xsk->sfd);
>> +     ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
>> +              xsk->outstanding_tx;
>> +
>> +     /* re-add completed Tx buffers */
>> +     rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs);
>> +     if (rcvd > 0) {
>> +             umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd);
>> +             xsk->outstanding_tx -= rcvd;
>> +             xsk->tx_npkts += rcvd;
>> +     }
>> +}
>> +
>> +static inline void complete_tx_only(struct xdpsock *xsk)
>> +{
>> +     u32 descs[BATCH_SIZE];
>> +     unsigned int rcvd;
>> +
>> +     if (!xsk->outstanding_tx)
>> +             return;
>> +
>> +     kick_tx(xsk->sfd);
>> +
>> +     rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE);
>> +     if (rcvd > 0) {
>> +             xsk->outstanding_tx -= rcvd;
>> +             xsk->tx_npkts += rcvd;
>> +     }
>> +}
>> +
>> +static void rx_drop(struct xdpsock *xsk)
>> +{
>> +     struct xdp_desc descs[BATCH_SIZE];
>> +     unsigned int rcvd, i;
>> +
>> +     rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
>> +     if (!rcvd)
>> +             return;
>> +
>> +     for (i = 0; i < rcvd; i++) {
>> +             u32 idx = descs[i].idx;
>> +
>> +             lassert(idx < NUM_FRAMES);
>> +#if DEBUG_HEXDUMP
>> +             char *pkt;
>> +             char buf[32];
>> +
>> +             pkt = xq_get_data(xsk, idx, descs[i].offset);
>> +             sprintf(buf, "idx=%d", idx);
>> +             hex_dump(pkt, descs[i].len, buf);
>> +#endif
>> +     }
>> +
>> +     xsk->rx_npkts += rcvd;
>> +
>> +     umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
>> +}
>> +
>> +static void rx_drop_all(void)
>> +{
>> +     struct pollfd fds[MAX_SOCKS + 1];
>> +     int i, ret, timeout, nfds = 1;
>> +
>> +     memset(fds, 0, sizeof(fds));
>> +
>> +     for (i = 0; i < num_socks; i++) {
>> +             fds[i].fd = xsks[i]->sfd;
>> +             fds[i].events = POLLIN;
>> +             timeout = 1000; /* 1sn */
>> +     }
>> +
>> +     for (;;) {
>> +             if (opt_poll) {
>> +                     ret = poll(fds, nfds, timeout);
>> +                     if (ret <= 0)
>> +                             continue;
>> +             }
>> +
>> +             for (i = 0; i < num_socks; i++)
>> +                     rx_drop(xsks[i]);
>> +     }
>> +}
>> +
>> +static void tx_only(struct xdpsock *xsk)
>> +{
>> +     int timeout, ret, nfds = 1;
>> +     struct pollfd fds[nfds + 1];
>> +     unsigned int idx = 0;
>> +
>> +     memset(fds, 0, sizeof(fds));
>> +     fds[0].fd = xsk->sfd;
>> +     fds[0].events = POLLOUT;
>> +     timeout = 1000; /* 1sn */
>> +
>> +     for (;;) {
>> +             if (opt_poll) {
>> +                     ret = poll(fds, nfds, timeout);
>> +                     if (ret <= 0)
>> +                             continue;
>> +
>> +                     if (fds[0].fd != xsk->sfd ||
>> +                         !(fds[0].revents & POLLOUT))
>> +                             continue;
>> +             }
>> +
>> +             if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) {
>> +                     lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0);
>> +
>> +                     xsk->outstanding_tx += BATCH_SIZE;
>> +                     idx += BATCH_SIZE;
>> +                     idx %= NUM_FRAMES;
>> +             }
>> +
>> +             complete_tx_only(xsk);
>> +     }
>> +}
>> +
>> +static void l2fwd(struct xdpsock *xsk)
>> +{
>> +     for (;;) {
>> +             struct xdp_desc descs[BATCH_SIZE];
>> +             unsigned int rcvd, i;
>> +             int ret;
>> +
>> +             for (;;) {
>> +                     complete_tx_l2fwd(xsk);
>> +
>> +                     rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
>> +                     if (rcvd > 0)
>> +                             break;
>> +             }
>> +
>> +             for (i = 0; i < rcvd; i++) {
>> +                     char *pkt = xq_get_data(xsk, descs[i].idx,
>> +                                             descs[i].offset);
>> +
>> +                     swap_mac_addresses(pkt);
>> +#if DEBUG_HEXDUMP
>> +                     char buf[32];
>> +                     u32 idx = descs[i].idx;
>> +
>> +                     sprintf(buf, "idx=%d", idx);
>> +                     hex_dump(pkt, descs[i].len, buf);
>> +#endif
>> +             }
>> +
>> +             xsk->rx_npkts += rcvd;
>> +
>> +             ret = xq_enq(&xsk->tx, descs, rcvd);
>> +             lassert(ret == 0);
>> +             xsk->outstanding_tx += rcvd;
>> +     }
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +     struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
>> +     char xdp_filename[256];
>> +     int i, ret, key = 0;
>> +     pthread_t pt;
>> +
>> +     parse_command_line(argc, argv);
>> +
>> +     if (setrlimit(RLIMIT_MEMLOCK, &r)) {
>> +             fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
>> +                     strerror(errno));
>> +             exit(EXIT_FAILURE);
>> +     }
>> +
>> +     snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
>> +
>> +     if (load_bpf_file(xdp_filename)) {
>> +             fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
>> +             exit(EXIT_FAILURE);
>> +     }
>> +
>> +     if (!prog_fd[0]) {
>> +             fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n",
>> +                     strerror(errno));
>> +             exit(EXIT_FAILURE);
>> +     }
>> +
>> +     if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
>> +             fprintf(stderr, "ERROR: link set xdp fd failed\n");
>> +             exit(EXIT_FAILURE);
>> +     }
>> +
>> +     ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0);
>> +     if (ret) {
>> +             fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n");
>> +             exit(EXIT_FAILURE);
>> +     }
>> +
>> +     /* Create sockets... */
>> +     xsks[num_socks++] = xsk_configure(NULL);
>> +
>> +#if RR_LB
>> +     for (i = 0; i < MAX_SOCKS - 1; i++)
>> +             xsks[num_socks++] = xsk_configure(xsks[0]->umem);
>> +#endif
>> +
>> +     /* ...and insert them into the map. */
>> +     for (i = 0; i < num_socks; i++) {
>> +             key = i;
>> +             ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0);
>> +             if (ret) {
>> +                     fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i);
>> +                     exit(EXIT_FAILURE);
>> +             }
>> +     }
>> +
>> +     signal(SIGINT, int_exit);
>> +     signal(SIGTERM, int_exit);
>> +     signal(SIGABRT, int_exit);
>> +
>> +     setlocale(LC_ALL, "");
>> +
>> +     ret = pthread_create(&pt, NULL, poller, NULL);
>> +     lassert(ret == 0);
>> +
>> +     prev_time = get_nsecs();
>> +
>> +     if (opt_bench == BENCH_RXDROP)
>> +             rx_drop_all();
>> +     else if (opt_bench == BENCH_TXONLY)
>> +             tx_only(xsks[0]);
>> +     else
>> +             l2fwd(xsks[0]);
>> +
>> +     return 0;
>> +}
>> --
>> 2.14.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
  2018-04-24  2:29 ` Jason Wang
@ 2018-04-24  8:44   ` Magnus Karlsson
  2018-04-24  9:10     ` Jason Wang
  0 siblings, 1 reply; 54+ messages in thread
From: Magnus Karlsson @ 2018-04-24  8:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
	Michael S. Tsirkin, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

>> We have run some benchmarks on a dual socket system with two Broadwell
>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>> cores which gives a total of 28, but only two cores are used in these
>> experiments. One for TR/RX and one for the user space application. The
>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>> Intel I40E 40Gbit/s using the i40e driver.
>>
>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>> and 1500 byte packets, generated by commercial packet generator HW that is
>> generating packets at full 40 Gbit/s line rate.
>>
>> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
>> Benchmark   XDP_SKB   XDP_DRV
>> rxdrop       2.9(3.0)   9.4(9.3)
>> txpush       2.5(2.2)   NA*
>> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)
>
>
> This number looks not very exciting. I can get ~3Mpps when using testpmd in
> a guest with xdp_redirect.sh on host between ixgbe and TAP/vhost. I believe
> we can even better performance without virt. It would be interesting to
> compare this performance with e.g testpmd + virito_user(vhost_kernel) + XDP.

Note that all the XDP_SKB numbers plus the TX part of XDP_DRV for l2fwd
uses SKBs and the generic XDP path in the kernel. I am not surprised those
numbers are lower than what you are seeing with XDP_DRV support.
(If that is what you are running? Unsure about your setup). The
9.4 Mpps for RX is what you get with the XDP_DRV support and copies
out to user space. Or is it this number you think is low? Zerocopy will be added
in later patch sets.

With that said, both XDP_SKB and XDP_DRV can be optimized. We
have not spent that much time on optimizations at this point.

>
>>
>> AF_XDP performance 1500 byte packets:
>> Benchmark   XDP_SKB   XDP_DRV
>> rxdrop       2.1(2.2)   3.3(3.1)
>> l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)
>>
>> * NA since we have no support for TX using the XDP_DRV infrastructure
>>    in this RFC. This is for a future patch set since it involves
>>    changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>>    Dangaard Brouer.
>>
>> XDP performance on our system as a base line:
>>
>> 64 byte packets:
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      16      32,921,521  0
>>
>> 1500 byte packets:
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      16      3,289,491   0
>>
>> Changes from RFC V2:
>>
>> * Optimizations and simplifications to the ring structures inspired by
>>    ptr_ring.h
>> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>>    consistent with AF_PACKET
>> * Support for only having an RX queue or a TX queue defined
>> * Some bug fixes and code cleanup
>>
>> The structure of the patch set is as follows:
>>
>> Patches 1-2: Basic socket and umem plumbing
>> Patches 3-10: RX support together with the new XSKMAP
>> Patches 11-14: TX support
>> Patch 15: Sample application
>>
>> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
>> Clean up btf.h in uapi")
>>
>> Questions:
>>
>> * How to deal with cache alignment for uapi when different
>>    architectures can have different cache line sizes? We have just
>>    aligned it to 64 bytes for now, which works for many popular
>>    architectures, but not all. Please advise.
>>
>> To do:
>>
>> * Optimize performance
>>
>> * Kernel selftest
>>
>> Post-series plan:
>>
>> * Kernel load module support of AF_XDP would be nice. Unclear how to
>>    achieve this though since our XDP code depends on net/core.
>>
>> * Support for AF_XDP sockets without an XPD program loaded. In this
>>    case all the traffic on a queue should go up to the user space socket.
>
>
> I think we probably need this in the case of TUN XDP for virt guest too.

Yes.

Thanks: Magnus

> Thanks
>
>
>>
>> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>>    XDP_PASS" for a tcpdump-like functionality.
>>
>> * And of course getting to zero-copy support in small increments.
>>
>> Thanks: Björn and Magnus
>>
>> Björn Töpel (8):
>>    net: initial AF_XDP skeleton
>>    xsk: add user memory registration support sockopt
>>    xsk: add Rx queue setup and mmap support
>>    xdp: introduce xdp_return_buff API
>>    xsk: add Rx receive functions and poll support
>>    bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>>    xsk: wire up XDP_DRV side of AF_XDP
>>    xsk: wire up XDP_SKB side of AF_XDP
>>
>> Magnus Karlsson (7):
>>    xsk: add umem fill queue support and mmap
>>    xsk: add support for bind for Rx
>>    xsk: add umem completion queue support and mmap
>>    xsk: add Tx queue setup and mmap support
>>    xsk: support for Tx
>>    xsk: statistics support
>>    samples/bpf: sample application for AF_XDP sockets
>>
>>   MAINTAINERS                         |   8 +
>>   include/linux/bpf.h                 |  26 +
>>   include/linux/bpf_types.h           |   3 +
>>   include/linux/filter.h              |   2 +-
>>   include/linux/socket.h              |   5 +-
>>   include/net/xdp.h                   |   1 +
>>   include/net/xdp_sock.h              |  46 ++
>>   include/uapi/linux/bpf.h            |   1 +
>>   include/uapi/linux/if_xdp.h         |  87 ++++
>>   kernel/bpf/Makefile                 |   3 +
>>   kernel/bpf/verifier.c               |   8 +-
>>   kernel/bpf/xskmap.c                 | 286 +++++++++++
>>   net/Kconfig                         |   1 +
>>   net/Makefile                        |   1 +
>>   net/core/dev.c                      |  34 +-
>>   net/core/filter.c                   |  40 +-
>>   net/core/sock.c                     |  12 +-
>>   net/core/xdp.c                      |  15 +-
>>   net/xdp/Kconfig                     |   7 +
>>   net/xdp/Makefile                    |   2 +
>>   net/xdp/xdp_umem.c                  | 256 ++++++++++
>>   net/xdp/xdp_umem.h                  |  65 +++
>>   net/xdp/xdp_umem_props.h            |  23 +
>>   net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
>>   net/xdp/xsk_queue.c                 |  73 +++
>>   net/xdp/xsk_queue.h                 | 245 ++++++++++
>>   samples/bpf/Makefile                |   4 +
>>   samples/bpf/xdpsock.h               |  11 +
>>   samples/bpf/xdpsock_kern.c          |  56 +++
>>   samples/bpf/xdpsock_user.c          | 947
>> ++++++++++++++++++++++++++++++++++++
>>   security/selinux/hooks.c            |   4 +-
>>   security/selinux/include/classmap.h |   4 +-
>>   32 files changed, 2945 insertions(+), 35 deletions(-)
>>   create mode 100644 include/net/xdp_sock.h
>>   create mode 100644 include/uapi/linux/if_xdp.h
>>   create mode 100644 kernel/bpf/xskmap.c
>>   create mode 100644 net/xdp/Kconfig
>>   create mode 100644 net/xdp/Makefile
>>   create mode 100644 net/xdp/xdp_umem.c
>>   create mode 100644 net/xdp/xdp_umem.h
>>   create mode 100644 net/xdp/xdp_umem_props.h
>>   create mode 100644 net/xdp/xsk.c
>>   create mode 100644 net/xdp/xsk_queue.c
>>   create mode 100644 net/xdp/xsk_queue.h
>>   create mode 100644 samples/bpf/xdpsock.h
>>   create mode 100644 samples/bpf/xdpsock_kern.c
>>   create mode 100644 samples/bpf/xdpsock_user.c
>>
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
  2018-04-24  8:44   ` Magnus Karlsson
@ 2018-04-24  9:10     ` Jason Wang
  2018-04-24  9:14       ` Magnus Karlsson
  0 siblings, 1 reply; 54+ messages in thread
From: Jason Wang @ 2018-04-24  9:10 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
	Michael S. Tsirkin, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z



On 2018年04月24日 16:44, Magnus Karlsson wrote:
>>> We have run some benchmarks on a dual socket system with two Broadwell
>>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>>> cores which gives a total of 28, but only two cores are used in these
>>> experiments. One for TR/RX and one for the user space application. The
>>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>>> Intel I40E 40Gbit/s using the i40e driver.
>>>
>>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>>> and 1500 byte packets, generated by commercial packet generator HW that is
>>> generating packets at full 40 Gbit/s line rate.
>>>
>>> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
>>> Benchmark   XDP_SKB   XDP_DRV
>>> rxdrop       2.9(3.0)   9.4(9.3)
>>> txpush       2.5(2.2)   NA*
>>> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)
>> This number looks not very exciting. I can get ~3Mpps when using testpmd in
>> a guest with xdp_redirect.sh on host between ixgbe and TAP/vhost. I believe
>> we can even better performance without virt. It would be interesting to
>> compare this performance with e.g testpmd + virito_user(vhost_kernel) + XDP.
> Note that all the XDP_SKB numbers plus the TX part of XDP_DRV for l2fwd
> uses SKBs and the generic XDP path in the kernel. I am not surprised those
> numbers are lower than what you are seeing with XDP_DRV support.
> (If that is what you are running? Unsure about your setup).

Yes, I'm using haswell E5-2630 v3 @ 2.40GHz and ixgbe.

>   The
> 9.4 Mpps for RX is what you get with the XDP_DRV support and copies
> out to user space. Or is it this number you think is low?

No rxdrop looks ok. I mean for l2fwd only.

>   Zerocopy will be added
> in later patch sets.
>
> With that said, both XDP_SKB and XDP_DRV can be optimized. We
> have not spent that much time on optimizations at this point.
>

Yes, and it is interesting to compare the performance numbers between 
AF_XDP and TAP XDP + vhost_net since their functions are almost equivalent.

Thanks

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
  2018-04-24  9:10     ` Jason Wang
@ 2018-04-24  9:14       ` Magnus Karlsson
  0 siblings, 0 replies; 54+ messages in thread
From: Magnus Karlsson @ 2018-04-24  9:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
	Michael S. Tsirkin, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On Tue, Apr 24, 2018 at 11:10 AM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2018年04月24日 16:44, Magnus Karlsson wrote:
>>>>
>>>> We have run some benchmarks on a dual socket system with two Broadwell
>>>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>>>> cores which gives a total of 28, but only two cores are used in these
>>>> experiments. One for TR/RX and one for the user space application. The
>>>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>>>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>>>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>>>> Intel I40E 40Gbit/s using the i40e driver.
>>>>
>>>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>>>> and 1500 byte packets, generated by commercial packet generator HW that
>>>> is
>>>> generating packets at full 40 Gbit/s line rate.
>>>>
>>>> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
>>>> Benchmark   XDP_SKB   XDP_DRV
>>>> rxdrop       2.9(3.0)   9.4(9.3)
>>>> txpush       2.5(2.2)   NA*
>>>> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)
>>>
>>> This number looks not very exciting. I can get ~3Mpps when using testpmd
>>> in
>>> a guest with xdp_redirect.sh on host between ixgbe and TAP/vhost. I
>>> believe
>>> we can even better performance without virt. It would be interesting to
>>> compare this performance with e.g testpmd + virito_user(vhost_kernel) +
>>> XDP.
>>
>> Note that all the XDP_SKB numbers plus the TX part of XDP_DRV for l2fwd
>> uses SKBs and the generic XDP path in the kernel. I am not surprised those
>> numbers are lower than what you are seeing with XDP_DRV support.
>> (If that is what you are running? Unsure about your setup).
>
>
> Yes, I'm using haswell E5-2630 v3 @ 2.40GHz and ixgbe.
>
>>   The
>> 9.4 Mpps for RX is what you get with the XDP_DRV support and copies
>> out to user space. Or is it this number you think is low?
>
>
> No rxdrop looks ok. I mean for l2fwd only.

OK, sounds good. l2fwd will get much better once we add XDP_DRV support for TX.

Thanks: Magnus

>>   Zerocopy will be added
>> in later patch sets.
>>
>> With that said, both XDP_SKB and XDP_DRV can be optimized. We
>> have not spent that much time on optimizations at this point.
>>
>
> Yes, and it is interesting to compare the performance numbers between AF_XDP
> and TAP XDP + vhost_net since their functions are almost equivalent.
>
> Thanks

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
  2018-04-23 13:56 ` [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt Björn Töpel
  2018-04-23 16:18   ` Michael S. Tsirkin
  2018-04-23 23:04   ` Willem de Bruijn
@ 2018-04-24 14:27   ` kbuild test robot
  2 siblings, 0 replies; 54+ messages in thread
From: kbuild test robot @ 2018-04-24 14:27 UTC (permalink / raw)
  To: Björn Töpel
  Cc: kbuild-all, bjorn.topel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev,
	Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

Hi Björn,

I love your patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/Introducing-AF_XDP-support/20180424-085240
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: m68k-allyesconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   net/xdp/xdp_umem.o: In function `xdp_umem_reg':
>> xdp_umem.c:(.text+0x200): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 45403 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
  2018-04-24  8:08       ` Magnus Karlsson
@ 2018-04-24 16:55         ` Willem de Bruijn
  0 siblings, 0 replies; 54+ messages in thread
From: Willem de Bruijn @ 2018-04-24 16:55 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Michael S. Tsirkin, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

>>>> +/* Pgoff for mmaping the rings */
>>>> +#define XDP_UMEM_PGOFF_FILL_RING     0x100000000
>>>> +
>>>> +struct xdp_ring {
>>>> +     __u32 producer __attribute__((aligned(64)));
>>>> +     __u32 consumer __attribute__((aligned(64)));
>>>> +};
>>>
>>> Why 64? And do you still need these guys in uapi?
>>
>> I was just about to ask the same. You mean cacheline_aligned?
>
> Yes, I would like to have these cache aligned. How can I accomplish
> this in a uapi?

Good point. This seems fine to me.

> I put a note around this in the cover letter:
>
> * How to deal with cache alignment for uapi when different
>   architectures can have different cache line sizes? We have just
>   aligned it to 64 bytes for now, which works for many popular
>   architectures, but not all. Please advise.
>
>>
>>>> +static int xsk_mmap(struct file *file, struct socket *sock,
>>>> +                 struct vm_area_struct *vma)
>>>> +{
>>>> +     unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
>>>> +     unsigned long size = vma->vm_end - vma->vm_start;
>>>> +     struct xdp_sock *xs = xdp_sk(sock->sk);
>>>> +     struct xsk_queue *q;
>>>> +     unsigned long pfn;
>>>> +     struct page *qpg;
>>>> +
>>>> +     if (!xs->umem)
>>>> +             return -EINVAL;
>>>> +
>>>> +     if (offset == XDP_UMEM_PGOFF_FILL_RING)
>>>> +             q = xs->umem->fq;
>>>> +     else
>>>> +             return -EINVAL;
>>>> +
>>>> +     qpg = virt_to_head_page(q->ring);
>>
>> Is it assured that q is initialized with a call to setsockopt
>> XDP_UMEM_FILL_RING before the call the mmap?
>
> Unfortunately not, so this is a bug. Case in point for running
> syzkaller below, definitely.
>
>> In general, with such an extensive new API, it might be worthwhile to
>> run syzkaller locally on a kernel with these patches. It is pretty
>> easy to set up (https://github.com/google/syzkaller/blob/master/docs/linux/setup.md),
>> though it also needs to be taught about any new APIs.
>
> Good idea. Will set this up and have it torture the API.
>
> Thanks: Magnus

Great, thanks. I forgot to mention how to encode the new APIs for syzkaller:

https://github.com/google/syzkaller/blob/master/docs/syscall_descriptions.md

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 05/15] xsk: add support for bind for Rx
  2018-04-23 13:56 ` [PATCH bpf-next 05/15] xsk: add support for bind for Rx Björn Töpel
@ 2018-04-24 16:55   ` Willem de Bruijn
  2018-04-24 18:43     ` Björn Töpel
  0 siblings, 1 reply; 54+ messages in thread
From: Willem de Bruijn @ 2018-04-24 16:55 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
>
> Here, the bind syscall is added. Binding an AF_XDP socket, means
> associating the socket to an umem, a netdev and a queue index. This
> can be done in two ways.
>
> The first way, creating a "socket from scratch". Create the umem using
> the XDP_UMEM_REG setsockopt and an associated fill queue with
> XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
> setsockopt. Call bind passing ifindex and queue index ("channel" in
> ethtool speak).
>
> The second way to bind a socket, is simply skipping the
> umem/netdev/queue index, and passing another already setup AF_XDP
> socket. The new socket will then have the same umem/netdev/queue index
> as the parent so it will share the same umem. You must also set the
> flags field in the socket address to XDP_SHARED_UMEM.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---

> +static struct socket *xsk_lookup_xsk_from_fd(int fd, int *err)
> +{
> +       struct socket *sock;
> +
> +       *err = -ENOTSOCK;
> +       sock = sockfd_lookup(fd, err);
> +       if (!sock)
> +               return NULL;
> +
> +       if (sock->sk->sk_family != PF_XDP) {
> +               *err = -ENOPROTOOPT;
> +               sockfd_put(sock);
> +               return NULL;
> +       }
> +
> +       *err = 0;
> +       return sock;
> +}

In this and similar cases, can use ERR_PTR to avoid the extra argument.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  2018-04-23 13:56 ` [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
@ 2018-04-24 16:56   ` Willem de Bruijn
  2018-04-24 18:58     ` Björn Töpel
  0 siblings, 1 reply; 54+ messages in thread
From: Willem de Bruijn @ 2018-04-24 16:56 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> The xskmap is yet another BPF map, very much inspired by
> dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
> adds AF_XDP sockets into the map, and by using the bpf_redirect_map
> helper, an XDP program can redirect XDP frames to an AF_XDP socket.
>
> Note that a socket that is bound to certain ifindex/queue index will
> *only* accept XDP frames from that netdev/queue index. If an XDP
> program tries to redirect from a netdev/queue index other than what
> the socket is bound to, the frame will not be received on the socket.
>
> A socket can reside in multiple maps.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

> +struct xsk_map_entry {
> +       struct xdp_sock *xs;
> +       struct rcu_head rcu;
> +};

> +struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
> +{
> +       struct xsk_map *m = container_of(map, struct xsk_map, map);
> +       struct xsk_map_entry *entry;
> +
> +       if (key >= map->max_entries)
> +               return NULL;
> +
> +       entry = READ_ONCE(m->xsk_map[key]);
> +       return entry ? entry->xs : NULL;
> +}

This dynamically allocated structure adds an extra cacheline lookup. If
xdp_sock gets an rcu_head, it can be linked into the map directly.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support
  2018-04-23 13:56 ` [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support Björn Töpel
@ 2018-04-24 16:56   ` Willem de Bruijn
  2018-04-24 18:32     ` Björn Töpel
  0 siblings, 1 reply; 54+ messages in thread
From: Willem de Bruijn @ 2018-04-24 16:56 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> Here the actual receive functions of AF_XDP are implemented, that in a
> later commit, will be called from the XDP layers.
>
> There's one set of functions for the XDP_DRV side and another for
> XDP_SKB (generic).
>
> Support for the poll syscall is also implemented.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---

> +/* Common functions operating for both RXTX and umem queues */
> +
> +static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
> +{
> +       u32 entries = q->prod_tail - q->cons_tail;
> +
> +       if (entries == 0) {
> +               /* Refresh the local pointer */
> +               q->prod_tail = READ_ONCE(q->ring->producer);
> +       }
> +
> +       entries = q->prod_tail - q->cons_tail;

Probably meant to be inside the branch? Though I see the same
pattern in the userspace example program.

> +static inline u32 *xskq_validate_id(struct xsk_queue *q)
> +{
> +       while (q->cons_tail != q->cons_head) {
> +               struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
> +               unsigned int idx = q->cons_tail & q->ring_mask;
> +
> +               if (xskq_is_valid_id(q, ring->desc[idx]))
> +                       return &ring->desc[idx];

Missing a q->cons_tail increment in this loop?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 13/15] xsk: support for Tx
  2018-04-23 13:56 ` [PATCH bpf-next 13/15] xsk: support for Tx Björn Töpel
@ 2018-04-24 16:57   ` Willem de Bruijn
  2018-04-25  9:11     ` Magnus Karlsson
  0 siblings, 1 reply; 54+ messages in thread
From: Willem de Bruijn @ 2018-04-24 16:57 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
>
> Here, Tx support is added. The user fills the Tx queue with frames to
> be sent by the kernel, and let's the kernel know using the sendmsg
> syscall.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>

> +static int xsk_xmit_skb(struct sk_buff *skb)

This is basically packet_direct_xmit. Might be better to just move that
to net/core/dev.c and use in both AF_PACKET and AF_XDP.

Also, (eventually) AF_XDP may also want to support the regular path
through dev_queue_xmit to go through traffic shaping.

> +{
> +       struct net_device *dev = skb->dev;
> +       struct sk_buff *orig_skb = skb;
> +       struct netdev_queue *txq;
> +       int ret = NETDEV_TX_BUSY;
> +       bool again = false;
> +
> +       if (unlikely(!netif_running(dev) || !netif_carrier_ok(dev)))
> +               goto drop;
> +
> +       skb = validate_xmit_skb_list(skb, dev, &again);
> +       if (skb != orig_skb)
> +               return NET_XMIT_DROP;

Need to free generated segment list on error, see packet_direct_xmit.

> +
> +       txq = skb_get_tx_queue(dev, skb);
> +
> +       local_bh_disable();
> +
> +       HARD_TX_LOCK(dev, txq, smp_processor_id());
> +       if (!netif_xmit_frozen_or_drv_stopped(txq))
> +               ret = netdev_start_xmit(skb, dev, txq, false);
> +       HARD_TX_UNLOCK(dev, txq);
> +
> +       local_bh_enable();
> +
> +       if (!dev_xmit_complete(ret))
> +               goto out_err;
> +
> +       return ret;
> +drop:
> +       atomic_long_inc(&dev->tx_dropped);
> +out_err:
> +       return NET_XMIT_DROP;
> +}

> +static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
> +                           size_t total_len)
> +{
> +       bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
> +       u32 max_batch = TX_BATCH_SIZE;
> +       struct xdp_sock *xs = xdp_sk(sk);
> +       bool sent_frame = false;
> +       struct xdp_desc desc;
> +       struct sk_buff *skb;
> +       int err = 0;
> +
> +       if (unlikely(!xs->tx))
> +               return -ENOBUFS;
> +       if (need_wait)
> +               return -EOPNOTSUPP;
> +
> +       mutex_lock(&xs->mutex);
> +
> +       while (xskq_peek_desc(xs->tx, &desc)) {

It is possible to pass a chain of skbs to validate_xmit_skb_list and
eventually pass this chain to xsk_xmit_skb, amortizing the cost of
taking the txq lock. Fine to ignore for this patch set.

> +               char *buffer;
> +               u32 id, len;
> +
> +               if (max_batch-- == 0) {
> +                       err = -EAGAIN;
> +                       goto out;
> +               }
> +
> +               if (xskq_reserve_id(xs->umem->cq)) {
> +                       err = -EAGAIN;
> +                       goto out;
> +               }
> +
> +               len = desc.len;
> +               if (unlikely(len > xs->dev->mtu)) {
> +                       err = -EMSGSIZE;
> +                       goto out;
> +               }
> +
> +               skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
> +               if (unlikely(!skb)) {
> +                       err = -EAGAIN;
> +                       goto out;
> +               }
> +
> +               skb_put(skb, len);
> +               id = desc.idx;
> +               buffer = xdp_umem_get_data(xs->umem, id) + desc.offset;
> +               err = skb_store_bits(skb, 0, buffer, len);
> +               if (unlikely(err))
> +                       goto out_store;

As xsk_destruct_skb delays notification until consume_skb is called, this
copy can be avoided by linking the xdp buffer into the skb frags array,
analogous to tpacket_snd.

You probably don't care much about the copy slow path, and this can be
implemented later, so also no need to do in this patchset.

static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
+                                             struct xdp_desc *desc)
+{
+       struct xdp_rxtx_ring *ring;
+
+       if (q->cons_tail == q->cons_head) {
+               WRITE_ONCE(q->ring->consumer, q->cons_tail);
+               q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
+
+               /* Order consumer and data */
+               smp_rmb();
+
+               return xskq_validate_desc(q, desc);
+       }
+
+       ring = (struct xdp_rxtx_ring *)q->ring;
+       *desc = ring->desc[q->cons_tail & q->ring_mask];
+       return desc;

This only validates descriptors if taking the branch.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 14/15] xsk: statistics support
  2018-04-23 13:56 ` [PATCH bpf-next 14/15] xsk: statistics support Björn Töpel
@ 2018-04-24 16:58   ` Willem de Bruijn
  2018-04-25 10:50     ` Magnus Karlsson
  0 siblings, 1 reply; 54+ messages in thread
From: Willem de Bruijn @ 2018-04-24 16:58 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
>
> In this commit, a new getsockopt is added: XDP_STATISTICS. This is
> used to obtain stats from the sockets.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>

> +static int xsk_getsockopt(struct socket *sock, int level, int optname,
> +                         char __user *optval, int __user *optlen)
> +{
> +       struct sock *sk = sock->sk;
> +       struct xdp_sock *xs = xdp_sk(sk);
> +       int len;
> +
> +       if (level != SOL_XDP)
> +               return -ENOPROTOOPT;
> +
> +       if (get_user(len, optlen))
> +               return -EFAULT;
> +       if (len < 0)
> +               return -EINVAL;
> +
> +       switch (optname) {
> +       case XDP_STATISTICS:
> +       {
> +               struct xdp_statistics stats;
> +
> +               if (len != sizeof(stats))
> +                       return -EINVAL;
> +
> +               mutex_lock(&xs->mutex);
> +               stats.rx_dropped = xs->rx_dropped;
> +               stats.rx_invalid_descs = xskq_nb_invalid_descs(xs->rx);
> +               stats.tx_invalid_descs = xskq_nb_invalid_descs(xs->tx);
> +               mutex_unlock(&xs->mutex);
> +
> +               if (copy_to_user(optval, &stats, sizeof(stats)))
> +                       return -EFAULT;
> +               return 0;

For forward compatibility, could allow caller to pass a struct larger
than stats and return the number of bytes filled in.

The lock can also be elided with something like gnet_stats, but it is probably
taken rarely enough that that is not worth the effort, at least right now.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
  2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
                   ` (16 preceding siblings ...)
  2018-04-24  2:29 ` Jason Wang
@ 2018-04-24 17:03 ` Willem de Bruijn
  17 siblings, 0 replies; 54+ messages in thread
From: Willem de Bruijn @ 2018-04-24 17:03 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics.

Overall, this looks really nice!

> In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path.

Please remove references to RFC when resending to bpf-next.

> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device,

The setup involves a lot of system calls. You may want to require the
caller to take these in a well defined order, and same for destruction.

Arbitrary order leads to a state explosion in paths through the code.

With AF_PACKET we've had to fix quite a few bugs due to unexpected
states of the socket, e.g., on teardown, and it is too late now to restrict
the number of states.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support
  2018-04-24 16:56   ` Willem de Bruijn
@ 2018-04-24 18:32     ` Björn Töpel
  0 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-24 18:32 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

2018-04-24 18:56 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> Here the actual receive functions of AF_XDP are implemented, that in a
>> later commit, will be called from the XDP layers.
>>
>> There's one set of functions for the XDP_DRV side and another for
>> XDP_SKB (generic).
>>
>> Support for the poll syscall is also implemented.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> ---
>
>> +/* Common functions operating for both RXTX and umem queues */
>> +
>> +static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
>> +{
>> +       u32 entries = q->prod_tail - q->cons_tail;
>> +
>> +       if (entries == 0) {
>> +               /* Refresh the local pointer */
>> +               q->prod_tail = READ_ONCE(q->ring->producer);
>> +       }
>> +
>> +       entries = q->prod_tail - q->cons_tail;
>
> Probably meant to be inside the branch? Though I see the same
> pattern in the userspace example program.
>

Yes! Nasty C&P going on here... :-(

>> +static inline u32 *xskq_validate_id(struct xsk_queue *q)
>> +{
>> +       while (q->cons_tail != q->cons_head) {
>> +               struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
>> +               unsigned int idx = q->cons_tail & q->ring_mask;
>> +
>> +               if (xskq_is_valid_id(q, ring->desc[idx]))
>> +                       return &ring->desc[idx];
>
> Missing a q->cons_tail increment in this loop?

Indeed! Good catch! Thanks!


Björn

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 05/15] xsk: add support for bind for Rx
  2018-04-24 16:55   ` Willem de Bruijn
@ 2018-04-24 18:43     ` Björn Töpel
  0 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-24 18:43 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

2018-04-24 18:55 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, the bind syscall is added. Binding an AF_XDP socket, means
>> associating the socket to an umem, a netdev and a queue index. This
>> can be done in two ways.
>>
>> The first way, creating a "socket from scratch". Create the umem using
>> the XDP_UMEM_REG setsockopt and an associated fill queue with
>> XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
>> setsockopt. Call bind passing ifindex and queue index ("channel" in
>> ethtool speak).
>>
>> The second way to bind a socket, is simply skipping the
>> umem/netdev/queue index, and passing another already setup AF_XDP
>> socket. The new socket will then have the same umem/netdev/queue index
>> as the parent so it will share the same umem. You must also set the
>> flags field in the socket address to XDP_SHARED_UMEM.
>>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> ---
>
>> +static struct socket *xsk_lookup_xsk_from_fd(int fd, int *err)
>> +{
>> +       struct socket *sock;
>> +
>> +       *err = -ENOTSOCK;
>> +       sock = sockfd_lookup(fd, err);
>> +       if (!sock)
>> +               return NULL;
>> +
>> +       if (sock->sk->sk_family != PF_XDP) {
>> +               *err = -ENOPROTOOPT;
>> +               sockfd_put(sock);
>> +               return NULL;
>> +       }
>> +
>> +       *err = 0;
>> +       return sock;
>> +}
>
> In this and similar cases, can use ERR_PTR to avoid the extra argument.

Noted. Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  2018-04-24 16:56   ` Willem de Bruijn
@ 2018-04-24 18:58     ` Björn Töpel
  0 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-24 18:58 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

2018-04-24 18:56 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> The xskmap is yet another BPF map, very much inspired by
>> dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
>> adds AF_XDP sockets into the map, and by using the bpf_redirect_map
>> helper, an XDP program can redirect XDP frames to an AF_XDP socket.
>>
>> Note that a socket that is bound to certain ifindex/queue index will
>> *only* accept XDP frames from that netdev/queue index. If an XDP
>> program tries to redirect from a netdev/queue index other than what
>> the socket is bound to, the frame will not be received on the socket.
>>
>> A socket can reside in multiple maps.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>
>> +struct xsk_map_entry {
>> +       struct xdp_sock *xs;
>> +       struct rcu_head rcu;
>> +};
>
>> +struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
>> +{
>> +       struct xsk_map *m = container_of(map, struct xsk_map, map);
>> +       struct xsk_map_entry *entry;
>> +
>> +       if (key >= map->max_entries)
>> +               return NULL;
>> +
>> +       entry = READ_ONCE(m->xsk_map[key]);
>> +       return entry ? entry->xs : NULL;
>> +}
>
> This dynamically allocated structure adds an extra cacheline lookup. If
> xdp_sock gets an rcu_head, it can be linked into the map directly.

Nice one! I'll try this out!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 13/15] xsk: support for Tx
  2018-04-24 16:57   ` Willem de Bruijn
@ 2018-04-25  9:11     ` Magnus Karlsson
  2018-04-25 19:00       ` Willem de Bruijn
  0 siblings, 1 reply; 54+ messages in thread
From: Magnus Karlsson @ 2018-04-25  9:11 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Daniel Borkmann, Michael S. Tsirkin,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

On Tue, Apr 24, 2018 at 6:57 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, Tx support is added. The user fills the Tx queue with frames to
>> be sent by the kernel, and let's the kernel know using the sendmsg
>> syscall.
>>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>
>> +static int xsk_xmit_skb(struct sk_buff *skb)
>
> This is basically packet_direct_xmit. Might be better to just move that
> to net/core/dev.c and use in both AF_PACKET and AF_XDP.

It is packet_direct_xmit with some code removed that is not used :-),
so your suggestion makes a lot of sense. Will implement in this patch
set.

> Also, (eventually) AF_XDP may also want to support the regular path
> through dev_queue_xmit to go through traffic shaping.

Agreed. Will put this on the todo list for a later patch.

>> +{
>> +       struct net_device *dev = skb->dev;
>> +       struct sk_buff *orig_skb = skb;
>> +       struct netdev_queue *txq;
>> +       int ret = NETDEV_TX_BUSY;
>> +       bool again = false;
>> +
>> +       if (unlikely(!netif_running(dev) || !netif_carrier_ok(dev)))
>> +               goto drop;
>> +
>> +       skb = validate_xmit_skb_list(skb, dev, &again);
>> +       if (skb != orig_skb)
>> +               return NET_XMIT_DROP;
>
> Need to free generated segment list on error, see packet_direct_xmit.

I do not use segments in the TX code for reasons of simplicity and the
free is in the calling function. But as I will create a common
packet_direct_xmit according to your suggestion, it will have a
kfree_skb_list() there as in af_packet.c.

>> +
>> +       txq = skb_get_tx_queue(dev, skb);
>> +
>> +       local_bh_disable();
>> +
>> +       HARD_TX_LOCK(dev, txq, smp_processor_id());
>> +       if (!netif_xmit_frozen_or_drv_stopped(txq))
>> +               ret = netdev_start_xmit(skb, dev, txq, false);
>> +       HARD_TX_UNLOCK(dev, txq);
>> +
>> +       local_bh_enable();
>> +
>> +       if (!dev_xmit_complete(ret))
>> +               goto out_err;
>> +
>> +       return ret;
>> +drop:
>> +       atomic_long_inc(&dev->tx_dropped);
>> +out_err:
>> +       return NET_XMIT_DROP;
>> +}
>
>> +static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
>> +                           size_t total_len)
>> +{
>> +       bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
>> +       u32 max_batch = TX_BATCH_SIZE;
>> +       struct xdp_sock *xs = xdp_sk(sk);
>> +       bool sent_frame = false;
>> +       struct xdp_desc desc;
>> +       struct sk_buff *skb;
>> +       int err = 0;
>> +
>> +       if (unlikely(!xs->tx))
>> +               return -ENOBUFS;
>> +       if (need_wait)
>> +               return -EOPNOTSUPP;
>> +
>> +       mutex_lock(&xs->mutex);
>> +
>> +       while (xskq_peek_desc(xs->tx, &desc)) {
>
> It is possible to pass a chain of skbs to validate_xmit_skb_list and
> eventually pass this chain to xsk_xmit_skb, amortizing the cost of
> taking the txq lock. Fine to ignore for this patch set.

Good suggestion. Will put it down on the todo list for a later patch set.

>> +               char *buffer;
>> +               u32 id, len;
>> +
>> +               if (max_batch-- == 0) {
>> +                       err = -EAGAIN;
>> +                       goto out;
>> +               }
>> +
>> +               if (xskq_reserve_id(xs->umem->cq)) {
>> +                       err = -EAGAIN;
>> +                       goto out;
>> +               }
>> +
>> +               len = desc.len;
>> +               if (unlikely(len > xs->dev->mtu)) {
>> +                       err = -EMSGSIZE;
>> +                       goto out;
>> +               }
>> +
>> +               skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
>> +               if (unlikely(!skb)) {
>> +                       err = -EAGAIN;
>> +                       goto out;
>> +               }
>> +
>> +               skb_put(skb, len);
>> +               id = desc.idx;
>> +               buffer = xdp_umem_get_data(xs->umem, id) + desc.offset;
>> +               err = skb_store_bits(skb, 0, buffer, len);
>> +               if (unlikely(err))
>> +                       goto out_store;
>
> As xsk_destruct_skb delays notification until consume_skb is called, this
> copy can be avoided by linking the xdp buffer into the skb frags array,
> analogous to tpacket_snd.
>
> You probably don't care much about the copy slow path, and this can be
> implemented later, so also no need to do in this patchset.

Agreed. I will also put this in the todo list for a later patch set.

> static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
> +                                             struct xdp_desc *desc)
> +{
> +       struct xdp_rxtx_ring *ring;
> +
> +       if (q->cons_tail == q->cons_head) {
> +               WRITE_ONCE(q->ring->consumer, q->cons_tail);
> +               q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
> +
> +               /* Order consumer and data */
> +               smp_rmb();
> +
> +               return xskq_validate_desc(q, desc);
> +       }
> +
> +       ring = (struct xdp_rxtx_ring *)q->ring;
> +       *desc = ring->desc[q->cons_tail & q->ring_mask];
> +       return desc;
>
> This only validates descriptors if taking the branch.

Yes, that is because we only want to validate the descriptors once
even if we call this function multiple times for the same entry.

Thanks. Highly appreciated comments Will.

/Magnus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 14/15] xsk: statistics support
  2018-04-24 16:58   ` Willem de Bruijn
@ 2018-04-25 10:50     ` Magnus Karlsson
  0 siblings, 0 replies; 54+ messages in thread
From: Magnus Karlsson @ 2018-04-25 10:50 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Daniel Borkmann, Michael S. Tsirkin,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

On Tue, Apr 24, 2018 at 6:58 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> In this commit, a new getsockopt is added: XDP_STATISTICS. This is
>> used to obtain stats from the sockets.
>>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>
>> +static int xsk_getsockopt(struct socket *sock, int level, int optname,
>> +                         char __user *optval, int __user *optlen)
>> +{
>> +       struct sock *sk = sock->sk;
>> +       struct xdp_sock *xs = xdp_sk(sk);
>> +       int len;
>> +
>> +       if (level != SOL_XDP)
>> +               return -ENOPROTOOPT;
>> +
>> +       if (get_user(len, optlen))
>> +               return -EFAULT;
>> +       if (len < 0)
>> +               return -EINVAL;
>> +
>> +       switch (optname) {
>> +       case XDP_STATISTICS:
>> +       {
>> +               struct xdp_statistics stats;
>> +
>> +               if (len != sizeof(stats))
>> +                       return -EINVAL;
>> +
>> +               mutex_lock(&xs->mutex);
>> +               stats.rx_dropped = xs->rx_dropped;
>> +               stats.rx_invalid_descs = xskq_nb_invalid_descs(xs->rx);
>> +               stats.tx_invalid_descs = xskq_nb_invalid_descs(xs->tx);
>> +               mutex_unlock(&xs->mutex);
>> +
>> +               if (copy_to_user(optval, &stats, sizeof(stats)))
>> +                       return -EFAULT;
>> +               return 0;
>
> For forward compatibility, could allow caller to pass a struct larger
> than stats and return the number of bytes filled in.

Yes definitely. Will fix right away.

> The lock can also be elided with something like gnet_stats, but it is probably
> taken rarely enough that that is not worth the effort, at least right now.

Will put this on the ever expanding todo list for future patches ;-).

Thanks: Magnus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
  2018-04-23 23:16   ` Michael S. Tsirkin
@ 2018-04-25 12:37     ` Björn Töpel
  0 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-25 12:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, michael.lundkvist,
	Brandeburg, Jesse, Singhai, Anjali, Zhang, Qi Z

2018-04-24 1:16 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, we add another setsockopt for registered user memory (umem)
>> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
>> ask the kernel to allocate a queue (ring buffer) and also mmap it
>> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
>>
>> The queue is used to explicitly pass ownership of umem frames from the
>> user process to the kernel. These frames will in a later patch be
>> filled in with Rx packet data by the kernel.
>>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> ---
>>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>>  net/xdp/Makefile            |  2 +-
>>  net/xdp/xdp_umem.c          |  5 ++++
>>  net/xdp/xdp_umem.h          |  2 ++
>>  net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
>>  net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
>>  7 files changed, 180 insertions(+), 2 deletions(-)
>>  create mode 100644 net/xdp/xsk_queue.c
>>  create mode 100644 net/xdp/xsk_queue.h
>>
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index 41252135a0fe..975661e1baca 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -23,6 +23,7 @@
>>
>>  /* XDP socket options */
>>  #define XDP_UMEM_REG                 3
>> +#define XDP_UMEM_FILL_RING           4
>>
>>  struct xdp_umem_reg {
>>       __u64 addr; /* Start of packet data area */
>> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>>       __u32 frame_headroom; /* Frame head room */
>>  };
>>
>> +/* Pgoff for mmaping the rings */
>> +#define XDP_UMEM_PGOFF_FILL_RING     0x100000000
>> +
>> +struct xdp_ring {
>> +     __u32 producer __attribute__((aligned(64)));
>> +     __u32 consumer __attribute__((aligned(64)));
>> +};
>> +
>> +/* Used for the fill and completion queues for buffers */
>> +struct xdp_umem_ring {
>> +     struct xdp_ring ptrs;
>> +     __u32 desc[0] __attribute__((aligned(64)));
>> +};
>> +
>>  #endif /* _LINUX_IF_XDP_H */
>> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
>> index a5d736640a0f..074fb2b2d51c 100644
>> --- a/net/xdp/Makefile
>> +++ b/net/xdp/Makefile
>> @@ -1,2 +1,2 @@
>> -obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
>> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
>>
>> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
>> index bff058f5a769..6fc233e03f30 100644
>> --- a/net/xdp/xdp_umem.c
>> +++ b/net/xdp/xdp_umem.c
>> @@ -62,6 +62,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
>>       struct mm_struct *mm;
>>       unsigned long diff;
>>
>> +     if (umem->fq) {
>> +             xskq_destroy(umem->fq);
>> +             umem->fq = NULL;
>> +     }
>> +
>>       if (umem->pgs) {
>>               xdp_umem_unpin_pages(umem);
>>
>> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
>> index 58714f4f7f25..3086091aebdd 100644
>> --- a/net/xdp/xdp_umem.h
>> +++ b/net/xdp/xdp_umem.h
>> @@ -18,9 +18,11 @@
>>  #include <linux/mm.h>
>>  #include <linux/if_xdp.h>
>>
>> +#include "xsk_queue.h"
>>  #include "xdp_umem_props.h"
>>
>>  struct xdp_umem {
>> +     struct xsk_queue *fq;
>>       struct page **pgs;
>>       struct xdp_umem_props props;
>>       u32 npgs;
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index 19fc719cbe0d..bf6a1151df28 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -32,6 +32,7 @@
>>  #include <linux/netdevice.h>
>>  #include <net/sock.h>
>>
>> +#include "xsk_queue.h"
>>  #include "xdp_umem.h"
>>
>>  struct xdp_sock {
>> @@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
>>       return (struct xdp_sock *)sk;
>>  }
>>
>> +static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
>> +{
>> +     struct xsk_queue *q;
>> +
>> +     if (entries == 0 || *queue || !is_power_of_2(entries))
>> +             return -EINVAL;
>> +
>> +     q = xskq_create(entries);
>> +     if (!q)
>> +             return -ENOMEM;
>> +
>> +     *queue = q;
>> +     return 0;
>> +}
>> +
>>  static int xsk_release(struct socket *sock)
>>  {
>>       struct sock *sk = sock->sk;
>> @@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>>               mutex_unlock(&xs->mutex);
>>               return 0;
>>       }
>> +     case XDP_UMEM_FILL_RING:
>> +     {
>> +             struct xsk_queue **q;
>> +             int entries;
>> +
>> +             if (!xs->umem)
>> +                     return -EINVAL;
>> +
>> +             if (copy_from_user(&entries, optval, sizeof(entries)))
>> +                     return -EFAULT;
>> +
>> +             mutex_lock(&xs->mutex);
>> +             q = &xs->umem->fq;
>> +             err = xsk_init_queue(entries, q);
>> +             mutex_unlock(&xs->mutex);
>> +             return err;
>> +     }
>>       default:
>>               break;
>>       }
>> @@ -116,6 +149,33 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>>       return -ENOPROTOOPT;
>>  }
>>
>> +static int xsk_mmap(struct file *file, struct socket *sock,
>> +                 struct vm_area_struct *vma)
>> +{
>> +     unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
>> +     unsigned long size = vma->vm_end - vma->vm_start;
>> +     struct xdp_sock *xs = xdp_sk(sock->sk);
>> +     struct xsk_queue *q;
>> +     unsigned long pfn;
>> +     struct page *qpg;
>> +
>> +     if (!xs->umem)
>> +             return -EINVAL;
>> +
>> +     if (offset == XDP_UMEM_PGOFF_FILL_RING)
>> +             q = xs->umem->fq;
>> +     else
>> +             return -EINVAL;
>> +
>> +     qpg = virt_to_head_page(q->ring);
>> +     if (size > (PAGE_SIZE << compound_order(qpg)))
>> +             return -EINVAL;
>> +
>> +     pfn = virt_to_phys(q->ring) >> PAGE_SHIFT;
>> +     return remap_pfn_range(vma, vma->vm_start, pfn,
>> +                            size, vma->vm_page_prot);
>> +}
>> +
>>  static struct proto xsk_proto = {
>>       .name =         "XDP",
>>       .owner =        THIS_MODULE,
>> @@ -139,7 +199,7 @@ static const struct proto_ops xsk_proto_ops = {
>>       .getsockopt =   sock_no_getsockopt,
>>       .sendmsg =      sock_no_sendmsg,
>>       .recvmsg =      sock_no_recvmsg,
>> -     .mmap =         sock_no_mmap,
>> +     .mmap =         xsk_mmap,
>>       .sendpage =     sock_no_sendpage,
>>  };
>>
>> diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
>> new file mode 100644
>> index 000000000000..23da4f29d3fb
>> --- /dev/null
>> +++ b/net/xdp/xsk_queue.c
>> @@ -0,0 +1,58 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* XDP user-space ring structure
>> + * Copyright(c) 2018 Intel Corporation.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + */
>> +
>> +#include <linux/slab.h>
>> +
>> +#include "xsk_queue.h"
>> +
>> +static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
>> +{
>> +     return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
>> +}
>> +
>> +struct xsk_queue *xskq_create(u32 nentries)
>> +{
>> +     struct xsk_queue *q;
>> +     gfp_t gfp_flags;
>> +     size_t size;
>> +
>> +     q = kzalloc(sizeof(*q), GFP_KERNEL);
>> +     if (!q)
>> +             return NULL;
>> +
>> +     q->nentries = nentries;
>> +     q->ring_mask = nentries - 1;
>> +
>> +     gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
>> +                 __GFP_COMP  | __GFP_NORETRY;
>> +     size = xskq_umem_get_ring_size(q);
>> +
>> +     q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
>> +                                                   get_order(size));
>> +     if (!q->ring) {
>> +             kfree(q);
>> +             return NULL;
>> +     }
>> +
>> +     return q;
>> +}
>> +
>> +void xskq_destroy(struct xsk_queue *q)
>> +{
>> +     if (!q)
>> +             return;
>> +
>> +     page_frag_free(q->ring);
>> +     kfree(q);
>> +}
>> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
>> new file mode 100644
>> index 000000000000..7eb556bf73be
>> --- /dev/null
>> +++ b/net/xdp/xsk_queue.h
>> @@ -0,0 +1,38 @@
>> +/* SPDX-License-Identifier: GPL-2.0
>> + * XDP user-space ring structure
>> + * Copyright(c) 2018 Intel Corporation.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + */
>> +
>> +#ifndef _LINUX_XSK_QUEUE_H
>> +#define _LINUX_XSK_QUEUE_H
>> +
>> +#include <linux/types.h>
>> +#include <linux/if_xdp.h>
>> +
>> +#include "xdp_umem_props.h"
>> +
>> +struct xsk_queue {
>> +     struct xdp_umem_props umem_props;
>> +     u32 ring_mask;
>> +     u32 nentries;
>> +     u32 prod_head;
>> +     u32 prod_tail;
>> +     u32 cons_head;
>> +     u32 cons_tail;
>> +     struct xdp_ring *ring;
>> +     u64 invalid_descs;
>> +};
>
> Any documentation on how e.g. the locking works here?
>

It's a SPSC queue. On the kernel side we guarantee synchronization via
the NAPI context. As for the user-space application, it's the
responsibility of the application to do the synchronization.

Even though the xsk_queue structure has both cons/prod members, for a
certain queue, only prod_ *or* cons_ will be used. For the kernel it
means that it will consume from fill and tx queues, and produce to
completion and rx queues.

Note that prod_/cons_ are the *cached* local variables, the actual
produce/consumer pointers reside in the kernel/user shared xdp_ring.

I'll try to make it clearer in the documentation!

>
>> +
>> +struct xsk_queue *xskq_create(u32 nentries);
>> +void xskq_destroy(struct xsk_queue *q);
>> +
>> +#endif /* _LINUX_XSK_QUEUE_H */
>> --
>> 2.14.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 13/15] xsk: support for Tx
  2018-04-25  9:11     ` Magnus Karlsson
@ 2018-04-25 19:00       ` Willem de Bruijn
  2018-04-26  4:02         ` Björn Töpel
  0 siblings, 1 reply; 54+ messages in thread
From: Willem de Bruijn @ 2018-04-25 19:00 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Daniel Borkmann, Michael S. Tsirkin,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

>>> +{
>>> +       struct net_device *dev = skb->dev;
>>> +       struct sk_buff *orig_skb = skb;
>>> +       struct netdev_queue *txq;
>>> +       int ret = NETDEV_TX_BUSY;
>>> +       bool again = false;
>>> +
>>> +       if (unlikely(!netif_running(dev) || !netif_carrier_ok(dev)))
>>> +               goto drop;
>>> +
>>> +       skb = validate_xmit_skb_list(skb, dev, &again);
>>> +       if (skb != orig_skb)
>>> +               return NET_XMIT_DROP;
>>
>> Need to free generated segment list on error, see packet_direct_xmit.
>
> I do not use segments in the TX code for reasons of simplicity and the
> free is in the calling function. But as I will create a common
> packet_direct_xmit according to your suggestion, it will have a
> kfree_skb_list() there as in af_packet.c.

Ah yes. For these sockets it is guaranteed that sbks are not gso skbs.
Of course, makes sense.

>> static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
>> +                                             struct xdp_desc *desc)
>> +{
>> +       struct xdp_rxtx_ring *ring;
>> +
>> +       if (q->cons_tail == q->cons_head) {
>> +               WRITE_ONCE(q->ring->consumer, q->cons_tail);
>> +               q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
>> +
>> +               /* Order consumer and data */
>> +               smp_rmb();
>> +
>> +               return xskq_validate_desc(q, desc);
>> +       }
>> +
>> +       ring = (struct xdp_rxtx_ring *)q->ring;
>> +       *desc = ring->desc[q->cons_tail & q->ring_mask];
>> +       return desc;
>>
>> This only validates descriptors if taking the branch.
>
> Yes, that is because we only want to validate the descriptors once
> even if we call this function multiple times for the same entry.

Then I am probably misreading this function. But isn't head increased
by up to RX_BATCH_SIZE frames at once. If so, then for many frames
the branch is not taken.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH bpf-next 13/15] xsk: support for Tx
  2018-04-25 19:00       ` Willem de Bruijn
@ 2018-04-26  4:02         ` Björn Töpel
  0 siblings, 0 replies; 54+ messages in thread
From: Björn Töpel @ 2018-04-26  4:02 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Magnus Karlsson, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Daniel Borkmann, Michael S. Tsirkin,
	Network Development, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

2018-04-25 21:00 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
[...]
>>> static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
>>> +                                             struct xdp_desc *desc)
>>> +{
>>> +       struct xdp_rxtx_ring *ring;
>>> +
>>> +       if (q->cons_tail == q->cons_head) {
>>> +               WRITE_ONCE(q->ring->consumer, q->cons_tail);
>>> +               q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
>>> +
>>> +               /* Order consumer and data */
>>> +               smp_rmb();
>>> +
>>> +               return xskq_validate_desc(q, desc);
>>> +       }
>>> +
>>> +       ring = (struct xdp_rxtx_ring *)q->ring;
>>> +       *desc = ring->desc[q->cons_tail & q->ring_mask];
>>> +       return desc;
>>>
>>> This only validates descriptors if taking the branch.
>>
>> Yes, that is because we only want to validate the descriptors once
>> even if we call this function multiple times for the same entry.
>
> Then I am probably misreading this function. But isn't head increased
> by up to RX_BATCH_SIZE frames at once. If so, then for many frames
> the branch is not taken.

You're not misreading it! :-) The head is indeed increased, but only
the tail descriptor is validated in that function. Later in the
xskq_discard_desc function when the tail is moved, the next descriptor
is validated. So, the peek function will always return a validated
descriptor, but the validation can be done in either peek or discard.


Björn

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2018-04-26  4:02 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 01/15] net: initial AF_XDP skeleton Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt Björn Töpel
2018-04-23 16:18   ` Michael S. Tsirkin
2018-04-23 20:00     ` Björn Töpel
2018-04-23 20:11       ` Michael S. Tsirkin
2018-04-23 20:15         ` Björn Töpel
2018-04-23 20:26           ` Michael S. Tsirkin
2018-04-24  7:01             ` Björn Töpel
2018-04-23 23:04   ` Willem de Bruijn
2018-04-24  7:30     ` Björn Töpel
2018-04-24 14:27   ` kbuild test robot
2018-04-23 13:56 ` [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap Björn Töpel
2018-04-23 23:16   ` Michael S. Tsirkin
2018-04-25 12:37     ` Björn Töpel
2018-04-23 23:21   ` Michael S. Tsirkin
2018-04-23 23:59     ` Willem de Bruijn
2018-04-24  8:08       ` Magnus Karlsson
2018-04-24 16:55         ` Willem de Bruijn
2018-04-23 13:56 ` [PATCH bpf-next 04/15] xsk: add Rx queue setup and mmap support Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 05/15] xsk: add support for bind for Rx Björn Töpel
2018-04-24 16:55   ` Willem de Bruijn
2018-04-24 18:43     ` Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 06/15] xdp: introduce xdp_return_buff API Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support Björn Töpel
2018-04-24 16:56   ` Willem de Bruijn
2018-04-24 18:32     ` Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
2018-04-24 16:56   ` Willem de Bruijn
2018-04-24 18:58     ` Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 09/15] xsk: wire up XDP_DRV side of AF_XDP Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 10/15] xsk: wire up XDP_SKB " Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 11/15] xsk: add umem completion queue support and mmap Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 12/15] xsk: add Tx queue setup and mmap support Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 13/15] xsk: support for Tx Björn Töpel
2018-04-24 16:57   ` Willem de Bruijn
2018-04-25  9:11     ` Magnus Karlsson
2018-04-25 19:00       ` Willem de Bruijn
2018-04-26  4:02         ` Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 14/15] xsk: statistics support Björn Töpel
2018-04-24 16:58   ` Willem de Bruijn
2018-04-25 10:50     ` Magnus Karlsson
2018-04-23 13:56 ` [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets Björn Töpel
2018-04-23 23:31   ` Michael S. Tsirkin
2018-04-24  8:22     ` Magnus Karlsson
2018-04-23 23:22 ` [PATCH bpf-next 00/15] Introducing AF_XDP support Michael S. Tsirkin
2018-04-24  6:55   ` Björn Töpel
2018-04-24  7:27     ` Jesper Dangaard Brouer
2018-04-24  7:33       ` Björn Töpel
2018-04-24  2:29 ` Jason Wang
2018-04-24  8:44   ` Magnus Karlsson
2018-04-24  9:10     ` Jason Wang
2018-04-24  9:14       ` Magnus Karlsson
2018-04-24 17:03 ` Willem de Bruijn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.