All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v3 00/15] Introducing AF_XDP support
@ 2018-05-02 11:01 Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 01/15] net: initial AF_XDP skeleton Björn Töpel
                   ` (17 more replies)
  0 siblings, 18 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This patch set introduces a new address family called AF_XDP that is
optimized for high performance packet processing and, in upcoming
patch sets, zero-copy semantics. In this patch set, we have removed
all zero-copy related code in order to make it smaller, simpler and
hopefully more review friendly. This patch set only supports copy-mode
for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
for RX using the XDP_DRV path. Zero-copy support requires XDP and
driver changes that Jesper Dangaard Brouer is working on. Some of his
work has already been accepted. We will publish our zero-copy support
for RX and TX on top of his patch sets at a later point in time.

An AF_XDP socket (XSK) is created with the normal socket()
syscall. Associated with each XSK are two queues: the RX queue and the
TX queue. A socket can receive packets on the RX queue and it can send
packets on the TX queue. These queues are registered and sized with
the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
mandatory to have at least one of these queues for each socket. In
contrast to AF_PACKET V2/V3 these descriptor queues are separated from
packet buffers. An RX or TX descriptor points to a data buffer in a
memory area called a UMEM. RX and TX can share the same UMEM so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, the
descriptor that points to that packet can be changed to point to
another and reused right away. This again avoids copying data.

This new dedicated packet buffer area is call a UMEM. It consists of a
number of equally size frames and each frame has a unique frame id. A
descriptor in one of the queues references a frame by referencing its
frame id. The user space allocates memory for this UMEM using whatever
means it feels is most appropriate (malloc, mmap, huge pages,
etc). This memory area is then registered with the kernel using the new
setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
and the COMPLETION queue. The fill queue is used by the application to
send down frame ids for the kernel to fill in with RX packet
data. References to these frames will then appear in the RX queue of
the XSK once they have been received. The completion queue, on the
other hand, contains frame ids that the kernel has transmitted
completely and can now be used again by user space, for either TX or
RX. Thus, the frame ids appearing in the completion queue are ids that
were previously transmitted using the TX queue. In summary, the RX and
FILL queues are used for the RX path and the TX and COMPLETION queues
are used for the TX path.

The socket is then finally bound with a bind() call to a device and a
specific queue id on that device, and it is not until bind is
completed that traffic starts to flow. Note that in this patch set,
all packet data is copied out to user-space.

A new feature in this patch set is that the UMEM can be shared between
processes, if desired. If a process wants to do this, it simply skips
the registration of the UMEM and its corresponding two queues, sets a
flag in the bind call and submits the XSK of the process it would like
to share UMEM with as well as its own newly created XSK socket. The
new process will then receive frame id references in its own RX queue
that point to this shared UMEM. Note that since the queue structures
are single-consumer / single-producer (for performance reasons), the
new process has to create its own socket with associated RX and TX
queues, since it cannot share this with the other process. This is
also the reason that there is only one set of FILL and COMPLETION
queues per UMEM. It is the responsibility of a single process to
handle the UMEM. If multiple-producer / multiple-consumer queues are
implemented in the future, this requirement could be relaxed.

How is then packets distributed between these two XSK? We have
introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
full). The user-space application can place an XSK at an arbitrary
place in this map. The XDP program can then redirect a packet to a
specific index in this map and at this point XDP validates that the
XSK in that map was indeed bound to that device and queue number. If
not, the packet is dropped. If the map is empty at that index, the
packet is also dropped. This also means that it is currently mandatory
to have an XDP program loaded (and one XSK in the XSKMAP) to be able
to get any traffic to user space through the XSK.

AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
driver does not have support for XDP, or XDP_SKB is explicitly chosen
when loading the XDP program, XDP_SKB mode is employed that uses SKBs
together with the generic XDP support and copies out the data to user
space. A fallback mode that works for any network device. On the other
hand, if the driver has support for XDP, it will be used by the AF_XDP
code to provide better performance, but there is still a copy of the
data into user space.

There is a xdpsock benchmarking/test application included that
demonstrates how to use AF_XDP sockets with both private and shared
UMEMs. Say that you would like your UDP traffic from port 4242 to end
up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode can then be done
using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are without
retpoline so that we can compare against previous numbers. With
retpoline, the AF_XDP numbers drop with between 10 - 15 percent.

AF_XDP performance 64 byte packets. Results from V2 in parenthesis.
Benchmark   XDP_SKB   XDP_DRV
rxdrop       2.9(3.0)   9.6(9.5)  
txpush       2.6(2.5)   NA*
l2fwd        1.9(1.9)   2.5(2.5) (TX using XDP_SKB in both cases)

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV
rxdrop       2.1(2.2)   3.3(3.3)  
l2fwd        1.4(1.4)   1.8(1.8) (TX using XDP_SKB in both cases)

* NA since we have no support for TX using the XDP_DRV infrastructure
  in this patch set. This is for a future patch set since it involves
  changes to the XDP NDOs. Some of this has been upstreamed by Jesper
  Dangaard Brouer.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      32.3(32.9)M  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3.3(3.3)M    0

Changes from V2:

* Fixed a race in XSKMAP map found by Will. The code has been
  completely rearchitected and is now simpler, faster, and hopefully
  also not racy. Please review and check if it holds.

If you would like to diff V2 against V3, you can find them here:
https://github.com/bjoto/linux/tree/af-xdp-v2-on-bpf-next
https://github.com/bjoto/linux/tree/af-xdp-v3-on-bpf-next

The structure of the patch set is as follows:

Patches 1-3: Basic socket and umem plumbing 
Patches 4-9: RX support together with the new XSKMAP
Patches 10-13: TX support
Patch 14: Statistics support with getsockopt()
Patch 15: Sample application

We based this patch set on bpf-next commit a3fe1f6f2ada ("tools:
bpftool: change time format for program 'loaded at:' information")

To do for this patch set:

* Syzkaller torture session being worked on

Post-series plan:

* Optimize performance

* Kernel selftest

* Kernel load module support of AF_XDP would be nice. Unclear how to
  achieve this though since our XDP code depends on net/core.

* Support for AF_XDP sockets without an XPD program loaded. In this
  case all the traffic on a queue should go up to the user space socket.

* Daniel Borkmann's suggestion for a "copy to XDP socket, and return
  XDP_PASS" for a tcpdump-like functionality.

* And of course getting to zero-copy support in small increments,
  starting with TX then adding RX. 

Thanks: Björn and Magnus

Björn Töpel (7):
  net: initial AF_XDP skeleton
  xsk: add user memory registration support sockopt
  xsk: add Rx queue setup and mmap support
  xsk: add Rx receive functions and poll support
  bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  xsk: wire up XDP_DRV side of AF_XDP
  xsk: wire up XDP_SKB side of AF_XDP

Magnus Karlsson (8):
  xsk: add umem fill queue support and mmap
  xsk: add support for bind for Rx
  xsk: add umem completion queue support and mmap
  xsk: add Tx queue setup and mmap support
  dev: packet: make packet_direct_xmit a common function
  xsk: support for Tx
  xsk: statistics support
  samples/bpf: sample application and documentation for AF_XDP sockets

 Documentation/networking/af_xdp.rst | 297 +++++++++++
 Documentation/networking/index.rst  |   1 +
 MAINTAINERS                         |   8 +
 include/linux/bpf.h                 |  25 +
 include/linux/bpf_types.h           |   3 +
 include/linux/filter.h              |   2 +-
 include/linux/netdevice.h           |   1 +
 include/linux/socket.h              |   5 +-
 include/net/xdp.h                   |   1 +
 include/net/xdp_sock.h              |  66 +++
 include/uapi/linux/bpf.h            |   1 +
 include/uapi/linux/if_xdp.h         |  87 ++++
 kernel/bpf/Makefile                 |   3 +
 kernel/bpf/verifier.c               |   8 +-
 kernel/bpf/xskmap.c                 | 239 +++++++++
 net/Kconfig                         |   1 +
 net/Makefile                        |   1 +
 net/core/dev.c                      |  73 ++-
 net/core/filter.c                   |  40 +-
 net/core/sock.c                     |  12 +-
 net/core/xdp.c                      |  15 +-
 net/packet/af_packet.c              |  42 +-
 net/xdp/Kconfig                     |   7 +
 net/xdp/Makefile                    |   2 +
 net/xdp/xdp_umem.c                  | 260 ++++++++++
 net/xdp/xdp_umem.h                  |  67 +++
 net/xdp/xdp_umem_props.h            |  23 +
 net/xdp/xsk.c                       | 656 +++++++++++++++++++++++++
 net/xdp/xsk_queue.c                 |  73 +++
 net/xdp/xsk_queue.h                 | 247 ++++++++++
 samples/bpf/Makefile                |   4 +
 samples/bpf/xdpsock.h               |  11 +
 samples/bpf/xdpsock_kern.c          |  56 +++
 samples/bpf/xdpsock_user.c          | 948 ++++++++++++++++++++++++++++++++++++
 security/selinux/hooks.c            |   4 +-
 security/selinux/include/classmap.h |   4 +-
 36 files changed, 3221 insertions(+), 72 deletions(-)
 create mode 100644 Documentation/networking/af_xdp.rst
 create mode 100644 include/net/xdp_sock.h
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 kernel/bpf/xskmap.c
 create mode 100644 net/xdp/Kconfig
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xdp_umem.c
 create mode 100644 net/xdp/xdp_umem.h
 create mode 100644 net/xdp/xdp_umem_props.h
 create mode 100644 net/xdp/xsk.c
 create mode 100644 net/xdp/xsk_queue.c
 create mode 100644 net/xdp/xsk_queue.h
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c

-- 
2.14.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 01/15] net: initial AF_XDP skeleton
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-23 22:50   ` Stephen Hemminger
  2018-05-02 11:01 ` [PATCH bpf-next v3 02/15] xsk: add user memory registration support sockopt Björn Töpel
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Buildable skeleton of AF_XDP without any functionality. Just what it
takes to register a new address family.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 MAINTAINERS                         |  8 ++++++++
 include/linux/socket.h              |  5 ++++-
 net/Kconfig                         |  1 +
 net/core/sock.c                     | 12 ++++++++----
 net/xdp/Kconfig                     |  7 +++++++
 security/selinux/hooks.c            |  4 +++-
 security/selinux/include/classmap.h |  4 +++-
 7 files changed, 34 insertions(+), 7 deletions(-)
 create mode 100644 net/xdp/Kconfig

diff --git a/MAINTAINERS b/MAINTAINERS
index 537fd17a211b..52d246fd29c9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15424,6 +15424,14 @@ T:	git git://linuxtv.org/media_tree.git
 S:	Maintained
 F:	drivers/media/tuners/tuner-xc2028.*
 
+XDP SOCKETS (AF_XDP)
+M:	Björn Töpel <bjorn.topel@intel.com>
+M:	Magnus Karlsson <magnus.karlsson@intel.com>
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	kernel/bpf/xskmap.c
+F:	net/xdp/
+
 XEN BLOCK SUBSYSTEM
 M:	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 M:	Roger Pau Monné <roger.pau@citrix.com>
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ea50f4a65816..7ed4713d5337 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -207,8 +207,9 @@ struct ucred {
 				 * PF_SMC protocol family that
 				 * reuses AF_INET address family
 				 */
+#define AF_XDP		44	/* XDP sockets			*/
 
-#define AF_MAX		44	/* For now.. */
+#define AF_MAX		45	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -257,6 +258,7 @@ struct ucred {
 #define PF_KCM		AF_KCM
 #define PF_QIPCRTR	AF_QIPCRTR
 #define PF_SMC		AF_SMC
+#define PF_XDP		AF_XDP
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -338,6 +340,7 @@ struct ucred {
 #define SOL_NFC		280
 #define SOL_KCM		281
 #define SOL_TLS		282
+#define SOL_XDP		283
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/net/Kconfig b/net/Kconfig
index 6fa1a4493b8c..86471a1c1ed4 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -59,6 +59,7 @@ source "net/tls/Kconfig"
 source "net/xfrm/Kconfig"
 source "net/iucv/Kconfig"
 source "net/smc/Kconfig"
+source "net/xdp/Kconfig"
 
 config INET
 	bool "TCP/IP networking"
diff --git a/net/core/sock.c b/net/core/sock.c
index b2c3db169ca1..e7d8b6c955c6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -226,7 +226,8 @@ static struct lock_class_key af_family_kern_slock_keys[AF_MAX];
   x "AF_RXRPC" ,	x "AF_ISDN"     ,	x "AF_PHONET"   , \
   x "AF_IEEE802154",	x "AF_CAIF"	,	x "AF_ALG"      , \
   x "AF_NFC"   ,	x "AF_VSOCK"    ,	x "AF_KCM"      , \
-  x "AF_QIPCRTR",	x "AF_SMC"	,	x "AF_MAX"
+  x "AF_QIPCRTR",	x "AF_SMC"	,	x "AF_XDP"	, \
+  x "AF_MAX"
 
 static const char *const af_family_key_strings[AF_MAX+1] = {
 	_sock_locks("sk_lock-")
@@ -262,7 +263,8 @@ static const char *const af_family_rlock_key_strings[AF_MAX+1] = {
   "rlock-AF_RXRPC" , "rlock-AF_ISDN"     , "rlock-AF_PHONET"   ,
   "rlock-AF_IEEE802154", "rlock-AF_CAIF" , "rlock-AF_ALG"      ,
   "rlock-AF_NFC"   , "rlock-AF_VSOCK"    , "rlock-AF_KCM"      ,
-  "rlock-AF_QIPCRTR", "rlock-AF_SMC"     , "rlock-AF_MAX"
+  "rlock-AF_QIPCRTR", "rlock-AF_SMC"     , "rlock-AF_XDP"      ,
+  "rlock-AF_MAX"
 };
 static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
   "wlock-AF_UNSPEC", "wlock-AF_UNIX"     , "wlock-AF_INET"     ,
@@ -279,7 +281,8 @@ static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
   "wlock-AF_RXRPC" , "wlock-AF_ISDN"     , "wlock-AF_PHONET"   ,
   "wlock-AF_IEEE802154", "wlock-AF_CAIF" , "wlock-AF_ALG"      ,
   "wlock-AF_NFC"   , "wlock-AF_VSOCK"    , "wlock-AF_KCM"      ,
-  "wlock-AF_QIPCRTR", "wlock-AF_SMC"     , "wlock-AF_MAX"
+  "wlock-AF_QIPCRTR", "wlock-AF_SMC"     , "wlock-AF_XDP"      ,
+  "wlock-AF_MAX"
 };
 static const char *const af_family_elock_key_strings[AF_MAX+1] = {
   "elock-AF_UNSPEC", "elock-AF_UNIX"     , "elock-AF_INET"     ,
@@ -296,7 +299,8 @@ static const char *const af_family_elock_key_strings[AF_MAX+1] = {
   "elock-AF_RXRPC" , "elock-AF_ISDN"     , "elock-AF_PHONET"   ,
   "elock-AF_IEEE802154", "elock-AF_CAIF" , "elock-AF_ALG"      ,
   "elock-AF_NFC"   , "elock-AF_VSOCK"    , "elock-AF_KCM"      ,
-  "elock-AF_QIPCRTR", "elock-AF_SMC"     , "elock-AF_MAX"
+  "elock-AF_QIPCRTR", "elock-AF_SMC"     , "elock-AF_XDP"      ,
+  "elock-AF_MAX"
 };
 
 /*
diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
new file mode 100644
index 000000000000..90e4a7152854
--- /dev/null
+++ b/net/xdp/Kconfig
@@ -0,0 +1,7 @@
+config XDP_SOCKETS
+	bool "XDP sockets"
+	depends on BPF_SYSCALL
+	default n
+	help
+	  XDP sockets allows a channel between XDP programs and
+	  userspace applications.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 4cafe6a19167..5c508d26b367 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -1471,7 +1471,9 @@ static inline u16 socket_type_to_security_class(int family, int type, int protoc
 			return SECCLASS_QIPCRTR_SOCKET;
 		case PF_SMC:
 			return SECCLASS_SMC_SOCKET;
-#if PF_MAX > 44
+		case PF_XDP:
+			return SECCLASS_XDP_SOCKET;
+#if PF_MAX > 45
 #error New address family defined, please update this function.
 #endif
 		}
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 7f0372426494..bd5fe0d3204a 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -240,9 +240,11 @@ struct security_class_mapping secclass_map[] = {
 	  { "manage_subnet", NULL } },
 	{ "bpf",
 	  {"map_create", "map_read", "map_write", "prog_load", "prog_run"} },
+	{ "xdp_socket",
+	  { COMMON_SOCK_PERMS, NULL } },
 	{ NULL }
   };
 
-#if PF_MAX > 44
+#if PF_MAX > 45
 #error New address family defined, please update secclass_map.
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 02/15] xsk: add user memory registration support sockopt
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 01/15] net: initial AF_XDP skeleton Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-04 12:34   ` Daniel Borkmann
  2018-05-02 11:01 ` [PATCH bpf-next v3 03/15] xsk: add umem fill queue support and mmap Björn Töpel
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

In this commit the base structure of the AF_XDP address family is set
up. Further, we introduce the abilty register a window of user memory
to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
window is viewed by an AF_XDP socket as a set of equally large
frames. After a user memory registration all frames are "owned" by the
user application, and not the kernel.

v2: More robust checks on umem creation and unaccount on error.
    Call set_page_dirty_lock on cleanup.
    Simplified xdp_umem_reg.

Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h      |  31 ++++++
 include/uapi/linux/if_xdp.h |  34 ++++++
 net/Makefile                |   1 +
 net/xdp/Makefile            |   2 +
 net/xdp/xdp_umem.c          | 245 ++++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.h          |  45 ++++++++
 net/xdp/xdp_umem_props.h    |  23 +++++
 net/xdp/xsk.c               | 215 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 596 insertions(+)
 create mode 100644 include/net/xdp_sock.h
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xdp_umem.c
 create mode 100644 net/xdp/xdp_umem.h
 create mode 100644 net/xdp/xdp_umem_props.h
 create mode 100644 net/xdp/xsk.c

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
new file mode 100644
index 000000000000..94785f5db13e
--- /dev/null
+++ b/include/net/xdp_sock.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * AF_XDP internal functions
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XDP_SOCK_H
+#define _LINUX_XDP_SOCK_H
+
+#include <linux/mutex.h>
+#include <net/sock.h>
+
+struct xdp_umem;
+
+struct xdp_sock {
+	/* struct sock must be the first member of struct xdp_sock */
+	struct sock sk;
+	struct xdp_umem *umem;
+	/* Protects multiple processes in the control path */
+	struct mutex mutex;
+};
+
+#endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
new file mode 100644
index 000000000000..41252135a0fe
--- /dev/null
+++ b/include/uapi/linux/if_xdp.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+ *
+ * if_xdp: XDP socket user-space interface
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Author(s): Björn Töpel <bjorn.topel@intel.com>
+ *	      Magnus Karlsson <magnus.karlsson@intel.com>
+ */
+
+#ifndef _LINUX_IF_XDP_H
+#define _LINUX_IF_XDP_H
+
+#include <linux/types.h>
+
+/* XDP socket options */
+#define XDP_UMEM_REG			3
+
+struct xdp_umem_reg {
+	__u64 addr; /* Start of packet data area */
+	__u64 len; /* Length of packet data area */
+	__u32 frame_size; /* Frame size */
+	__u32 frame_headroom; /* Frame head room */
+};
+
+#endif /* _LINUX_IF_XDP_H */
diff --git a/net/Makefile b/net/Makefile
index a6147c61b174..77aaddedbd29 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -85,3 +85,4 @@ obj-y				+= l3mdev/
 endif
 obj-$(CONFIG_QRTR)		+= qrtr/
 obj-$(CONFIG_NET_NCSI)		+= ncsi/
+obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
new file mode 100644
index 000000000000..a5d736640a0f
--- /dev/null
+++ b/net/xdp/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
+
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
new file mode 100644
index 000000000000..ec8b3552be44
--- /dev/null
+++ b/net/xdp/xdp_umem.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/init.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/uaccess.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/mm.h>
+
+#include "xdp_umem.h"
+
+#define XDP_UMEM_MIN_FRAME_SIZE 2048
+
+int xdp_umem_create(struct xdp_umem **umem)
+{
+	*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
+
+	if (!(*umem))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void xdp_umem_unpin_pages(struct xdp_umem *umem)
+{
+	unsigned int i;
+
+	if (umem->pgs) {
+		for (i = 0; i < umem->npgs; i++) {
+			struct page *page = umem->pgs[i];
+
+			set_page_dirty_lock(page);
+			put_page(page);
+		}
+
+		kfree(umem->pgs);
+		umem->pgs = NULL;
+	}
+}
+
+static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
+{
+	if (umem->user) {
+		atomic_long_sub(umem->npgs, &umem->user->locked_vm);
+		free_uid(umem->user);
+	}
+}
+
+static void xdp_umem_release(struct xdp_umem *umem)
+{
+	struct task_struct *task;
+	struct mm_struct *mm;
+
+	if (umem->pgs) {
+		xdp_umem_unpin_pages(umem);
+
+		task = get_pid_task(umem->pid, PIDTYPE_PID);
+		put_pid(umem->pid);
+		if (!task)
+			goto out;
+		mm = get_task_mm(task);
+		put_task_struct(task);
+		if (!mm)
+			goto out;
+
+		mmput(mm);
+		umem->pgs = NULL;
+	}
+
+	xdp_umem_unaccount_pages(umem);
+out:
+	kfree(umem);
+}
+
+static void xdp_umem_release_deferred(struct work_struct *work)
+{
+	struct xdp_umem *umem = container_of(work, struct xdp_umem, work);
+
+	xdp_umem_release(umem);
+}
+
+void xdp_get_umem(struct xdp_umem *umem)
+{
+	atomic_inc(&umem->users);
+}
+
+void xdp_put_umem(struct xdp_umem *umem)
+{
+	if (!umem)
+		return;
+
+	if (atomic_dec_and_test(&umem->users)) {
+		INIT_WORK(&umem->work, xdp_umem_release_deferred);
+		schedule_work(&umem->work);
+	}
+}
+
+static int xdp_umem_pin_pages(struct xdp_umem *umem)
+{
+	unsigned int gup_flags = FOLL_WRITE;
+	long npgs;
+	int err;
+
+	umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), GFP_KERNEL);
+	if (!umem->pgs)
+		return -ENOMEM;
+
+	down_write(&current->mm->mmap_sem);
+	npgs = get_user_pages(umem->address, umem->npgs,
+			      gup_flags, &umem->pgs[0], NULL);
+	up_write(&current->mm->mmap_sem);
+
+	if (npgs != umem->npgs) {
+		if (npgs >= 0) {
+			umem->npgs = npgs;
+			err = -ENOMEM;
+			goto out_pin;
+		}
+		err = npgs;
+		goto out_pgs;
+	}
+	return 0;
+
+out_pin:
+	xdp_umem_unpin_pages(umem);
+out_pgs:
+	kfree(umem->pgs);
+	umem->pgs = NULL;
+	return err;
+}
+
+static int xdp_umem_account_pages(struct xdp_umem *umem)
+{
+	unsigned long lock_limit, new_npgs, old_npgs;
+
+	if (capable(CAP_IPC_LOCK))
+		return 0;
+
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	umem->user = get_uid(current_user());
+
+	do {
+		old_npgs = atomic_long_read(&umem->user->locked_vm);
+		new_npgs = old_npgs + umem->npgs;
+		if (new_npgs > lock_limit) {
+			free_uid(umem->user);
+			umem->user = NULL;
+			return -ENOBUFS;
+		}
+	} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
+				     new_npgs) != old_npgs);
+	return 0;
+}
+
+int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
+{
+	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
+	u64 addr = mr->addr, size = mr->len;
+	unsigned int nframes, nfpp;
+	int size_chk, err;
+
+	if (!umem)
+		return -EINVAL;
+
+	if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
+		/* Strictly speaking we could support this, if:
+		 * - huge pages, or*
+		 * - using an IOMMU, or
+		 * - making sure the memory area is consecutive
+		 * but for now, we simply say "computer says no".
+		 */
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(frame_size))
+		return -EINVAL;
+
+	if (!PAGE_ALIGNED(addr)) {
+		/* Memory area has to be page size aligned. For
+		 * simplicity, this might change.
+		 */
+		return -EINVAL;
+	}
+
+	if ((addr + size) < addr)
+		return -EINVAL;
+
+	nframes = size / frame_size;
+	if (nframes == 0 || nframes > UINT_MAX)
+		return -EINVAL;
+
+	nfpp = PAGE_SIZE / frame_size;
+	if (nframes < nfpp || nframes % nfpp)
+		return -EINVAL;
+
+	frame_headroom = ALIGN(frame_headroom, 64);
+
+	size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
+	if (size_chk < 0)
+		return -EINVAL;
+
+	umem->pid = get_task_pid(current, PIDTYPE_PID);
+	umem->size = (size_t)size;
+	umem->address = (unsigned long)addr;
+	umem->props.frame_size = frame_size;
+	umem->props.nframes = nframes;
+	umem->frame_headroom = frame_headroom;
+	umem->npgs = size / PAGE_SIZE;
+	umem->pgs = NULL;
+	umem->user = NULL;
+
+	umem->frame_size_log2 = ilog2(frame_size);
+	umem->nfpp_mask = nfpp - 1;
+	umem->nfpplog2 = ilog2(nfpp);
+	atomic_set(&umem->users, 1);
+
+	err = xdp_umem_account_pages(umem);
+	if (err)
+		goto out;
+
+	err = xdp_umem_pin_pages(umem);
+	if (err)
+		goto out_account;
+	return 0;
+
+out_account:
+	xdp_umem_unaccount_pages(umem);
+out:
+	put_pid(umem->pid);
+	return err;
+}
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
new file mode 100644
index 000000000000..4597ae81a221
--- /dev/null
+++ b/net/xdp/xdp_umem.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef XDP_UMEM_H_
+#define XDP_UMEM_H_
+
+#include <linux/mm.h>
+#include <linux/if_xdp.h>
+#include <linux/workqueue.h>
+
+#include "xdp_umem_props.h"
+
+struct xdp_umem {
+	struct page **pgs;
+	struct xdp_umem_props props;
+	u32 npgs;
+	u32 frame_headroom;
+	u32 nfpp_mask;
+	u32 nfpplog2;
+	u32 frame_size_log2;
+	struct user_struct *user;
+	struct pid *pid;
+	unsigned long address;
+	size_t size;
+	atomic_t users;
+	struct work_struct work;
+};
+
+int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
+void xdp_get_umem(struct xdp_umem *umem);
+void xdp_put_umem(struct xdp_umem *umem);
+int xdp_umem_create(struct xdp_umem **umem);
+
+#endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
new file mode 100644
index 000000000000..77fb5daf29f3
--- /dev/null
+++ b/net/xdp/xdp_umem_props.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef XDP_UMEM_PROPS_H_
+#define XDP_UMEM_PROPS_H_
+
+struct xdp_umem_props {
+	u32 frame_size;
+	u32 nframes;
+};
+
+#endif /* XDP_UMEM_PROPS_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
new file mode 100644
index 000000000000..84e0e867febb
--- /dev/null
+++ b/net/xdp/xsk.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XDP sockets
+ *
+ * AF_XDP sockets allows a channel between XDP programs and userspace
+ * applications.
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Author(s): Björn Töpel <bjorn.topel@intel.com>
+ *	      Magnus Karlsson <magnus.karlsson@intel.com>
+ */
+
+#define pr_fmt(fmt) "AF_XDP: %s: " fmt, __func__
+
+#include <linux/if_xdp.h>
+#include <linux/init.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/socket.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/net.h>
+#include <linux/netdevice.h>
+#include <net/xdp_sock.h>
+
+#include "xdp_umem.h"
+
+static struct xdp_sock *xdp_sk(struct sock *sk)
+{
+	return (struct xdp_sock *)sk;
+}
+
+static int xsk_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct net *net;
+
+	if (!sk)
+		return 0;
+
+	net = sock_net(sk);
+
+	local_bh_disable();
+	sock_prot_inuse_add(net, sk->sk_prot, -1);
+	local_bh_enable();
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+
+	sk_refcnt_debug_release(sk);
+	sock_put(sk);
+
+	return 0;
+}
+
+static int xsk_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, unsigned int optlen)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	int err;
+
+	if (level != SOL_XDP)
+		return -ENOPROTOOPT;
+
+	switch (optname) {
+	case XDP_UMEM_REG:
+	{
+		struct xdp_umem_reg mr;
+		struct xdp_umem *umem;
+
+		if (xs->umem)
+			return -EBUSY;
+
+		if (copy_from_user(&mr, optval, sizeof(mr)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		err = xdp_umem_create(&umem);
+
+		err = xdp_umem_reg(umem, &mr);
+		if (err) {
+			kfree(umem);
+			mutex_unlock(&xs->mutex);
+			return err;
+		}
+
+		/* Make sure umem is ready before it can be seen by others */
+		smp_wmb();
+
+		xs->umem = umem;
+		mutex_unlock(&xs->mutex);
+		return 0;
+	}
+	default:
+		break;
+	}
+
+	return -ENOPROTOOPT;
+}
+
+static struct proto xsk_proto = {
+	.name =		"XDP",
+	.owner =	THIS_MODULE,
+	.obj_size =	sizeof(struct xdp_sock),
+};
+
+static const struct proto_ops xsk_proto_ops = {
+	.family =	PF_XDP,
+	.owner =	THIS_MODULE,
+	.release =	xsk_release,
+	.bind =		sock_no_bind,
+	.connect =	sock_no_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	sock_no_accept,
+	.getname =	sock_no_getname,
+	.poll =		sock_no_poll,
+	.ioctl =	sock_no_ioctl,
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	xsk_setsockopt,
+	.getsockopt =	sock_no_getsockopt,
+	.sendmsg =	sock_no_sendmsg,
+	.recvmsg =	sock_no_recvmsg,
+	.mmap =		sock_no_mmap,
+	.sendpage =	sock_no_sendpage,
+};
+
+static void xsk_destruct(struct sock *sk)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (!sock_flag(sk, SOCK_DEAD))
+		return;
+
+	xdp_put_umem(xs->umem);
+
+	sk_refcnt_debug_dec(sk);
+}
+
+static int xsk_create(struct net *net, struct socket *sock, int protocol,
+		      int kern)
+{
+	struct sock *sk;
+	struct xdp_sock *xs;
+
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
+		return -EPERM;
+	if (sock->type != SOCK_RAW)
+		return -ESOCKTNOSUPPORT;
+
+	if (protocol)
+		return -EPROTONOSUPPORT;
+
+	sock->state = SS_UNCONNECTED;
+
+	sk = sk_alloc(net, PF_XDP, GFP_KERNEL, &xsk_proto, kern);
+	if (!sk)
+		return -ENOBUFS;
+
+	sock->ops = &xsk_proto_ops;
+
+	sock_init_data(sock, sk);
+
+	sk->sk_family = PF_XDP;
+
+	sk->sk_destruct = xsk_destruct;
+	sk_refcnt_debug_inc(sk);
+
+	xs = xdp_sk(sk);
+	mutex_init(&xs->mutex);
+
+	local_bh_disable();
+	sock_prot_inuse_add(net, &xsk_proto, 1);
+	local_bh_enable();
+
+	return 0;
+}
+
+static const struct net_proto_family xsk_family_ops = {
+	.family = PF_XDP,
+	.create = xsk_create,
+	.owner	= THIS_MODULE,
+};
+
+static int __init xsk_init(void)
+{
+	int err;
+
+	err = proto_register(&xsk_proto, 0 /* no slab */);
+	if (err)
+		goto out;
+
+	err = sock_register(&xsk_family_ops);
+	if (err)
+		goto out_proto;
+
+	return 0;
+
+out_proto:
+	proto_unregister(&xsk_proto);
+out:
+	return err;
+}
+
+fs_initcall(xsk_init);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 03/15] xsk: add umem fill queue support and mmap
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 01/15] net: initial AF_XDP skeleton Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 02/15] xsk: add user memory registration support sockopt Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-04 12:49   ` Daniel Borkmann
  2018-05-02 11:01 ` [PATCH bpf-next v3 04/15] xsk: add Rx queue setup and mmap support Björn Töpel
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, we add another setsockopt for registered user memory (umem)
called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
ask the kernel to allocate a queue (ring buffer) and also mmap it
(XDP_UMEM_PGOFF_FILL_QUEUE) into the process.

The queue is used to explicitly pass ownership of umem frames from the
user process to the kernel. These frames will in a later patch be
filled in with Rx packet data by the kernel.

v2: Fixed potential crash in xsk_mmap.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 15 +++++++++++
 net/xdp/Makefile            |  2 +-
 net/xdp/xdp_umem.c          |  5 ++++
 net/xdp/xdp_umem.h          |  2 ++
 net/xdp/xsk.c               | 65 ++++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h         | 38 ++++++++++++++++++++++++++
 7 files changed, 183 insertions(+), 2 deletions(-)
 create mode 100644 net/xdp/xsk_queue.c
 create mode 100644 net/xdp/xsk_queue.h

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 41252135a0fe..975661e1baca 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -23,6 +23,7 @@
 
 /* XDP socket options */
 #define XDP_UMEM_REG			3
+#define XDP_UMEM_FILL_RING		4
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -31,4 +32,18 @@ struct xdp_umem_reg {
 	__u32 frame_headroom; /* Frame head room */
 };
 
+/* Pgoff for mmaping the rings */
+#define XDP_UMEM_PGOFF_FILL_RING	0x100000000
+
+struct xdp_ring {
+	__u32 producer __attribute__((aligned(64)));
+	__u32 consumer __attribute__((aligned(64)));
+};
+
+/* Used for the fill and completion queues for buffers */
+struct xdp_umem_ring {
+	struct xdp_ring ptrs;
+	__u32 desc[0] __attribute__((aligned(64)));
+};
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
index a5d736640a0f..074fb2b2d51c 100644
--- a/net/xdp/Makefile
+++ b/net/xdp/Makefile
@@ -1,2 +1,2 @@
-obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
 
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index ec8b3552be44..e1f627d0cc1c 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -65,6 +65,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
 	struct task_struct *task;
 	struct mm_struct *mm;
 
+	if (umem->fq) {
+		xskq_destroy(umem->fq);
+		umem->fq = NULL;
+	}
+
 	if (umem->pgs) {
 		xdp_umem_unpin_pages(umem);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 4597ae81a221..25634b8a5c6f 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -19,9 +19,11 @@
 #include <linux/if_xdp.h>
 #include <linux/workqueue.h>
 
+#include "xsk_queue.h"
 #include "xdp_umem_props.h"
 
 struct xdp_umem {
+	struct xsk_queue *fq;
 	struct page **pgs;
 	struct xdp_umem_props props;
 	u32 npgs;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 84e0e867febb..da67a3c5c1c9 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -32,6 +32,7 @@
 #include <linux/netdevice.h>
 #include <net/xdp_sock.h>
 
+#include "xsk_queue.h"
 #include "xdp_umem.h"
 
 static struct xdp_sock *xdp_sk(struct sock *sk)
@@ -39,6 +40,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
+{
+	struct xsk_queue *q;
+
+	if (entries == 0 || *queue || !is_power_of_2(entries))
+		return -EINVAL;
+
+	q = xskq_create(entries);
+	if (!q)
+		return -ENOMEM;
+
+	*queue = q;
+	return 0;
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -101,6 +117,23 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return 0;
 	}
+	case XDP_UMEM_FILL_RING:
+	{
+		struct xsk_queue **q;
+		int entries;
+
+		if (!xs->umem)
+			return -EINVAL;
+
+		if (copy_from_user(&entries, optval, sizeof(entries)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		q = &xs->umem->fq;
+		err = xsk_init_queue(entries, q);
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	default:
 		break;
 	}
@@ -108,6 +141,36 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 	return -ENOPROTOOPT;
 }
 
+static int xsk_mmap(struct file *file, struct socket *sock,
+		    struct vm_area_struct *vma)
+{
+	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long size = vma->vm_end - vma->vm_start;
+	struct xdp_sock *xs = xdp_sk(sock->sk);
+	struct xsk_queue *q = NULL;
+	unsigned long pfn;
+	struct page *qpg;
+
+	if (!xs->umem)
+		return -EINVAL;
+
+	if (offset == XDP_UMEM_PGOFF_FILL_RING)
+		q = xs->umem->fq;
+	else
+		return -EINVAL;
+
+	if (!q)
+		return -EINVAL;
+
+	qpg = virt_to_head_page(q->ring);
+	if (size > (PAGE_SIZE << compound_order(qpg)))
+		return -EINVAL;
+
+	pfn = virt_to_phys(q->ring) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn,
+			       size, vma->vm_page_prot);
+}
+
 static struct proto xsk_proto = {
 	.name =		"XDP",
 	.owner =	THIS_MODULE,
@@ -131,7 +194,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	sock_no_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
-	.mmap =		sock_no_mmap,
+	.mmap =		xsk_mmap,
 	.sendpage =	sock_no_sendpage,
 };
 
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
new file mode 100644
index 000000000000..23da4f29d3fb
--- /dev/null
+++ b/net/xdp/xsk_queue.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XDP user-space ring structure
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/slab.h>
+
+#include "xsk_queue.h"
+
+static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
+{
+	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
+}
+
+struct xsk_queue *xskq_create(u32 nentries)
+{
+	struct xsk_queue *q;
+	gfp_t gfp_flags;
+	size_t size;
+
+	q = kzalloc(sizeof(*q), GFP_KERNEL);
+	if (!q)
+		return NULL;
+
+	q->nentries = nentries;
+	q->ring_mask = nentries - 1;
+
+	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
+		    __GFP_COMP  | __GFP_NORETRY;
+	size = xskq_umem_get_ring_size(q);
+
+	q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
+						      get_order(size));
+	if (!q->ring) {
+		kfree(q);
+		return NULL;
+	}
+
+	return q;
+}
+
+void xskq_destroy(struct xsk_queue *q)
+{
+	if (!q)
+		return;
+
+	page_frag_free(q->ring);
+	kfree(q);
+}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
new file mode 100644
index 000000000000..7eb556bf73be
--- /dev/null
+++ b/net/xdp/xsk_queue.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * XDP user-space ring structure
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XSK_QUEUE_H
+#define _LINUX_XSK_QUEUE_H
+
+#include <linux/types.h>
+#include <linux/if_xdp.h>
+
+#include "xdp_umem_props.h"
+
+struct xsk_queue {
+	struct xdp_umem_props umem_props;
+	u32 ring_mask;
+	u32 nentries;
+	u32 prod_head;
+	u32 prod_tail;
+	u32 cons_head;
+	u32 cons_tail;
+	struct xdp_ring *ring;
+	u64 invalid_descs;
+};
+
+struct xsk_queue *xskq_create(u32 nentries);
+void xskq_destroy(struct xsk_queue *q);
+
+#endif /* _LINUX_XSK_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 04/15] xsk: add Rx queue setup and mmap support
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (2 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 03/15] xsk: add umem fill queue support and mmap Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 05/15] xsk: add support for bind for Rx Björn Töpel
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
a queue, where the kernel can pass completed Rx frames from the kernel
to user process.

The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h      |  4 ++++
 include/uapi/linux/if_xdp.h | 16 ++++++++++++++++
 net/xdp/xsk.c               | 41 ++++++++++++++++++++++++++++++++---------
 net/xdp/xsk_queue.c         | 11 +++++++++--
 net/xdp/xsk_queue.h         |  2 +-
 5 files changed, 62 insertions(+), 12 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 94785f5db13e..db9a321de087 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -18,11 +18,15 @@
 #include <linux/mutex.h>
 #include <net/sock.h>
 
+struct net_device;
+struct xsk_queue;
 struct xdp_umem;
 
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
+	struct xsk_queue *rx;
+	struct net_device *dev;
 	struct xdp_umem *umem;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 975661e1baca..65324558829d 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -22,6 +22,7 @@
 #include <linux/types.h>
 
 /* XDP socket options */
+#define XDP_RX_RING			1
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
 
@@ -33,13 +34,28 @@ struct xdp_umem_reg {
 };
 
 /* Pgoff for mmaping the rings */
+#define XDP_PGOFF_RX_RING			  0
 #define XDP_UMEM_PGOFF_FILL_RING	0x100000000
 
+struct xdp_desc {
+	__u32 idx;
+	__u32 len;
+	__u16 offset;
+	__u8 flags;
+	__u8 padding[5];
+};
+
 struct xdp_ring {
 	__u32 producer __attribute__((aligned(64)));
 	__u32 consumer __attribute__((aligned(64)));
 };
 
+/* Used for the RX and TX queues for packets */
+struct xdp_rxtx_ring {
+	struct xdp_ring ptrs;
+	struct xdp_desc desc[0] __attribute__((aligned(64)));
+};
+
 /* Used for the fill and completion queues for buffers */
 struct xdp_umem_ring {
 	struct xdp_ring ptrs;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index da67a3c5c1c9..92bd9b7e548f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -31,6 +31,7 @@
 #include <linux/net.h>
 #include <linux/netdevice.h>
 #include <net/xdp_sock.h>
+#include <net/xdp.h>
 
 #include "xsk_queue.h"
 #include "xdp_umem.h"
@@ -40,14 +41,15 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
-static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
+static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
+			  bool umem_queue)
 {
 	struct xsk_queue *q;
 
 	if (entries == 0 || *queue || !is_power_of_2(entries))
 		return -EINVAL;
 
-	q = xskq_create(entries);
+	q = xskq_create(entries, umem_queue);
 	if (!q)
 		return -ENOMEM;
 
@@ -89,6 +91,22 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		return -ENOPROTOOPT;
 
 	switch (optname) {
+	case XDP_RX_RING:
+	{
+		struct xsk_queue **q;
+		int entries;
+
+		if (optlen < sizeof(entries))
+			return -EINVAL;
+		if (copy_from_user(&entries, optval, sizeof(entries)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		q = &xs->rx;
+		err = xsk_init_queue(entries, q, false);
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	case XDP_UMEM_REG:
 	{
 		struct xdp_umem_reg mr;
@@ -130,7 +148,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 
 		mutex_lock(&xs->mutex);
 		q = &xs->umem->fq;
-		err = xsk_init_queue(entries, q);
+		err = xsk_init_queue(entries, q, true);
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
@@ -151,13 +169,17 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 	unsigned long pfn;
 	struct page *qpg;
 
-	if (!xs->umem)
-		return -EINVAL;
+	if (offset == XDP_PGOFF_RX_RING) {
+		q = xs->rx;
+	} else {
+		if (!xs->umem)
+			return -EINVAL;
 
-	if (offset == XDP_UMEM_PGOFF_FILL_RING)
-		q = xs->umem->fq;
-	else
-		return -EINVAL;
+		if (offset == XDP_UMEM_PGOFF_FILL_RING)
+			q = xs->umem->fq;
+		else
+			return -EINVAL;
+	}
 
 	if (!q)
 		return -EINVAL;
@@ -205,6 +227,7 @@ static void xsk_destruct(struct sock *sk)
 	if (!sock_flag(sk, SOCK_DEAD))
 		return;
 
+	xskq_destroy(xs->rx);
 	xdp_put_umem(xs->umem);
 
 	sk_refcnt_debug_dec(sk);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 23da4f29d3fb..894f9f89afc7 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -21,7 +21,13 @@ static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
 	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
 }
 
-struct xsk_queue *xskq_create(u32 nentries)
+static u32 xskq_rxtx_get_ring_size(struct xsk_queue *q)
+{
+	return (sizeof(struct xdp_ring) +
+		q->nentries * sizeof(struct xdp_desc));
+}
+
+struct xsk_queue *xskq_create(u32 nentries, bool umem_queue)
 {
 	struct xsk_queue *q;
 	gfp_t gfp_flags;
@@ -36,7 +42,8 @@ struct xsk_queue *xskq_create(u32 nentries)
 
 	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
 		    __GFP_COMP  | __GFP_NORETRY;
-	size = xskq_umem_get_ring_size(q);
+	size = umem_queue ? xskq_umem_get_ring_size(q) :
+	       xskq_rxtx_get_ring_size(q);
 
 	q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
 						      get_order(size));
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 7eb556bf73be..5439fa381763 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -32,7 +32,7 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
-struct xsk_queue *xskq_create(u32 nentries);
+struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q);
 
 #endif /* _LINUX_XSK_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 05/15] xsk: add support for bind for Rx
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (3 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 04/15] xsk: add Rx queue setup and mmap support Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 06/15] xsk: add Rx receive functions and poll support Björn Töpel
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, the bind syscall is added. Binding an AF_XDP socket, means
associating the socket to an umem, a netdev and a queue index. This
can be done in two ways.

The first way, creating a "socket from scratch". Create the umem using
the XDP_UMEM_REG setsockopt and an associated fill queue with
XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
setsockopt. Call bind passing ifindex and queue index ("channel" in
ethtool speak).

The second way to bind a socket, is simply skipping the
umem/netdev/queue index, and passing another already setup AF_XDP
socket. The new socket will then have the same umem/netdev/queue index
as the parent so it will share the same umem. You must also set the
flags field in the socket address to XDP_SHARED_UMEM.

v2: Use PTR_ERR instead of passing error variable explicitly.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h      |   1 +
 include/uapi/linux/if_xdp.h |  11 ++++
 net/xdp/xdp_umem.c          |   5 ++
 net/xdp/xdp_umem.h          |   1 +
 net/xdp/xsk.c               | 124 +++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.c         |   8 +++
 net/xdp/xsk_queue.h         |   1 +
 7 files changed, 150 insertions(+), 1 deletion(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index db9a321de087..85d02512f59b 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -28,6 +28,7 @@ struct xdp_sock {
 	struct xsk_queue *rx;
 	struct net_device *dev;
 	struct xdp_umem *umem;
+	u16 queue_id;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 };
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 65324558829d..e5091881f776 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -21,6 +21,17 @@
 
 #include <linux/types.h>
 
+/* Options for the sxdp_flags field */
+#define XDP_SHARED_UMEM 1
+
+struct sockaddr_xdp {
+	__u16 sxdp_family;
+	__u32 sxdp_ifindex;
+	__u32 sxdp_queue_id;
+	__u32 sxdp_shared_umem_fd;
+	__u16 sxdp_flags;
+};
+
 /* XDP socket options */
 #define XDP_RX_RING			1
 #define XDP_UMEM_REG			3
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index e1f627d0cc1c..9bac1ad570fa 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -248,3 +248,8 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	put_pid(umem->pid);
 	return err;
 }
+
+bool xdp_umem_validate_queues(struct xdp_umem *umem)
+{
+	return umem->fq;
+}
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 25634b8a5c6f..b13133e9c501 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -39,6 +39,7 @@ struct xdp_umem {
 	struct work_struct work;
 };
 
+bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
 void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 92bd9b7e548f..bf2c97b87992 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -57,9 +57,18 @@ static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 	return 0;
 }
 
+static void __xsk_release(struct xdp_sock *xs)
+{
+	/* Wait for driver to stop using the xdp socket. */
+	synchronize_net();
+
+	dev_put(xs->dev);
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
 	struct net *net;
 
 	if (!sk)
@@ -71,6 +80,11 @@ static int xsk_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	local_bh_enable();
 
+	if (xs->dev) {
+		__xsk_release(xs);
+		xs->dev = NULL;
+	}
+
 	sock_orphan(sk);
 	sock->sk = NULL;
 
@@ -80,6 +94,114 @@ static int xsk_release(struct socket *sock)
 	return 0;
 }
 
+static struct socket *xsk_lookup_xsk_from_fd(int fd)
+{
+	struct socket *sock;
+	int err;
+
+	sock = sockfd_lookup(fd, &err);
+	if (!sock)
+		return ERR_PTR(-ENOTSOCK);
+
+	if (sock->sk->sk_family != PF_XDP) {
+		sockfd_put(sock);
+		return ERR_PTR(-ENOPROTOOPT);
+	}
+
+	return sock;
+}
+
+static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
+	struct sock *sk = sock->sk;
+	struct net_device *dev, *dev_curr;
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct xdp_umem *old_umem = NULL;
+	int err = 0;
+
+	if (addr_len < sizeof(struct sockaddr_xdp))
+		return -EINVAL;
+	if (sxdp->sxdp_family != AF_XDP)
+		return -EINVAL;
+
+	mutex_lock(&xs->mutex);
+	dev_curr = xs->dev;
+	dev = dev_get_by_index(sock_net(sk), sxdp->sxdp_ifindex);
+	if (!dev) {
+		err = -ENODEV;
+		goto out_release;
+	}
+
+	if (!xs->rx) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (sxdp->sxdp_queue_id >= dev->num_rx_queues) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (sxdp->sxdp_flags & XDP_SHARED_UMEM) {
+		struct xdp_sock *umem_xs;
+		struct socket *sock;
+
+		if (xs->umem) {
+			/* We have already our own. */
+			err = -EINVAL;
+			goto out_unlock;
+		}
+
+		sock = xsk_lookup_xsk_from_fd(sxdp->sxdp_shared_umem_fd);
+		if (IS_ERR(sock)) {
+			err = PTR_ERR(sock);
+			goto out_unlock;
+		}
+
+		umem_xs = xdp_sk(sock->sk);
+		if (!umem_xs->umem) {
+			/* No umem to inherit. */
+			err = -EBADF;
+			sockfd_put(sock);
+			goto out_unlock;
+		} else if (umem_xs->dev != dev ||
+			   umem_xs->queue_id != sxdp->sxdp_queue_id) {
+			err = -EINVAL;
+			sockfd_put(sock);
+			goto out_unlock;
+		}
+
+		xdp_get_umem(umem_xs->umem);
+		old_umem = xs->umem;
+		xs->umem = umem_xs->umem;
+		sockfd_put(sock);
+	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Rebind? */
+	if (dev_curr && (dev_curr != dev ||
+			 xs->queue_id != sxdp->sxdp_queue_id)) {
+		__xsk_release(xs);
+		if (old_umem)
+			xdp_put_umem(old_umem);
+	}
+
+	xs->dev = dev;
+	xs->queue_id = sxdp->sxdp_queue_id;
+
+	xskq_set_umem(xs->rx, &xs->umem->props);
+
+out_unlock:
+	if (err)
+		dev_put(dev);
+out_release:
+	mutex_unlock(&xs->mutex);
+	return err;
+}
+
 static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			  char __user *optval, unsigned int optlen)
 {
@@ -203,7 +325,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.family =	PF_XDP,
 	.owner =	THIS_MODULE,
 	.release =	xsk_release,
-	.bind =		sock_no_bind,
+	.bind =		xsk_bind,
 	.connect =	sock_no_connect,
 	.socketpair =	sock_no_socketpair,
 	.accept =	sock_no_accept,
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 894f9f89afc7..d012e5e23591 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -16,6 +16,14 @@
 
 #include "xsk_queue.h"
 
+void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props)
+{
+	if (!q)
+		return;
+
+	q->umem_props = *umem_props;
+}
+
 static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
 {
 	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 5439fa381763..9ddd2ee07a84 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -32,6 +32,7 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
+void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props);
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q);
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 06/15] xsk: add Rx receive functions and poll support
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (4 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 05/15] xsk: add support for bind for Rx Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-04 12:59   ` Daniel Borkmann
  2018-05-02 11:01 ` [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Here the actual receive functions of AF_XDP are implemented, that in a
later commit, will be called from the XDP layers.

There's one set of functions for the XDP_DRV side and another for
XDP_SKB (generic).

A new XDP API, xdp_return_buff, is also introduced.

Adding xdp_return_buff, which is analogous to xdp_return_frame, but
acts upon an struct xdp_buff. The API will be used by AF_XDP in future
commits.

Support for the poll syscall is also implemented.

v2: xskq_validate_id did not update cons_tail.
    The entries variable was calculated twice in xskq_nb_avail.
    Squashed xdp_return_buff commit.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp.h      |   1 +
 include/net/xdp_sock.h |  22 ++++++++++
 net/core/xdp.c         |  15 +++++--
 net/xdp/xdp_umem.h     |  18 ++++++++
 net/xdp/xsk.c          |  73 ++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.h    | 114 ++++++++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 238 insertions(+), 5 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 137ad5f9f40f..0b689cf561c7 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -104,6 +104,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 }
 
 void xdp_return_frame(struct xdp_frame *xdpf);
+void xdp_return_buff(struct xdp_buff *xdp);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 85d02512f59b..a0342dff6a4d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -31,6 +31,28 @@ struct xdp_sock {
 	u16 queue_id;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
+	u64 rx_dropped;
 };
 
+struct xdp_buff;
+#ifdef CONFIG_XDP_SOCKETS
+int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
+void xsk_flush(struct xdp_sock *xs);
+#else
+static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	return -ENOTSUPP;
+}
+
+static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	return -ENOTSUPP;
+}
+
+static inline void xsk_flush(struct xdp_sock *xs)
+{
+}
+#endif /* CONFIG_XDP_SOCKETS */
+
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 0c86b53a3a63..bf6758f74339 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -308,11 +308,9 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-void xdp_return_frame(struct xdp_frame *xdpf)
+static void xdp_return(void *data, struct xdp_mem_info *mem)
 {
-	struct xdp_mem_info *mem = &xdpf->mem;
 	struct xdp_mem_allocator *xa;
-	void *data = xdpf->data;
 	struct page *page;
 
 	switch (mem->type) {
@@ -339,4 +337,15 @@ void xdp_return_frame(struct xdp_frame *xdpf)
 		break;
 	}
 }
+
+void xdp_return_frame(struct xdp_frame *xdpf)
+{
+	xdp_return(xdpf->data, &xdpf->mem);
+}
 EXPORT_SYMBOL_GPL(xdp_return_frame);
+
+void xdp_return_buff(struct xdp_buff *xdp)
+{
+	xdp_return(xdp->data, &xdp->rxq->mem);
+}
+EXPORT_SYMBOL_GPL(xdp_return_buff);
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index b13133e9c501..c7378a11721f 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -39,6 +39,24 @@ struct xdp_umem {
 	struct work_struct work;
 };
 
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
+{
+	u64 pg, off;
+	char *data;
+
+	pg = idx >> umem->nfpplog2;
+	off = (idx & umem->nfpp_mask) << umem->frame_size_log2;
+
+	data = page_address(umem->pgs[pg]);
+	return data + off;
+}
+
+static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
+						    u32 idx)
+{
+	return xdp_umem_get_data(umem, idx) + umem->frame_headroom;
+}
+
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
 void xdp_get_umem(struct xdp_umem *umem);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index bf2c97b87992..4e1e6c581e1d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -41,6 +41,74 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	u32 *id, len = xdp->data_end - xdp->data;
+	void *buffer;
+	int err = 0;
+
+	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+		return -EINVAL;
+
+	id = xskq_peek_id(xs->umem->fq);
+	if (!id)
+		return -ENOSPC;
+
+	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
+	memcpy(buffer, xdp->data, len);
+	err = xskq_produce_batch_desc(xs->rx, *id, len,
+				      xs->umem->frame_headroom);
+	if (!err)
+		xskq_discard_id(xs->umem->fq);
+
+	return err;
+}
+
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	int err;
+
+	err = __xsk_rcv(xs, xdp);
+	if (likely(!err))
+		xdp_return_buff(xdp);
+	else
+		xs->rx_dropped++;
+
+	return err;
+}
+
+void xsk_flush(struct xdp_sock *xs)
+{
+	xskq_produce_flush_desc(xs->rx);
+	xs->sk.sk_data_ready(&xs->sk);
+}
+
+int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	int err;
+
+	err = __xsk_rcv(xs, xdp);
+	if (!err)
+		xsk_flush(xs);
+	else
+		xs->rx_dropped++;
+
+	return err;
+}
+
+static unsigned int xsk_poll(struct file *file, struct socket *sock,
+			     struct poll_table_struct *wait)
+{
+	unsigned int mask = datagram_poll(file, sock, wait);
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (xs->rx && !xskq_empty_desc(xs->rx))
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
 static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 			  bool umem_queue)
 {
@@ -179,6 +247,9 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
 		err = -EINVAL;
 		goto out_unlock;
+	} else {
+		/* This xsk has its own umem. */
+		xskq_set_umem(xs->umem->fq, &xs->umem->props);
 	}
 
 	/* Rebind? */
@@ -330,7 +401,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.socketpair =	sock_no_socketpair,
 	.accept =	sock_no_accept,
 	.getname =	sock_no_getname,
-	.poll =		sock_no_poll,
+	.poll =		xsk_poll,
 	.ioctl =	sock_no_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 9ddd2ee07a84..0a9b92b4f93a 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -20,6 +20,8 @@
 
 #include "xdp_umem_props.h"
 
+#define RX_BATCH_SIZE 16
+
 struct xsk_queue {
 	struct xdp_umem_props umem_props;
 	u32 ring_mask;
@@ -32,8 +34,118 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
+/* Common functions operating for both RXTX and umem queues */
+
+static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
+{
+	u32 entries = q->prod_tail - q->cons_tail;
+
+	if (entries == 0) {
+		/* Refresh the local pointer */
+		q->prod_tail = READ_ONCE(q->ring->producer);
+		entries = q->prod_tail - q->cons_tail;
+	}
+
+	return (entries > dcnt) ? dcnt : entries;
+}
+
+static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
+{
+	u32 free_entries = q->nentries - (producer - q->cons_tail);
+
+	if (free_entries >= dcnt)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cons_tail = READ_ONCE(q->ring->consumer);
+	return q->nentries - (producer - q->cons_tail);
+}
+
+/* UMEM queue */
+
+static inline bool xskq_is_valid_id(struct xsk_queue *q, u32 idx)
+{
+	if (unlikely(idx >= q->umem_props.nframes)) {
+		q->invalid_descs++;
+		return false;
+	}
+	return true;
+}
+
+static inline u32 *xskq_validate_id(struct xsk_queue *q)
+{
+	while (q->cons_tail != q->cons_head) {
+		struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+		unsigned int idx = q->cons_tail & q->ring_mask;
+
+		if (xskq_is_valid_id(q, ring->desc[idx]))
+			return &ring->desc[idx];
+
+		q->cons_tail++;
+	}
+
+	return NULL;
+}
+
+static inline u32 *xskq_peek_id(struct xsk_queue *q)
+{
+	struct xdp_umem_ring *ring;
+
+	if (q->cons_tail == q->cons_head) {
+		WRITE_ONCE(q->ring->consumer, q->cons_tail);
+		q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
+
+		/* Order consumer and data */
+		smp_rmb();
+
+		return xskq_validate_id(q);
+	}
+
+	ring = (struct xdp_umem_ring *)q->ring;
+	return &ring->desc[q->cons_tail & q->ring_mask];
+}
+
+static inline void xskq_discard_id(struct xsk_queue *q)
+{
+	q->cons_tail++;
+	(void)xskq_validate_id(q);
+}
+
+/* Rx queue */
+
+static inline int xskq_produce_batch_desc(struct xsk_queue *q,
+					  u32 id, u32 len, u16 offset)
+{
+	struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
+	unsigned int idx;
+
+	if (xskq_nb_free(q, q->prod_head, 1) == 0)
+		return -ENOSPC;
+
+	idx = (q->prod_head++) & q->ring_mask;
+	ring->desc[idx].idx = id;
+	ring->desc[idx].len = len;
+	ring->desc[idx].offset = offset;
+
+	return 0;
+}
+
+static inline void xskq_produce_flush_desc(struct xsk_queue *q)
+{
+	/* Order producer and data */
+	smp_wmb();
+
+	q->prod_tail = q->prod_head,
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+}
+
+static inline bool xskq_empty_desc(struct xsk_queue *q)
+{
+	return (xskq_nb_free(q, q->prod_tail, 1) == q->nentries);
+}
+
 void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props);
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
-void xskq_destroy(struct xsk_queue *q);
+void xskq_destroy(struct xsk_queue *q_ops);
 
 #endif /* _LINUX_XSK_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (5 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 06/15] xsk: add Rx receive functions and poll support Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-10-08 15:31   ` Eric Dumazet
  2018-05-02 11:01 ` [PATCH bpf-next v3 08/15] xsk: wire up XDP_DRV side of AF_XDP Björn Töpel
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

The xskmap is yet another BPF map, very much inspired by
dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
adds AF_XDP sockets into the map, and by using the bpf_redirect_map
helper, an XDP program can redirect XDP frames to an AF_XDP socket.

Note that a socket that is bound to certain ifindex/queue index will
*only* accept XDP frames from that netdev/queue index. If an XDP
program tries to redirect from a netdev/queue index other than what
the socket is bound to, the frame will not be received on the socket.

A socket can reside in multiple maps.

v3: Fixed race and simplified code.
v2: Removed one indirection in map lookup.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/bpf.h       |  25 +++++
 include/linux/bpf_types.h |   3 +
 include/net/xdp_sock.h    |   7 ++
 include/uapi/linux/bpf.h  |   1 +
 kernel/bpf/Makefile       |   3 +
 kernel/bpf/verifier.c     |   8 +-
 kernel/bpf/xskmap.c       | 239 ++++++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk.c             |   5 +
 8 files changed, 289 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/xskmap.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c553f6f9c6b0..68ecdb4eea09 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -676,6 +676,31 @@ static inline int sock_map_prog(struct bpf_map *map,
 }
 #endif
 
+#if defined(CONFIG_XDP_SOCKETS)
+struct xdp_sock;
+struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key);
+int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp,
+		       struct xdp_sock *xs);
+void __xsk_map_flush(struct bpf_map *map);
+#else
+struct xdp_sock;
+static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map,
+						     u32 key)
+{
+	return NULL;
+}
+
+static inline int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp,
+				     struct xdp_sock *xs)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void __xsk_map_flush(struct bpf_map *map)
+{
+}
+#endif
+
 /* verifier prototypes for helper functions called from eBPF programs */
 extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern const struct bpf_func_proto bpf_map_update_elem_proto;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b28fcf6f6ae..d7df1b323082 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -49,4 +49,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
+#if defined(CONFIG_XDP_SOCKETS)
+BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
+#endif
 #endif
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index a0342dff6a4d..ce3a2ab16b8f 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -28,6 +28,7 @@ struct xdp_sock {
 	struct xsk_queue *rx;
 	struct net_device *dev;
 	struct xdp_umem *umem;
+	struct list_head flush_node;
 	u16 queue_id;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
@@ -39,6 +40,7 @@ struct xdp_buff;
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
+bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
@@ -53,6 +55,11 @@ static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 static inline void xsk_flush(struct xdp_sock *xs)
 {
 }
+
+static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
+{
+	return false;
+}
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8daef7326bb7..a3a495052511 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_DEVMAP,
 	BPF_MAP_TYPE_SOCKMAP,
 	BPF_MAP_TYPE_CPUMAP,
+	BPF_MAP_TYPE_XSKMAP,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 35c485fa9ea3..f27f5496d6fe 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,6 +8,9 @@ obj-$(CONFIG_BPF_SYSCALL) += btf.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
+ifeq ($(CONFIG_XDP_SOCKETS),y)
+obj-$(CONFIG_BPF_SYSCALL) += xskmap.o
+endif
 obj-$(CONFIG_BPF_SYSCALL) += offload.o
 ifeq ($(CONFIG_STREAM_PARSER),y)
 ifeq ($(CONFIG_INET),y)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 712d8655e916..0d91f18b2eb5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2070,8 +2070,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (func_id != BPF_FUNC_redirect_map)
 			goto error;
 		break;
-	/* Restrict bpf side of cpumap, open when use-cases appear */
+	/* Restrict bpf side of cpumap and xskmap, open when use-cases
+	 * appear.
+	 */
 	case BPF_MAP_TYPE_CPUMAP:
+	case BPF_MAP_TYPE_XSKMAP:
 		if (func_id != BPF_FUNC_redirect_map)
 			goto error;
 		break;
@@ -2118,7 +2121,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_FUNC_redirect_map:
 		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
-		    map->map_type != BPF_MAP_TYPE_CPUMAP)
+		    map->map_type != BPF_MAP_TYPE_CPUMAP &&
+		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
 	case BPF_FUNC_sk_redirect_map:
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
new file mode 100644
index 000000000000..869dbb11b612
--- /dev/null
+++ b/kernel/bpf/xskmap.c
@@ -0,0 +1,239 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XSKMAP used for AF_XDP sockets
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/bpf.h>
+#include <linux/capability.h>
+#include <net/xdp_sock.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+struct xsk_map {
+	struct bpf_map map;
+	struct xdp_sock **xsk_map;
+	struct list_head __percpu *flush_list;
+};
+
+static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
+{
+	int cpu, err = -EINVAL;
+	struct xsk_map *m;
+	u64 cost;
+
+	if (!capable(CAP_NET_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (attr->max_entries == 0 || attr->key_size != 4 ||
+	    attr->value_size != 4 ||
+	    attr->map_flags & ~(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY))
+		return ERR_PTR(-EINVAL);
+
+	m = kzalloc(sizeof(*m), GFP_USER);
+	if (!m)
+		return ERR_PTR(-ENOMEM);
+
+	bpf_map_init_from_attr(&m->map, attr);
+
+	cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *);
+	cost += sizeof(struct list_head) * num_possible_cpus();
+	if (cost >= U32_MAX - PAGE_SIZE)
+		goto free_m;
+
+	m->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+	/* Notice returns -EPERM on if map size is larger than memlock limit */
+	err = bpf_map_precharge_memlock(m->map.pages);
+	if (err)
+		goto free_m;
+
+	m->flush_list = alloc_percpu(struct list_head);
+	if (!m->flush_list)
+		goto free_m;
+
+	for_each_possible_cpu(cpu)
+		INIT_LIST_HEAD(per_cpu_ptr(m->flush_list, cpu));
+
+	m->xsk_map = bpf_map_area_alloc(m->map.max_entries *
+					sizeof(struct xdp_sock *),
+					m->map.numa_node);
+	if (!m->xsk_map)
+		goto free_percpu;
+	return &m->map;
+
+free_percpu:
+	free_percpu(m->flush_list);
+free_m:
+	kfree(m);
+	return ERR_PTR(err);
+}
+
+static void xsk_map_free(struct bpf_map *map)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	int i;
+
+	synchronize_net();
+
+	for (i = 0; i < map->max_entries; i++) {
+		struct xdp_sock *xs;
+
+		xs = m->xsk_map[i];
+		if (!xs)
+			continue;
+
+		sock_put((struct sock *)xs);
+	}
+
+	free_percpu(m->flush_list);
+	bpf_map_area_free(m->xsk_map);
+	kfree(m);
+}
+
+static int xsk_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	u32 index = key ? *(u32 *)key : U32_MAX;
+	u32 *next = next_key;
+
+	if (index >= m->map.max_entries) {
+		*next = 0;
+		return 0;
+	}
+
+	if (index == m->map.max_entries - 1)
+		return -ENOENT;
+	*next = index + 1;
+	return 0;
+}
+
+struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xdp_sock *xs;
+
+	if (key >= map->max_entries)
+		return NULL;
+
+	xs = READ_ONCE(m->xsk_map[key]);
+	return xs;
+}
+
+int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp,
+		       struct xdp_sock *xs)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct list_head *flush_list = this_cpu_ptr(m->flush_list);
+	int err;
+
+	err = xsk_rcv(xs, xdp);
+	if (err)
+		return err;
+
+	if (!xs->flush_node.prev)
+		list_add(&xs->flush_node, flush_list);
+
+	return 0;
+}
+
+void __xsk_map_flush(struct bpf_map *map)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct list_head *flush_list = this_cpu_ptr(m->flush_list);
+	struct xdp_sock *xs, *tmp;
+
+	list_for_each_entry_safe(xs, tmp, flush_list, flush_node) {
+		xsk_flush(xs);
+		__list_del(xs->flush_node.prev, xs->flush_node.next);
+		xs->flush_node.prev = NULL;
+	}
+}
+
+static void *xsk_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return NULL;
+}
+
+static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
+			       u64 map_flags)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	u32 i = *(u32 *)key, fd = *(u32 *)value;
+	struct xdp_sock *xs, *old_xs;
+	struct socket *sock;
+	int err;
+
+	if (unlikely(map_flags > BPF_EXIST))
+		return -EINVAL;
+	if (unlikely(i >= m->map.max_entries))
+		return -E2BIG;
+	if (unlikely(map_flags == BPF_NOEXIST))
+		return -EEXIST;
+
+	sock = sockfd_lookup(fd, &err);
+	if (!sock)
+		return err;
+
+	if (sock->sk->sk_family != PF_XDP) {
+		sockfd_put(sock);
+		return -EOPNOTSUPP;
+	}
+
+	xs = (struct xdp_sock *)sock->sk;
+
+	if (!xsk_is_setup_for_bpf_map(xs)) {
+		sockfd_put(sock);
+		return -EOPNOTSUPP;
+	}
+
+	sock_hold(sock->sk);
+
+	old_xs = xchg(&m->xsk_map[i], xs);
+	if (old_xs) {
+		/* Make sure we've flushed everything. */
+		synchronize_net();
+		sock_put((struct sock *)old_xs);
+	}
+
+	sockfd_put(sock);
+	return 0;
+}
+
+static int xsk_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xdp_sock *old_xs;
+	int k = *(u32 *)key;
+
+	if (k >= map->max_entries)
+		return -EINVAL;
+
+	old_xs = xchg(&m->xsk_map[k], NULL);
+	if (old_xs) {
+		/* Make sure we've flushed everything. */
+		synchronize_net();
+		sock_put((struct sock *)old_xs);
+	}
+
+	return 0;
+}
+
+const struct bpf_map_ops xsk_map_ops = {
+	.map_alloc = xsk_map_alloc,
+	.map_free = xsk_map_free,
+	.map_get_next_key = xsk_map_get_next_key,
+	.map_lookup_elem = xsk_map_lookup_elem,
+	.map_update_elem = xsk_map_update_elem,
+	.map_delete_elem = xsk_map_delete_elem,
+};
+
+
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4e1e6c581e1d..b931a0db5588 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -41,6 +41,11 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
+{
+	return !!xs->rx;
+}
+
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	u32 *id, len = xdp->data_end - xdp->data;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 08/15] xsk: wire up XDP_DRV side of AF_XDP
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (6 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 09/15] xsk: wire up XDP_SKB " Björn Töpel
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This commit wires up the xskmap to XDP_DRV layer.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/core/filter.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d3781daa26ab..40d4bbb4508d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2801,7 +2801,8 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 {
 	int err;
 
-	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP: {
 		struct net_device *dev = fwd;
 		struct xdp_frame *xdpf;
 
@@ -2819,14 +2820,25 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 		if (err)
 			return err;
 		__dev_map_insert_ctx(map, index);
-
-	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP) {
+		break;
+	}
+	case BPF_MAP_TYPE_CPUMAP: {
 		struct bpf_cpu_map_entry *rcpu = fwd;
 
 		err = cpu_map_enqueue(rcpu, xdp, dev_rx);
 		if (err)
 			return err;
 		__cpu_map_insert_ctx(map, index);
+		break;
+	}
+	case BPF_MAP_TYPE_XSKMAP: {
+		struct xdp_sock *xs = fwd;
+
+		err = __xsk_map_redirect(map, xdp, xs);
+		return err;
+	}
+	default:
+		break;
 	}
 	return 0;
 }
@@ -2845,6 +2857,9 @@ void xdp_do_flush_map(void)
 		case BPF_MAP_TYPE_CPUMAP:
 			__cpu_map_flush(map);
 			break;
+		case BPF_MAP_TYPE_XSKMAP:
+			__xsk_map_flush(map);
+			break;
 		default:
 			break;
 		}
@@ -2859,6 +2874,8 @@ static void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
 		return __dev_map_lookup_elem(map, index);
 	case BPF_MAP_TYPE_CPUMAP:
 		return __cpu_map_lookup_elem(map, index);
+	case BPF_MAP_TYPE_XSKMAP:
+		return __xsk_map_lookup_elem(map, index);
 	default:
 		return NULL;
 	}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 09/15] xsk: wire up XDP_SKB side of AF_XDP
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (7 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 08/15] xsk: wire up XDP_DRV side of AF_XDP Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 10/15] xsk: add umem completion queue support and mmap Björn Töpel
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This commit wires up the xskmap to XDP_SKB layer.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/filter.h |  2 +-
 net/core/dev.c         | 35 +++++++++++++++++++----------------
 net/core/filter.c      | 17 ++++++++++++++---
 3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 64899c04c1a6..b7f81e3a70cb 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -760,7 +760,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
  * This does not appear to be a real limitation for existing software.
  */
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
-			    struct bpf_prog *prog);
+			    struct xdp_buff *xdp, struct bpf_prog *prog);
 int xdp_do_redirect(struct net_device *dev,
 		    struct xdp_buff *xdp,
 		    struct bpf_prog *prog);
diff --git a/net/core/dev.c b/net/core/dev.c
index 8f8931b93140..aea36b5a2fed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3994,12 +3994,12 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 }
 
 static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+				     struct xdp_buff *xdp,
 				     struct bpf_prog *xdp_prog)
 {
 	struct netdev_rx_queue *rxqueue;
 	void *orig_data, *orig_data_end;
 	u32 metalen, act = XDP_DROP;
-	struct xdp_buff xdp;
 	int hlen, off;
 	u32 mac_len;
 
@@ -4034,19 +4034,19 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	 */
 	mac_len = skb->data - skb_mac_header(skb);
 	hlen = skb_headlen(skb) + mac_len;
-	xdp.data = skb->data - mac_len;
-	xdp.data_meta = xdp.data;
-	xdp.data_end = xdp.data + hlen;
-	xdp.data_hard_start = skb->data - skb_headroom(skb);
-	orig_data_end = xdp.data_end;
-	orig_data = xdp.data;
+	xdp->data = skb->data - mac_len;
+	xdp->data_meta = xdp->data;
+	xdp->data_end = xdp->data + hlen;
+	xdp->data_hard_start = skb->data - skb_headroom(skb);
+	orig_data_end = xdp->data_end;
+	orig_data = xdp->data;
 
 	rxqueue = netif_get_rxqueue(skb);
-	xdp.rxq = &rxqueue->xdp_rxq;
+	xdp->rxq = &rxqueue->xdp_rxq;
 
-	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
-	off = xdp.data - orig_data;
+	off = xdp->data - orig_data;
 	if (off > 0)
 		__skb_pull(skb, off);
 	else if (off < 0)
@@ -4056,10 +4056,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	/* check if bpf_xdp_adjust_tail was used. it can only "shrink"
 	 * pckt.
 	 */
-	off = orig_data_end - xdp.data_end;
+	off = orig_data_end - xdp->data_end;
 	if (off != 0) {
-		skb_set_tail_pointer(skb, xdp.data_end - xdp.data);
+		skb_set_tail_pointer(skb, xdp->data_end - xdp->data);
 		skb->len -= off;
+
 	}
 
 	switch (act) {
@@ -4068,7 +4069,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 		__skb_push(skb, mac_len);
 		break;
 	case XDP_PASS:
-		metalen = xdp.data - xdp.data_meta;
+		metalen = xdp->data - xdp->data_meta;
 		if (metalen)
 			skb_metadata_set(skb, metalen);
 		break;
@@ -4118,17 +4119,19 @@ static struct static_key generic_xdp_needed __read_mostly;
 int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 {
 	if (xdp_prog) {
-		u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+		struct xdp_buff xdp;
+		u32 act;
 		int err;
 
+		act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
 		if (act != XDP_PASS) {
 			switch (act) {
 			case XDP_REDIRECT:
 				err = xdp_do_generic_redirect(skb->dev, skb,
-							      xdp_prog);
+							      &xdp, xdp_prog);
 				if (err)
 					goto out_redir;
-			/* fallthru to submit skb */
+				break;
 			case XDP_TX:
 				generic_xdp_tx(skb, xdp_prog);
 				break;
diff --git a/net/core/filter.c b/net/core/filter.c
index 40d4bbb4508d..120bc8a202d9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -59,6 +59,7 @@
 #include <net/tcp.h>
 #include <net/xfrm.h>
 #include <linux/bpf_trace.h>
+#include <net/xdp_sock.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -2973,13 +2974,14 @@ static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
 
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
+				       struct xdp_buff *xdp,
 				       struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	unsigned long map_owner = ri->map_owner;
 	struct bpf_map *map = ri->map;
-	struct net_device *fwd = NULL;
 	u32 index = ri->ifindex;
+	void *fwd = NULL;
 	int err = 0;
 
 	ri->ifindex = 0;
@@ -3001,6 +3003,14 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 		if (unlikely((err = __xdp_generic_ok_fwd_dev(skb, fwd))))
 			goto err;
 		skb->dev = fwd;
+		generic_xdp_tx(skb, xdp_prog);
+	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
+		struct xdp_sock *xs = fwd;
+
+		err = xsk_generic_rcv(xs, xdp);
+		if (err)
+			goto err;
+		consume_skb(skb);
 	} else {
 		/* TODO: Handle BPF_MAP_TYPE_CPUMAP */
 		err = -EBADRQC;
@@ -3015,7 +3025,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 }
 
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
-			    struct bpf_prog *xdp_prog)
+			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	u32 index = ri->ifindex;
@@ -3023,7 +3033,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 	int err = 0;
 
 	if (ri->map)
-		return xdp_do_generic_redirect_map(dev, skb, xdp_prog);
+		return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog);
 
 	ri->ifindex = 0;
 	fwd = dev_get_by_index_rcu(dev_net(dev), index);
@@ -3037,6 +3047,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 	skb->dev = fwd;
 	_trace_xdp_redirect(dev, xdp_prog, index);
+	generic_xdp_tx(skb, xdp_prog);
 	return 0;
 err:
 	_trace_xdp_redirect_err(dev, xdp_prog, index, err);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 10/15] xsk: add umem completion queue support and mmap
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (8 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 09/15] xsk: wire up XDP_SKB " Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 11/15] xsk: add Tx queue setup and mmap support Björn Töpel
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, we add another setsockopt for registered user memory (umem)
called XDP_UMEM_COMPLETION_QUEUE. Using this socket option, the
process can ask the kernel to allocate a queue (ring buffer) and also
mmap it (XDP_UMEM_PGOFF_COMPLETION_QUEUE) into the process.

The queue is used to explicitly pass ownership of umem frames from the
kernel to user process. This will be used by the TX path to tell user
space that a certain frame has been transmitted and user space can use
it for something else, if it wishes.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 2 ++
 net/xdp/xdp_umem.c          | 7 ++++++-
 net/xdp/xdp_umem.h          | 1 +
 net/xdp/xsk.c               | 7 ++++++-
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index e5091881f776..71581a139f26 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -36,6 +36,7 @@ struct sockaddr_xdp {
 #define XDP_RX_RING			1
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
+#define XDP_UMEM_COMPLETION_RING	5
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -47,6 +48,7 @@ struct xdp_umem_reg {
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
 #define XDP_UMEM_PGOFF_FILL_RING	0x100000000
+#define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000
 
 struct xdp_desc {
 	__u32 idx;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 9bac1ad570fa..881dfdefe235 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -70,6 +70,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
 		umem->fq = NULL;
 	}
 
+	if (umem->cq) {
+		xskq_destroy(umem->cq);
+		umem->cq = NULL;
+	}
+
 	if (umem->pgs) {
 		xdp_umem_unpin_pages(umem);
 
@@ -251,5 +256,5 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 
 bool xdp_umem_validate_queues(struct xdp_umem *umem)
 {
-	return umem->fq;
+	return (umem->fq && umem->cq);
 }
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index c7378a11721f..7e0b2fab8522 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -24,6 +24,7 @@
 
 struct xdp_umem {
 	struct xsk_queue *fq;
+	struct xsk_queue *cq;
 	struct page **pgs;
 	struct xdp_umem_props props;
 	u32 npgs;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b931a0db5588..f4a2c5bc6da9 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -255,6 +255,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	} else {
 		/* This xsk has its own umem. */
 		xskq_set_umem(xs->umem->fq, &xs->umem->props);
+		xskq_set_umem(xs->umem->cq, &xs->umem->props);
 	}
 
 	/* Rebind? */
@@ -334,6 +335,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		return 0;
 	}
 	case XDP_UMEM_FILL_RING:
+	case XDP_UMEM_COMPLETION_RING:
 	{
 		struct xsk_queue **q;
 		int entries;
@@ -345,7 +347,8 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			return -EFAULT;
 
 		mutex_lock(&xs->mutex);
-		q = &xs->umem->fq;
+		q = (optname == XDP_UMEM_FILL_RING) ? &xs->umem->fq :
+			&xs->umem->cq;
 		err = xsk_init_queue(entries, q, true);
 		mutex_unlock(&xs->mutex);
 		return err;
@@ -375,6 +378,8 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 
 		if (offset == XDP_UMEM_PGOFF_FILL_RING)
 			q = xs->umem->fq;
+		else if (offset == XDP_UMEM_PGOFF_COMPLETION_RING)
+			q = xs->umem->cq;
 		else
 			return -EINVAL;
 	}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 11/15] xsk: add Tx queue setup and mmap support
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (9 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 10/15] xsk: add umem completion queue support and mmap Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 12/15] dev: packet: make packet_direct_xmit a common function Björn Töpel
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate
a queue, where the user process can pass frames to be transmitted by
the kernel.

The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h      | 1 +
 include/uapi/linux/if_xdp.h | 2 ++
 net/xdp/xsk.c               | 8 ++++++--
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ce3a2ab16b8f..185f4928fbda 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -30,6 +30,7 @@ struct xdp_sock {
 	struct xdp_umem *umem;
 	struct list_head flush_node;
 	u16 queue_id;
+	struct xsk_queue *tx ____cacheline_aligned_in_smp;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 	u64 rx_dropped;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 71581a139f26..e2ea878d025c 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -34,6 +34,7 @@ struct sockaddr_xdp {
 
 /* XDP socket options */
 #define XDP_RX_RING			1
+#define XDP_TX_RING			2
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
 #define XDP_UMEM_COMPLETION_RING	5
@@ -47,6 +48,7 @@ struct xdp_umem_reg {
 
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
+#define XDP_PGOFF_TX_RING		 0x80000000
 #define XDP_UMEM_PGOFF_FILL_RING	0x100000000
 #define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f4a2c5bc6da9..2d7b0c90d996 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -206,7 +206,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		goto out_release;
 	}
 
-	if (!xs->rx) {
+	if (!xs->rx && !xs->tx) {
 		err = -EINVAL;
 		goto out_unlock;
 	}
@@ -291,6 +291,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 
 	switch (optname) {
 	case XDP_RX_RING:
+	case XDP_TX_RING:
 	{
 		struct xsk_queue **q;
 		int entries;
@@ -301,7 +302,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			return -EFAULT;
 
 		mutex_lock(&xs->mutex);
-		q = &xs->rx;
+		q = (optname == XDP_TX_RING) ? &xs->tx : &xs->rx;
 		err = xsk_init_queue(entries, q, false);
 		mutex_unlock(&xs->mutex);
 		return err;
@@ -372,6 +373,8 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 
 	if (offset == XDP_PGOFF_RX_RING) {
 		q = xs->rx;
+	} else if (offset == XDP_PGOFF_TX_RING) {
+		q = xs->tx;
 	} else {
 		if (!xs->umem)
 			return -EINVAL;
@@ -431,6 +434,7 @@ static void xsk_destruct(struct sock *sk)
 		return;
 
 	xskq_destroy(xs->rx);
+	xskq_destroy(xs->tx);
 	xdp_put_umem(xs->umem);
 
 	sk_refcnt_debug_dec(sk);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 12/15] dev: packet: make packet_direct_xmit a common function
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (10 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 11/15] xsk: add Tx queue setup and mmap support Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 13/15] xsk: support for Tx Björn Töpel
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

The new dev_direct_xmit will be used by AF_XDP in later commits.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c            | 38 ++++++++++++++++++++++++++++++++++++++
 net/packet/af_packet.c    | 42 +++++-------------------------------------
 3 files changed, 44 insertions(+), 37 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 366c32891158..a30435118530 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2486,6 +2486,7 @@ void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
 int dev_queue_xmit(struct sk_buff *skb);
 int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
+int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
 int register_netdevice(struct net_device *dev);
 void unregister_netdevice_queue(struct net_device *dev, struct list_head *head);
 void unregister_netdevice_many(struct list_head *head);
diff --git a/net/core/dev.c b/net/core/dev.c
index aea36b5a2fed..d3fdc86516e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3625,6 +3625,44 @@ int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv)
 }
 EXPORT_SYMBOL(dev_queue_xmit_accel);
 
+int dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
+{
+	struct net_device *dev = skb->dev;
+	struct sk_buff *orig_skb = skb;
+	struct netdev_queue *txq;
+	int ret = NETDEV_TX_BUSY;
+	bool again = false;
+
+	if (unlikely(!netif_running(dev) ||
+		     !netif_carrier_ok(dev)))
+		goto drop;
+
+	skb = validate_xmit_skb_list(skb, dev, &again);
+	if (skb != orig_skb)
+		goto drop;
+
+	skb_set_queue_mapping(skb, queue_id);
+	txq = skb_get_tx_queue(dev, skb);
+
+	local_bh_disable();
+
+	HARD_TX_LOCK(dev, txq, smp_processor_id());
+	if (!netif_xmit_frozen_or_drv_stopped(txq))
+		ret = netdev_start_xmit(skb, dev, txq, false);
+	HARD_TX_UNLOCK(dev, txq);
+
+	local_bh_enable();
+
+	if (!dev_xmit_complete(ret))
+		kfree_skb(skb);
+
+	return ret;
+drop:
+	atomic_long_inc(&dev->tx_dropped);
+	kfree_skb_list(skb);
+	return NET_XMIT_DROP;
+}
+EXPORT_SYMBOL(dev_direct_xmit);
 
 /*************************************************************************
  *			Receiver routines
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 01f3515cada0..611a26d5235c 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -209,7 +209,7 @@ static void prb_clear_rxhash(struct tpacket_kbdq_core *,
 static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
 		struct tpacket3_hdr *);
 static void packet_flush_mclist(struct sock *sk);
-static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb);
+static u16 packet_pick_tx_queue(struct sk_buff *skb);
 
 struct packet_skb_cb {
 	union {
@@ -243,40 +243,7 @@ static void __fanout_link(struct sock *sk, struct packet_sock *po);
 
 static int packet_direct_xmit(struct sk_buff *skb)
 {
-	struct net_device *dev = skb->dev;
-	struct sk_buff *orig_skb = skb;
-	struct netdev_queue *txq;
-	int ret = NETDEV_TX_BUSY;
-	bool again = false;
-
-	if (unlikely(!netif_running(dev) ||
-		     !netif_carrier_ok(dev)))
-		goto drop;
-
-	skb = validate_xmit_skb_list(skb, dev, &again);
-	if (skb != orig_skb)
-		goto drop;
-
-	packet_pick_tx_queue(dev, skb);
-	txq = skb_get_tx_queue(dev, skb);
-
-	local_bh_disable();
-
-	HARD_TX_LOCK(dev, txq, smp_processor_id());
-	if (!netif_xmit_frozen_or_drv_stopped(txq))
-		ret = netdev_start_xmit(skb, dev, txq, false);
-	HARD_TX_UNLOCK(dev, txq);
-
-	local_bh_enable();
-
-	if (!dev_xmit_complete(ret))
-		kfree_skb(skb);
-
-	return ret;
-drop:
-	atomic_long_inc(&dev->tx_dropped);
-	kfree_skb_list(skb);
-	return NET_XMIT_DROP;
+	return dev_direct_xmit(skb, packet_pick_tx_queue(skb));
 }
 
 static struct net_device *packet_cached_dev_get(struct packet_sock *po)
@@ -313,8 +280,9 @@ static u16 __packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
 	return (u16) raw_smp_processor_id() % dev->real_num_tx_queues;
 }
 
-static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
+static u16 packet_pick_tx_queue(struct sk_buff *skb)
 {
+	struct net_device *dev = skb->dev;
 	const struct net_device_ops *ops = dev->netdev_ops;
 	u16 queue_index;
 
@@ -326,7 +294,7 @@ static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
 		queue_index = __packet_pick_tx_queue(dev, skb);
 	}
 
-	skb_set_queue_mapping(skb, queue_index);
+	return queue_index;
 }
 
 /* __register_prot_hook must be invoked through register_prot_hook
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 13/15] xsk: support for Tx
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (11 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 12/15] dev: packet: make packet_direct_xmit a common function Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 14/15] xsk: statistics support Björn Töpel
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, Tx support is added. The user fills the Tx queue with frames to
be sent by the kernel, and let's the kernel know using the sendmsg
syscall.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 net/xdp/xsk.c       | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 net/xdp/xsk_queue.h |  93 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 200 insertions(+), 4 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 2d7b0c90d996..b33c535c7996 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -36,6 +36,8 @@
 #include "xsk_queue.h"
 #include "xdp_umem.h"
 
+#define TX_BATCH_SIZE 16
+
 static struct xdp_sock *xdp_sk(struct sock *sk)
 {
 	return (struct xdp_sock *)sk;
@@ -101,6 +103,108 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+static void xsk_destruct_skb(struct sk_buff *skb)
+{
+	u32 id = (u32)(long)skb_shinfo(skb)->destructor_arg;
+	struct xdp_sock *xs = xdp_sk(skb->sk);
+
+	WARN_ON_ONCE(xskq_produce_id(xs->umem->cq, id));
+
+	sock_wfree(skb);
+}
+
+static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
+			    size_t total_len)
+{
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
+	u32 max_batch = TX_BATCH_SIZE;
+	struct xdp_sock *xs = xdp_sk(sk);
+	bool sent_frame = false;
+	struct xdp_desc desc;
+	struct sk_buff *skb;
+	int err = 0;
+
+	if (unlikely(!xs->tx))
+		return -ENOBUFS;
+	if (need_wait)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&xs->mutex);
+
+	while (xskq_peek_desc(xs->tx, &desc)) {
+		char *buffer;
+		u32 id, len;
+
+		if (max_batch-- == 0) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		if (xskq_reserve_id(xs->umem->cq)) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		len = desc.len;
+		if (unlikely(len > xs->dev->mtu)) {
+			err = -EMSGSIZE;
+			goto out;
+		}
+
+		skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
+		if (unlikely(!skb)) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		skb_put(skb, len);
+		id = desc.idx;
+		buffer = xdp_umem_get_data(xs->umem, id) + desc.offset;
+		err = skb_store_bits(skb, 0, buffer, len);
+		if (unlikely(err)) {
+			kfree_skb(skb);
+			goto out;
+		}
+
+		skb->dev = xs->dev;
+		skb->priority = sk->sk_priority;
+		skb->mark = sk->sk_mark;
+		skb_shinfo(skb)->destructor_arg = (void *)(long)id;
+		skb->destructor = xsk_destruct_skb;
+
+		err = dev_direct_xmit(skb, xs->queue_id);
+		/* Ignore NET_XMIT_CN as packet might have been sent */
+		if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
+			err = -EAGAIN;
+			/* SKB consumed by dev_direct_xmit() */
+			goto out;
+		}
+
+		sent_frame = true;
+		xskq_discard_desc(xs->tx);
+	}
+
+out:
+	if (sent_frame)
+		sk->sk_write_space(sk);
+
+	mutex_unlock(&xs->mutex);
+	return err;
+}
+
+static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (unlikely(!xs->dev))
+		return -ENXIO;
+	if (unlikely(!(xs->dev->flags & IFF_UP)))
+		return -ENETDOWN;
+
+	return xsk_generic_xmit(sk, m, total_len);
+}
+
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
 			     struct poll_table_struct *wait)
 {
@@ -110,6 +214,8 @@ static unsigned int xsk_poll(struct file *file, struct socket *sock,
 
 	if (xs->rx && !xskq_empty_desc(xs->rx))
 		mask |= POLLIN | POLLRDNORM;
+	if (xs->tx && !xskq_full_desc(xs->tx))
+		mask |= POLLOUT | POLLWRNORM;
 
 	return mask;
 }
@@ -270,6 +376,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	xs->queue_id = sxdp->sxdp_queue_id;
 
 	xskq_set_umem(xs->rx, &xs->umem->props);
+	xskq_set_umem(xs->tx, &xs->umem->props);
 
 out_unlock:
 	if (err)
@@ -383,8 +490,6 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 			q = xs->umem->fq;
 		else if (offset == XDP_UMEM_PGOFF_COMPLETION_RING)
 			q = xs->umem->cq;
-		else
-			return -EINVAL;
 	}
 
 	if (!q)
@@ -420,7 +525,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	xsk_setsockopt,
 	.getsockopt =	sock_no_getsockopt,
-	.sendmsg =	sock_no_sendmsg,
+	.sendmsg =	xsk_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
 	.mmap =		xsk_mmap,
 	.sendpage =	sock_no_sendpage,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 0a9b92b4f93a..3497e8808608 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -111,7 +111,93 @@ static inline void xskq_discard_id(struct xsk_queue *q)
 	(void)xskq_validate_id(q);
 }
 
-/* Rx queue */
+static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+
+	ring->desc[q->prod_tail++ & q->ring_mask] = id;
+
+	/* Order producer and data */
+	smp_wmb();
+
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+	return 0;
+}
+
+static inline int xskq_reserve_id(struct xsk_queue *q)
+{
+	if (xskq_nb_free(q, q->prod_head, 1) == 0)
+		return -ENOSPC;
+
+	q->prod_head++;
+	return 0;
+}
+
+/* Rx/Tx queue */
+
+static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
+{
+	u32 buff_len;
+
+	if (unlikely(d->idx >= q->umem_props.nframes)) {
+		q->invalid_descs++;
+		return false;
+	}
+
+	buff_len = q->umem_props.frame_size;
+	if (unlikely(d->len > buff_len || d->len == 0 ||
+		     d->offset > buff_len || d->offset + d->len > buff_len)) {
+		q->invalid_descs++;
+		return false;
+	}
+
+	return true;
+}
+
+static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
+						  struct xdp_desc *desc)
+{
+	while (q->cons_tail != q->cons_head) {
+		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
+		unsigned int idx = q->cons_tail & q->ring_mask;
+
+		if (xskq_is_valid_desc(q, &ring->desc[idx])) {
+			if (desc)
+				*desc = ring->desc[idx];
+			return desc;
+		}
+
+		q->cons_tail++;
+	}
+
+	return NULL;
+}
+
+static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
+					      struct xdp_desc *desc)
+{
+	struct xdp_rxtx_ring *ring;
+
+	if (q->cons_tail == q->cons_head) {
+		WRITE_ONCE(q->ring->consumer, q->cons_tail);
+		q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
+
+		/* Order consumer and data */
+		smp_rmb();
+
+		return xskq_validate_desc(q, desc);
+	}
+
+	ring = (struct xdp_rxtx_ring *)q->ring;
+	*desc = ring->desc[q->cons_tail & q->ring_mask];
+	return desc;
+}
+
+static inline void xskq_discard_desc(struct xsk_queue *q)
+{
+	q->cons_tail++;
+	(void)xskq_validate_desc(q, NULL);
+}
 
 static inline int xskq_produce_batch_desc(struct xsk_queue *q,
 					  u32 id, u32 len, u16 offset)
@@ -139,6 +225,11 @@ static inline void xskq_produce_flush_desc(struct xsk_queue *q)
 	WRITE_ONCE(q->ring->producer, q->prod_tail);
 }
 
+static inline bool xskq_full_desc(struct xsk_queue *q)
+{
+	return (xskq_nb_avail(q, q->nentries) == q->nentries);
+}
+
 static inline bool xskq_empty_desc(struct xsk_queue *q)
 {
 	return (xskq_nb_free(q, q->prod_tail, 1) == q->nentries);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 14/15] xsk: statistics support
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (12 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 13/15] xsk: support for Tx Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-02 11:01 ` [PATCH bpf-next v3 15/15] samples/bpf: sample application and documentation for AF_XDP sockets Björn Töpel
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

In this commit, a new getsockopt is added: XDP_STATISTICS. This is
used to obtain stats from the sockets.

v2: getsockopt now returns size of stats structure.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h |  7 +++++++
 net/xdp/xsk.c               | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.h         |  5 +++++
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index e2ea878d025c..77b88c4efe98 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -38,6 +38,7 @@ struct sockaddr_xdp {
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
 #define XDP_UMEM_COMPLETION_RING	5
+#define XDP_STATISTICS			6
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -46,6 +47,12 @@ struct xdp_umem_reg {
 	__u32 frame_headroom; /* Frame head room */
 };
 
+struct xdp_statistics {
+	__u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+	__u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+	__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+};
+
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
 #define XDP_PGOFF_TX_RING		 0x80000000
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b33c535c7996..009c5af5bba5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -468,6 +468,49 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 	return -ENOPROTOOPT;
 }
 
+static int xsk_getsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int __user *optlen)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	int len;
+
+	if (level != SOL_XDP)
+		return -ENOPROTOOPT;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	if (len < 0)
+		return -EINVAL;
+
+	switch (optname) {
+	case XDP_STATISTICS:
+	{
+		struct xdp_statistics stats;
+
+		if (len < sizeof(stats))
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		stats.rx_dropped = xs->rx_dropped;
+		stats.rx_invalid_descs = xskq_nb_invalid_descs(xs->rx);
+		stats.tx_invalid_descs = xskq_nb_invalid_descs(xs->tx);
+		mutex_unlock(&xs->mutex);
+
+		if (copy_to_user(optval, &stats, sizeof(stats)))
+			return -EFAULT;
+		if (put_user(sizeof(stats), optlen))
+			return -EFAULT;
+
+		return 0;
+	}
+	default:
+		break;
+	}
+
+	return -EOPNOTSUPP;
+}
+
 static int xsk_mmap(struct file *file, struct socket *sock,
 		    struct vm_area_struct *vma)
 {
@@ -524,7 +567,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	xsk_setsockopt,
-	.getsockopt =	sock_no_getsockopt,
+	.getsockopt =	xsk_getsockopt,
 	.sendmsg =	xsk_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
 	.mmap =		xsk_mmap,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 3497e8808608..7aa9a535db0e 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -36,6 +36,11 @@ struct xsk_queue {
 
 /* Common functions operating for both RXTX and umem queues */
 
+static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q)
+{
+	return q ? q->invalid_descs : 0;
+}
+
 static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
 {
 	u32 entries = q->prod_tail - q->cons_tail;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH bpf-next v3 15/15] samples/bpf: sample application and documentation for AF_XDP sockets
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (13 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 14/15] xsk: statistics support Björn Töpel
@ 2018-05-02 11:01 ` Björn Töpel
  2018-05-02 20:59   ` Jesper Dangaard Brouer
  2018-05-03 13:55 ` [PATCH bpf-next v3 00/15] Introducing AF_XDP support Willem de Bruijn
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 47+ messages in thread
From: Björn Töpel @ 2018-05-02 11:01 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	Björn Töpel

From: Magnus Karlsson <magnus.karlsson@intel.com>

This is a sample application for AF_XDP sockets. The application
supports three different modes of operation: rxdrop, txonly and l2fwd.

To show-case a simple round-robin load-balancing between a set of
sockets in an xskmap, set the RR_LB compile time define option to 1 in
"xdpsock.h".

v2: The entries variable was calculated twice in {umem,xq}_nb_avail.

Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 Documentation/networking/af_xdp.rst | 297 +++++++++++
 Documentation/networking/index.rst  |   1 +
 samples/bpf/Makefile                |   4 +
 samples/bpf/xdpsock.h               |  11 +
 samples/bpf/xdpsock_kern.c          |  56 +++
 samples/bpf/xdpsock_user.c          | 948 ++++++++++++++++++++++++++++++++++++
 6 files changed, 1317 insertions(+)
 create mode 100644 Documentation/networking/af_xdp.rst
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
new file mode 100644
index 000000000000..91928d9ee4bf
--- /dev/null
+++ b/Documentation/networking/af_xdp.rst
@@ -0,0 +1,297 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+AF_XDP
+======
+
+Overview
+========
+
+AF_XDP is an address family that is optimized for high performance
+packet processing.
+
+This document assumes that the reader is familiar with BPF and XDP. If
+not, the Cilium project has an excellent reference guide at
+http://cilium.readthedocs.io/en/doc-1.0/bpf/.
+
+Using the XDP_REDIRECT action from an XDP program, the program can
+redirect ingress frames to other XDP enabled netdevs, using the
+bpf_redirect_map() function. AF_XDP sockets enable the possibility for
+XDP programs to redirect frames to a memory buffer in a user-space
+application.
+
+An AF_XDP socket (XSK) is created with the normal socket()
+syscall. Associated with each XSK are two rings: the RX ring and the
+TX ring. A socket can receive packets on the RX ring and it can send
+packets on the TX ring. These rings are registered and sized with the
+setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory
+to have at least one of these rings for each socket. An RX or TX
+descriptor ring points to a data buffer in a memory area called a
+UMEM. RX and TX can share the same UMEM so that a packet does not have
+to be copied between RX and TX. Moreover, if a packet needs to be kept
+for a while due to a possible retransmit, the descriptor that points
+to that packet can be changed to point to another and reused right
+away. This again avoids copying data.
+
+The UMEM consists of a number of equally size frames and each frame
+has a unique frame id. A descriptor in one of the rings references a
+frame by referencing its frame id. The user space allocates memory for
+this UMEM using whatever means it feels is most appropriate (malloc,
+mmap, huge pages, etc). This memory area is then registered with the
+kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two
+rings: the FILL ring and the COMPLETION ring. The fill ring is used by
+the application to send down frame ids for the kernel to fill in with
+RX packet data. References to these frames will then appear in the RX
+ring once each packet has been received. The completion ring, on the
+other hand, contains frame ids that the kernel has transmitted
+completely and can now be used again by user space, for either TX or
+RX. Thus, the frame ids appearing in the completion ring are ids that
+were previously transmitted using the TX ring. In summary, the RX and
+FILL rings are used for the RX path and the TX and COMPLETION rings
+are used for the TX path.
+
+The socket is then finally bound with a bind() call to a device and a
+specific queue id on that device, and it is not until bind is
+completed that traffic starts to flow.
+
+The UMEM can be shared between processes, if desired. If a process
+wants to do this, it simply skips the registration of the UMEM and its
+corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
+call and submits the XSK of the process it would like to share UMEM
+with as well as its own newly created XSK socket. The new process will
+then receive frame id references in its own RX ring that point to this
+shared UMEM. Note that since the ring structures are single-consumer /
+single-producer (for performance reasons), the new process has to
+create its own socket with associated RX and TX rings, since it cannot
+share this with the other process. This is also the reason that there
+is only one set of FILL and COMPLETION rings per UMEM. It is the
+responsibility of a single process to handle the UMEM.
+
+How is then packets distributed from an XDP program to the XSKs? There
+is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
+user-space application can place an XSK at an arbitrary place in this
+map. The XDP program can then redirect a packet to a specific index in
+this map and at this point XDP validates that the XSK in that map was
+indeed bound to that device and ring number. If not, the packet is
+dropped. If the map is empty at that index, the packet is also
+dropped. This also means that it is currently mandatory to have an XDP
+program loaded (and one XSK in the XSKMAP) to be able to get any
+traffic to user space through the XSK.
+
+AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
+driver does not have support for XDP, or XDP_SKB is explicitly chosen
+when loading the XDP program, XDP_SKB mode is employed that uses SKBs
+together with the generic XDP support and copies out the data to user
+space. A fallback mode that works for any network device. On the other
+hand, if the driver has support for XDP, it will be used by the AF_XDP
+code to provide better performance, but there is still a copy of the
+data into user space.
+
+Concepts
+========
+
+In order to use an AF_XDP socket, a number of associated objects need
+to be setup.
+
+Jonathan Corbet has also written an excellent article on LWN,
+"Accelerating networking with AF_XDP". It can be found at
+https://lwn.net/Articles/750845/.
+
+UMEM
+----
+
+UMEM is a region of virtual contiguous memory, divided into
+equal-sized frames. An UMEM is associated to a netdev and a specific
+queue id of that netdev. It is created and configured (frame size,
+frame headroom, start address and size) by using the XDP_UMEM_REG
+setsockopt system call. A UMEM is bound to a netdev and queue id, via
+the bind() system call.
+
+An AF_XDP is socket linked to a single UMEM, but one UMEM can have
+multiple AF_XDP sockets. To share an UMEM created via one socket A,
+the next socket B can do this by setting the XDP_SHARED_UMEM flag in
+struct sockaddr_xdp member sxdp_flags, and passing the file descriptor
+of A to struct sockaddr_xdp member sxdp_shared_umem_fd.
+
+The UMEM has two single-producer/single-consumer rings, that are used
+to transfer ownership of UMEM frames between the kernel and the
+user-space application.
+
+Rings
+-----
+
+There are a four different kind of rings: Fill, Completion, RX and
+TX. All rings are single-producer/single-consumer, so the user-space
+application need explicit synchronization of multiple
+processes/threads are reading/writing to them.
+
+The UMEM uses two rings: Fill and Completion. Each socket associated
+with the UMEM must have an RX queue, TX queue or both. Say, that there
+is a setup with four sockets (all doing TX and RX). Then there will be
+one Fill ring, one Completion ring, four TX rings and four RX rings.
+
+The rings are head(producer)/tail(consumer) based rings. A producer
+writes the data ring at the index pointed out by struct xdp_ring
+producer member, and increasing the producer index. A consumer reads
+the data ring at the index pointed out by struct xdp_ring consumer
+member, and increasing the consumer index.
+
+The rings are configured and created via the _RING setsockopt system
+calls and mmapped to user-space using the appropriate offset to mmap()
+(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and
+XDP_UMEM_PGOFF_COMPLETION_RING).
+
+The size of the rings need to be of size power of two.
+
+UMEM Fill Ring
+~~~~~~~~~~~~~~
+
+The Fill ring is used to transfer ownership of UMEM frames from
+user-space to kernel-space. The UMEM indicies are passed in the
+ring. As an example, if the UMEM is 64k and each frame is 4k, then the
+UMEM has 16 frames and can pass indicies between 0 and 15.
+
+Frames passed to the kernel are used for the ingress path (RX rings).
+
+The user application produces UMEM indicies to this ring.
+
+UMEM Completetion Ring
+~~~~~~~~~~~~~~~~~~~~~~
+
+The Completion Ring is used transfer ownership of UMEM frames from
+kernel-space to user-space. Just like the Fill ring, UMEM indicies are
+used.
+
+Frames passed from the kernel to user-space are frames that has been
+sent (TX ring) and can be used by user-space again.
+
+The user application consumes UMEM indicies from this ring.
+
+
+RX Ring
+~~~~~~~
+
+The RX ring is the receiving side of a socket. Each entry in the ring
+is a struct xdp_desc descriptor. The descriptor contains UMEM index
+(idx), the length of the data (len), the offset into the frame
+(offset).
+
+If no frames have been passed to kernel via the Fill ring, no
+descriptors will (or can) appear on the RX ring.
+
+The user application consumes struct xdp_desc descriptors from this
+ring.
+
+TX Ring
+~~~~~~~
+
+The TX ring is used to send frames. The struct xdp_desc descriptor is
+filled (index, length and offset) and passed into the ring.
+
+To start the transfer a sendmsg() system call is required. This might
+be relaxed in the future.
+
+The user application produces struct xdp_desc descriptors to this
+ring.
+
+XSKMAP / BPF_MAP_TYPE_XSKMAP
+----------------------------
+
+On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that
+is used in conjunction with bpf_redirect_map() to pass the ingress
+frame to a socket.
+
+The user application inserts the socket into the map, via the bpf()
+system call.
+
+Note that if an XDP program tries to redirect to a socket that does
+not match the queue configuration and netdev, the frame will be
+dropped. E.g. an AF_XDP socket is bound to netdev eth0 and
+queue 17. Only the XDP program executing for eth0 and queue 17 will
+successfully pass data to the socket. Please refer to the sample
+application (samples/bpf/) in for an example.
+
+Usage
+=====
+
+In order to use AF_XDP sockets there are two parts needed. The
+user-space application and the XDP program. For a complete setup and
+usage example, please refer to the sample application. The user-space
+side is xdpsock_user.c and the XDP side xdpsock_kern.c.
+
+Naive ring dequeue and enqueue could look like this::
+
+    // typedef struct xdp_rxtx_ring RING;
+    // typedef struct xdp_umem_ring RING;
+
+    // typedef struct xdp_desc RING_TYPE;
+    // typedef __u32 RING_TYPE;
+
+    int dequeue_one(RING *ring, RING_TYPE *item)
+    {
+        __u32 entries = ring->ptrs.producer - ring->ptrs.consumer;
+
+        if (entries == 0)
+            return -1;
+
+        // read-barrier!
+
+        *item = ring->desc[ring->ptrs.consumer & (RING_SIZE - 1)];
+        ring->ptrs.consumer++;
+        return 0;
+    }
+
+    int enqueue_one(RING *ring, const RING_TYPE *item)
+    {
+        u32 free_entries = RING_SIZE - (ring->ptrs.producer - ring->ptrs.consumer);
+
+        if (free_entries == 0)
+            return -1;
+
+        ring->desc[ring->ptrs.producer & (RING_SIZE - 1)] = *item;
+
+        // write-barrier!
+
+        ring->ptrs.producer++;
+        return 0;
+    }
+
+
+For a more optimized version, please refer to the sample application.
+
+Sample application
+==================
+
+There is a xdpsock benchmarking/test application included that
+demonstrates how to use AF_XDP sockets with both private and shared
+UMEMs. Say that you would like your UDP traffic from port 4242 to end
+up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
+for this::
+
+      ethtool -N p3p2 rx-flow-hash udp4 fn
+      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
+          action 16
+
+Running the rxdrop benchmark in XDP_DRV mode can then be done
+using::
+
+      samples/bpf/xdpsock -i p3p2 -q 16 -r -N
+
+For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
+can be displayed with "-h", as usual.
+
+Credits
+=======
+
+- Björn Töpel (AF_XDP core)
+- Magnus Karlsson (AF_XDP core)
+- Alexander Duyck
+- Alexei Starovoitov
+- Daniel Borkmann
+- Jesper Dangaard Brouer
+- John Fastabend
+- Jonathan Corbet (LWN coverage)
+- Michael S. Tsirkin
+- Qi Z Zhang
+- Willem de Bruijn
+
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index f204eaff657d..cbd9bdd4a79e 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -6,6 +6,7 @@ Contents:
 .. toctree::
    :maxdepth: 2
 
+   af_xdp
    batman-adv
    can
    dpaa2/index
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 5e31770ac087..8e0c7fb6d7cc 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
 hostprogs-y += xdp_adjust_tail
+hostprogs-y += xdpsock
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -98,6 +99,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
 xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
+xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
 always += xdp_adjust_tail_kern.o
+always += xdpsock_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
 HOSTLOADLIBES_xdp_adjust_tail += -lelf
+HOSTLOADLIBES_xdpsock += -lelf -pthread
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
new file mode 100644
index 000000000000..533ab81adfa1
--- /dev/null
+++ b/samples/bpf/xdpsock.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef XDPSOCK_H_
+#define XDPSOCK_H_
+
+/* Power-of-2 number of sockets */
+#define MAX_SOCKS 4
+
+/* Round-robin receive */
+#define RR_LB 0
+
+#endif /* XDPSOCK_H_ */
diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
new file mode 100644
index 000000000000..d8806c41362e
--- /dev/null
+++ b/samples/bpf/xdpsock_kern.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#include "xdpsock.h"
+
+struct bpf_map_def SEC("maps") qidconf_map = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.key_size	= sizeof(int),
+	.value_size	= sizeof(int),
+	.max_entries	= 1,
+};
+
+struct bpf_map_def SEC("maps") xsks_map = {
+	.type = BPF_MAP_TYPE_XSKMAP,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = 4,
+};
+
+struct bpf_map_def SEC("maps") rr_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(unsigned int),
+	.max_entries = 1,
+};
+
+SEC("xdp_sock")
+int xdp_sock_prog(struct xdp_md *ctx)
+{
+	int *qidconf, key = 0, idx;
+	unsigned int *rr;
+
+	qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
+	if (!qidconf)
+		return XDP_ABORTED;
+
+	if (*qidconf != ctx->rx_queue_index)
+		return XDP_PASS;
+
+#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
+	rr = bpf_map_lookup_elem(&rr_map, &key);
+	if (!rr)
+		return XDP_ABORTED;
+
+	*rr = (*rr + 1) & (MAX_SOCKS - 1);
+	idx = *rr;
+#else
+	idx = 0;
+#endif
+
+	return bpf_redirect_map(&xsks_map, idx, 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
new file mode 100644
index 000000000000..4b8a7cf3e63b
--- /dev/null
+++ b/samples/bpf/xdpsock_user.c
@@ -0,0 +1,948 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2017 - 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <getopt.h>
+#include <libgen.h>
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <linux/if_xdp.h>
+#include <linux/if_ether.h>
+#include <net/if.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/ethernet.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <locale.h>
+#include <sys/types.h>
+#include <poll.h>
+
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "libbpf.h"
+
+#include "xdpsock.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+#ifndef AF_XDP
+#define AF_XDP 44
+#endif
+
+#ifndef PF_XDP
+#define PF_XDP AF_XDP
+#endif
+
+#define NUM_FRAMES 131072
+#define FRAME_HEADROOM 0
+#define FRAME_SIZE 2048
+#define NUM_DESCS 1024
+#define BATCH_SIZE 16
+
+#define FQ_NUM_DESCS 1024
+#define CQ_NUM_DESCS 1024
+
+#define DEBUG_HEXDUMP 0
+
+typedef __u32 u32;
+
+static unsigned long prev_time;
+
+enum benchmark_type {
+	BENCH_RXDROP = 0,
+	BENCH_TXONLY = 1,
+	BENCH_L2FWD = 2,
+};
+
+static enum benchmark_type opt_bench = BENCH_RXDROP;
+static u32 opt_xdp_flags;
+static const char *opt_if = "";
+static int opt_ifindex;
+static int opt_queue;
+static int opt_poll;
+static int opt_shared_packet_buffer;
+static int opt_interval = 1;
+
+struct xdp_umem_uqueue {
+	u32 cached_prod;
+	u32 cached_cons;
+	u32 mask;
+	u32 size;
+	struct xdp_umem_ring *ring;
+};
+
+struct xdp_umem {
+	char (*frames)[FRAME_SIZE];
+	struct xdp_umem_uqueue fq;
+	struct xdp_umem_uqueue cq;
+	int fd;
+};
+
+struct xdp_uqueue {
+	u32 cached_prod;
+	u32 cached_cons;
+	u32 mask;
+	u32 size;
+	struct xdp_rxtx_ring *ring;
+};
+
+struct xdpsock {
+	struct xdp_uqueue rx;
+	struct xdp_uqueue tx;
+	int sfd;
+	struct xdp_umem *umem;
+	u32 outstanding_tx;
+	unsigned long rx_npkts;
+	unsigned long tx_npkts;
+	unsigned long prev_rx_npkts;
+	unsigned long prev_tx_npkts;
+};
+
+#define MAX_SOCKS 4
+static int num_socks;
+struct xdpsock *xsks[MAX_SOCKS];
+
+static unsigned long get_nsecs(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
+}
+
+static void dump_stats(void);
+
+#define lassert(expr)							\
+	do {								\
+		if (!(expr)) {						\
+			fprintf(stderr, "%s:%s:%i: Assertion failed: "	\
+				#expr ": errno: %d/\"%s\"\n",		\
+				__FILE__, __func__, __LINE__,		\
+				errno, strerror(errno));		\
+			dump_stats();					\
+			exit(EXIT_FAILURE);				\
+		}							\
+	} while (0)
+
+#define barrier() __asm__ __volatile__("": : :"memory")
+#define u_smp_rmb() barrier()
+#define u_smp_wmb() barrier()
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+
+static const char pkt_data[] =
+	"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
+	"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
+	"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
+	"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
+
+static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
+{
+	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
+
+	if (free_entries >= nb)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cached_cons = q->ring->ptrs.consumer;
+
+	return q->size - (q->cached_prod - q->cached_cons);
+}
+
+static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
+{
+	u32 free_entries = q->cached_cons - q->cached_prod;
+
+	if (free_entries >= ndescs)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cached_cons = q->ring->ptrs.consumer + q->size;
+	return q->cached_cons - q->cached_prod;
+}
+
+static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb)
+{
+	u32 entries = q->cached_prod - q->cached_cons;
+
+	if (entries == 0) {
+		q->cached_prod = q->ring->ptrs.producer;
+		entries = q->cached_prod - q->cached_cons;
+	}
+
+	return (entries > nb) ? nb : entries;
+}
+
+static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs)
+{
+	u32 entries = q->cached_prod - q->cached_cons;
+
+	if (entries == 0) {
+		q->cached_prod = q->ring->ptrs.producer;
+		entries = q->cached_prod - q->cached_cons;
+	}
+
+	return (entries > ndescs) ? ndescs : entries;
+}
+
+static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
+					 struct xdp_desc *d,
+					 size_t nb)
+{
+	u32 i;
+
+	if (umem_nb_free(fq, nb) < nb)
+		return -ENOSPC;
+
+	for (i = 0; i < nb; i++) {
+		u32 idx = fq->cached_prod++ & fq->mask;
+
+		fq->ring->desc[idx] = d[i].idx;
+	}
+
+	u_smp_wmb();
+
+	fq->ring->ptrs.producer = fq->cached_prod;
+
+	return 0;
+}
+
+static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
+				      size_t nb)
+{
+	u32 i;
+
+	if (umem_nb_free(fq, nb) < nb)
+		return -ENOSPC;
+
+	for (i = 0; i < nb; i++) {
+		u32 idx = fq->cached_prod++ & fq->mask;
+
+		fq->ring->desc[idx] = d[i];
+	}
+
+	u_smp_wmb();
+
+	fq->ring->ptrs.producer = fq->cached_prod;
+
+	return 0;
+}
+
+static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
+					       u32 *d, size_t nb)
+{
+	u32 idx, i, entries = umem_nb_avail(cq, nb);
+
+	u_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = cq->cached_cons++ & cq->mask;
+		d[i] = cq->ring->desc[idx];
+	}
+
+	if (entries > 0) {
+		u_smp_wmb();
+
+		cq->ring->ptrs.consumer = cq->cached_cons;
+	}
+
+	return entries;
+}
+
+static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
+{
+	lassert(idx < NUM_FRAMES);
+	return &xsk->umem->frames[idx][off];
+}
+
+static inline int xq_enq(struct xdp_uqueue *uq,
+			 const struct xdp_desc *descs,
+			 unsigned int ndescs)
+{
+	struct xdp_rxtx_ring *r = uq->ring;
+	unsigned int i;
+
+	if (xq_nb_free(uq, ndescs) < ndescs)
+		return -ENOSPC;
+
+	for (i = 0; i < ndescs; i++) {
+		u32 idx = uq->cached_prod++ & uq->mask;
+
+		r->desc[idx].idx = descs[i].idx;
+		r->desc[idx].len = descs[i].len;
+		r->desc[idx].offset = descs[i].offset;
+	}
+
+	u_smp_wmb();
+
+	r->ptrs.producer = uq->cached_prod;
+	return 0;
+}
+
+static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
+				 __u32 idx, unsigned int ndescs)
+{
+	struct xdp_rxtx_ring *q = uq->ring;
+	unsigned int i;
+
+	if (xq_nb_free(uq, ndescs) < ndescs)
+		return -ENOSPC;
+
+	for (i = 0; i < ndescs; i++) {
+		u32 idx = uq->cached_prod++ & uq->mask;
+
+		q->desc[idx].idx	= idx + i;
+		q->desc[idx].len	= sizeof(pkt_data) - 1;
+		q->desc[idx].offset	= 0;
+	}
+
+	u_smp_wmb();
+
+	q->ptrs.producer = uq->cached_prod;
+	return 0;
+}
+
+static inline int xq_deq(struct xdp_uqueue *uq,
+			 struct xdp_desc *descs,
+			 int ndescs)
+{
+	struct xdp_rxtx_ring *r = uq->ring;
+	unsigned int idx;
+	int i, entries;
+
+	entries = xq_nb_avail(uq, ndescs);
+
+	u_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = uq->cached_cons++ & uq->mask;
+		descs[i] = r->desc[idx];
+	}
+
+	if (entries > 0) {
+		u_smp_wmb();
+
+		r->ptrs.consumer = uq->cached_cons;
+	}
+
+	return entries;
+}
+
+static void swap_mac_addresses(void *data)
+{
+	struct ether_header *eth = (struct ether_header *)data;
+	struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
+	struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
+	struct ether_addr tmp;
+
+	tmp = *src_addr;
+	*src_addr = *dst_addr;
+	*dst_addr = tmp;
+}
+
+#if DEBUG_HEXDUMP
+static void hex_dump(void *pkt, size_t length, const char *prefix)
+{
+	int i = 0;
+	const unsigned char *address = (unsigned char *)pkt;
+	const unsigned char *line = address;
+	size_t line_size = 32;
+	unsigned char c;
+
+	printf("length = %zu\n", length);
+	printf("%s | ", prefix);
+	while (length-- > 0) {
+		printf("%02X ", *address++);
+		if (!(++i % line_size) || (length == 0 && i % line_size)) {
+			if (length == 0) {
+				while (i++ % line_size)
+					printf("__ ");
+			}
+			printf(" | ");	/* right close */
+			while (line < address) {
+				c = *line++;
+				printf("%c", (c < 33 || c == 255) ? 0x2E : c);
+			}
+			printf("\n");
+			if (length > 0)
+				printf("%s | ", prefix);
+		}
+	}
+	printf("\n");
+}
+#endif
+
+static size_t gen_eth_frame(char *frame)
+{
+	memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
+	return sizeof(pkt_data) - 1;
+}
+
+static struct xdp_umem *xdp_umem_configure(int sfd)
+{
+	int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS;
+	struct xdp_umem_reg mr;
+	struct xdp_umem *umem;
+	void *bufs;
+
+	umem = calloc(1, sizeof(*umem));
+	lassert(umem);
+
+	lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
+			       NUM_FRAMES * FRAME_SIZE) == 0);
+
+	mr.addr = (__u64)bufs;
+	mr.len = NUM_FRAMES * FRAME_SIZE;
+	mr.frame_size = FRAME_SIZE;
+	mr.frame_headroom = FRAME_HEADROOM;
+
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size,
+			   sizeof(int)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size,
+			   sizeof(int)) == 0);
+
+	umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
+			     FQ_NUM_DESCS * sizeof(u32),
+			     PROT_READ | PROT_WRITE,
+			     MAP_SHARED | MAP_POPULATE, sfd,
+			     XDP_UMEM_PGOFF_FILL_RING);
+	lassert(umem->fq.ring != MAP_FAILED);
+
+	umem->fq.mask = FQ_NUM_DESCS - 1;
+	umem->fq.size = FQ_NUM_DESCS;
+
+	umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
+			     CQ_NUM_DESCS * sizeof(u32),
+			     PROT_READ | PROT_WRITE,
+			     MAP_SHARED | MAP_POPULATE, sfd,
+			     XDP_UMEM_PGOFF_COMPLETION_RING);
+	lassert(umem->cq.ring != MAP_FAILED);
+
+	umem->cq.mask = CQ_NUM_DESCS - 1;
+	umem->cq.size = CQ_NUM_DESCS;
+
+	umem->frames = (char (*)[FRAME_SIZE])bufs;
+	umem->fd = sfd;
+
+	if (opt_bench == BENCH_TXONLY) {
+		int i;
+
+		for (i = 0; i < NUM_FRAMES; i++)
+			(void)gen_eth_frame(&umem->frames[i][0]);
+	}
+
+	return umem;
+}
+
+static struct xdpsock *xsk_configure(struct xdp_umem *umem)
+{
+	struct sockaddr_xdp sxdp = {};
+	int sfd, ndescs = NUM_DESCS;
+	struct xdpsock *xsk;
+	bool shared = true;
+	u32 i;
+
+	sfd = socket(PF_XDP, SOCK_RAW, 0);
+	lassert(sfd >= 0);
+
+	xsk = calloc(1, sizeof(*xsk));
+	lassert(xsk);
+
+	xsk->sfd = sfd;
+	xsk->outstanding_tx = 0;
+
+	if (!umem) {
+		shared = false;
+		xsk->umem = xdp_umem_configure(sfd);
+	} else {
+		xsk->umem = umem;
+	}
+
+	lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING,
+			   &ndescs, sizeof(int)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING,
+			   &ndescs, sizeof(int)) == 0);
+
+	/* Rx */
+	xsk->rx.ring = mmap(NULL,
+			    sizeof(struct xdp_ring) +
+			    NUM_DESCS * sizeof(struct xdp_desc),
+			    PROT_READ | PROT_WRITE,
+			    MAP_SHARED | MAP_POPULATE, sfd,
+			    XDP_PGOFF_RX_RING);
+	lassert(xsk->rx.ring != MAP_FAILED);
+
+	if (!shared) {
+		for (i = 0; i < NUM_DESCS / 2; i++)
+			lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
+				== 0);
+	}
+
+	/* Tx */
+	xsk->tx.ring = mmap(NULL,
+			 sizeof(struct xdp_ring) +
+			 NUM_DESCS * sizeof(struct xdp_desc),
+			 PROT_READ | PROT_WRITE,
+			 MAP_SHARED | MAP_POPULATE, sfd,
+			 XDP_PGOFF_TX_RING);
+	lassert(xsk->tx.ring != MAP_FAILED);
+
+	xsk->rx.mask = NUM_DESCS - 1;
+	xsk->rx.size = NUM_DESCS;
+
+	xsk->tx.mask = NUM_DESCS - 1;
+	xsk->tx.size = NUM_DESCS;
+
+	sxdp.sxdp_family = PF_XDP;
+	sxdp.sxdp_ifindex = opt_ifindex;
+	sxdp.sxdp_queue_id = opt_queue;
+	if (shared) {
+		sxdp.sxdp_flags = XDP_SHARED_UMEM;
+		sxdp.sxdp_shared_umem_fd = umem->fd;
+	}
+
+	lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
+
+	return xsk;
+}
+
+static void print_benchmark(bool running)
+{
+	const char *bench_str = "INVALID";
+
+	if (opt_bench == BENCH_RXDROP)
+		bench_str = "rxdrop";
+	else if (opt_bench == BENCH_TXONLY)
+		bench_str = "txonly";
+	else if (opt_bench == BENCH_L2FWD)
+		bench_str = "l2fwd";
+
+	printf("%s:%d %s ", opt_if, opt_queue, bench_str);
+	if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
+		printf("xdp-skb ");
+	else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
+		printf("xdp-drv ");
+	else
+		printf("	");
+
+	if (opt_poll)
+		printf("poll() ");
+
+	if (running) {
+		printf("running...");
+		fflush(stdout);
+	}
+}
+
+static void dump_stats(void)
+{
+	unsigned long now = get_nsecs();
+	long dt = now - prev_time;
+	int i;
+
+	prev_time = now;
+
+	for (i = 0; i < num_socks; i++) {
+		char *fmt = "%-15s %'-11.0f %'-11lu\n";
+		double rx_pps, tx_pps;
+
+		rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) *
+			 1000000000. / dt;
+		tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) *
+			 1000000000. / dt;
+
+		printf("\n sock%d@", i);
+		print_benchmark(false);
+		printf("\n");
+
+		printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
+		       dt / 1000000000.);
+		printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
+		printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
+
+		xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts;
+		xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts;
+	}
+}
+
+static void *poller(void *arg)
+{
+	(void)arg;
+	for (;;) {
+		sleep(opt_interval);
+		dump_stats();
+	}
+
+	return NULL;
+}
+
+static void int_exit(int sig)
+{
+	(void)sig;
+	dump_stats();
+	bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
+	exit(EXIT_SUCCESS);
+}
+
+static struct option long_options[] = {
+	{"rxdrop", no_argument, 0, 'r'},
+	{"txonly", no_argument, 0, 't'},
+	{"l2fwd", no_argument, 0, 'l'},
+	{"interface", required_argument, 0, 'i'},
+	{"queue", required_argument, 0, 'q'},
+	{"poll", no_argument, 0, 'p'},
+	{"shared-buffer", no_argument, 0, 's'},
+	{"xdp-skb", no_argument, 0, 'S'},
+	{"xdp-native", no_argument, 0, 'N'},
+	{"interval", required_argument, 0, 'n'},
+	{0, 0, 0, 0}
+};
+
+static void usage(const char *prog)
+{
+	const char *str =
+		"  Usage: %s [OPTIONS]\n"
+		"  Options:\n"
+		"  -r, --rxdrop		Discard all incoming packets (default)\n"
+		"  -t, --txonly		Only send packets\n"
+		"  -l, --l2fwd		MAC swap L2 forwarding\n"
+		"  -i, --interface=n	Run on interface n\n"
+		"  -q, --queue=n	Use queue n (default 0)\n"
+		"  -p, --poll		Use poll syscall\n"
+		"  -s, --shared-buffer	Use shared packet buffer\n"
+		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
+		"  -N, --xdp-native=n	Enfore XDP native mode\n"
+		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
+		"\n";
+	fprintf(stderr, str, prog);
+	exit(EXIT_FAILURE);
+}
+
+static void parse_command_line(int argc, char **argv)
+{
+	int option_index, c;
+
+	opterr = 0;
+
+	for (;;) {
+		c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
+				&option_index);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 'r':
+			opt_bench = BENCH_RXDROP;
+			break;
+		case 't':
+			opt_bench = BENCH_TXONLY;
+			break;
+		case 'l':
+			opt_bench = BENCH_L2FWD;
+			break;
+		case 'i':
+			opt_if = optarg;
+			break;
+		case 'q':
+			opt_queue = atoi(optarg);
+			break;
+		case 's':
+			opt_shared_packet_buffer = 1;
+			break;
+		case 'p':
+			opt_poll = 1;
+			break;
+		case 'S':
+			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
+			break;
+		case 'n':
+			opt_interval = atoi(optarg);
+			break;
+		default:
+			usage(basename(argv[0]));
+		}
+	}
+
+	opt_ifindex = if_nametoindex(opt_if);
+	if (!opt_ifindex) {
+		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
+			opt_if);
+		usage(basename(argv[0]));
+	}
+}
+
+static void kick_tx(int fd)
+{
+	int ret;
+
+	ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
+		return;
+	lassert(0);
+}
+
+static inline void complete_tx_l2fwd(struct xdpsock *xsk)
+{
+	u32 descs[BATCH_SIZE];
+	unsigned int rcvd;
+	size_t ndescs;
+
+	if (!xsk->outstanding_tx)
+		return;
+
+	kick_tx(xsk->sfd);
+	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
+		 xsk->outstanding_tx;
+
+	/* re-add completed Tx buffers */
+	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs);
+	if (rcvd > 0) {
+		umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd);
+		xsk->outstanding_tx -= rcvd;
+		xsk->tx_npkts += rcvd;
+	}
+}
+
+static inline void complete_tx_only(struct xdpsock *xsk)
+{
+	u32 descs[BATCH_SIZE];
+	unsigned int rcvd;
+
+	if (!xsk->outstanding_tx)
+		return;
+
+	kick_tx(xsk->sfd);
+
+	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE);
+	if (rcvd > 0) {
+		xsk->outstanding_tx -= rcvd;
+		xsk->tx_npkts += rcvd;
+	}
+}
+
+static void rx_drop(struct xdpsock *xsk)
+{
+	struct xdp_desc descs[BATCH_SIZE];
+	unsigned int rcvd, i;
+
+	rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
+	if (!rcvd)
+		return;
+
+	for (i = 0; i < rcvd; i++) {
+		u32 idx = descs[i].idx;
+
+		lassert(idx < NUM_FRAMES);
+#if DEBUG_HEXDUMP
+		char *pkt;
+		char buf[32];
+
+		pkt = xq_get_data(xsk, idx, descs[i].offset);
+		sprintf(buf, "idx=%d", idx);
+		hex_dump(pkt, descs[i].len, buf);
+#endif
+	}
+
+	xsk->rx_npkts += rcvd;
+
+	umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
+}
+
+static void rx_drop_all(void)
+{
+	struct pollfd fds[MAX_SOCKS + 1];
+	int i, ret, timeout, nfds = 1;
+
+	memset(fds, 0, sizeof(fds));
+
+	for (i = 0; i < num_socks; i++) {
+		fds[i].fd = xsks[i]->sfd;
+		fds[i].events = POLLIN;
+		timeout = 1000; /* 1sn */
+	}
+
+	for (;;) {
+		if (opt_poll) {
+			ret = poll(fds, nfds, timeout);
+			if (ret <= 0)
+				continue;
+		}
+
+		for (i = 0; i < num_socks; i++)
+			rx_drop(xsks[i]);
+	}
+}
+
+static void tx_only(struct xdpsock *xsk)
+{
+	int timeout, ret, nfds = 1;
+	struct pollfd fds[nfds + 1];
+	unsigned int idx = 0;
+
+	memset(fds, 0, sizeof(fds));
+	fds[0].fd = xsk->sfd;
+	fds[0].events = POLLOUT;
+	timeout = 1000; /* 1sn */
+
+	for (;;) {
+		if (opt_poll) {
+			ret = poll(fds, nfds, timeout);
+			if (ret <= 0)
+				continue;
+
+			if (fds[0].fd != xsk->sfd ||
+			    !(fds[0].revents & POLLOUT))
+				continue;
+		}
+
+		if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) {
+			lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0);
+
+			xsk->outstanding_tx += BATCH_SIZE;
+			idx += BATCH_SIZE;
+			idx %= NUM_FRAMES;
+		}
+
+		complete_tx_only(xsk);
+	}
+}
+
+static void l2fwd(struct xdpsock *xsk)
+{
+	for (;;) {
+		struct xdp_desc descs[BATCH_SIZE];
+		unsigned int rcvd, i;
+		int ret;
+
+		for (;;) {
+			complete_tx_l2fwd(xsk);
+
+			rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
+			if (rcvd > 0)
+				break;
+		}
+
+		for (i = 0; i < rcvd; i++) {
+			char *pkt = xq_get_data(xsk, descs[i].idx,
+						descs[i].offset);
+
+			swap_mac_addresses(pkt);
+#if DEBUG_HEXDUMP
+			char buf[32];
+			u32 idx = descs[i].idx;
+
+			sprintf(buf, "idx=%d", idx);
+			hex_dump(pkt, descs[i].len, buf);
+#endif
+		}
+
+		xsk->rx_npkts += rcvd;
+
+		ret = xq_enq(&xsk->tx, descs, rcvd);
+		lassert(ret == 0);
+		xsk->outstanding_tx += rcvd;
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	char xdp_filename[256];
+	int i, ret, key = 0;
+	pthread_t pt;
+
+	parse_command_line(argc, argv);
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(xdp_filename)) {
+		fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
+		exit(EXIT_FAILURE);
+	}
+
+	if (!prog_fd[0]) {
+		fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
+		fprintf(stderr, "ERROR: link set xdp fd failed\n");
+		exit(EXIT_FAILURE);
+	}
+
+	ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0);
+	if (ret) {
+		fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Create sockets... */
+	xsks[num_socks++] = xsk_configure(NULL);
+
+#if RR_LB
+	for (i = 0; i < MAX_SOCKS - 1; i++)
+		xsks[num_socks++] = xsk_configure(xsks[0]->umem);
+#endif
+
+	/* ...and insert them into the map. */
+	for (i = 0; i < num_socks; i++) {
+		key = i;
+		ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0);
+		if (ret) {
+			fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i);
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+	signal(SIGABRT, int_exit);
+
+	setlocale(LC_ALL, "");
+
+	ret = pthread_create(&pt, NULL, poller, NULL);
+	lassert(ret == 0);
+
+	prev_time = get_nsecs();
+
+	if (opt_bench == BENCH_RXDROP)
+		rx_drop_all();
+	else if (opt_bench == BENCH_TXONLY)
+		tx_only(xsks[0]);
+	else
+		l2fwd(xsks[0]);
+
+	return 0;
+}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 15/15] samples/bpf: sample application and documentation for AF_XDP sockets
  2018-05-02 11:01 ` [PATCH bpf-next v3 15/15] samples/bpf: sample application and documentation for AF_XDP sockets Björn Töpel
@ 2018-05-02 20:59   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-02 20:59 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, willemdebruijn.kernel, daniel, mst, netdev,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	Björn Töpel, brouer


On Wed,  2 May 2018 13:01:36 +0200 Björn Töpel <bjorn.topel@gmail.com> wrote:

> +static void rx_drop(struct xdpsock *xsk)
> +{
> +	struct xdp_desc descs[BATCH_SIZE];
> +	unsigned int rcvd, i;
> +
> +	rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
> +	if (!rcvd)
> +		return;
> +
> +	for (i = 0; i < rcvd; i++) {
> +		u32 idx = descs[i].idx;
> +
> +		lassert(idx < NUM_FRAMES);
> +#if DEBUG_HEXDUMP
> +		char *pkt;
> +		char buf[32];
> +
> +		pkt = xq_get_data(xsk, idx, descs[i].offset);
> +		sprintf(buf, "idx=%d", idx);
> +		hex_dump(pkt, descs[i].len, buf);
> +#endif
> +	}
> +
> +	xsk->rx_npkts += rcvd;
> +
> +	umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
> +}

I would really like to see an option that can enable reading the
data/memory in the packet.  Else the test is rather fake...

I hacked it myself manually to read first u32.
 - Before: 10,771,083 pps
 - After:   9,430,741 pps

The slowdown is not as big as I expected, which is good :-)

With perf stat I can see more LLC-load's, but not misses.  It is not
getting registered as a cache-miss that I read data on the remote CPPU.

p.s. these tests are with mlx5 (which only have XDP_REDIRECT RX-side).

- - 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Before:

sudo ~/perf stat -C3 -e L1-icache-load-misses -e cycles -e  instructions -e cache-misses -e   cache-references  -e LLC-store-misses -e LLC-store -e LLC-load-misses -e  LLC-load -r 3 sleep 1

 Performance counter stats for 'CPU(s) 3' (3 runs):

           200,020      L1-icache-load-misses                                         ( +-  0.76% )  (33.31%)
     3,920,754,587      cycles                                                        ( +-  0.14% )  (44.50%)
     3,062,308,209      instructions              #    0.78  insn per cycle           ( +-  0.28% )  (55.65%)
               823      cache-misses              #    0.011 % of all cache refs      ( +- 70.81% )  (66.74%)
         7,587,132      cache-references                                              ( +-  0.48% )  (77.83%)
                 0      LLC-store-misses                                              (77.83%)
           384,401      LLC-store                                                     ( +-  2.97% )  (77.83%)
                15      LLC-load-misses           #    0.00% of all LL-cache hits     ( +-100.00% )  (22.17%)
         3,192,312      LLC-load                                                      ( +-  0.35% )  (22.17%)

       1.001199221 seconds time elapsed                                          ( +-  0.00% )


After:

$ sudo ~/perf stat -C3 -e L1-icache-load-misses -e cycles -e  instructions -e cache-misses -e   cache-references  -e LLC-store-misses -e LLC-store -e LLC-load-misses -e  LLC-load -r 3 sleep 1

 Performance counter stats for 'CPU(s) 3' (3 runs):

           154,921      L1-icache-load-misses                                         ( +-  3.88% )  (33.31%)
     3,924,791,213      cycles                                                        ( +-  0.10% )  (44.50%)
     2,930,116,185      instructions              #    0.75  insn per cycle           ( +-  0.33% )  (55.65%)
               342      cache-misses              #    0.002 % of all cache refs      ( +- 65.52% )  (66.74%)
        15,810,892      cache-references                                              ( +-  0.13% )  (77.83%)
                 0      LLC-store-misses                                              (77.83%)
           925,544      LLC-store                                                     ( +-  2.33% )  (77.83%)
               155      LLC-load-misses           #    0.00% of all LL-cache hits     ( +- 67.22% )  (22.17%)
        12,791,264      LLC-load                                                      ( +-  0.04% )  (22.17%)

       1.001206058 seconds time elapsed                                          ( +-  0.00% )

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (14 preceding siblings ...)
  2018-05-02 11:01 ` [PATCH bpf-next v3 15/15] samples/bpf: sample application and documentation for AF_XDP sockets Björn Töpel
@ 2018-05-03 13:55 ` Willem de Bruijn
  2018-05-03 15:07 ` David Miller
  2018-05-03 22:49 ` Daniel Borkmann
  17 siblings, 0 replies; 47+ messages in thread
From: Willem de Bruijn @ 2018-05-03 13:55 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

On Wed, May 2, 2018 at 1:01 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This patch set introduces a new address family called AF_XDP that is
> optimized for high performance packet processing

Great patchset, thanks.

> and, in upcoming
> patch sets, zero-copy semantics.

And looking forward to this!

> Thanks: Björn and Magnus
>
> Björn Töpel (7):
>   net: initial AF_XDP skeleton
>   xsk: add user memory registration support sockopt
>   xsk: add Rx queue setup and mmap support
>   xsk: add Rx receive functions and poll support
>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>   xsk: wire up XDP_DRV side of AF_XDP
>   xsk: wire up XDP_SKB side of AF_XDP
>
> Magnus Karlsson (8):
>   xsk: add umem fill queue support and mmap
>   xsk: add support for bind for Rx
>   xsk: add umem completion queue support and mmap
>   xsk: add Tx queue setup and mmap support
>   dev: packet: make packet_direct_xmit a common function
>   xsk: support for Tx
>   xsk: statistics support
>   samples/bpf: sample application and documentation for AF_XDP sockets

For the series

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (15 preceding siblings ...)
  2018-05-03 13:55 ` [PATCH bpf-next v3 00/15] Introducing AF_XDP support Willem de Bruijn
@ 2018-05-03 15:07 ` David Miller
  2018-05-03 22:49 ` Daniel Borkmann
  17 siblings, 0 replies; 47+ messages in thread
From: David Miller @ 2018-05-03 15:07 UTC (permalink / raw)
  To: bjorn.topel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev, bjorn.topel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@gmail.com>
Date: Wed,  2 May 2018 13:01:21 +0200

> This patch set introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this patch set, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This patch set only supports copy-mode
> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> for RX using the XDP_DRV path. Zero-copy support requires XDP and
> driver changes that Jesper Dangaard Brouer is working on. Some of his
> work has already been accepted. We will publish our zero-copy support
> for RX and TX on top of his patch sets at a later point in time.
 ...

Looks great.

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
                   ` (16 preceding siblings ...)
  2018-05-03 15:07 ` David Miller
@ 2018-05-03 22:49 ` Daniel Borkmann
  2018-05-03 23:38   ` Alexei Starovoitov
  17 siblings, 1 reply; 47+ messages in thread
From: Daniel Borkmann @ 2018-05-03 22:49 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

On 05/02/2018 01:01 PM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This patch set introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this patch set, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This patch set only supports copy-mode
> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> for RX using the XDP_DRV path. Zero-copy support requires XDP and
> driver changes that Jesper Dangaard Brouer is working on. Some of his
> work has already been accepted. We will publish our zero-copy support
> for RX and TX on top of his patch sets at a later point in time.

+1, would be great to see it land this cycle. Saw few minor nits here
and there but nothing to hold it up, for the series:

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Thanks everyone!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-03 22:49 ` Daniel Borkmann
@ 2018-05-03 23:38   ` Alexei Starovoitov
  2018-05-04 11:22     ` Magnus Karlsson
  2018-05-17  6:46     ` Björn Töpel
  0 siblings, 2 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2018-05-03 23:38 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Björn Töpel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, mst, netdev, Björn Töpel,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
> On 05/02/2018 01:01 PM, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> > 
> > This patch set introduces a new address family called AF_XDP that is
> > optimized for high performance packet processing and, in upcoming
> > patch sets, zero-copy semantics. In this patch set, we have removed
> > all zero-copy related code in order to make it smaller, simpler and
> > hopefully more review friendly. This patch set only supports copy-mode
> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
> > driver changes that Jesper Dangaard Brouer is working on. Some of his
> > work has already been accepted. We will publish our zero-copy support
> > for RX and TX on top of his patch sets at a later point in time.
> 
> +1, would be great to see it land this cycle. Saw few minor nits here
> and there but nothing to hold it up, for the series:
> 
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> 
> Thanks everyone!

Great stuff!

Applied to bpf-next, with one condition.
Upcoming zero-copy patches for both RX and TX need to be posted
and reviewed within this release window.
If netdev community as a whole won't be able to agree on the zero-copy
bits we'd need to revert this feature before the next merge window.

Few other minor nits:
patch 3:
+struct xdp_ring {
+       __u32 producer __attribute__((aligned(64)));
+       __u32 consumer __attribute__((aligned(64)));
+};
It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.

patch 5:
+struct sockaddr_xdp {
+       __u16 sxdp_family;
+       __u32 sxdp_ifindex;
Not great to have a hole in uapi struct. Please fix it in the follow up.

patch 7:
Has a lot of synchronize_net(). I think udpate/delete side
can be improved to avoid them. Otherwise users may unknowingly DoS.

As the next steps I suggest to prioritize the highest to ship
zero-copy rx/tx patches and to add selftests.

Thanks!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-03 23:38   ` Alexei Starovoitov
@ 2018-05-04 11:22     ` Magnus Karlsson
  2018-05-05  0:34       ` Alexei Starovoitov
  2018-05-17  6:46     ` Björn Töpel
  1 sibling, 1 reply; 47+ messages in thread
From: Magnus Karlsson @ 2018-05-04 11:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Jesper Dangaard Brouer, Willem de Bruijn,
	Michael S. Tsirkin, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>> > From: Björn Töpel <bjorn.topel@intel.com>
>> >
>> > This patch set introduces a new address family called AF_XDP that is
>> > optimized for high performance packet processing and, in upcoming
>> > patch sets, zero-copy semantics. In this patch set, we have removed
>> > all zero-copy related code in order to make it smaller, simpler and
>> > hopefully more review friendly. This patch set only supports copy-mode
>> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
>> > driver changes that Jesper Dangaard Brouer is working on. Some of his
>> > work has already been accepted. We will publish our zero-copy support
>> > for RX and TX on top of his patch sets at a later point in time.
>>
>> +1, would be great to see it land this cycle. Saw few minor nits here
>> and there but nothing to hold it up, for the series:
>>
>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>
>> Thanks everyone!
>
> Great stuff!
>
> Applied to bpf-next, with one condition.
> Upcoming zero-copy patches for both RX and TX need to be posted
> and reviewed within this release window.
> If netdev community as a whole won't be able to agree on the zero-copy
> bits we'd need to revert this feature before the next merge window.

Thanks everyone for reviewing this. Highly appreciated.

Just so we understand the purpose correctly:

1: Do you want to see the ZC patches in order to verify that the user
space API holds? If so, we can produce an additional RFC  patch set
using a big chunk of code that we had in RFC V1. We are not proud of
this code since it is clunky, but it hopefully proves the point with
the uapi being the same.

2: And/Or are you worried about us all (the netdev community) not
agreeing on a way to implement ZC internally in the drivers and the
XDP infrastructure? This is not going to be possible to finish during
this cycle since we do not like the implementation we had in RFC V1.
Too intrusive and now we also have nicer abstractions from Jesper that
we can use and extend to provide a (hopefully) much cleaner and less
intrusive solution.

Just so that we focus on the right proof points.

> Few other minor nits:
> patch 3:
> +struct xdp_ring {
> +       __u32 producer __attribute__((aligned(64)));
> +       __u32 consumer __attribute__((aligned(64)));
> +};
> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.

Agreed.

> patch 5:
> +struct sockaddr_xdp {
> +       __u16 sxdp_family;
> +       __u32 sxdp_ifindex;
> Not great to have a hole in uapi struct. Please fix it in the follow up.

You are correct. Will fix.

> patch 7:
> Has a lot of synchronize_net(). I think udpate/delete side
> can be improved to avoid them. Otherwise users may unknowingly DoS.

OK. Could you please elaborate on what kind of DoS attacks can be
performed with this, so we can come up with the right solution here?

Thanks: Magnus

> As the next steps I suggest to prioritize the highest to ship
> zero-copy rx/tx patches and to add selftests.
>
> Thanks!
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 02/15] xsk: add user memory registration support sockopt
  2018-05-02 11:01 ` [PATCH bpf-next v3 02/15] xsk: add user memory registration support sockopt Björn Töpel
@ 2018-05-04 12:34   ` Daniel Borkmann
  0 siblings, 0 replies; 47+ messages in thread
From: Daniel Borkmann @ 2018-05-04 12:34 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

On 05/02/2018 01:01 PM, Björn Töpel wrote:
[...]

(Few nits for follow-ups I haven't sent out yesterday night yet; they are on
 top of Alexei's remarks.)

> ---
>  include/net/xdp_sock.h      |  31 ++++++
>  include/uapi/linux/if_xdp.h |  34 ++++++
>  net/Makefile                |   1 +
>  net/xdp/Makefile            |   2 +
>  net/xdp/xdp_umem.c          | 245 ++++++++++++++++++++++++++++++++++++++++++++
>  net/xdp/xdp_umem.h          |  45 ++++++++
>  net/xdp/xdp_umem_props.h    |  23 +++++
>  net/xdp/xsk.c               | 215 ++++++++++++++++++++++++++++++++++++++
>  8 files changed, 596 insertions(+)
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xdp_umem.c
>  create mode 100644 net/xdp/xdp_umem.h
>  create mode 100644 net/xdp/xdp_umem_props.h
>  create mode 100644 net/xdp/xsk.c
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> new file mode 100644
> index 000000000000..94785f5db13e
> --- /dev/null
> +++ b/include/net/xdp_sock.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: GPL-2.0

I think this should just be a single line comment, at least that's the case
for majority of files in tree, so probably good to stick to it as well ...

$ git grep -n "SPDX-License-Identifier" | grep "\/\*" | grep -v "\*\/" | wc -l
20
$ git grep -n "SPDX-License-Identifier" | grep "\/\*" | grep "\*\/" | wc -l
7742

... and probably would also make sense to get rid of the below boiler plate
text then. (Applies to other added files as well in this series, only mention
it here once.)

> + * AF_XDP internal functions
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#ifndef _LINUX_XDP_SOCK_H
> +#define _LINUX_XDP_SOCK_H
> +
> +#include <linux/mutex.h>
> +#include <net/sock.h>
> +
> +struct xdp_umem;
> +
> +struct xdp_sock {
> +	/* struct sock must be the first member of struct xdp_sock */
> +	struct sock sk;
> +	struct xdp_umem *umem;
> +	/* Protects multiple processes in the control path */
> +	struct mutex mutex;
> +};
> +
> +#endif /* _LINUX_XDP_SOCK_H */
[...]
> diff --git a/net/Makefile b/net/Makefile
> index a6147c61b174..77aaddedbd29 100644
> --- a/net/Makefile
> +++ b/net/Makefile
> @@ -85,3 +85,4 @@ obj-y				+= l3mdev/
>  endif
>  obj-$(CONFIG_QRTR)		+= qrtr/
>  obj-$(CONFIG_NET_NCSI)		+= ncsi/
> +obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
> new file mode 100644
> index 000000000000..a5d736640a0f
> --- /dev/null
> +++ b/net/xdp/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
> +

Nit: newline at end of file.

> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> new file mode 100644
> index 000000000000..ec8b3552be44
> --- /dev/null
> +++ b/net/xdp/xdp_umem.c
[...]
> +
> +#include "xdp_umem.h"
> +
> +#define XDP_UMEM_MIN_FRAME_SIZE 2048
> +
> +int xdp_umem_create(struct xdp_umem **umem)
> +{
> +	*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
> +
> +	if (!(*umem))

Nit: extra () not needed. You also have the extra brackets in couple of
other places/conditionals.

> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> +{
> +	unsigned int i;
> +
> +	if (umem->pgs) {

All call-sites of this have umem->pgs never as NULL.

> +		for (i = 0; i < umem->npgs; i++) {
> +			struct page *page = umem->pgs[i];
> +
> +			set_page_dirty_lock(page);
> +			put_page(page);
> +		}
> +
> +		kfree(umem->pgs);
> +		umem->pgs = NULL;
> +	}
> +}
> +
> +static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
> +{
> +	if (umem->user) {
> +		atomic_long_sub(umem->npgs, &umem->user->locked_vm);
> +		free_uid(umem->user);
> +	}
> +}
> +
> +static void xdp_umem_release(struct xdp_umem *umem)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +
> +	if (umem->pgs) {
> +		xdp_umem_unpin_pages(umem);
> +
> +		task = get_pid_task(umem->pid, PIDTYPE_PID);
> +		put_pid(umem->pid);
> +		if (!task)
> +			goto out;
> +		mm = get_task_mm(task);
> +		put_task_struct(task);
> +		if (!mm)
> +			goto out;
> +
> +		mmput(mm);
> +		umem->pgs = NULL;
> +	}
> +
> +	xdp_umem_unaccount_pages(umem);
> +out:
> +	kfree(umem);
> +}
> +
> +static void xdp_umem_release_deferred(struct work_struct *work)
> +{
> +	struct xdp_umem *umem = container_of(work, struct xdp_umem, work);
> +
> +	xdp_umem_release(umem);
> +}
> +
> +void xdp_get_umem(struct xdp_umem *umem)
> +{
> +	atomic_inc(&umem->users);
> +}
> +
> +void xdp_put_umem(struct xdp_umem *umem)
> +{
> +	if (!umem)
> +		return;
> +
> +	if (atomic_dec_and_test(&umem->users)) {
> +		INIT_WORK(&umem->work, xdp_umem_release_deferred);
> +		schedule_work(&umem->work);
> +	}
> +}
> +
> +static int xdp_umem_pin_pages(struct xdp_umem *umem)
> +{
> +	unsigned int gup_flags = FOLL_WRITE;
> +	long npgs;
> +	int err;
> +
> +	umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), GFP_KERNEL);
> +	if (!umem->pgs)
> +		return -ENOMEM;
> +
> +	down_write(&current->mm->mmap_sem);
> +	npgs = get_user_pages(umem->address, umem->npgs,
> +			      gup_flags, &umem->pgs[0], NULL);
> +	up_write(&current->mm->mmap_sem);
> +
> +	if (npgs != umem->npgs) {
> +		if (npgs >= 0) {
> +			umem->npgs = npgs;
> +			err = -ENOMEM;
> +			goto out_pin;
> +		}
> +		err = npgs;
> +		goto out_pgs;
> +	}
> +	return 0;
> +
> +out_pin:
> +	xdp_umem_unpin_pages(umem);
> +out_pgs:
> +	kfree(umem->pgs);
> +	umem->pgs = NULL;
> +	return err;
> +}
> +
> +static int xdp_umem_account_pages(struct xdp_umem *umem)
> +{
> +	unsigned long lock_limit, new_npgs, old_npgs;
> +
> +	if (capable(CAP_IPC_LOCK))
> +		return 0;
> +
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	umem->user = get_uid(current_user());
> +
> +	do {
> +		old_npgs = atomic_long_read(&umem->user->locked_vm);
> +		new_npgs = old_npgs + umem->npgs;
> +		if (new_npgs > lock_limit) {
> +			free_uid(umem->user);
> +			umem->user = NULL;
> +			return -ENOBUFS;
> +		}
> +	} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
> +				     new_npgs) != old_npgs);
> +	return 0;
> +}
> +
> +int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> +{
> +	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> +	u64 addr = mr->addr, size = mr->len;
> +	unsigned int nframes, nfpp;
> +	int size_chk, err;
> +
> +	if (!umem)
> +		return -EINVAL;

Wouldn't it be better to remove these sort of defensive checks (here and in
other places)? Eventually they might only end up hiding potential bugs rather
than having them noticed. Only call-site is in xsk_setsockopt() where you do:

[...]
                mutex_lock(&xs->mutex);
                err = xdp_umem_create(&umem);

                err = xdp_umem_reg(umem, &mr);
                if (err) {
                        kfree(umem);
                        mutex_unlock(&xs->mutex);
                        return err;
                }
[...]

Seems more intuitive and easier to audit when you bail out initially when
xdp_umem_create() fails rather than in xdp_umem_reg(), and then the test
for !umem can be removed.

> +	if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> +		/* Strictly speaking we could support this, if:
> +		 * - huge pages, or*
> +		 * - using an IOMMU, or
> +		 * - making sure the memory area is consecutive
> +		 * but for now, we simply say "computer says no".
> +		 */
> +		return -EINVAL;
> +	}
> +
> +	if (!is_power_of_2(frame_size))
> +		return -EINVAL;
> +
> +	if (!PAGE_ALIGNED(addr)) {
> +		/* Memory area has to be page size aligned. For
> +		 * simplicity, this might change.
> +		 */
> +		return -EINVAL;
> +	}
> +
> +	if ((addr + size) < addr)
> +		return -EINVAL;
> +
> +	nframes = size / frame_size;
> +	if (nframes == 0 || nframes > UINT_MAX)
> +		return -EINVAL;
> +
> +	nfpp = PAGE_SIZE / frame_size;
> +	if (nframes < nfpp || nframes % nfpp)
> +		return -EINVAL;
> +
> +	frame_headroom = ALIGN(frame_headroom, 64);
> +
> +	size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
> +	if (size_chk < 0)
> +		return -EINVAL;
> +
> +	umem->pid = get_task_pid(current, PIDTYPE_PID);
> +	umem->size = (size_t)size;
> +	umem->address = (unsigned long)addr;
> +	umem->props.frame_size = frame_size;
> +	umem->props.nframes = nframes;
> +	umem->frame_headroom = frame_headroom;
> +	umem->npgs = size / PAGE_SIZE;
> +	umem->pgs = NULL;
> +	umem->user = NULL;
> +
> +	umem->frame_size_log2 = ilog2(frame_size);
> +	umem->nfpp_mask = nfpp - 1;
> +	umem->nfpplog2 = ilog2(nfpp);
> +	atomic_set(&umem->users, 1);
> +
> +	err = xdp_umem_account_pages(umem);
> +	if (err)
> +		goto out;
> +
> +	err = xdp_umem_pin_pages(umem);
> +	if (err)
> +		goto out_account;
> +	return 0;
> +
> +out_account:
> +	xdp_umem_unaccount_pages(umem);
> +out:
> +	put_pid(umem->pid);
> +	return err;
> +}
[...]
> +#ifndef XDP_UMEM_H_
> +#define XDP_UMEM_H_
> +
> +#include <linux/mm.h>
> +#include <linux/if_xdp.h>
> +#include <linux/workqueue.h>
> +
> +#include "xdp_umem_props.h"
> +
> +struct xdp_umem {
> +	struct page **pgs;
> +	struct xdp_umem_props props;
> +	u32 npgs;
> +	u32 frame_headroom;
> +	u32 nfpp_mask;
> +	u32 nfpplog2;
> +	u32 frame_size_log2;
> +	struct user_struct *user;
> +	struct pid *pid;
> +	unsigned long address;
> +	size_t size;
> +	atomic_t users;

Convert to refcnt_t?

> +	struct work_struct work;
> +};
> +
> +int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
> +void xdp_get_umem(struct xdp_umem *umem);
> +void xdp_put_umem(struct xdp_umem *umem);
> +int xdp_umem_create(struct xdp_umem **umem);
> +
> +#endif /* XDP_UMEM_H_ */
[...]
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> new file mode 100644
> index 000000000000..84e0e867febb
[...]
> +static struct xdp_sock *xdp_sk(struct sock *sk)
> +{
> +	return (struct xdp_sock *)sk;
> +}
> +
> +static int xsk_release(struct socket *sock)
> +{
> +	struct sock *sk = sock->sk;
> +	struct net *net;
> +
> +	if (!sk)
> +		return 0;
> +
> +	net = sock_net(sk);
> +
> +	local_bh_disable();
> +	sock_prot_inuse_add(net, sk->sk_prot, -1);
> +	local_bh_enable();
> +
> +	sock_orphan(sk);
> +	sock->sk = NULL;
> +
> +	sk_refcnt_debug_release(sk);
> +	sock_put(sk);
> +
> +	return 0;
> +}
> +
> +static int xsk_setsockopt(struct socket *sock, int level, int optname,
> +			  char __user *optval, unsigned int optlen)
> +{
> +	struct sock *sk = sock->sk;
> +	struct xdp_sock *xs = xdp_sk(sk);
> +	int err;
> +
> +	if (level != SOL_XDP)
> +		return -ENOPROTOOPT;
> +
> +	switch (optname) {
> +	case XDP_UMEM_REG:
> +	{
> +		struct xdp_umem_reg mr;
> +		struct xdp_umem *umem;
> +
> +		if (xs->umem)

Does this need READ_ONCE() or needs to go under the lock?

> +			return -EBUSY;
> +
> +		if (copy_from_user(&mr, optval, sizeof(mr)))
> +			return -EFAULT;
> +
> +		mutex_lock(&xs->mutex);
> +		err = xdp_umem_create(&umem);
> +
> +		err = xdp_umem_reg(umem, &mr);
> +		if (err) {
> +			kfree(umem);

The kfree() here begs for having a proper destructor such that once you extend
xdp_umem_create() these spots are not missed when further cleanups have to be
performed on dismantle.

> +			mutex_unlock(&xs->mutex);
> +			return err;
> +		}
> +
> +		/* Make sure umem is ready before it can be seen by others */
> +		smp_wmb();
> +
> +		xs->umem = umem;
> +		mutex_unlock(&xs->mutex);
> +		return 0;
> +	}
> +	default:
> +		break;
> +	}
> +
> +	return -ENOPROTOOPT;
> +}
> +
> +static struct proto xsk_proto = {
> +	.name =		"XDP",
> +	.owner =	THIS_MODULE,
> +	.obj_size =	sizeof(struct xdp_sock),
> +};
> +
> +static const struct proto_ops xsk_proto_ops = {
> +	.family =	PF_XDP,
> +	.owner =	THIS_MODULE,
> +	.release =	xsk_release,
> +	.bind =		sock_no_bind,
> +	.connect =	sock_no_connect,
> +	.socketpair =	sock_no_socketpair,
> +	.accept =	sock_no_accept,
> +	.getname =	sock_no_getname,
> +	.poll =		sock_no_poll,
> +	.ioctl =	sock_no_ioctl,
> +	.listen =	sock_no_listen,
> +	.shutdown =	sock_no_shutdown,
> +	.setsockopt =	xsk_setsockopt,
> +	.getsockopt =	sock_no_getsockopt,
> +	.sendmsg =	sock_no_sendmsg,
> +	.recvmsg =	sock_no_recvmsg,
> +	.mmap =		sock_no_mmap,
> +	.sendpage =	sock_no_sendpage,

Nit: would have been nice to properly align the '='

> +};
> +
> +static void xsk_destruct(struct sock *sk)
> +{
> +	struct xdp_sock *xs = xdp_sk(sk);
> +
> +	if (!sock_flag(sk, SOCK_DEAD))
> +		return;
> +
> +	xdp_put_umem(xs->umem);
> +
> +	sk_refcnt_debug_dec(sk);
> +}
> +
> +static int xsk_create(struct net *net, struct socket *sock, int protocol,
> +		      int kern)
> +{
> +	struct sock *sk;
> +	struct xdp_sock *xs;
> +
> +	if (!ns_capable(net->user_ns, CAP_NET_RAW))
> +		return -EPERM;
> +	if (sock->type != SOCK_RAW)
> +		return -ESOCKTNOSUPPORT;
> +
> +	if (protocol)
> +		return -EPROTONOSUPPORT;
> +
> +	sock->state = SS_UNCONNECTED;
[...]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 03/15] xsk: add umem fill queue support and mmap
  2018-05-02 11:01 ` [PATCH bpf-next v3 03/15] xsk: add umem fill queue support and mmap Björn Töpel
@ 2018-05-04 12:49   ` Daniel Borkmann
  0 siblings, 0 replies; 47+ messages in thread
From: Daniel Borkmann @ 2018-05-04 12:49 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, mst, netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang

On 05/02/2018 01:01 PM, Björn Töpel wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Here, we add another setsockopt for registered user memory (umem)
> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
> ask the kernel to allocate a queue (ring buffer) and also mmap it
> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
> 
> The queue is used to explicitly pass ownership of umem frames from the
> user process to the kernel. These frames will in a later patch be
> filled in with Rx packet data by the kernel.
> 
> v2: Fixed potential crash in xsk_mmap.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>  net/xdp/Makefile            |  2 +-
>  net/xdp/xdp_umem.c          |  5 ++++
>  net/xdp/xdp_umem.h          |  2 ++
>  net/xdp/xsk.c               | 65 ++++++++++++++++++++++++++++++++++++++++++++-
>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++
>  net/xdp/xsk_queue.h         | 38 ++++++++++++++++++++++++++
>  7 files changed, 183 insertions(+), 2 deletions(-)
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
> 
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 41252135a0fe..975661e1baca 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -23,6 +23,7 @@
>  
>  /* XDP socket options */
>  #define XDP_UMEM_REG			3
> +#define XDP_UMEM_FILL_RING		4
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>  	__u32 frame_headroom; /* Frame head room */
>  };
>  
> +/* Pgoff for mmaping the rings */
> +#define XDP_UMEM_PGOFF_FILL_RING	0x100000000
> +
> +struct xdp_ring {
> +	__u32 producer __attribute__((aligned(64)));
> +	__u32 consumer __attribute__((aligned(64)));
> +};
> +
> +/* Used for the fill and completion queues for buffers */
> +struct xdp_umem_ring {
> +	struct xdp_ring ptrs;
> +	__u32 desc[0] __attribute__((aligned(64)));
> +};
> +
>  #endif /* _LINUX_IF_XDP_H */
> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
> index a5d736640a0f..074fb2b2d51c 100644
> --- a/net/xdp/Makefile
> +++ b/net/xdp/Makefile
> @@ -1,2 +1,2 @@
> -obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
>  
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index ec8b3552be44..e1f627d0cc1c 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -65,6 +65,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
>  	struct task_struct *task;
>  	struct mm_struct *mm;
>  
> +	if (umem->fq) {
> +		xskq_destroy(umem->fq);
> +		umem->fq = NULL;
> +	}
> +
>  	if (umem->pgs) {
>  		xdp_umem_unpin_pages(umem);
>  
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> index 4597ae81a221..25634b8a5c6f 100644
> --- a/net/xdp/xdp_umem.h
> +++ b/net/xdp/xdp_umem.h
> @@ -19,9 +19,11 @@
>  #include <linux/if_xdp.h>
>  #include <linux/workqueue.h>
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem_props.h"
>  
>  struct xdp_umem {
> +	struct xsk_queue *fq;
>  	struct page **pgs;
>  	struct xdp_umem_props props;
>  	u32 npgs;
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 84e0e867febb..da67a3c5c1c9 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -32,6 +32,7 @@
>  #include <linux/netdevice.h>
>  #include <net/xdp_sock.h>
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem.h"
>  
>  static struct xdp_sock *xdp_sk(struct sock *sk)
> @@ -39,6 +40,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
>  	return (struct xdp_sock *)sk;
>  }
>  
> +static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
> +{
> +	struct xsk_queue *q;
> +
> +	if (entries == 0 || *queue || !is_power_of_2(entries))
> +		return -EINVAL;
> +
> +	q = xskq_create(entries);
> +	if (!q)
> +		return -ENOMEM;
> +
> +	*queue = q;
> +	return 0;
> +}
> +
>  static int xsk_release(struct socket *sock)
>  {
>  	struct sock *sk = sock->sk;
> @@ -101,6 +117,23 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		mutex_unlock(&xs->mutex);
>  		return 0;
>  	}
> +	case XDP_UMEM_FILL_RING:
> +	{
> +		struct xsk_queue **q;
> +		int entries;
> +
> +		if (!xs->umem)
> +			return -EINVAL;

(Same here as previously mentioned.)

> +		if (copy_from_user(&entries, optval, sizeof(entries)))
> +			return -EFAULT;
> +
> +		mutex_lock(&xs->mutex);
> +		q = &xs->umem->fq;
> +		err = xsk_init_queue(entries, q);
> +		mutex_unlock(&xs->mutex);
> +		return err;
> +	}
>  	default:
>  		break;
>  	}
> @@ -108,6 +141,36 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  	return -ENOPROTOOPT;
>  }
>  
> +static int xsk_mmap(struct file *file, struct socket *sock,
> +		    struct vm_area_struct *vma)
> +{
> +	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
> +	unsigned long size = vma->vm_end - vma->vm_start;
> +	struct xdp_sock *xs = xdp_sk(sock->sk);
> +	struct xsk_queue *q = NULL;
> +	unsigned long pfn;
> +	struct page *qpg;
> +
> +	if (!xs->umem)
> +		return -EINVAL;
> +
> +	if (offset == XDP_UMEM_PGOFF_FILL_RING)
> +		q = xs->umem->fq;
> +	else
> +		return -EINVAL;
> +
> +	if (!q)
> +		return -EINVAL;

Nit: since q is NULL above, could be simplified as:

	if (offset == XDP_UMEM_PGOFF_FILL_RING)
		q = xs->umem->fq;
	if (!q)
		return -EINVAL;

> +
> +	qpg = virt_to_head_page(q->ring);
> +	if (size > (PAGE_SIZE << compound_order(qpg)))
> +		return -EINVAL;
> +
> +	pfn = virt_to_phys(q->ring) >> PAGE_SHIFT;
> +	return remap_pfn_range(vma, vma->vm_start, pfn,
> +			       size, vma->vm_page_prot);
> +}
> +
>  static struct proto xsk_proto = {
>  	.name =		"XDP",
>  	.owner =	THIS_MODULE,
> @@ -131,7 +194,7 @@ static const struct proto_ops xsk_proto_ops = {
>  	.getsockopt =	sock_no_getsockopt,
>  	.sendmsg =	sock_no_sendmsg,
>  	.recvmsg =	sock_no_recvmsg,
> -	.mmap =		sock_no_mmap,
> +	.mmap =		xsk_mmap,
>  	.sendpage =	sock_no_sendpage,
>  };
>  

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 06/15] xsk: add Rx receive functions and poll support
  2018-05-02 11:01 ` [PATCH bpf-next v3 06/15] xsk: add Rx receive functions and poll support Björn Töpel
@ 2018-05-04 12:59   ` Daniel Borkmann
  2018-05-22  7:42     ` Björn Töpel
  0 siblings, 1 reply; 47+ messages in thread
From: Daniel Borkmann @ 2018-05-04 12:59 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

On 05/02/2018 01:01 PM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> Here the actual receive functions of AF_XDP are implemented, that in a
> later commit, will be called from the XDP layers.
> 
> There's one set of functions for the XDP_DRV side and another for
> XDP_SKB (generic).
> 
> A new XDP API, xdp_return_buff, is also introduced.
> 
> Adding xdp_return_buff, which is analogous to xdp_return_frame, but
> acts upon an struct xdp_buff. The API will be used by AF_XDP in future
> commits.
> 
> Support for the poll syscall is also implemented.
> 
> v2: xskq_validate_id did not update cons_tail.
>     The entries variable was calculated twice in xskq_nb_avail.
>     Squashed xdp_return_buff commit.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  include/net/xdp.h      |   1 +
>  include/net/xdp_sock.h |  22 ++++++++++
>  net/core/xdp.c         |  15 +++++--
>  net/xdp/xdp_umem.h     |  18 ++++++++
>  net/xdp/xsk.c          |  73 ++++++++++++++++++++++++++++++-
>  net/xdp/xsk_queue.h    | 114 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  6 files changed, 238 insertions(+), 5 deletions(-)
> 
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 137ad5f9f40f..0b689cf561c7 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -104,6 +104,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>  }
>  
>  void xdp_return_frame(struct xdp_frame *xdpf);
> +void xdp_return_buff(struct xdp_buff *xdp);
>  
>  int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
>  		     struct net_device *dev, u32 queue_index);
[...]
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index bf2c97b87992..4e1e6c581e1d 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -41,6 +41,74 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
>  	return (struct xdp_sock *)sk;
>  }
>  
> +static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> +{
> +	u32 *id, len = xdp->data_end - xdp->data;
> +	void *buffer;
> +	int err = 0;
> +
> +	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
> +		return -EINVAL;
> +
> +	id = xskq_peek_id(xs->umem->fq);
> +	if (!id)
> +		return -ENOSPC;
> +
> +	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
> +	memcpy(buffer, xdp->data, len);
> +	err = xskq_produce_batch_desc(xs->rx, *id, len,
> +				      xs->umem->frame_headroom);
> +	if (!err)
> +		xskq_discard_id(xs->umem->fq);
> +
> +	return err;
> +}
> +
> +int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> +{
> +	int err;
> +
> +	err = __xsk_rcv(xs, xdp);
> +	if (likely(!err))
> +		xdp_return_buff(xdp);
> +	else
> +		xs->rx_dropped++;

This is triggered from __bpf_tx_xdp_map() -> __xsk_map_redirect().
Should this be percpu counter instead?

> +	return err;
> +}
> +
> +void xsk_flush(struct xdp_sock *xs)
> +{
> +	xskq_produce_flush_desc(xs->rx);
> +	xs->sk.sk_data_ready(&xs->sk);
> +}
> +
> +int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> +{
> +	int err;
> +
> +	err = __xsk_rcv(xs, xdp);
> +	if (!err)
> +		xsk_flush(xs);
> +	else
> +		xs->rx_dropped++;
> +
> +	return err;
> +}
> +

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-04 11:22     ` Magnus Karlsson
@ 2018-05-05  0:34       ` Alexei Starovoitov
  2018-05-07  9:13         ` Magnus Karlsson
  0 siblings, 1 reply; 47+ messages in thread
From: Alexei Starovoitov @ 2018-05-05  0:34 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Daniel Borkmann, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Jesper Dangaard Brouer, Willem de Bruijn,
	Michael S. Tsirkin, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote:
> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
> >> On 05/02/2018 01:01 PM, Björn Töpel wrote:
> >> > From: Björn Töpel <bjorn.topel@intel.com>
> >> >
> >> > This patch set introduces a new address family called AF_XDP that is
> >> > optimized for high performance packet processing and, in upcoming
> >> > patch sets, zero-copy semantics. In this patch set, we have removed
> >> > all zero-copy related code in order to make it smaller, simpler and
> >> > hopefully more review friendly. This patch set only supports copy-mode
> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his
> >> > work has already been accepted. We will publish our zero-copy support
> >> > for RX and TX on top of his patch sets at a later point in time.
> >>
> >> +1, would be great to see it land this cycle. Saw few minor nits here
> >> and there but nothing to hold it up, for the series:
> >>
> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> >>
> >> Thanks everyone!
> >
> > Great stuff!
> >
> > Applied to bpf-next, with one condition.
> > Upcoming zero-copy patches for both RX and TX need to be posted
> > and reviewed within this release window.
> > If netdev community as a whole won't be able to agree on the zero-copy
> > bits we'd need to revert this feature before the next merge window.
> 
> Thanks everyone for reviewing this. Highly appreciated.
> 
> Just so we understand the purpose correctly:
> 
> 1: Do you want to see the ZC patches in order to verify that the user
> space API holds? If so, we can produce an additional RFC  patch set
> using a big chunk of code that we had in RFC V1. We are not proud of
> this code since it is clunky, but it hopefully proves the point with
> the uapi being the same.
> 
> 2: And/Or are you worried about us all (the netdev community) not
> agreeing on a way to implement ZC internally in the drivers and the
> XDP infrastructure? This is not going to be possible to finish during
> this cycle since we do not like the implementation we had in RFC V1.
> Too intrusive and now we also have nicer abstractions from Jesper that
> we can use and extend to provide a (hopefully) much cleaner and less
> intrusive solution.

short answer: both.

Cleanliness and performance of the ZC code is not as important as
getting API right. The main concern that during ZC review process
we will find out that existing API has issues, so we have to
do this exercise before the merge window.
And RFC won't fly. Send the patches for real. They have to go
through the proper code review. The hackers of netdev community
can accept a partial, or a bit unclean, or slightly inefficient
implementation, since it can be and will be improved later,
but API we cannot change once it goes into official release.

Here is the example of API concern:
this patch set added shared umem concept. It sounds good in theory,
but will it perform well with ZC ? Earlier RFCs didn't have that
feature. If it won't perform well than it shouldn't be in the tree.
The key reason to let AF_XDP into the tree is its performance promise.
If it doesn't perform we should rip it out and redesign.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-05  0:34       ` Alexei Starovoitov
@ 2018-05-07  9:13         ` Magnus Karlsson
  2018-05-07 13:09           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 47+ messages in thread
From: Magnus Karlsson @ 2018-05-07  9:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Jesper Dangaard Brouer, Willem de Bruijn,
	Michael S. Tsirkin, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On Sat, May 5, 2018 at 2:34 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote:
>> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>> >> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>> >> > From: Björn Töpel <bjorn.topel@intel.com>
>> >> >
>> >> > This patch set introduces a new address family called AF_XDP that is
>> >> > optimized for high performance packet processing and, in upcoming
>> >> > patch sets, zero-copy semantics. In this patch set, we have removed
>> >> > all zero-copy related code in order to make it smaller, simpler and
>> >> > hopefully more review friendly. This patch set only supports copy-mode
>> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
>> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his
>> >> > work has already been accepted. We will publish our zero-copy support
>> >> > for RX and TX on top of his patch sets at a later point in time.
>> >>
>> >> +1, would be great to see it land this cycle. Saw few minor nits here
>> >> and there but nothing to hold it up, for the series:
>> >>
>> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>> >>
>> >> Thanks everyone!
>> >
>> > Great stuff!
>> >
>> > Applied to bpf-next, with one condition.
>> > Upcoming zero-copy patches for both RX and TX need to be posted
>> > and reviewed within this release window.
>> > If netdev community as a whole won't be able to agree on the zero-copy
>> > bits we'd need to revert this feature before the next merge window.
>>
>> Thanks everyone for reviewing this. Highly appreciated.
>>
>> Just so we understand the purpose correctly:
>>
>> 1: Do you want to see the ZC patches in order to verify that the user
>> space API holds? If so, we can produce an additional RFC  patch set
>> using a big chunk of code that we had in RFC V1. We are not proud of
>> this code since it is clunky, but it hopefully proves the point with
>> the uapi being the same.
>>
>> 2: And/Or are you worried about us all (the netdev community) not
>> agreeing on a way to implement ZC internally in the drivers and the
>> XDP infrastructure? This is not going to be possible to finish during
>> this cycle since we do not like the implementation we had in RFC V1.
>> Too intrusive and now we also have nicer abstractions from Jesper that
>> we can use and extend to provide a (hopefully) much cleaner and less
>> intrusive solution.
>
> short answer: both.
>
> Cleanliness and performance of the ZC code is not as important as
> getting API right. The main concern that during ZC review process
> we will find out that existing API has issues, so we have to
> do this exercise before the merge window.
> And RFC won't fly. Send the patches for real. They have to go
> through the proper code review. The hackers of netdev community
> can accept a partial, or a bit unclean, or slightly inefficient
> implementation, since it can be and will be improved later,
> but API we cannot change once it goes into official release.
>
> Here is the example of API concern:
> this patch set added shared umem concept. It sounds good in theory,
> but will it perform well with ZC ? Earlier RFCs didn't have that
> feature. If it won't perform well than it shouldn't be in the tree.
> The key reason to let AF_XDP into the tree is its performance promise.
> If it doesn't perform we should rip it out and redesign.

That is a fair point. We will try to produce patch sets for zero-copy
RX and TX using the latest interfaces within this merge window. Just
note that we will focus on this for the next week(s) instead of the
review items that you and Daniel Borkmann submitted. If we get those
patch sets out in time and we agree that they are a possible way
forward, then we produce patches with your fixes. It was mainly small
items, so should be quick.

/Magnus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-07  9:13         ` Magnus Karlsson
@ 2018-05-07 13:09           ` Jesper Dangaard Brouer
  2018-05-07 19:47             ` Björn Töpel
  0 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-07 13:09 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Willem de Bruijn,
	Michael S. Tsirkin, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z, brouer

On Mon, 7 May 2018 11:13:58 +0200
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Sat, May 5, 2018 at 2:34 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote:  
> >> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:  
> >> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:  
> >> >> On 05/02/2018 01:01 PM, Björn Töpel wrote:  
> >> >> > From: Björn Töpel <bjorn.topel@intel.com>
> >> >> >
> >> >> > This patch set introduces a new address family called AF_XDP that is
> >> >> > optimized for high performance packet processing and, in upcoming
> >> >> > patch sets, zero-copy semantics. In this patch set, we have removed
> >> >> > all zero-copy related code in order to make it smaller, simpler and
> >> >> > hopefully more review friendly. This patch set only supports copy-mode
> >> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> >> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
> >> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his
> >> >> > work has already been accepted. We will publish our zero-copy support
> >> >> > for RX and TX on top of his patch sets at a later point in time.  
> >> >>
> >> >> +1, would be great to see it land this cycle. Saw few minor nits here
> >> >> and there but nothing to hold it up, for the series:
> >> >>
> >> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> >> >>
> >> >> Thanks everyone!  
> >> >
> >> > Great stuff!
> >> >
> >> > Applied to bpf-next, with one condition.
> >> > Upcoming zero-copy patches for both RX and TX need to be posted
> >> > and reviewed within this release window.
> >> > If netdev community as a whole won't be able to agree on the zero-copy
> >> > bits we'd need to revert this feature before the next merge window.  
> >>
> >> Thanks everyone for reviewing this. Highly appreciated.
> >>
> >> Just so we understand the purpose correctly:
> >>
> >> 1: Do you want to see the ZC patches in order to verify that the user
> >> space API holds? If so, we can produce an additional RFC  patch set
> >> using a big chunk of code that we had in RFC V1. We are not proud of
> >> this code since it is clunky, but it hopefully proves the point with
> >> the uapi being the same.
> >>
> >> 2: And/Or are you worried about us all (the netdev community) not
> >> agreeing on a way to implement ZC internally in the drivers and the
> >> XDP infrastructure? This is not going to be possible to finish during
> >> this cycle since we do not like the implementation we had in RFC V1.
> >> Too intrusive and now we also have nicer abstractions from Jesper that
> >> we can use and extend to provide a (hopefully) much cleaner and less
> >> intrusive solution.  
> >
> > short answer: both.
> >
> > Cleanliness and performance of the ZC code is not as important as
> > getting API right. The main concern that during ZC review process
> > we will find out that existing API has issues, so we have to
> > do this exercise before the merge window.
> > And RFC won't fly. Send the patches for real. They have to go
> > through the proper code review. The hackers of netdev community
> > can accept a partial, or a bit unclean, or slightly inefficient
> > implementation, since it can be and will be improved later,
> > but API we cannot change once it goes into official release.
> >
> > Here is the example of API concern:
> > this patch set added shared umem concept. It sounds good in theory,
> > but will it perform well with ZC ? Earlier RFCs didn't have that
> > feature. If it won't perform well than it shouldn't be in the tree.
> > The key reason to let AF_XDP into the tree is its performance promise.
> > If it doesn't perform we should rip it out and redesign.  
> 
> That is a fair point. We will try to produce patch sets for zero-copy
> RX and TX using the latest interfaces within this merge window. Just
> note that we will focus on this for the next week(s) instead of the
> review items that you and Daniel Borkmann submitted. If we get those
> patch sets out in time and we agree that they are a possible way
> forward, then we produce patches with your fixes. It was mainly small
> items, so should be quick.

I would like to see that you create a new xdp_mem_type for this new
zero-copy type. This will allow other XDP redirect methods/types (e.g.
devmap and cpumap) to react appropriately when receiving a zero-copy
frame.

For devmap, I'm hoping we can allow/support using the ndo_xdp_xmit call
without (first) copying (into a newly allocated page).  By arguing that
if an xsk-userspace app modify a frame it's not allowed to, then it is
simply a bug in the program. (Note, this would also allow using
ndo_xdp_xmit call for TX from xsk-userspace).

For cpumap, it is hard to avoid a copy, but I'm hoping we could delay
the copy (and alloc of mem dest area) until on the remote CPU.  This is
already the principle of cpumap; of moving the allocation of the SKB to
the remote CPU.

For ZC to interact with XDP redirect-core and return API, the zero-copy
memory type/allocator, need to provide an area for the xdp_frame data
to be stored in (as we cannot allow using top-of-frame like
non-zero-copy variants), and extend xdp_frame with an ZC umem-id.
I imagine we can avoid any dynamic allocations, as we upfront (at bind
and XDP_UMEM_REG time) know the number of frames.  (e.g. pre-alloc in
xdp_umem_reg() call, and have xdp_umem_get_xdp_frame lookup func).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-07 13:09           ` Jesper Dangaard Brouer
@ 2018-05-07 19:47             ` Björn Töpel
  0 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-07 19:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Magnus Karlsson, Alexei Starovoitov, Daniel Borkmann, Karlsson,
	Magnus, Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Willem de Bruijn, Michael S. Tsirkin,
	Network Development, Björn Töpel, michael.lundkvist,
	Brandeburg, Jesse, Singhai, Anjali, Zhang, Qi Z

2018-05-07 15:09 GMT+02:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Mon, 7 May 2018 11:13:58 +0200
> Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
>
>> On Sat, May 5, 2018 at 2:34 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote:
>> >> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
>> >> <alexei.starovoitov@gmail.com> wrote:
>> >> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>> >> >> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>> >> >> > From: Björn Töpel <bjorn.topel@intel.com>
>> >> >> >
>> >> >> > This patch set introduces a new address family called AF_XDP that is
>> >> >> > optimized for high performance packet processing and, in upcoming
>> >> >> > patch sets, zero-copy semantics. In this patch set, we have removed
>> >> >> > all zero-copy related code in order to make it smaller, simpler and
>> >> >> > hopefully more review friendly. This patch set only supports copy-mode
>> >> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>> >> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
>> >> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his
>> >> >> > work has already been accepted. We will publish our zero-copy support
>> >> >> > for RX and TX on top of his patch sets at a later point in time.
>> >> >>
>> >> >> +1, would be great to see it land this cycle. Saw few minor nits here
>> >> >> and there but nothing to hold it up, for the series:
>> >> >>
>> >> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>> >> >>
>> >> >> Thanks everyone!
>> >> >
>> >> > Great stuff!
>> >> >
>> >> > Applied to bpf-next, with one condition.
>> >> > Upcoming zero-copy patches for both RX and TX need to be posted
>> >> > and reviewed within this release window.
>> >> > If netdev community as a whole won't be able to agree on the zero-copy
>> >> > bits we'd need to revert this feature before the next merge window.
>> >>
>> >> Thanks everyone for reviewing this. Highly appreciated.
>> >>
>> >> Just so we understand the purpose correctly:
>> >>
>> >> 1: Do you want to see the ZC patches in order to verify that the user
>> >> space API holds? If so, we can produce an additional RFC  patch set
>> >> using a big chunk of code that we had in RFC V1. We are not proud of
>> >> this code since it is clunky, but it hopefully proves the point with
>> >> the uapi being the same.
>> >>
>> >> 2: And/Or are you worried about us all (the netdev community) not
>> >> agreeing on a way to implement ZC internally in the drivers and the
>> >> XDP infrastructure? This is not going to be possible to finish during
>> >> this cycle since we do not like the implementation we had in RFC V1.
>> >> Too intrusive and now we also have nicer abstractions from Jesper that
>> >> we can use and extend to provide a (hopefully) much cleaner and less
>> >> intrusive solution.
>> >
>> > short answer: both.
>> >
>> > Cleanliness and performance of the ZC code is not as important as
>> > getting API right. The main concern that during ZC review process
>> > we will find out that existing API has issues, so we have to
>> > do this exercise before the merge window.
>> > And RFC won't fly. Send the patches for real. They have to go
>> > through the proper code review. The hackers of netdev community
>> > can accept a partial, or a bit unclean, or slightly inefficient
>> > implementation, since it can be and will be improved later,
>> > but API we cannot change once it goes into official release.
>> >
>> > Here is the example of API concern:
>> > this patch set added shared umem concept. It sounds good in theory,
>> > but will it perform well with ZC ? Earlier RFCs didn't have that
>> > feature. If it won't perform well than it shouldn't be in the tree.
>> > The key reason to let AF_XDP into the tree is its performance promise.
>> > If it doesn't perform we should rip it out and redesign.
>>
>> That is a fair point. We will try to produce patch sets for zero-copy
>> RX and TX using the latest interfaces within this merge window. Just
>> note that we will focus on this for the next week(s) instead of the
>> review items that you and Daniel Borkmann submitted. If we get those
>> patch sets out in time and we agree that they are a possible way
>> forward, then we produce patches with your fixes. It was mainly small
>> items, so should be quick.
>
> I would like to see that you create a new xdp_mem_type for this new
> zero-copy type. This will allow other XDP redirect methods/types (e.g.
> devmap and cpumap) to react appropriately when receiving a zero-copy
> frame.
>

Yes, that's the plan!

> For devmap, I'm hoping we can allow/support using the ndo_xdp_xmit call
> without (first) copying (into a newly allocated page).  By arguing that
> if an xsk-userspace app modify a frame it's not allowed to, then it is
> simply a bug in the program. (Note, this would also allow using
> ndo_xdp_xmit call for TX from xsk-userspace).
>

Makes sense. I think the ZC rational for Rx can indeed be extended for
devmap redirects -- i.e. no frame cloning is required.

> For cpumap, it is hard to avoid a copy, but I'm hoping we could delay
> the copy (and alloc of mem dest area) until on the remote CPU.  This is
> already the principle of cpumap; of moving the allocation of the SKB to
> the remote CPU.
>

I think for most AF_XDP applications that would like to pass frames to
the kernel, the cpumap would be preferred instead of XDP_PASS (moving
the stack execution to another off-AF_XDP-thread).

> For ZC to interact with XDP redirect-core and return API, the zero-copy
> memory type/allocator, need to provide an area for the xdp_frame data
> to be stored in (as we cannot allow using top-of-frame like
> non-zero-copy variants), and extend xdp_frame with an ZC umem-id.
> I imagine we can avoid any dynamic allocations, as we upfront (at bind
> and XDP_UMEM_REG time) know the number of frames.  (e.g. pre-alloc in
> xdp_umem_reg() call, and have xdp_umem_get_xdp_frame lookup func).
>

Yeah, we can allocate a kernel-side-only xdp_frame for each umem frame.

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-03 23:38   ` Alexei Starovoitov
  2018-05-04 11:22     ` Magnus Karlsson
@ 2018-05-17  6:46     ` Björn Töpel
  2018-05-18  3:38       ` Alexei Starovoitov
  1 sibling, 1 reply; 47+ messages in thread
From: Björn Töpel @ 2018-05-17  6:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Willem de Bruijn, Michael S. Tsirkin,
	Netdev, Björn Töpel, michael.lundkvist, Brandeburg,
	Jesse, Singhai, Anjali, Zhang, Qi Z

2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>> > From: Björn Töpel <bjorn.topel@intel.com>
>> >
>> > This patch set introduces a new address family called AF_XDP that is
>> > optimized for high performance packet processing and, in upcoming
>> > patch sets, zero-copy semantics. In this patch set, we have removed
>> > all zero-copy related code in order to make it smaller, simpler and
>> > hopefully more review friendly. This patch set only supports copy-mode
>> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
>> > driver changes that Jesper Dangaard Brouer is working on. Some of his
>> > work has already been accepted. We will publish our zero-copy support
>> > for RX and TX on top of his patch sets at a later point in time.
>>
>> +1, would be great to see it land this cycle. Saw few minor nits here
>> and there but nothing to hold it up, for the series:
>>
>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>
>> Thanks everyone!
>
> Great stuff!
>
> Applied to bpf-next, with one condition.
> Upcoming zero-copy patches for both RX and TX need to be posted
> and reviewed within this release window.
> If netdev community as a whole won't be able to agree on the zero-copy
> bits we'd need to revert this feature before the next merge window.
>
> Few other minor nits:
> patch 3:
> +struct xdp_ring {
> +       __u32 producer __attribute__((aligned(64)));
> +       __u32 consumer __attribute__((aligned(64)));
> +};
> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>

Hmm, I need some guidance on what a sane uapi variant would be. We
can't have the uapi depend on the kernel build. ARM64, e.g., can have
both 64B and 128B according to the specs. Contemporary IA processors
have 64B.

The simplest, and maybe most future-proof, would be 128B aligned for
all. Another is having 128B for ARM and 64B for all IA. A third option
is having a hand-shaking API (I think virtio has that) for determine
the cache line size, but I'd rather not go down that route.

Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
would look like?

> patch 5:
> +struct sockaddr_xdp {
> +       __u16 sxdp_family;
> +       __u32 sxdp_ifindex;
> Not great to have a hole in uapi struct. Please fix it in the follow up.
>
> patch 7:
> Has a lot of synchronize_net(). I think udpate/delete side
> can be improved to avoid them. Otherwise users may unknowingly DoS.
>
> As the next steps I suggest to prioritize the highest to ship
> zero-copy rx/tx patches and to add selftests.
>
> Thanks!
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-17  6:46     ` Björn Töpel
@ 2018-05-18  3:38       ` Alexei Starovoitov
  2018-05-18 13:43         ` Daniel Borkmann
  0 siblings, 1 reply; 47+ messages in thread
From: Alexei Starovoitov @ 2018-05-18  3:38 UTC (permalink / raw)
  To: Björn Töpel, Alexei Starovoitov
  Cc: Daniel Borkmann, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Jesper Dangaard Brouer,
	Willem de Bruijn, Michael S. Tsirkin, Netdev,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

On 5/16/18 11:46 PM, Björn Töpel wrote:
> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>
>>>> This patch set introduces a new address family called AF_XDP that is
>>>> optimized for high performance packet processing and, in upcoming
>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>> all zero-copy related code in order to make it smaller, simpler and
>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>> work has already been accepted. We will publish our zero-copy support
>>>> for RX and TX on top of his patch sets at a later point in time.
>>>
>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>> and there but nothing to hold it up, for the series:
>>>
>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>
>>> Thanks everyone!
>>
>> Great stuff!
>>
>> Applied to bpf-next, with one condition.
>> Upcoming zero-copy patches for both RX and TX need to be posted
>> and reviewed within this release window.
>> If netdev community as a whole won't be able to agree on the zero-copy
>> bits we'd need to revert this feature before the next merge window.
>>
>> Few other minor nits:
>> patch 3:
>> +struct xdp_ring {
>> +       __u32 producer __attribute__((aligned(64)));
>> +       __u32 consumer __attribute__((aligned(64)));
>> +};
>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>
>
> Hmm, I need some guidance on what a sane uapi variant would be. We
> can't have the uapi depend on the kernel build. ARM64, e.g., can have
> both 64B and 128B according to the specs. Contemporary IA processors
> have 64B.
>
> The simplest, and maybe most future-proof, would be 128B aligned for
> all. Another is having 128B for ARM and 64B for all IA. A third option
> is having a hand-shaking API (I think virtio has that) for determine
> the cache line size, but I'd rather not go down that route.
>
> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
> would look like?

I suspect i40e+arm combination wasn't tested anyway.
The api may have endianness issues too on something like sparc.
I think the way to be backwards compatible in this area
is to make the api usable on x86 only by adding
to include/uapi/linux/if_xdp.h
#if defined(__x86_64__)
#define AF_XDP_CACHE_BYTES 64
#else
#error "AF_XDP support is not yet available for this architecture"
#endif
and doing:
     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));

And progressively add to this for arm64 and few other archs.
Eventually removing #error and adding some generic define
that's good enough for long tail of architectures that
we really cannot test.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-18  3:38       ` Alexei Starovoitov
@ 2018-05-18 13:43         ` Daniel Borkmann
  2018-05-18 15:18           ` Björn Töpel
  0 siblings, 1 reply; 47+ messages in thread
From: Daniel Borkmann @ 2018-05-18 13:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Björn Töpel, Alexei Starovoitov
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Jesper Dangaard Brouer, Willem de Bruijn,
	Michael S. Tsirkin, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali, Zhang,
	Qi Z

On 05/18/2018 05:38 AM, Alexei Starovoitov wrote:
> On 5/16/18 11:46 PM, Björn Töpel wrote:
>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>
>>>>> This patch set introduces a new address family called AF_XDP that is
>>>>> optimized for high performance packet processing and, in upcoming
>>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>>> all zero-copy related code in order to make it smaller, simpler and
>>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>>> work has already been accepted. We will publish our zero-copy support
>>>>> for RX and TX on top of his patch sets at a later point in time.
>>>>
>>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>>> and there but nothing to hold it up, for the series:
>>>>
>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>
>>>> Thanks everyone!
>>>
>>> Great stuff!
>>>
>>> Applied to bpf-next, with one condition.
>>> Upcoming zero-copy patches for both RX and TX need to be posted
>>> and reviewed within this release window.
>>> If netdev community as a whole won't be able to agree on the zero-copy
>>> bits we'd need to revert this feature before the next merge window.
>>>
>>> Few other minor nits:
>>> patch 3:
>>> +struct xdp_ring {
>>> +       __u32 producer __attribute__((aligned(64)));
>>> +       __u32 consumer __attribute__((aligned(64)));
>>> +};
>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>
>> Hmm, I need some guidance on what a sane uapi variant would be. We
>> can't have the uapi depend on the kernel build. ARM64, e.g., can have
>> both 64B and 128B according to the specs. Contemporary IA processors
>> have 64B.
>>
>> The simplest, and maybe most future-proof, would be 128B aligned for
>> all. Another is having 128B for ARM and 64B for all IA. A third option
>> is having a hand-shaking API (I think virtio has that) for determine
>> the cache line size, but I'd rather not go down that route.
>>
>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
>> would look like?
> 
> I suspect i40e+arm combination wasn't tested anyway.
> The api may have endianness issues too on something like sparc.
> I think the way to be backwards compatible in this area
> is to make the api usable on x86 only by adding
> to include/uapi/linux/if_xdp.h
> #if defined(__x86_64__)
> #define AF_XDP_CACHE_BYTES 64
> #else
> #error "AF_XDP support is not yet available for this architecture"
> #endif
> and doing:
>     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
> 
> And progressively add to this for arm64 and few other archs.
> Eventually removing #error and adding some generic define
> that's good enough for long tail of architectures that
> we really cannot test.

Been looking into this yesterday as well a bit, and it's a bit of a mess what
uapi headers do on this regard (though there are just a handful of such headers).
Some of the kernel uapi headers hard-code generally 64 bytes regardless of the
underlying arch. In general, the kernel does expose it to user space via sysfs
(coherency_line_size). Here's what perf does to retrieve it:

#ifdef _SC_LEVEL1_DCACHE_LINESIZE
#define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
#else
static void cache_line_size(int *cacheline_sizep)
{
        if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep))
                pr_debug("cannot determine cache line size");
}
#endif

The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only
available for x86, arm64, s390 and ppc on a cursory glance in the glibc code.
In the x86 case it retrieves the info from cpuid insn. In order to generically
use it in combination with the header you'd have some probe which would then
set this as a define before including the header.

Then projects like urcu, they do ...

#define ____cacheline_internodealigned_in_smp \
        __attribute__((__aligned__(CAA_CACHE_LINE_SIZE)))

... and then hard code CAA_CACHE_LINE_SIZE for x86 (== 128), s390 (== 128),
ppc (== 256) and sparc64 (== 256) with a generic fallback to 64.

Hmm, perhaps a combination of the two would make sense where in case of known
cacheline size it can still be used and we only have the fallback in such way.
Like:

#ifndef XDP_CACHE_BYTES
# if defined(__x86_64__)
#  define XDP_CACHE_BYTES	64
# else
#  error "Please define XDP_CACHE_BYTES for this architecture!"
# endif
#endif

Too bad there's no asm uapi header at least for the archs where it's fixed
anyway such that not every project out there has to redefine all of it from
scratch and we could just include it (and the generic-asm one would throw
a compile error if it's not externally defined or such).

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-18 13:43         ` Daniel Borkmann
@ 2018-05-18 15:18           ` Björn Töpel
  2018-05-18 16:17             ` Daniel Borkmann
  0 siblings, 1 reply; 47+ messages in thread
From: Björn Töpel @ 2018-05-18 15:18 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Alexei Starovoitov, Karlsson, Magnus, Duyck,
	Alexander H, Alexander Duyck, John Fastabend,
	Jesper Dangaard Brouer, Willem de Bruijn, Michael S. Tsirkin,
	Netdev, Björn Töpel, michael.lundkvist, Brandeburg,
	Jesse, Singhai, Anjali, Zhang, Qi Z

2018-05-18 15:43 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
> On 05/18/2018 05:38 AM, Alexei Starovoitov wrote:
>> On 5/16/18 11:46 PM, Björn Töpel wrote:
>>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>>
>>>>>> This patch set introduces a new address family called AF_XDP that is
>>>>>> optimized for high performance packet processing and, in upcoming
>>>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>>>> all zero-copy related code in order to make it smaller, simpler and
>>>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>>>> work has already been accepted. We will publish our zero-copy support
>>>>>> for RX and TX on top of his patch sets at a later point in time.
>>>>>
>>>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>>>> and there but nothing to hold it up, for the series:
>>>>>
>>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>
>>>>> Thanks everyone!
>>>>
>>>> Great stuff!
>>>>
>>>> Applied to bpf-next, with one condition.
>>>> Upcoming zero-copy patches for both RX and TX need to be posted
>>>> and reviewed within this release window.
>>>> If netdev community as a whole won't be able to agree on the zero-copy
>>>> bits we'd need to revert this feature before the next merge window.
>>>>
>>>> Few other minor nits:
>>>> patch 3:
>>>> +struct xdp_ring {
>>>> +       __u32 producer __attribute__((aligned(64)));
>>>> +       __u32 consumer __attribute__((aligned(64)));
>>>> +};
>>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>>
>>> Hmm, I need some guidance on what a sane uapi variant would be. We
>>> can't have the uapi depend on the kernel build. ARM64, e.g., can have
>>> both 64B and 128B according to the specs. Contemporary IA processors
>>> have 64B.
>>>
>>> The simplest, and maybe most future-proof, would be 128B aligned for
>>> all. Another is having 128B for ARM and 64B for all IA. A third option
>>> is having a hand-shaking API (I think virtio has that) for determine
>>> the cache line size, but I'd rather not go down that route.
>>>
>>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
>>> would look like?
>>
>> I suspect i40e+arm combination wasn't tested anyway.
>> The api may have endianness issues too on something like sparc.
>> I think the way to be backwards compatible in this area
>> is to make the api usable on x86 only by adding
>> to include/uapi/linux/if_xdp.h
>> #if defined(__x86_64__)
>> #define AF_XDP_CACHE_BYTES 64
>> #else
>> #error "AF_XDP support is not yet available for this architecture"
>> #endif
>> and doing:
>>     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>
>> And progressively add to this for arm64 and few other archs.
>> Eventually removing #error and adding some generic define
>> that's good enough for long tail of architectures that
>> we really cannot test.
>
> Been looking into this yesterday as well a bit, and it's a bit of a mess what
> uapi headers do on this regard (though there are just a handful of such headers).
> Some of the kernel uapi headers hard-code generally 64 bytes regardless of the
> underlying arch. In general, the kernel does expose it to user space via sysfs
> (coherency_line_size). Here's what perf does to retrieve it:
>
> #ifdef _SC_LEVEL1_DCACHE_LINESIZE
> #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
> #else
> static void cache_line_size(int *cacheline_sizep)
> {
>         if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep))
>                 pr_debug("cannot determine cache line size");
> }
> #endif
>
> The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only
> available for x86, arm64, s390 and ppc on a cursory glance in the glibc code.
> In the x86 case it retrieves the info from cpuid insn. In order to generically
> use it in combination with the header you'd have some probe which would then
> set this as a define before including the header.
>

But as a uapi we cannot depend on the L1 cache line size for what's
currently running on the system, right? So, either a "one cache line
size for all flavors of an arch", e.g. for ARMv8 that would be 128,
even though there can be 64 flavors out there.

Another way would be to remove the ring structure completely, and
leave that to the user-space application to figure out. So there's a
runtime interface (getsockopt) to probe the offsets the head and tail
pointer is after/before the mmap call, and only expose the descriptor
format explicitly in if_xdp.h. Don't know if that is too unorthodox or
not...

> Then projects like urcu, they do ...
>
> #define ____cacheline_internodealigned_in_smp \
>         __attribute__((__aligned__(CAA_CACHE_LINE_SIZE)))
>
> ... and then hard code CAA_CACHE_LINE_SIZE for x86 (== 128), s390 (== 128),
> ppc (== 256) and sparc64 (== 256) with a generic fallback to 64.
>
> Hmm, perhaps a combination of the two would make sense where in case of known
> cacheline size it can still be used and we only have the fallback in such way.
> Like:
>
> #ifndef XDP_CACHE_BYTES
> # if defined(__x86_64__)
> #  define XDP_CACHE_BYTES       64
> # else
> #  error "Please define XDP_CACHE_BYTES for this architecture!"
> # endif
> #endif
>
> Too bad there's no asm uapi header at least for the archs where it's fixed
> anyway such that not every project out there has to redefine all of it from
> scratch and we could just include it (and the generic-asm one would throw
> a compile error if it's not externally defined or such).
>

I started out with adding a cache.h to the arch/XXX/include/uapi, but
then realized that this didn't really work. :-)

> Cheers,
> Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-18 15:18           ` Björn Töpel
@ 2018-05-18 16:17             ` Daniel Borkmann
  2018-05-18 16:32               ` Björn Töpel
  0 siblings, 1 reply; 47+ messages in thread
From: Daniel Borkmann @ 2018-05-18 16:17 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexei Starovoitov, Alexei Starovoitov, Karlsson, Magnus, Duyck,
	Alexander H, Alexander Duyck, John Fastabend,
	Jesper Dangaard Brouer, Willem de Bruijn, Michael S. Tsirkin,
	Netdev, Björn Töpel, michael.lundkvist, Brandeburg,
	Jesse, Singhai, Anjali, Zhang, Qi Z

On 05/18/2018 05:18 PM, Björn Töpel wrote:
> 2018-05-18 15:43 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
>> On 05/18/2018 05:38 AM, Alexei Starovoitov wrote:
>>> On 5/16/18 11:46 PM, Björn Töpel wrote:
>>>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>>>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>>>
>>>>>>> This patch set introduces a new address family called AF_XDP that is
>>>>>>> optimized for high performance packet processing and, in upcoming
>>>>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>>>>> all zero-copy related code in order to make it smaller, simpler and
>>>>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>>>>> work has already been accepted. We will publish our zero-copy support
>>>>>>> for RX and TX on top of his patch sets at a later point in time.
>>>>>>
>>>>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>>>>> and there but nothing to hold it up, for the series:
>>>>>>
>>>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>>
>>>>>> Thanks everyone!
>>>>>
>>>>> Great stuff!
>>>>>
>>>>> Applied to bpf-next, with one condition.
>>>>> Upcoming zero-copy patches for both RX and TX need to be posted
>>>>> and reviewed within this release window.
>>>>> If netdev community as a whole won't be able to agree on the zero-copy
>>>>> bits we'd need to revert this feature before the next merge window.
>>>>>
>>>>> Few other minor nits:
>>>>> patch 3:
>>>>> +struct xdp_ring {
>>>>> +       __u32 producer __attribute__((aligned(64)));
>>>>> +       __u32 consumer __attribute__((aligned(64)));
>>>>> +};
>>>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>>>
>>>> Hmm, I need some guidance on what a sane uapi variant would be. We
>>>> can't have the uapi depend on the kernel build. ARM64, e.g., can have
>>>> both 64B and 128B according to the specs. Contemporary IA processors
>>>> have 64B.
>>>>
>>>> The simplest, and maybe most future-proof, would be 128B aligned for
>>>> all. Another is having 128B for ARM and 64B for all IA. A third option
>>>> is having a hand-shaking API (I think virtio has that) for determine
>>>> the cache line size, but I'd rather not go down that route.
>>>>
>>>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
>>>> would look like?
>>>
>>> I suspect i40e+arm combination wasn't tested anyway.
>>> The api may have endianness issues too on something like sparc.
>>> I think the way to be backwards compatible in this area
>>> is to make the api usable on x86 only by adding
>>> to include/uapi/linux/if_xdp.h
>>> #if defined(__x86_64__)
>>> #define AF_XDP_CACHE_BYTES 64
>>> #else
>>> #error "AF_XDP support is not yet available for this architecture"
>>> #endif
>>> and doing:
>>>     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>>     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>>
>>> And progressively add to this for arm64 and few other archs.
>>> Eventually removing #error and adding some generic define
>>> that's good enough for long tail of architectures that
>>> we really cannot test.
>>
>> Been looking into this yesterday as well a bit, and it's a bit of a mess what
>> uapi headers do on this regard (though there are just a handful of such headers).
>> Some of the kernel uapi headers hard-code generally 64 bytes regardless of the
>> underlying arch. In general, the kernel does expose it to user space via sysfs
>> (coherency_line_size). Here's what perf does to retrieve it:
>>
>> #ifdef _SC_LEVEL1_DCACHE_LINESIZE
>> #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
>> #else
>> static void cache_line_size(int *cacheline_sizep)
>> {
>>         if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep))
>>                 pr_debug("cannot determine cache line size");
>> }
>> #endif
>>
>> The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only
>> available for x86, arm64, s390 and ppc on a cursory glance in the glibc code.
>> In the x86 case it retrieves the info from cpuid insn. In order to generically
>> use it in combination with the header you'd have some probe which would then
>> set this as a define before including the header.
> 
> But as a uapi we cannot depend on the L1 cache line size for what's
> currently running on the system, right? So, either a "one cache line
> size for all flavors of an arch", e.g. for ARMv8 that would be 128,
> even though there can be 64 flavors out there.
> 
> Another way would be to remove the ring structure completely, and
> leave that to the user-space application to figure out. So there's a
> runtime interface (getsockopt) to probe the offsets the head and tail
> pointer is after/before the mmap call, and only expose the descriptor
> format explicitly in if_xdp.h. Don't know if that is too unorthodox or
> not...

Good point, I think that may not be too unreasonable thing to do, imho,
at least this doesn't get us into the issue discussed here in the first
place, and it would work for other archs more seamlessly rather than ugly
build error or single per arch 'catch-all' define.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
  2018-05-18 16:17             ` Daniel Borkmann
@ 2018-05-18 16:32               ` Björn Töpel
  0 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-18 16:32 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Alexei Starovoitov, Karlsson, Magnus, Duyck,
	Alexander H, Alexander Duyck, John Fastabend,
	Jesper Dangaard Brouer, Willem de Bruijn, Michael S. Tsirkin,
	Netdev, Björn Töpel, michael.lundkvist, Brandeburg,
	Jesse, Singhai, Anjali, Zhang, Qi Z

2018-05-18 18:17 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
> On 05/18/2018 05:18 PM, Björn Töpel wrote:
>> 2018-05-18 15:43 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
>>> On 05/18/2018 05:38 AM, Alexei Starovoitov wrote:
>>>> On 5/16/18 11:46 PM, Björn Töpel wrote:
>>>>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>>>>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>>>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>>>>
>>>>>>>> This patch set introduces a new address family called AF_XDP that is
>>>>>>>> optimized for high performance packet processing and, in upcoming
>>>>>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>>>>>> all zero-copy related code in order to make it smaller, simpler and
>>>>>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>>>>>> work has already been accepted. We will publish our zero-copy support
>>>>>>>> for RX and TX on top of his patch sets at a later point in time.
>>>>>>>
>>>>>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>>>>>> and there but nothing to hold it up, for the series:
>>>>>>>
>>>>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>>>
>>>>>>> Thanks everyone!
>>>>>>
>>>>>> Great stuff!
>>>>>>
>>>>>> Applied to bpf-next, with one condition.
>>>>>> Upcoming zero-copy patches for both RX and TX need to be posted
>>>>>> and reviewed within this release window.
>>>>>> If netdev community as a whole won't be able to agree on the zero-copy
>>>>>> bits we'd need to revert this feature before the next merge window.
>>>>>>
>>>>>> Few other minor nits:
>>>>>> patch 3:
>>>>>> +struct xdp_ring {
>>>>>> +       __u32 producer __attribute__((aligned(64)));
>>>>>> +       __u32 consumer __attribute__((aligned(64)));
>>>>>> +};
>>>>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>>>>
>>>>> Hmm, I need some guidance on what a sane uapi variant would be. We
>>>>> can't have the uapi depend on the kernel build. ARM64, e.g., can have
>>>>> both 64B and 128B according to the specs. Contemporary IA processors
>>>>> have 64B.
>>>>>
>>>>> The simplest, and maybe most future-proof, would be 128B aligned for
>>>>> all. Another is having 128B for ARM and 64B for all IA. A third option
>>>>> is having a hand-shaking API (I think virtio has that) for determine
>>>>> the cache line size, but I'd rather not go down that route.
>>>>>
>>>>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
>>>>> would look like?
>>>>
>>>> I suspect i40e+arm combination wasn't tested anyway.
>>>> The api may have endianness issues too on something like sparc.
>>>> I think the way to be backwards compatible in this area
>>>> is to make the api usable on x86 only by adding
>>>> to include/uapi/linux/if_xdp.h
>>>> #if defined(__x86_64__)
>>>> #define AF_XDP_CACHE_BYTES 64
>>>> #else
>>>> #error "AF_XDP support is not yet available for this architecture"
>>>> #endif
>>>> and doing:
>>>>     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>>>     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>>>
>>>> And progressively add to this for arm64 and few other archs.
>>>> Eventually removing #error and adding some generic define
>>>> that's good enough for long tail of architectures that
>>>> we really cannot test.
>>>
>>> Been looking into this yesterday as well a bit, and it's a bit of a mess what
>>> uapi headers do on this regard (though there are just a handful of such headers).
>>> Some of the kernel uapi headers hard-code generally 64 bytes regardless of the
>>> underlying arch. In general, the kernel does expose it to user space via sysfs
>>> (coherency_line_size). Here's what perf does to retrieve it:
>>>
>>> #ifdef _SC_LEVEL1_DCACHE_LINESIZE
>>> #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
>>> #else
>>> static void cache_line_size(int *cacheline_sizep)
>>> {
>>>         if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep))
>>>                 pr_debug("cannot determine cache line size");
>>> }
>>> #endif
>>>
>>> The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only
>>> available for x86, arm64, s390 and ppc on a cursory glance in the glibc code.
>>> In the x86 case it retrieves the info from cpuid insn. In order to generically
>>> use it in combination with the header you'd have some probe which would then
>>> set this as a define before including the header.
>>
>> But as a uapi we cannot depend on the L1 cache line size for what's
>> currently running on the system, right? So, either a "one cache line
>> size for all flavors of an arch", e.g. for ARMv8 that would be 128,
>> even though there can be 64 flavors out there.
>>
>> Another way would be to remove the ring structure completely, and
>> leave that to the user-space application to figure out. So there's a
>> runtime interface (getsockopt) to probe the offsets the head and tail
>> pointer is after/before the mmap call, and only expose the descriptor
>> format explicitly in if_xdp.h. Don't know if that is too unorthodox or
>> not...
>
> Good point, I think that may not be too unreasonable thing to do, imho,
> at least this doesn't get us into the issue discussed here in the first
> place, and it would work for other archs more seamlessly rather than ugly
> build error or single per arch 'catch-all' define.
>

I'll try that route (runtime-based offset to prod/cons), and see where
it ends up!


Enjoy the weekend,
Björn

> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 06/15] xsk: add Rx receive functions and poll support
  2018-05-04 12:59   ` Daniel Borkmann
@ 2018-05-22  7:42     ` Björn Töpel
  0 siblings, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-22  7:42 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Michael S. Tsirkin, Netdev,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

2018-05-04 14:59 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
[...]
>> +
>> +int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>> +{
>> +     int err;
>> +
>> +     err = __xsk_rcv(xs, xdp);
>> +     if (likely(!err))
>> +             xdp_return_buff(xdp);
>> +     else
>> +             xs->rx_dropped++;
>
> This is triggered from __bpf_tx_xdp_map() -> __xsk_map_redirect().
> Should this be percpu counter instead?
>

No, it shouldn't be percpu, but the drop count shouldn't be increased
here. It should be increased as a result of full queue event. Thanks
for pointing this out, I'll fix it.


Björn

[...]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 01/15] net: initial AF_XDP skeleton
  2018-05-02 11:01 ` [PATCH bpf-next v3 01/15] net: initial AF_XDP skeleton Björn Töpel
@ 2018-05-23 22:50   ` Stephen Hemminger
  2018-05-24  6:38     ` Björn Töpel
  2018-05-24 17:57     ` Alexei Starovoitov
  0 siblings, 2 replies; 47+ messages in thread
From: Stephen Hemminger @ 2018-05-23 22:50 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev, Björn Töpel, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang

On Wed,  2 May 2018 13:01:22 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
> new file mode 100644
> index 000000000000..90e4a7152854
> --- /dev/null
> +++ b/net/xdp/Kconfig
> @@ -0,0 +1,7 @@
> +config XDP_SOCKETS
> +	bool "XDP sockets"
> +	depends on BPF_SYSCALL
> +	default n
> +	help
> +	  XDP sockets allows a channel between XDP programs and
> +	  userspace applications.

Why is XDP not supported as a module?
Most distributions will want it to be a module so that it is not loaded
unless used, and AF_XDP could be also be disabled by blacklisting the module.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 01/15] net: initial AF_XDP skeleton
  2018-05-23 22:50   ` Stephen Hemminger
@ 2018-05-24  6:38     ` Björn Töpel
  2018-05-24 17:57     ` Alexei Starovoitov
  1 sibling, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-05-24  6:38 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin, Netdev,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

2018-05-24 0:50 GMT+02:00 Stephen Hemminger <stephen@networkplumber.org>:
> On Wed,  2 May 2018 13:01:22 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
>> diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
>> new file mode 100644
>> index 000000000000..90e4a7152854
>> --- /dev/null
>> +++ b/net/xdp/Kconfig
>> @@ -0,0 +1,7 @@
>> +config XDP_SOCKETS
>> +     bool "XDP sockets"
>> +     depends on BPF_SYSCALL
>> +     default n
>> +     help
>> +       XDP sockets allows a channel between XDP programs and
>> +       userspace applications.
>
> Why is XDP not supported as a module?
> Most distributions will want it to be a module so that it is not loaded
> unless used, and AF_XDP could be also be disabled by blacklisting the module.

Yes, all good points, and The Grand Plan is adding module support.
Unfortunately, it's not there yet.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 01/15] net: initial AF_XDP skeleton
  2018-05-23 22:50   ` Stephen Hemminger
  2018-05-24  6:38     ` Björn Töpel
@ 2018-05-24 17:57     ` Alexei Starovoitov
  1 sibling, 0 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2018-05-24 17:57 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Björn Töpel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev,
	Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

On Wed, May 23, 2018 at 03:50:47PM -0700, Stephen Hemminger wrote:
> Most distributions will want it to be a module so that it is not loaded
> unless used, and AF_XDP could be also be disabled by blacklisting the module.

I think the opposite will be the case. Anyone who cares about performance
would want AF_XDP code to be builtin, since builtin vs module gives additional
performance. All our NIC drivers are builtin, since we see noticeable
perf gains on production workloads.
Hence I'd rather see us spending time on improving AF_XDP instead
of making it a module and forever struggling with maintaining it as a module.

More so I think it's time to get rid of IPV6=m for good. The kernel
is full of ugly hacks and performance degradation due to indirect calls
just because IPV6=m is still supported.
Folks that care about vmlinux size should be using kconfig to compile it out.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  2018-05-02 11:01 ` [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
@ 2018-10-08 15:31   ` Eric Dumazet
  2018-10-08 16:05     ` Björn Töpel
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2018-10-08 15:31 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang



On 05/02/2018 04:01 AM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The xskmap is yet another BPF map, very much inspired by
> dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
> adds AF_XDP sockets into the map, and by using the bpf_redirect_map
> helper, an XDP program can redirect XDP frames to an AF_XDP socket.
> 
> Note that a socket that is bound to certain ifindex/queue index will
> *only* accept XDP frames from that netdev/queue index. If an XDP
> program tries to redirect from a netdev/queue index other than what
> the socket is bound to, the frame will not be received on the socket.
> 
> A socket can reside in multiple maps.
> 
> v3: Fixed race and simplified code.
> v2: Removed one indirection in map lookup.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  include/linux/bpf.h       |  25 +++++
>  include/linux/bpf_types.h |   3 +
>  include/net/xdp_sock.h    |   7 ++
>  include/uapi/linux/bpf.h  |   1 +
>  kernel/bpf/Makefile       |   3 +
>  kernel/bpf/verifier.c     |   8 +-
>  kernel/bpf/xskmap.c       | 239 ++++++++++++++++++++++++++++++++++++++++++++++
>  net/xdp/xsk.c             |   5 +
>  8 files changed, 289 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/bpf/xskmap.c
> 

This function is called under rcu_read_lock() , from map_update_elem()

> +
> +static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
> +			       u64 map_flags)
> +{
> +	struct xsk_map *m = container_of(map, struct xsk_map, map);
> +	u32 i = *(u32 *)key, fd = *(u32 *)value;
> +	struct xdp_sock *xs, *old_xs;
> +	struct socket *sock;
> +	int err;
> +
> +	if (unlikely(map_flags > BPF_EXIST))
> +		return -EINVAL;
> +	if (unlikely(i >= m->map.max_entries))
> +		return -E2BIG;
> +	if (unlikely(map_flags == BPF_NOEXIST))
> +		return -EEXIST;
> +
> +	sock = sockfd_lookup(fd, &err);
> +	if (!sock)
> +		return err;
> +
> +	if (sock->sk->sk_family != PF_XDP) {
> +		sockfd_put(sock);
> +		return -EOPNOTSUPP;
> +	}
> +
> +	xs = (struct xdp_sock *)sock->sk;
> +
> +	if (!xsk_is_setup_for_bpf_map(xs)) {
> +		sockfd_put(sock);
> +		return -EOPNOTSUPP;
> +	}
> +
> +	sock_hold(sock->sk);
> +
> +	old_xs = xchg(&m->xsk_map[i], xs);
> +	if (old_xs) {
> +		/* Make sure we've flushed everything. */

So it is illegal to call synchronize_net(), since it is a reschedule point.

> +		synchronize_net();
> +		sock_put((struct sock *)old_xs);
> +	}
> +
> +	sockfd_put(sock);
> +	return 0;
> +}
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  2018-10-08 15:31   ` Eric Dumazet
@ 2018-10-08 16:05     ` Björn Töpel
  2018-10-08 16:52       ` Björn Töpel
  2018-10-08 16:55       ` Eric Dumazet
  0 siblings, 2 replies; 47+ messages in thread
From: Björn Töpel @ 2018-10-08 16:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin, Netdev,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

Den mån 8 okt. 2018 kl 17:31 skrev Eric Dumazet <eric.dumazet@gmail.com>:
>
> On 05/02/2018 04:01 AM, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > The xskmap is yet another BPF map, very much inspired by
> > dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
> > adds AF_XDP sockets into the map, and by using the bpf_redirect_map
> > helper, an XDP program can redirect XDP frames to an AF_XDP socket.
> >
> > Note that a socket that is bound to certain ifindex/queue index will
> > *only* accept XDP frames from that netdev/queue index. If an XDP
> > program tries to redirect from a netdev/queue index other than what
> > the socket is bound to, the frame will not be received on the socket.
> >
> > A socket can reside in multiple maps.
> >
> > v3: Fixed race and simplified code.
> > v2: Removed one indirection in map lookup.
> >
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> > ---
> >  include/linux/bpf.h       |  25 +++++
> >  include/linux/bpf_types.h |   3 +
> >  include/net/xdp_sock.h    |   7 ++
> >  include/uapi/linux/bpf.h  |   1 +
> >  kernel/bpf/Makefile       |   3 +
> >  kernel/bpf/verifier.c     |   8 +-
> >  kernel/bpf/xskmap.c       | 239 ++++++++++++++++++++++++++++++++++++++++++++++
> >  net/xdp/xsk.c             |   5 +
> >  8 files changed, 289 insertions(+), 2 deletions(-)
> >  create mode 100644 kernel/bpf/xskmap.c
> >
>
> This function is called under rcu_read_lock() , from map_update_elem()
>
> > +
> > +static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
> > +                            u64 map_flags)
> > +{
> > +     struct xsk_map *m = container_of(map, struct xsk_map, map);
> > +     u32 i = *(u32 *)key, fd = *(u32 *)value;
> > +     struct xdp_sock *xs, *old_xs;
> > +     struct socket *sock;
> > +     int err;
> > +
> > +     if (unlikely(map_flags > BPF_EXIST))
> > +             return -EINVAL;
> > +     if (unlikely(i >= m->map.max_entries))
> > +             return -E2BIG;
> > +     if (unlikely(map_flags == BPF_NOEXIST))
> > +             return -EEXIST;
> > +
> > +     sock = sockfd_lookup(fd, &err);
> > +     if (!sock)
> > +             return err;
> > +
> > +     if (sock->sk->sk_family != PF_XDP) {
> > +             sockfd_put(sock);
> > +             return -EOPNOTSUPP;
> > +     }
> > +
> > +     xs = (struct xdp_sock *)sock->sk;
> > +
> > +     if (!xsk_is_setup_for_bpf_map(xs)) {
> > +             sockfd_put(sock);
> > +             return -EOPNOTSUPP;
> > +     }
> > +
> > +     sock_hold(sock->sk);
> > +
> > +     old_xs = xchg(&m->xsk_map[i], xs);
> > +     if (old_xs) {
> > +             /* Make sure we've flushed everything. */
>
> So it is illegal to call synchronize_net(), since it is a reschedule point.
>

Thanks for finding and pointing this out, Eric!

I'll have look and get back with a patch.


Björn


> > +             synchronize_net();
> > +             sock_put((struct sock *)old_xs);
> > +     }
> > +
> > +     sockfd_put(sock);
> > +     return 0;
> > +}
> >
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  2018-10-08 16:05     ` Björn Töpel
@ 2018-10-08 16:52       ` Björn Töpel
  2018-10-08 16:55       ` Eric Dumazet
  1 sibling, 0 replies; 47+ messages in thread
From: Björn Töpel @ 2018-10-08 16:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin, Netdev,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

Den mån 8 okt. 2018 kl 18:05 skrev Björn Töpel <bjorn.topel@gmail.com>:
>
> Den mån 8 okt. 2018 kl 17:31 skrev Eric Dumazet <eric.dumazet@gmail.com>:
> >
[...]
> > So it is illegal to call synchronize_net(), since it is a reschedule point.
> >
>
> Thanks for finding and pointing this out, Eric!
>
> I'll have look and get back with a patch.
>

Eric, something in the lines of the patch below? Or is it considered
bad practice to use call_rcu in this context (prone to DoSing the
kernel)?

Thanks for spending time on the xskmap code. Very much appreciated!

>From 491f7bd87705f72c45e59242fc6c3b1db9d3b56d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= <bjorn.topel@intel.com>
Date: Mon, 8 Oct 2018 18:34:11 +0200
Subject: [PATCH] xsk: do not call synchronize_net() under RCU read lock
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

XSKMAP update and delete functions called synchronize_net(), which can
sleep. It is not allowed to sleep during an RCU read section.

Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type
BPF_MAP_TYPE_XSKMAP")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  1 +
 kernel/bpf/xskmap.c    | 21 +++++++++++----------
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 13acb9803a6d..5b430141a3f6 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -68,6 +68,7 @@ struct xdp_sock {
      */
     spinlock_t tx_completion_lock;
     u64 rx_dropped;
+    struct rcu_head rcu;
 };

 struct xdp_buff;
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9f8463afda9c..51e8e2785612 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -157,6 +157,13 @@ static void *xsk_map_lookup_elem(struct bpf_map
*map, void *key)
     return NULL;
 }

+static void __xsk_map_remove_async(struct rcu_head *rcu)
+{
+    struct xdp_sock *xs = container_of(rcu, struct xdp_sock, rcu);
+
+    sock_put((struct sock *)xs);
+}
+
 static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
                    u64 map_flags)
 {
@@ -192,11 +199,8 @@ static int xsk_map_update_elem(struct bpf_map
*map, void *key, void *value,
     sock_hold(sock->sk);

     old_xs = xchg(&m->xsk_map[i], xs);
-    if (old_xs) {
-        /* Make sure we've flushed everything. */
-        synchronize_net();
-        sock_put((struct sock *)old_xs);
-    }
+    if (old_xs)
+        call_rcu(&old_xs->rcu, __xsk_map_remove_async);

     sockfd_put(sock);
     return 0;
@@ -212,11 +216,8 @@ static int xsk_map_delete_elem(struct bpf_map
*map, void *key)
         return -EINVAL;

     old_xs = xchg(&m->xsk_map[k], NULL);
-    if (old_xs) {
-        /* Make sure we've flushed everything. */
-        synchronize_net();
-        sock_put((struct sock *)old_xs);
-    }
+    if (old_xs)
+        call_rcu(&old_xs->rcu, __xsk_map_remove_async);

     return 0;
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  2018-10-08 16:05     ` Björn Töpel
  2018-10-08 16:52       ` Björn Töpel
@ 2018-10-08 16:55       ` Eric Dumazet
  2018-10-08 17:04         ` Björn Töpel
  1 sibling, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2018-10-08 16:55 UTC (permalink / raw)
  To: Björn Töpel, Eric Dumazet
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin, Netdev,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z



On 10/08/2018 09:05 AM, Björn Töpel wrote:

> 
> Thanks for finding and pointing this out, Eric!
> 
> I'll have look and get back with a patch.
> 
>

You might take a look at SOCK_RCU_FREE flag for sockets.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  2018-10-08 16:55       ` Eric Dumazet
@ 2018-10-08 17:04         ` Björn Töpel
  2018-10-08 17:40           ` [PATCH bpf] xsk: do not call synchronize_net() under RCU read lock Björn Töpel
  0 siblings, 1 reply; 47+ messages in thread
From: Björn Töpel @ 2018-10-08 17:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Michael S. Tsirkin, Netdev,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z

Den mån 8 okt. 2018 kl 18:55 skrev Eric Dumazet <eric.dumazet@gmail.com>:
>
[...]
>
> You might take a look at SOCK_RCU_FREE flag for sockets.
>

Ah, thanks! I'll use this instead.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH bpf] xsk: do not call synchronize_net() under RCU read lock
  2018-10-08 17:04         ` Björn Töpel
@ 2018-10-08 17:40           ` Björn Töpel
  2018-10-09  0:30             ` Song Liu
  2018-10-11  8:22             ` Daniel Borkmann
  0 siblings, 2 replies; 47+ messages in thread
From: Björn Töpel @ 2018-10-08 17:40 UTC (permalink / raw)
  To: ast, daniel, netdev, eric.dumazet
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson

From: Björn Töpel <bjorn.topel@intel.com>

The XSKMAP update and delete functions called synchronize_net(), which
can sleep. It is not allowed to sleep during an RCU read section.

Instead we need to make sure that the sock sk_destruct (xsk_destruct)
function is asynchronously called after an RCU grace period. Setting
the SOCK_RCU_FREE flag for XDP sockets takes care of this.

Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 kernel/bpf/xskmap.c | 10 ++--------
 net/xdp/xsk.c       |  2 ++
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9f8463afda9c..47147c9e184d 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -192,11 +192,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 	sock_hold(sock->sk);
 
 	old_xs = xchg(&m->xsk_map[i], xs);
-	if (old_xs) {
-		/* Make sure we've flushed everything. */
-		synchronize_net();
+	if (old_xs)
 		sock_put((struct sock *)old_xs);
-	}
 
 	sockfd_put(sock);
 	return 0;
@@ -212,11 +209,8 @@ static int xsk_map_delete_elem(struct bpf_map *map, void *key)
 		return -EINVAL;
 
 	old_xs = xchg(&m->xsk_map[k], NULL);
-	if (old_xs) {
-		/* Make sure we've flushed everything. */
-		synchronize_net();
+	if (old_xs)
 		sock_put((struct sock *)old_xs);
-	}
 
 	return 0;
 }
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 0577cd49aa72..07156f43d295 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -754,6 +754,8 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 	sk->sk_destruct = xsk_destruct;
 	sk_refcnt_debug_inc(sk);
 
+	sock_set_flag(sk, SOCK_RCU_FREE);
+
 	xs = xdp_sk(sk);
 	mutex_init(&xs->mutex);
 	spin_lock_init(&xs->tx_completion_lock);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf] xsk: do not call synchronize_net() under RCU read lock
  2018-10-08 17:40           ` [PATCH bpf] xsk: do not call synchronize_net() under RCU read lock Björn Töpel
@ 2018-10-09  0:30             ` Song Liu
  2018-10-11  8:22             ` Daniel Borkmann
  1 sibling, 0 replies; 47+ messages in thread
From: Song Liu @ 2018-10-09  0:30 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexei Starovoitov, Daniel Borkmann, Networking, eric.dumazet,
	Björn Töpel, Magnus Karlsson, Magnus Karlsson

On Mon, Oct 8, 2018 at 10:41 AM Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> From: Björn Töpel <bjorn.topel@intel.com>
>
> The XSKMAP update and delete functions called synchronize_net(), which
> can sleep. It is not allowed to sleep during an RCU read section.
>
> Instead we need to make sure that the sock sk_destruct (xsk_destruct)
> function is asynchronously called after an RCU grace period. Setting
> the SOCK_RCU_FREE flag for XDP sockets takes care of this.
>
> Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  kernel/bpf/xskmap.c | 10 ++--------
>  net/xdp/xsk.c       |  2 ++
>  2 files changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
> index 9f8463afda9c..47147c9e184d 100644
> --- a/kernel/bpf/xskmap.c
> +++ b/kernel/bpf/xskmap.c
> @@ -192,11 +192,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
>         sock_hold(sock->sk);
>
>         old_xs = xchg(&m->xsk_map[i], xs);
> -       if (old_xs) {
> -               /* Make sure we've flushed everything. */
> -               synchronize_net();
> +       if (old_xs)
>                 sock_put((struct sock *)old_xs);
> -       }
>
>         sockfd_put(sock);
>         return 0;
> @@ -212,11 +209,8 @@ static int xsk_map_delete_elem(struct bpf_map *map, void *key)
>                 return -EINVAL;
>
>         old_xs = xchg(&m->xsk_map[k], NULL);
> -       if (old_xs) {
> -               /* Make sure we've flushed everything. */
> -               synchronize_net();
> +       if (old_xs)
>                 sock_put((struct sock *)old_xs);
> -       }
>
>         return 0;
>  }
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 0577cd49aa72..07156f43d295 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -754,6 +754,8 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
>         sk->sk_destruct = xsk_destruct;
>         sk_refcnt_debug_inc(sk);
>
> +       sock_set_flag(sk, SOCK_RCU_FREE);
> +
>         xs = xdp_sk(sk);
>         mutex_init(&xs->mutex);
>         spin_lock_init(&xs->tx_completion_lock);
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH bpf] xsk: do not call synchronize_net() under RCU read lock
  2018-10-08 17:40           ` [PATCH bpf] xsk: do not call synchronize_net() under RCU read lock Björn Töpel
  2018-10-09  0:30             ` Song Liu
@ 2018-10-11  8:22             ` Daniel Borkmann
  1 sibling, 0 replies; 47+ messages in thread
From: Daniel Borkmann @ 2018-10-11  8:22 UTC (permalink / raw)
  To: Björn Töpel, ast, netdev, eric.dumazet
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson

On 10/08/2018 07:40 PM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The XSKMAP update and delete functions called synchronize_net(), which
> can sleep. It is not allowed to sleep during an RCU read section.
> 
> Instead we need to make sure that the sock sk_destruct (xsk_destruct)
> function is asynchronously called after an RCU grace period. Setting
> the SOCK_RCU_FREE flag for XDP sockets takes care of this.
> 
> Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

Applied to bpf, thanks everyone!

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2018-10-11 15:48 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-02 11:01 [PATCH bpf-next v3 00/15] Introducing AF_XDP support Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 01/15] net: initial AF_XDP skeleton Björn Töpel
2018-05-23 22:50   ` Stephen Hemminger
2018-05-24  6:38     ` Björn Töpel
2018-05-24 17:57     ` Alexei Starovoitov
2018-05-02 11:01 ` [PATCH bpf-next v3 02/15] xsk: add user memory registration support sockopt Björn Töpel
2018-05-04 12:34   ` Daniel Borkmann
2018-05-02 11:01 ` [PATCH bpf-next v3 03/15] xsk: add umem fill queue support and mmap Björn Töpel
2018-05-04 12:49   ` Daniel Borkmann
2018-05-02 11:01 ` [PATCH bpf-next v3 04/15] xsk: add Rx queue setup and mmap support Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 05/15] xsk: add support for bind for Rx Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 06/15] xsk: add Rx receive functions and poll support Björn Töpel
2018-05-04 12:59   ` Daniel Borkmann
2018-05-22  7:42     ` Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
2018-10-08 15:31   ` Eric Dumazet
2018-10-08 16:05     ` Björn Töpel
2018-10-08 16:52       ` Björn Töpel
2018-10-08 16:55       ` Eric Dumazet
2018-10-08 17:04         ` Björn Töpel
2018-10-08 17:40           ` [PATCH bpf] xsk: do not call synchronize_net() under RCU read lock Björn Töpel
2018-10-09  0:30             ` Song Liu
2018-10-11  8:22             ` Daniel Borkmann
2018-05-02 11:01 ` [PATCH bpf-next v3 08/15] xsk: wire up XDP_DRV side of AF_XDP Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 09/15] xsk: wire up XDP_SKB " Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 10/15] xsk: add umem completion queue support and mmap Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 11/15] xsk: add Tx queue setup and mmap support Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 12/15] dev: packet: make packet_direct_xmit a common function Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 13/15] xsk: support for Tx Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 14/15] xsk: statistics support Björn Töpel
2018-05-02 11:01 ` [PATCH bpf-next v3 15/15] samples/bpf: sample application and documentation for AF_XDP sockets Björn Töpel
2018-05-02 20:59   ` Jesper Dangaard Brouer
2018-05-03 13:55 ` [PATCH bpf-next v3 00/15] Introducing AF_XDP support Willem de Bruijn
2018-05-03 15:07 ` David Miller
2018-05-03 22:49 ` Daniel Borkmann
2018-05-03 23:38   ` Alexei Starovoitov
2018-05-04 11:22     ` Magnus Karlsson
2018-05-05  0:34       ` Alexei Starovoitov
2018-05-07  9:13         ` Magnus Karlsson
2018-05-07 13:09           ` Jesper Dangaard Brouer
2018-05-07 19:47             ` Björn Töpel
2018-05-17  6:46     ` Björn Töpel
2018-05-18  3:38       ` Alexei Starovoitov
2018-05-18 13:43         ` Daniel Borkmann
2018-05-18 15:18           ` Björn Töpel
2018-05-18 16:17             ` Daniel Borkmann
2018-05-18 16:32               ` Björn Töpel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.