From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Wang Subject: Re: [PATCH bpf-next 00/15] Introducing AF_XDP support Date: Tue, 24 Apr 2018 10:29:00 +0800 Message-ID: <3165e013-fab9-a0a2-2048-6d7aac0bd85e@redhat.com> References: <20180423135619.7179-1-bjorn.topel@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , michael.lundkvist@ericsson.com, jesse.brandeburg@intel.com, anjali.singhai@intel.com, qi.z.zhang@intel.com To: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , magnus.karlsson@intel.com, alexander.h.duyck@intel.com, alexander.duyck@gmail.com, john.fastabend@gmail.com, ast@fb.com, brouer@redhat.com, willemdebruijn.kernel@gmail.com, daniel@iogearbox.net, mst@redhat.com, netdev@vger.kernel.org Return-path: Received: from mx3-rdu2.redhat.com ([66.187.233.73]:53430 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932688AbeDXC3M (ORCPT ); Mon, 23 Apr 2018 22:29:12 -0400 In-Reply-To: <20180423135619.7179-1-bjorn.topel@gmail.com> Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 2018年04月23日 21:56, Björn Töpel wrote: > From: Björn Töpel > > This RFC introduces a new address family called AF_XDP that is > optimized for high performance packet processing and, in upcoming > patch sets, zero-copy semantics. In this v2 version, we have removed > all zero-copy related code in order to make it smaller, simpler and > hopefully more review friendly. This RFC only supports copy-mode for > the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX > using the XDP_DRV path. Zero-copy support requires XDP and driver > changes that Jesper Dangaard Brouer is working on. Some of his work > has already been accepted. We will publish our zero-copy support for > RX and TX on top of his patch sets at a later point in time. > > An AF_XDP socket (XSK) is created with the normal socket() > syscall. Associated with each XSK are two queues: the RX queue and the > TX queue. A socket can receive packets on the RX queue and it can send > packets on the TX queue. These queues are registered and sized with > the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is > mandatory to have at least one of these queues for each socket. In > contrast to AF_PACKET V2/V3 these descriptor queues are separated from > packet buffers. An RX or TX descriptor points to a data buffer in a > memory area called a UMEM. RX and TX can share the same UMEM so that a > packet does not have to be copied between RX and TX. Moreover, if a > packet needs to be kept for a while due to a possible retransmit, the > descriptor that points to that packet can be changed to point to > another and reused right away. This again avoids copying data. > > This new dedicated packet buffer area is call a UMEM. It consists of a > number of equally size frames and each frame has a unique frame id. A > descriptor in one of the queues references a frame by referencing its > frame id. The user space allocates memory for this UMEM using whatever > means it feels is most appropriate (malloc, mmap, huge pages, > etc). This memory area is then registered with the kernel using the new > setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue > and the COMPLETION queue. The fill queue is used by the application to > send down frame ids for the kernel to fill in with RX packet > data. References to these frames will then appear in the RX queue of > the XSK once they have been received. The completion queue, on the > other hand, contains frame ids that the kernel has transmitted > completely and can now be used again by user space, for either TX or > RX. Thus, the frame ids appearing in the completion queue are ids that > were previously transmitted using the TX queue. In summary, the RX and > FILL queues are used for the RX path and the TX and COMPLETION queues > are used for the TX path. > > The socket is then finally bound with a bind() call to a device and a > specific queue id on that device, and it is not until bind is > completed that traffic starts to flow. Note that in this RFC, all > packet data is copied out to user-space. > > A new feature in this RFC is that the UMEM can be shared between > processes, if desired. If a process wants to do this, it simply skips > the registration of the UMEM and its corresponding two queues, sets a > flag in the bind call and submits the XSK of the process it would like > to share UMEM with as well as its own newly created XSK socket. The > new process will then receive frame id references in its own RX queue > that point to this shared UMEM. Note that since the queue structures > are single-consumer / single-producer (for performance reasons), the > new process has to create its own socket with associated RX and TX > queues, since it cannot share this with the other process. This is > also the reason that there is only one set of FILL and COMPLETION > queues per UMEM. It is the responsibility of a single process to > handle the UMEM. If multiple-producer / multiple-consumer queues are > implemented in the future, this requirement could be relaxed. > > How is then packets distributed between these two XSK? We have > introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in > full). The user-space application can place an XSK at an arbitrary > place in this map. The XDP program can then redirect a packet to a > specific index in this map and at this point XDP validates that the > XSK in that map was indeed bound to that device and queue number. If > not, the packet is dropped. If the map is empty at that index, the > packet is also dropped. This also means that it is currently mandatory > to have an XDP program loaded (and one XSK in the XSKMAP) to be able > to get any traffic to user space through the XSK. > > AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the > driver does not have support for XDP, or XDP_SKB is explicitly chosen > when loading the XDP program, XDP_SKB mode is employed that uses SKBs > together with the generic XDP support and copies out the data to user > space. A fallback mode that works for any network device. On the other > hand, if the driver has support for XDP, it will be used by the AF_XDP > code to provide better performance, but there is still a copy of the > data into user space. > > There is a xdpsock benchmarking/test application included that > demonstrates how to use AF_XDP sockets with both private and shared > UMEMs. Say that you would like your UDP traffic from port 4242 to end > up in queue 16, that we will enable AF_XDP on. Here, we use ethtool > for this: > > ethtool -N p3p2 rx-flow-hash udp4 fn > ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ > action 16 > > Running the rxdrop benchmark in XDP_DRV mode can then be done > using: > > samples/bpf/xdpsock -i p3p2 -q 16 -r -N > > For XDP_SKB mode, use the switch "-S" instead of "-N" and all options > can be displayed with "-h", as usual. > > We have run some benchmarks on a dual socket system with two Broadwell > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 > cores which gives a total of 28, but only two cores are used in these > experiments. One for TR/RX and one for the user space application. The > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total > memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an > Intel I40E 40Gbit/s using the i40e driver. > > Below are the results in Mpps of the I40E NIC benchmark runs for 64 > and 1500 byte packets, generated by commercial packet generator HW that is > generating packets at full 40 Gbit/s line rate. > > AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis. > Benchmark XDP_SKB XDP_DRV > rxdrop 2.9(3.0) 9.4(9.3) > txpush 2.5(2.2) NA* > l2fwd 1.9(1.7) 2.4(2.4) (TX using XDP_SKB in both cases) This number looks not very exciting. I can get ~3Mpps when using testpmd in a guest with xdp_redirect.sh on host between ixgbe and TAP/vhost. I believe we can even better performance without virt. It would be interesting to compare this performance with e.g testpmd + virito_user(vhost_kernel) + XDP. > > AF_XDP performance 1500 byte packets: > Benchmark XDP_SKB XDP_DRV > rxdrop 2.1(2.2) 3.3(3.1) > l2fwd 1.4(1.1) 1.8(1.7) (TX using XDP_SKB in both cases) > > * NA since we have no support for TX using the XDP_DRV infrastructure > in this RFC. This is for a future patch set since it involves > changes to the XDP NDOs. Some of this has been upstreamed by Jesper > Dangaard Brouer. > > XDP performance on our system as a base line: > > 64 byte packets: > XDP stats CPU pps issue-pps > XDP-RX CPU 16 32,921,521 0 > > 1500 byte packets: > XDP stats CPU pps issue-pps > XDP-RX CPU 16 3,289,491 0 > > Changes from RFC V2: > > * Optimizations and simplifications to the ring structures inspired by > ptr_ring.h > * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be > consistent with AF_PACKET > * Support for only having an RX queue or a TX queue defined > * Some bug fixes and code cleanup > > The structure of the patch set is as follows: > > Patches 1-2: Basic socket and umem plumbing > Patches 3-10: RX support together with the new XSKMAP > Patches 11-14: TX support > Patch 15: Sample application > > We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf: > Clean up btf.h in uapi") > > Questions: > > * How to deal with cache alignment for uapi when different > architectures can have different cache line sizes? We have just > aligned it to 64 bytes for now, which works for many popular > architectures, but not all. Please advise. > > To do: > > * Optimize performance > > * Kernel selftest > > Post-series plan: > > * Kernel load module support of AF_XDP would be nice. Unclear how to > achieve this though since our XDP code depends on net/core. > > * Support for AF_XDP sockets without an XPD program loaded. In this > case all the traffic on a queue should go up to the user space socket. I think we probably need this in the case of TUN XDP for virt guest too. Thanks > > * Daniel Borkmann's suggestion for a "copy to XDP socket, and return > XDP_PASS" for a tcpdump-like functionality. > > * And of course getting to zero-copy support in small increments. > > Thanks: Björn and Magnus > > Björn Töpel (8): > net: initial AF_XDP skeleton > xsk: add user memory registration support sockopt > xsk: add Rx queue setup and mmap support > xdp: introduce xdp_return_buff API > xsk: add Rx receive functions and poll support > bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP > xsk: wire up XDP_DRV side of AF_XDP > xsk: wire up XDP_SKB side of AF_XDP > > Magnus Karlsson (7): > xsk: add umem fill queue support and mmap > xsk: add support for bind for Rx > xsk: add umem completion queue support and mmap > xsk: add Tx queue setup and mmap support > xsk: support for Tx > xsk: statistics support > samples/bpf: sample application for AF_XDP sockets > > MAINTAINERS | 8 + > include/linux/bpf.h | 26 + > include/linux/bpf_types.h | 3 + > include/linux/filter.h | 2 +- > include/linux/socket.h | 5 +- > include/net/xdp.h | 1 + > include/net/xdp_sock.h | 46 ++ > include/uapi/linux/bpf.h | 1 + > include/uapi/linux/if_xdp.h | 87 ++++ > kernel/bpf/Makefile | 3 + > kernel/bpf/verifier.c | 8 +- > kernel/bpf/xskmap.c | 286 +++++++++++ > net/Kconfig | 1 + > net/Makefile | 1 + > net/core/dev.c | 34 +- > net/core/filter.c | 40 +- > net/core/sock.c | 12 +- > net/core/xdp.c | 15 +- > net/xdp/Kconfig | 7 + > net/xdp/Makefile | 2 + > net/xdp/xdp_umem.c | 256 ++++++++++ > net/xdp/xdp_umem.h | 65 +++ > net/xdp/xdp_umem_props.h | 23 + > net/xdp/xsk.c | 704 +++++++++++++++++++++++++++ > net/xdp/xsk_queue.c | 73 +++ > net/xdp/xsk_queue.h | 245 ++++++++++ > samples/bpf/Makefile | 4 + > samples/bpf/xdpsock.h | 11 + > samples/bpf/xdpsock_kern.c | 56 +++ > samples/bpf/xdpsock_user.c | 947 ++++++++++++++++++++++++++++++++++++ > security/selinux/hooks.c | 4 +- > security/selinux/include/classmap.h | 4 +- > 32 files changed, 2945 insertions(+), 35 deletions(-) > create mode 100644 include/net/xdp_sock.h > create mode 100644 include/uapi/linux/if_xdp.h > create mode 100644 kernel/bpf/xskmap.c > create mode 100644 net/xdp/Kconfig > create mode 100644 net/xdp/Makefile > create mode 100644 net/xdp/xdp_umem.c > create mode 100644 net/xdp/xdp_umem.h > create mode 100644 net/xdp/xdp_umem_props.h > create mode 100644 net/xdp/xsk.c > create mode 100644 net/xdp/xsk_queue.c > create mode 100644 net/xdp/xsk_queue.h > create mode 100644 samples/bpf/xdpsock.h > create mode 100644 samples/bpf/xdpsock_kern.c > create mode 100644 samples/bpf/xdpsock_user.c >