From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jason Wang <jasowang@redhat.com>
Subject: Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
Date: Tue, 24 Apr 2018 10:29:00 +0800
Message-ID: <3165e013-fab9-a0a2-2048-6d7aac0bd85e@redhat.com>
References: <20180423135619.7179-1-bjorn.topel@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Cc: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@intel.com>,
        michael.lundkvist@ericsson.com, jesse.brandeburg@intel.com,
        anjali.singhai@intel.com, qi.z.zhang@intel.com
To: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@gmail.com>,
        magnus.karlsson@intel.com, alexander.h.duyck@intel.com,
        alexander.duyck@gmail.com, john.fastabend@gmail.com, ast@fb.com,
        brouer@redhat.com, willemdebruijn.kernel@gmail.com,
        daniel@iogearbox.net, mst@redhat.com, netdev@vger.kernel.org
Return-path: <netdev-owner@vger.kernel.org>
Received: from mx3-rdu2.redhat.com ([66.187.233.73]:53430 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
        id S932688AbeDXC3M (ORCPT <rfc822;netdev@vger.kernel.org>);
        Mon, 23 Apr 2018 22:29:12 -0400
In-Reply-To: <20180423135619.7179-1-bjorn.topel@gmail.com>
Content-Language: en-US
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>


On 2018年04月23日 21:56, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work
> has already been accepted. We will publish our zero-copy support for
> RX and TX on top of his patch sets at a later point in time.
>
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this RFC, all
> packet data is copied out to user-space.
>
> A new feature in this RFC is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
>
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
>
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> together with the generic XDP support and copies out the data to user
> space. A fallback mode that works for any network device. On the other
> hand, if the driver has support for XDP, it will be used by the AF_XDP
> code to provide better performance, but there is still a copy of the
> data into user space.
>
> There is a xdpsock benchmarking/test application included that
> demonstrates how to use AF_XDP sockets with both private and shared
> UMEMs. Say that you would like your UDP traffic from port 4242 to end
> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
> for this:
>
>        ethtool -N p3p2 rx-flow-hash udp4 fn
>        ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>            action 16
>
> Running the rxdrop benchmark in XDP_DRV mode can then be done
> using:
>
>        samples/bpf/xdpsock -i p3p2 -q 16 -r -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.9(3.0)   9.4(9.3)
> txpush       2.5(2.2)   NA*
> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)

This number looks not very exciting. I can get ~3Mpps when using testpmd 
in a guest with xdp_redirect.sh on host between ixgbe and TAP/vhost. I 
believe we can even better performance without virt. It would be 
interesting to compare this performance with e.g testpmd + 
virito_user(vhost_kernel) + XDP.

>
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.1(2.2)   3.3(3.1)
> l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)
>
> * NA since we have no support for TX using the XDP_DRV infrastructure
>    in this RFC. This is for a future patch set since it involves
>    changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>    Dangaard Brouer.
>
> XDP performance on our system as a base line:
>
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32,921,521  0
>
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3,289,491   0
>
> Changes from RFC V2:
>
> * Optimizations and simplifications to the ring structures inspired by
>    ptr_ring.h
> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>    consistent with AF_PACKET
> * Support for only having an RX queue or a TX queue defined
> * Some bug fixes and code cleanup
>
> The structure of the patch set is as follows:
>
> Patches 1-2: Basic socket and umem plumbing
> Patches 3-10: RX support together with the new XSKMAP
> Patches 11-14: TX support
> Patch 15: Sample application
>
> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
> Clean up btf.h in uapi")
>
> Questions:
>
> * How to deal with cache alignment for uapi when different
>    architectures can have different cache line sizes? We have just
>    aligned it to 64 bytes for now, which works for many popular
>    architectures, but not all. Please advise.
>
> To do:
>
> * Optimize performance
>
> * Kernel selftest
>
> Post-series plan:
>
> * Kernel load module support of AF_XDP would be nice. Unclear how to
>    achieve this though since our XDP code depends on net/core.
>
> * Support for AF_XDP sockets without an XPD program loaded. In this
>    case all the traffic on a queue should go up to the user space socket.

I think we probably need this in the case of TUN XDP for virt guest too.

Thanks

>
> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>    XDP_PASS" for a tcpdump-like functionality.
>
> * And of course getting to zero-copy support in small increments.
>
> Thanks: Björn and Magnus
>
> Björn Töpel (8):
>    net: initial AF_XDP skeleton
>    xsk: add user memory registration support sockopt
>    xsk: add Rx queue setup and mmap support
>    xdp: introduce xdp_return_buff API
>    xsk: add Rx receive functions and poll support
>    bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>    xsk: wire up XDP_DRV side of AF_XDP
>    xsk: wire up XDP_SKB side of AF_XDP
>
> Magnus Karlsson (7):
>    xsk: add umem fill queue support and mmap
>    xsk: add support for bind for Rx
>    xsk: add umem completion queue support and mmap
>    xsk: add Tx queue setup and mmap support
>    xsk: support for Tx
>    xsk: statistics support
>    samples/bpf: sample application for AF_XDP sockets
>
>   MAINTAINERS                         |   8 +
>   include/linux/bpf.h                 |  26 +
>   include/linux/bpf_types.h           |   3 +
>   include/linux/filter.h              |   2 +-
>   include/linux/socket.h              |   5 +-
>   include/net/xdp.h                   |   1 +
>   include/net/xdp_sock.h              |  46 ++
>   include/uapi/linux/bpf.h            |   1 +
>   include/uapi/linux/if_xdp.h         |  87 ++++
>   kernel/bpf/Makefile                 |   3 +
>   kernel/bpf/verifier.c               |   8 +-
>   kernel/bpf/xskmap.c                 | 286 +++++++++++
>   net/Kconfig                         |   1 +
>   net/Makefile                        |   1 +
>   net/core/dev.c                      |  34 +-
>   net/core/filter.c                   |  40 +-
>   net/core/sock.c                     |  12 +-
>   net/core/xdp.c                      |  15 +-
>   net/xdp/Kconfig                     |   7 +
>   net/xdp/Makefile                    |   2 +
>   net/xdp/xdp_umem.c                  | 256 ++++++++++
>   net/xdp/xdp_umem.h                  |  65 +++
>   net/xdp/xdp_umem_props.h            |  23 +
>   net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
>   net/xdp/xsk_queue.c                 |  73 +++
>   net/xdp/xsk_queue.h                 | 245 ++++++++++
>   samples/bpf/Makefile                |   4 +
>   samples/bpf/xdpsock.h               |  11 +
>   samples/bpf/xdpsock_kern.c          |  56 +++
>   samples/bpf/xdpsock_user.c          | 947 ++++++++++++++++++++++++++++++++++++
>   security/selinux/hooks.c            |   4 +-
>   security/selinux/include/classmap.h |   4 +-
>   32 files changed, 2945 insertions(+), 35 deletions(-)
>   create mode 100644 include/net/xdp_sock.h
>   create mode 100644 include/uapi/linux/if_xdp.h
>   create mode 100644 kernel/bpf/xskmap.c
>   create mode 100644 net/xdp/Kconfig
>   create mode 100644 net/xdp/Makefile
>   create mode 100644 net/xdp/xdp_umem.c
>   create mode 100644 net/xdp/xdp_umem.h
>   create mode 100644 net/xdp/xdp_umem_props.h
>   create mode 100644 net/xdp/xsk.c
>   create mode 100644 net/xdp/xsk_queue.c
>   create mode 100644 net/xdp/xsk_queue.h
>   create mode 100644 samples/bpf/xdpsock.h
>   create mode 100644 samples/bpf/xdpsock_kern.c
>   create mode 100644 samples/bpf/xdpsock_user.c
>