From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= Subject: [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support Date: Tue, 15 May 2018 21:06:03 +0200 Message-ID: <20180515190615.23099-1-bjorn.topel@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= , michael.lundkvist@ericsson.com, jesse.brandeburg@intel.com, anjali.singhai@intel.com, qi.z.zhang@intel.com, intel-wired-lan@lists.osuosl.org To: bjorn.topel@gmail.com, magnus.karlsson@gmail.com, magnus.karlsson@intel.com, alexander.h.duyck@intel.com, alexander.duyck@gmail.com, john.fastabend@gmail.com, ast@fb.com, brouer@redhat.com, willemdebruijn.kernel@gmail.com, daniel@iogearbox.net, mst@redhat.com, netdev@vger.kernel.org Return-path: Received: from mga05.intel.com ([192.55.52.43]:33643 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751400AbeEOTGg (ORCPT ); Tue, 15 May 2018 15:06:36 -0400 Sender: netdev-owner@vger.kernel.org List-ID: From: Björn Töpel This RFC introduces zerocopy (ZC) support for AF_XDP. Programs using AF_XDP sockets will now receive RX packets without any copies and can also transmit packets without incurring any copies. No modifications to the application are needed, but the NIC driver needs to be modified to support ZC. If ZC is not supported by the driver, the modes introduced in the AF_XDP patch will be used. Using ZC in our micro benchmarks results in significantly improved performance as can be seen in the performance section later in this cover letter. Note that we did not post this as a proper patch set as suggested by Alexei due to mainly one reason. The i40e modifications need to be fully and properly implemented (we need support for dynamically creating and removing queues in the driver), split up in multiple patches, then reviewed and QA:ed by the Intel NIC team before they can become a proper patch. We just did not have time to finish all of this in this merge window. Alexei had two concerns in conjunction with adding ZC support to AF_XDP: show that the user interface holds and can deliver good performance for ZC and that the driver interfaces for ZC are good. We think that this patch set shows that we have addressed the first issue: performance is good and there is no change to the uapi. But please take a look at the code and see if you like the ZC interfaces that was the second concern. Note that for an untrusted application, HW packet steering to a specific queue pair (the one associated with the application) is a requirement when using ZC, as the application would otherwise be able to see other user space processes' packets. If the HW cannot support the required packet steering you need to use the XDP_SKB mode or the XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP patch set can be used to do load balancing in that case. For benchmarking, you can use the xdpsock application from the AF_XDP patch set without any modifications. Say that you would like your UDP traffic from port 4242 to end up in queue 16, that we will enable AF_XDP on. Here, we use ethtool for this: ethtool -N p3p2 rx-flow-hash udp4 fn ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ action 16 Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be done using: samples/bpf/xdpsock -i p3p2 -q 16 -r -N We have run some benchmarks on a dual socket system with two Broadwell E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 cores which gives a total of 28, but only two cores are used in these experiments. One for TR/RX and one for the user space application. The memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is 8192MB and with 8 of those DIMMs in the system we have 64 GB of total memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The NIC is Intel I40E 40Gbit/s using the i40e driver. Below are the results in Mpps of the I40E NIC benchmark runs for 64 and 1500 byte packets, generated by a commercial packet generator HW outputing packets at full 40 Gbit/s line rate. The results are without retpoline so that we can compare against previous numbers. AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch set are also reported for ease of reference. Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy rxdrop 2.9* 9.6* 21.5 txpush 2.6* - 21.6 l2fwd 1.9* 2.5* 15.0 * From AF_XDP V3 patch set and cover letter. AF_XDP performance 1500 byte packets: Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy rxdrop 2.1 3.3 3.3 l2fwd 1.4 1.8 3.1 So why do we not get higher values for RX similar to the 34 Mpps we had in AF_PACKET V4? We made an experiment running the rxdrop benchmark without using the xdp_do_redirect/flush infrastructure nor using an XDP program (all traffic on a queue goes to one socket). Instead the driver acts directly on the AF_XDP socket. With this we got 36.9 Mpps, a significant improvement without any change to the uapi. So not forcing users to have an XDP program if they do not need it, might be a good idea. This measurement is actually higher than what we got with AF_PACKET V4. XDP performance on our system as a base line: 64 byte packets: XDP stats CPU pps issue-pps XDP-RX CPU 16 32.3M 0 1500 byte packets: XDP stats CPU pps issue-pps XDP-RX CPU 16 3.3M 0 The structure of the patch set is as follows: Patch 1: Removes rebind support. Complicated to support for ZC, so will not be supported for AF_XDP in any mode at this point. Will be a follow up patch for the AF_XDP patch set. Patches 2-4: Plumbing for AF_XDP ZC support Patches 5-6: AF_XDP ZC for RX Patches 7-8: AF_XDP ZC for TX Patch 9: Minor performance fix for the sample application. ZC will work with nearly as good performance without this. Patch 10-12: ZC support for i40e. Should be broken out in smaller pieces as pre-patches. We based this patch set on bpf-next commit f2467c2dbc01 ("selftests/bpf: make sure build-id is on") To do for this RFC to become a patch set: * Implement dynamic creation and deletion of queues in the i40e driver * Properly splitting up the i40e changes * Have the Intel NIC team review the i40e changes from at least an architecture point of view * Implement a more fair scheduling policy for multiple XSKs that share an umem for TX. This can be combined with a batching API for xsk_umem_consume_tx. We are planning on joining the iovisor call on Wednesday if you would like to have a chat with us about this. Thanks: Björn and Magnus Björn Töpel (8): xsk: remove rebind support xsk: moved struct xdp_umem definition xsk: introduce xdp_umem_frame net: xdp: added bpf_netdev_command XDP_SETUP_XSK_UMEM xdp: add MEM_TYPE_ZERO_COPY xsk: add zero-copy support for Rx i40e: added queue pair disable/enable functions i40e: implement AF_XDP zero-copy support for Rx Magnus Karlsson (4): net: added netdevice operation for Tx xsk: wire upp Tx zero-copy functions samples/bpf: minor *_nb_free performance fix i40e: implement Tx zero-copy drivers/net/ethernet/intel/i40e/i40e.h | 20 + drivers/net/ethernet/intel/i40e/i40e_main.c | 458 +++++++++++++++++++- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 635 +++++++++++++++++++++++++--- drivers/net/ethernet/intel/i40e/i40e_txrx.h | 36 +- include/linux/netdevice.h | 13 + include/net/xdp.h | 10 + include/net/xdp_sock.h | 45 +- net/core/xdp.c | 47 +- net/xdp/xdp_umem.c | 112 ++++- net/xdp/xdp_umem.h | 42 +- net/xdp/xdp_umem_props.h | 23 - net/xdp/xsk.c | 162 +++++-- net/xdp/xsk_queue.h | 35 +- samples/bpf/xdpsock_user.c | 8 +- 14 files changed, 1458 insertions(+), 188 deletions(-) delete mode 100644 net/xdp/xdp_umem_props.h -- 2.14.1 From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= Date: Tue, 15 May 2018 21:06:03 +0200 Subject: [Intel-wired-lan] [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support Message-ID: <20180515190615.23099-1-bjorn.topel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: intel-wired-lan@osuosl.org List-ID: From: Bj?rn T?pel This RFC introduces zerocopy (ZC) support for AF_XDP. Programs using AF_XDP sockets will now receive RX packets without any copies and can also transmit packets without incurring any copies. No modifications to the application are needed, but the NIC driver needs to be modified to support ZC. If ZC is not supported by the driver, the modes introduced in the AF_XDP patch will be used. Using ZC in our micro benchmarks results in significantly improved performance as can be seen in the performance section later in this cover letter. Note that we did not post this as a proper patch set as suggested by Alexei due to mainly one reason. The i40e modifications need to be fully and properly implemented (we need support for dynamically creating and removing queues in the driver), split up in multiple patches, then reviewed and QA:ed by the Intel NIC team before they can become a proper patch. We just did not have time to finish all of this in this merge window. Alexei had two concerns in conjunction with adding ZC support to AF_XDP: show that the user interface holds and can deliver good performance for ZC and that the driver interfaces for ZC are good. We think that this patch set shows that we have addressed the first issue: performance is good and there is no change to the uapi. But please take a look at the code and see if you like the ZC interfaces that was the second concern. Note that for an untrusted application, HW packet steering to a specific queue pair (the one associated with the application) is a requirement when using ZC, as the application would otherwise be able to see other user space processes' packets. If the HW cannot support the required packet steering you need to use the XDP_SKB mode or the XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP patch set can be used to do load balancing in that case. For benchmarking, you can use the xdpsock application from the AF_XDP patch set without any modifications. Say that you would like your UDP traffic from port 4242 to end up in queue 16, that we will enable AF_XDP on. Here, we use ethtool for this: ethtool -N p3p2 rx-flow-hash udp4 fn ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ action 16 Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be done using: samples/bpf/xdpsock -i p3p2 -q 16 -r -N We have run some benchmarks on a dual socket system with two Broadwell E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 cores which gives a total of 28, but only two cores are used in these experiments. One for TR/RX and one for the user space application. The memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is 8192MB and with 8 of those DIMMs in the system we have 64 GB of total memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The NIC is Intel I40E 40Gbit/s using the i40e driver. Below are the results in Mpps of the I40E NIC benchmark runs for 64 and 1500 byte packets, generated by a commercial packet generator HW outputing packets at full 40 Gbit/s line rate. The results are without retpoline so that we can compare against previous numbers. AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch set are also reported for ease of reference. Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy rxdrop 2.9* 9.6* 21.5 txpush 2.6* - 21.6 l2fwd 1.9* 2.5* 15.0 * From AF_XDP V3 patch set and cover letter. AF_XDP performance 1500 byte packets: Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy rxdrop 2.1 3.3 3.3 l2fwd 1.4 1.8 3.1 So why do we not get higher values for RX similar to the 34 Mpps we had in AF_PACKET V4? We made an experiment running the rxdrop benchmark without using the xdp_do_redirect/flush infrastructure nor using an XDP program (all traffic on a queue goes to one socket). Instead the driver acts directly on the AF_XDP socket. With this we got 36.9 Mpps, a significant improvement without any change to the uapi. So not forcing users to have an XDP program if they do not need it, might be a good idea. This measurement is actually higher than what we got with AF_PACKET V4. XDP performance on our system as a base line: 64 byte packets: XDP stats CPU pps issue-pps XDP-RX CPU 16 32.3M 0 1500 byte packets: XDP stats CPU pps issue-pps XDP-RX CPU 16 3.3M 0 The structure of the patch set is as follows: Patch 1: Removes rebind support. Complicated to support for ZC, so will not be supported for AF_XDP in any mode at this point. Will be a follow up patch for the AF_XDP patch set. Patches 2-4: Plumbing for AF_XDP ZC support Patches 5-6: AF_XDP ZC for RX Patches 7-8: AF_XDP ZC for TX Patch 9: Minor performance fix for the sample application. ZC will work with nearly as good performance without this. Patch 10-12: ZC support for i40e. Should be broken out in smaller pieces as pre-patches. We based this patch set on bpf-next commit f2467c2dbc01 ("selftests/bpf: make sure build-id is on") To do for this RFC to become a patch set: * Implement dynamic creation and deletion of queues in the i40e driver * Properly splitting up the i40e changes * Have the Intel NIC team review the i40e changes from at least an architecture point of view * Implement a more fair scheduling policy for multiple XSKs that share an umem for TX. This can be combined with a batching API for xsk_umem_consume_tx. We are planning on joining the iovisor call on Wednesday if you would like to have a chat with us about this. Thanks: Bj?rn and Magnus Bj?rn T?pel (8): xsk: remove rebind support xsk: moved struct xdp_umem definition xsk: introduce xdp_umem_frame net: xdp: added bpf_netdev_command XDP_SETUP_XSK_UMEM xdp: add MEM_TYPE_ZERO_COPY xsk: add zero-copy support for Rx i40e: added queue pair disable/enable functions i40e: implement AF_XDP zero-copy support for Rx Magnus Karlsson (4): net: added netdevice operation for Tx xsk: wire upp Tx zero-copy functions samples/bpf: minor *_nb_free performance fix i40e: implement Tx zero-copy drivers/net/ethernet/intel/i40e/i40e.h | 20 + drivers/net/ethernet/intel/i40e/i40e_main.c | 458 +++++++++++++++++++- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 635 +++++++++++++++++++++++++--- drivers/net/ethernet/intel/i40e/i40e_txrx.h | 36 +- include/linux/netdevice.h | 13 + include/net/xdp.h | 10 + include/net/xdp_sock.h | 45 +- net/core/xdp.c | 47 +- net/xdp/xdp_umem.c | 112 ++++- net/xdp/xdp_umem.h | 42 +- net/xdp/xdp_umem_props.h | 23 - net/xdp/xsk.c | 162 +++++-- net/xdp/xsk_queue.h | 35 +- samples/bpf/xdpsock_user.c | 8 +- 14 files changed, 1458 insertions(+), 188 deletions(-) delete mode 100644 net/xdp/xdp_umem_props.h -- 2.14.1