From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD520C433E0 for ; Sun, 28 Jun 2020 17:16:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B8B462076C for ; Sun, 28 Jun 2020 17:16:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726612AbgF1RQk (ORCPT ); Sun, 28 Jun 2020 13:16:40 -0400 Received: from mga02.intel.com ([134.134.136.20]:15923 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726059AbgF1RQk (ORCPT ); Sun, 28 Jun 2020 13:16:40 -0400 IronPort-SDR: Tc5u38z0Rg0j4+DX+3gFuSTzd5U6V9JP54N5FQEAG/70xmUqDw67VZyEIF49FSxZgLXLrdKOpy Epyqpz9iB+2g== X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="134173291" X-IronPort-AV: E=Sophos;i="5.75,291,1589266800"; d="scan'208";a="134173291" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jun 2020 10:16:38 -0700 IronPort-SDR: s/CaFswzkV9uLr+zFxqweD+BHz4N478o0Avg93xSnrVl+Mvwb54GAnFPN8KiQhUTMFnXdTGpBB 1JKO8wlMnqdA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,291,1589266800"; d="scan'208";a="453934416" Received: from unknown (HELO btopel-mobl.ger.intel.com) ([10.252.54.42]) by orsmga005.jf.intel.com with ESMTP; 28 Jun 2020 10:16:34 -0700 Subject: Re: [PATCH net] xsk: remove cheap_dma optimization To: Christoph Hellwig , Daniel Borkmann Cc: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , netdev@vger.kernel.org, davem@davemloft.net, konrad.wilk@oracle.com, iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org, bpf@vger.kernel.org, maximmi@mellanox.com, magnus.karlsson@intel.com, jonathan.lemon@gmail.com References: <20200626134358.90122-1-bjorn.topel@gmail.com> <20200627070406.GB11854@lst.de> From: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= Message-ID: <88d27e1b-dbda-301c-64ba-2391092e3236@intel.com> Date: Sun, 28 Jun 2020 19:16:33 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 MIME-Version: 1.0 In-Reply-To: <20200627070406.GB11854@lst.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 2020-06-27 09:04, Christoph Hellwig wrote: > On Sat, Jun 27, 2020 at 01:00:19AM +0200, Daniel Borkmann wrote: >> Given there is roughly a ~5 weeks window at max where this removal could >> still be applied in the worst case, could we come up with a fix / proposal >> first that moves this into the DMA mapping core? If there is something that >> can be agreed upon by all parties, then we could avoid re-adding the 9% >> slowdown. :/ > > I'd rather turn it upside down - this abuse of the internals blocks work > that has basically just missed the previous window and I'm not going > to wait weeks to sort out the API misuse. But we can add optimizations > back later if we find a sane way. > I'm not super excited about the performance loss, but I do get Christoph's frustration about gutting the DMA API making it harder for DMA people to get work done. Lets try to solve this properly using proper DMA APIs. > That being said I really can't see how this would make so much of a > difference. What architecture and what dma_ops are you using for > those measurements? What is the workload? > The 9% is for an AF_XDP (Fast raw Ethernet socket. Think AF_PACKET, but faster.) benchmark: receive the packet from the NIC, and drop it. The DMA syncs stand out in the perf top: 28.63% [kernel] [k] i40e_clean_rx_irq_zc 17.12% [kernel] [k] xp_alloc 8.80% [kernel] [k] __xsk_rcv_zc 7.69% [kernel] [k] xdp_do_redirect 5.35% bpf_prog_992d9ddc835e5629 [k] bpf_prog_992d9ddc835e5629 4.77% [kernel] [k] xsk_rcv.part.0 4.07% [kernel] [k] __xsk_map_redirect 3.80% [kernel] [k] dma_direct_sync_single_for_cpu 3.03% [kernel] [k] dma_direct_sync_single_for_device 2.76% [kernel] [k] i40e_alloc_rx_buffers_zc 1.83% [kernel] [k] xsk_flush ... For this benchmark the dma_ops are NULL (dma_is_direct() == true), and the main issue is that SWIOTLB is now unconditionally enabled [1] for x86, and for each sync we have to check that if is_swiotlb_buffer() which involves a some costly indirection. That was pretty much what my hack avoided. Instead we did all the checks upfront, since AF_XDP has long-term DMA mappings, and just set a flag for that. Avoiding the whole "is this address swiotlb" in dma_direct_sync_single_for_{cpu, device]() per-packet would help a lot. Somewhat related to the DMA API; It would have performance benefits for AF_XDP if the DMA range of the mapped memory was linear, i.e. by IOMMU utilization. I've started hacking a thing a little bit, but it would be nice if such API was part of the mapping core. Input: array of pages Output: array of dma addrs (and obviously dev, flags and such) For non-IOMMU len(array of pages) == len(array of dma addrs) For best-case IOMMU len(array of dma addrs) == 1 (large linear space) But that's for later. :-) Björn [1] commit: 09230cbc1bab ("swiotlb: move the SWIOTLB config symbol to lib/Kconfig")