From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AF979C282E1 for ; Thu, 25 Apr 2019 08:01:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7BD91217D7 for ; Thu, 25 Apr 2019 08:01:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="N/0LPK/J" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728731AbfDYIBf (ORCPT ); Thu, 25 Apr 2019 04:01:35 -0400 Received: from mail-ot1-f66.google.com ([209.85.210.66]:36867 "EHLO mail-ot1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726380AbfDYIBf (ORCPT ); Thu, 25 Apr 2019 04:01:35 -0400 Received: by mail-ot1-f66.google.com with SMTP id c16so18646523otn.4 for ; Thu, 25 Apr 2019 01:01:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=r7k3jNtM4rfpF/e+3MiSDVH4Z9psz3XqZDdZqFJQYhE=; b=N/0LPK/JZ2wVFsnXwBZ1WWvWMfON5jfkymx9HbHeCik8bJA0XpabaLFvqj5h5NRrPW FgvVQ8OSaurUeuCyCWAic4Qpad3Chs8mEA9DoR5ykduHUabG+q5R3vE3p4SNeGoy2lw8 +rGV2Iv6PTdSsed4mC4MAbrbTNHTvnYGrQjshUoS0TMthk5jAMHlHjju5DvIKOzzVlJo /IPfIa7ygKTYA9H5n2KvwhDIW0PVMr6tZvABQqxGXsDCuLgdN1LSMggAjrZYgP54gQLu Tz/LiTlquwpVCNFzvKN79lNf//jNTl/5rgnyZixCOXyjMXbe/6GkoXXmDMbKzHHGYS5p EWbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=r7k3jNtM4rfpF/e+3MiSDVH4Z9psz3XqZDdZqFJQYhE=; b=p+5cLuaafJ9tX/mejCLKSRFx8SmBiuRSejRd3sH5yk0hzoYj1lZzN6nl1WZJS7d88y EHaSq84sp+u/iYwLorZC/PRprArDXwlnES0pJQJYlWGGxi6eb5eqqxCRkuzPoSdmlEnb 1/gUckLcXudyKHj6xQkxfsMPUV/sEGXwhEuMi1uxz08kvlzt+pUA0sO3tsIVMamSadn4 Yt08ChJ2zWzGqEBVT/deSYEganwIIJZE3Ald+vqaNgRH0F6qloVxUldAvmEvZ5yJUYoN Lmk/PN6oVkahb5CECK+8BADV7LQAKgEiqZyQA6wJWtnxRuKmPwx9XdcSsGzLMdx0bwdM ZfCQ== X-Gm-Message-State: APjAAAVw8Mbxa3O0/lCAC4FiqRZjPXmcAx10w9OrhZRP9P/6gnmjeXeA ElsZJvw1OLt9tYsCMVUXIKK+IrstfGD52EEyR97WsD7h4YE= X-Google-Smtp-Source: APXvYqwsgZLmqSxFgX+ZSDMJ6HzlLjEMVAoxSKqm0j8G3bSM4Q457dq5wW0WljStj3jiDZYlAKwszXRDqhsa4YAWzfM= X-Received: by 2002:a05:6830:1692:: with SMTP id k18mr20781806otr.216.1556179293739; Thu, 25 Apr 2019 01:01:33 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Maxim Uvarov Date: Thu, 25 Apr 2019 11:01:22 +0300 Message-ID: Subject: Re: RFC: zero copy recv() To: Eric Dumazet Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Ilias Apalodimas Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 24 Apr 2019 at 18:59, Eric Dumazet wrote: > > > > On 04/23/2019 11:23 PM, Maxim Uvarov wrote: > > Hello, > > > > On different conferences I see that people are trying to accelerate > > network with putting packet processing with protocol level completely > > to user space. It might be DPDK, ODP or AF_XDP plus some network > > stack on top of it. Then people are trying to test this solution with > > some existence applications. And in better way do not modify > > application binaries and just LD_PRELOAD sockets syscalls (recv(), > > sendto() and etc). Current recv() expects that application allocates > > memory and call will "copy" packet to that memory. Copy per packet is > > slow. Can we consider about implementing zero copy API calls > > friendly? Can this change be accepted to kernel? > Hello Eric, thanks for responding. > Generic zero copy is hard. > yes that is true. > As soon as you have multiple consumers in different domains for the data, > you need some kind of multiplexing, typically using hardware capabilities. > > For TCP, we implemented zero copy last year, which works quite well > on x86 if your network uses MTU of 4096+headers. > > tools/testing/selftests/net/tcp_mmap.c reaches line rate (100Gbit) on > a single TCP flow, if using a NIC able to perform header split. > That is great work. But isn't there context switches on getsockopt(TCP_ZEROCOPY_RECEIVE) and read() per packet? I played with AF_XDP where one core can be isolated and do polling of umem pool memory and some other core can do softirq processing. And polling of umem is really fast - about 96ns on 2.5Ghz x86 laptop and no context switches on umem polling core. But in general for tcp_mmap.c code if getsockopt()+read() will be changed to one zero copy call, something like recvmsg_zc() then it can be LD_PRELOADED. mmap() can be also moved under socket creation to simplify api. Does it look reasonable? > But the model is not to run a legacy application with some LD_PRELOAD > hack/magic, sorry. > More likely that legacy applications will like to use zero copy networking. Once api will be stable they will support it, especially if api can be used with minimal changes for apps. Than it will be quite easy to LD_PRELOAD hack or change application to use some other IP stack. Maxim.