Re: RFC: zero copy recv()

From: Eric Dumazet <eric.dumazet@gmail.com>
To: Maxim Uvarov <maxim.uvarov@linaro.org>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>
Subject: Re: RFC: zero copy recv()
Date: Thu, 25 Apr 2019 10:50:03 -0700	[thread overview]
Message-ID: <77665188-27f2-6567-9e0c-62c66d98f436@gmail.com> (raw)
In-Reply-To: <CAD8XO3b0m5Qn1Ey3gu3HPmcOanN-yjCYBJZEUEu754X=5jAtOA@mail.gmail.com>

On 4/25/19 1:01 AM, Maxim Uvarov wrote:
> On Wed, 24 Apr 2019 at 18:59, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 04/23/2019 11:23 PM, Maxim Uvarov wrote:
>>> Hello,
>>>
>>> On different conferences I see that people are trying to accelerate
>>> network with putting packet processing with protocol level completely
>>> to user space. It might be DPDK, ODP or AF_XDP  plus some network
>>> stack on top of it. Then people are trying to test this solution with
>>> some existence applications. And in better way do not modify
>>> application binaries and just LD_PRELOAD sockets syscalls (recv(),
>>> sendto() and etc). Current recv() expects that application allocates
>>> memory and call will "copy" packet to that memory. Copy per packet is
>>> slow.  Can we consider about implementing zero copy API calls
>>> friendly? Can this change be accepted to kernel?
>>
> 
> Hello Eric, thanks for responding.
> 
>> Generic zero copy is hard.
>>
> 
> yes that is true.
> 
>> As soon as you have multiple consumers in different domains for the data,
>> you need some kind of multiplexing, typically using hardware capabilities.
>>
>> For TCP, we implemented zero copy last year, which works quite well
>> on x86 if your network uses MTU of 4096+headers.
>>
>> tools/testing/selftests/net/tcp_mmap.c  reaches line rate (100Gbit) on
>> a single TCP flow, if using a NIC able to perform header split.
>>
> 
> That is great work. But isn't there context switches on
> getsockopt(TCP_ZEROCOPY_RECEIVE) and read() per packet?

No, since in many cases you actually know how many bytes are expected to be received.

SO_RCVLOWAT can be used by the application to tell the kernel :

- Please send me an EPOLLIN only when you have at least XXXXXX bytes available in receive queue.

> 
> I played with AF_XDP where one core can be isolated and do polling of
> umem pool memory and some other core can do softirq processing.
> And polling of umem is really fast - about 96ns on 2.5Ghz x86 laptop
> and no context switches on umem polling core.

Sure, but again this is very far from being 'generic', let say if you want to reuse TCP stack...

> 
> But in general for tcp_mmap.c code if getsockopt()+read() will be
> changed to one zero copy call, something like recvmsg_zc() then it can
> be LD_PRELOADED.
> mmap() can be also moved under socket creation to simplify api. Does
> it look reasonable?

Honestly I prefer not having to play games like that.

They are many subtle issues there really.

> 
>> But the model is not to run a legacy application with some LD_PRELOAD
>> hack/magic, sorry.
>>
> More likely that legacy applications will like to use zero copy
> networking. Once api will be stable they will support it, especially
> if api can be used with minimal changes for apps.
> Than it will be quite easy to LD_PRELOAD hack or change application to
> use some other IP stack.
> 
> Maxim.
>