Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload

From: David Ahern <dsahern@gmail.com>
To: Boris Pismenny <borispismenny@gmail.com>,
	Boris Pismenny <borisp@mellanox.com>,
	kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com,
	hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org,
	viro@zeniv.linux.org.uk, edumazet@google.com
Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org,
	netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com,
	yorayz@nvidia.com, Ben Ben-Ishay <benishay@mellanox.com>,
	Or Gerlitz <ogerlitz@mellanox.com>,
	Yoray Zack <yorayz@mellanox.com>,
	Boris Pismenny <borisp@nvidia.com>
Subject: Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
Date: Thu, 17 Dec 2020 17:44:22 -0700	[thread overview]
Message-ID: <1b6314aa-fa22-0847-67e6-280f73281110@gmail.com> (raw)
In-Reply-To: <999f935c-310b-39e0-6f77-6f39192cabc2@gmail.com>

On 12/17/20 12:06 PM, Boris Pismenny wrote:
> 
> 
> On 15/12/2020 7:19, David Ahern wrote:
>> On 12/13/20 11:21 AM, Boris Pismenny wrote:
>>>>> as zerocopy for the following reasons:
>>>>> (1) The former places buffers *exactly* where the user requests
>>>>> regardless of the order of response arrivals, while the latter places packets
>>>>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>>>>> can be implemented using data placement, but not vice versa.
>>>>
>>>> Fundamentally, it is an SGL and a TCP sequence number. There is a
>>>> starting point where seq N == sgl element 0, position 0. Presumably
>>>> there is a hardware cursor to track where you are in filling the SGL as
>>>> packets are processed. You abort on OOO, so it seems like a fairly
>>>> straightfoward problem.
>>>>
>>>
>>> We do not abort on OOO. Moreover, we can keep going as long as
>>> PDU headers are not reordered.
>>
>> Meaning packets received OOO which contain all or part of a PDU header
>> are aborted, but pure data packets can arrive out-of-order?
>>
>> Did you measure the affect of OOO packets? e.g., randomly drop 1 in 1000
>> nvme packets, 1 in 10,000, 1 in 100,000? How does that affect the fio
>> benchmarks?
>>
> 
> Yes for TLS where similar ideas are used, but not for NVMe-TCP, yet.
> At the worst case we measured (5% OOO), and we got the same performance
> as pure software TLS under these conditions. We will strive to have the
> same for nvme-tcp. We would be able to test this on nvme-tcp only when we
> have hardware. For now, we are using a mix of emulation and simulation to
> test and benchmark.

Interesting. So you don't have hardware today for this feature.

> 
>>>> Similarly for the NVMe SGLs and DDP offload - a more generic solution
>>>> allows other use cases to build on this as opposed to the checks you
>>>> want for a special case. For example, a split at the protocol headers /
>>>> payload boundaries would be a generic solution where kernel managed
>>>> protocols get data in one buffer and socket data is put into a given
>>>> SGL. I am guessing that you have to be already doing this to put PDU
>>>> payloads into an SGL and other headers into other memory to make a
>>>> complete packet, so this is not too far off from what you are already doing.
>>>>
>>>
>>> Splitting at protocol header boundaries and placing data at socket defined
>>> SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
>>> responses. Here is an example:
>>>
>>> the host submits the following requests:
>>> +--------+--------+--------+
>>> | Read 1 | Read 2 | Read 3 |
>>> +--------+--------+--------+
>>>
>>> the target responds with the following responses:
>>> +--------+--------+--------+
>>> | Resp 2 | Resp 3 | Resp 1 |
>>> +--------+--------+--------+
>>
>> Does 'Resp N' == 'PDU + data' like this:
>>
>>  +---------+--------+---------+--------+---------+--------+
>>  | PDU hdr | Resp 2 | PDU hdr | Resp 3 | PDU hdr | Resp 1 |
>>  +---------+--------+---------+--------+---------+--------+
>>
>> or is it 1 PDU hdr + all of the responses?
>>
> 
> Yes, 'RespN = PDU header + PDU data' segmented by TCP whichever way it
> chooses to do so. The PDU header's command_id field correlates between

I thought so and verified in the NVME over Fabrics spec. Not sure why
you brought up order of responses; that should be irrelevant to the design.

> the request and the response. We use that correlation in hardware to
> identify the buffers where data needs to be scattered. In other words,
> hardware holds a map between command_id and block layer buffers SGL.

Understood. Your design is forcing a command id to sgl correlation vs
handing h/w generic or shaped SGLs for writing the socket data.

> 
>>>
>>> I think that the interface we created (tcp_ddp) is sufficiently generic
>>> for the task at hand, which is offloading protocols that can re-order
>>> their responses, a non-trivial task that we claim is important.
>>>
>>> We designed it to support other protocols and not just nvme-tcp,
>>> which is merely an example. For instance, I think that supporting iSCSI
>>> would be natural, and that other protocols will fit nicely.
>>
>> It would be good to add documentation that describes the design, its
>> assumptions and its limitations. tls has several under
>> Documentation/networking. e.g., one important limitation to note is that
>> this design only works for TCP sockets owned by kernel modules.
>>
> 
> You are right. I'll do so for tcp_ddp. You are right that it works only
> for kernel TCP sockets, but maybe future work will extend it.

Little to nothing about this design can work for userspace sockets or
setups like Jonathan's netgpu. e.g., the memory address compares for
whether to copy the data will not work with userspace buffers which is
why I suggested something more generic like marking the skb as "h/w
offload" but that does not work for you since your skbs have a mixed bag
of fragments.