netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Ahern <dsahern@gmail.com>
To: Boris Pismenny <borispismenny@gmail.com>,
	Boris Pismenny <borisp@mellanox.com>,
	kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com,
	hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org,
	viro@zeniv.linux.org.uk, edumazet@google.com
Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org,
	netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com,
	yorayz@nvidia.com, Ben Ben-Ishay <benishay@mellanox.com>,
	Or Gerlitz <ogerlitz@mellanox.com>,
	Yoray Zack <yorayz@mellanox.com>,
	Boris Pismenny <borisp@nvidia.com>
Subject: Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
Date: Thu, 17 Dec 2020 17:44:22 -0700	[thread overview]
Message-ID: <1b6314aa-fa22-0847-67e6-280f73281110@gmail.com> (raw)
In-Reply-To: <999f935c-310b-39e0-6f77-6f39192cabc2@gmail.com>

On 12/17/20 12:06 PM, Boris Pismenny wrote:
> 
> 
> On 15/12/2020 7:19, David Ahern wrote:
>> On 12/13/20 11:21 AM, Boris Pismenny wrote:
>>>>> as zerocopy for the following reasons:
>>>>> (1) The former places buffers *exactly* where the user requests
>>>>> regardless of the order of response arrivals, while the latter places packets
>>>>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>>>>> can be implemented using data placement, but not vice versa.
>>>>
>>>> Fundamentally, it is an SGL and a TCP sequence number. There is a
>>>> starting point where seq N == sgl element 0, position 0. Presumably
>>>> there is a hardware cursor to track where you are in filling the SGL as
>>>> packets are processed. You abort on OOO, so it seems like a fairly
>>>> straightfoward problem.
>>>>
>>>
>>> We do not abort on OOO. Moreover, we can keep going as long as
>>> PDU headers are not reordered.
>>
>> Meaning packets received OOO which contain all or part of a PDU header
>> are aborted, but pure data packets can arrive out-of-order?
>>
>> Did you measure the affect of OOO packets? e.g., randomly drop 1 in 1000
>> nvme packets, 1 in 10,000, 1 in 100,000? How does that affect the fio
>> benchmarks?
>>
> 
> Yes for TLS where similar ideas are used, but not for NVMe-TCP, yet.
> At the worst case we measured (5% OOO), and we got the same performance
> as pure software TLS under these conditions. We will strive to have the
> same for nvme-tcp. We would be able to test this on nvme-tcp only when we
> have hardware. For now, we are using a mix of emulation and simulation to
> test and benchmark.

Interesting. So you don't have hardware today for this feature.

> 
>>>> Similarly for the NVMe SGLs and DDP offload - a more generic solution
>>>> allows other use cases to build on this as opposed to the checks you
>>>> want for a special case. For example, a split at the protocol headers /
>>>> payload boundaries would be a generic solution where kernel managed
>>>> protocols get data in one buffer and socket data is put into a given
>>>> SGL. I am guessing that you have to be already doing this to put PDU
>>>> payloads into an SGL and other headers into other memory to make a
>>>> complete packet, so this is not too far off from what you are already doing.
>>>>
>>>
>>> Splitting at protocol header boundaries and placing data at socket defined
>>> SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
>>> responses. Here is an example:
>>>
>>> the host submits the following requests:
>>> +--------+--------+--------+
>>> | Read 1 | Read 2 | Read 3 |
>>> +--------+--------+--------+
>>>
>>> the target responds with the following responses:
>>> +--------+--------+--------+
>>> | Resp 2 | Resp 3 | Resp 1 |
>>> +--------+--------+--------+
>>
>> Does 'Resp N' == 'PDU + data' like this:
>>
>>  +---------+--------+---------+--------+---------+--------+
>>  | PDU hdr | Resp 2 | PDU hdr | Resp 3 | PDU hdr | Resp 1 |
>>  +---------+--------+---------+--------+---------+--------+
>>
>> or is it 1 PDU hdr + all of the responses?
>>
> 
> Yes, 'RespN = PDU header + PDU data' segmented by TCP whichever way it
> chooses to do so. The PDU header's command_id field correlates between

I thought so and verified in the NVME over Fabrics spec. Not sure why
you brought up order of responses; that should be irrelevant to the design.

> the request and the response. We use that correlation in hardware to
> identify the buffers where data needs to be scattered. In other words,
> hardware holds a map between command_id and block layer buffers SGL.

Understood. Your design is forcing a command id to sgl correlation vs
handing h/w generic or shaped SGLs for writing the socket data.

> 
>>>
>>> I think that the interface we created (tcp_ddp) is sufficiently generic
>>> for the task at hand, which is offloading protocols that can re-order
>>> their responses, a non-trivial task that we claim is important.
>>>
>>> We designed it to support other protocols and not just nvme-tcp,
>>> which is merely an example. For instance, I think that supporting iSCSI
>>> would be natural, and that other protocols will fit nicely.
>>
>> It would be good to add documentation that describes the design, its
>> assumptions and its limitations. tls has several under
>> Documentation/networking. e.g., one important limitation to note is that
>> this design only works for TCP sockets owned by kernel modules.
>>
> 
> You are right. I'll do so for tcp_ddp. You are right that it works only
> for kernel TCP sockets, but maybe future work will extend it.

Little to nothing about this design can work for userspace sockets or
setups like Jonathan's netgpu. e.g., the memory address compares for
whether to copy the data will not work with userspace buffers which is
why I suggested something more generic like marking the skb as "h/w
offload" but that does not work for you since your skbs have a mixed bag
of fragments.


  reply	other threads:[~2020-12-18  0:46 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 01/15] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
2020-12-08  0:39   ` David Ahern
2020-12-08 14:30     ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload Boris Pismenny
2020-12-08  0:42   ` David Ahern
2020-12-08 14:36     ` Boris Pismenny
2020-12-09  0:38       ` David Ahern
2020-12-09  8:15         ` Boris Pismenny
2020-12-10  4:26           ` David Ahern
2020-12-11  2:01             ` Jakub Kicinski
2020-12-11  2:43               ` David Ahern
2020-12-11 18:45                 ` Jakub Kicinski
2020-12-11 18:58                   ` Eric Dumazet
2020-12-11 19:59                   ` David Ahern
2020-12-11 23:05                     ` Jonathan Lemon
2020-12-13 18:34                   ` Boris Pismenny
2020-12-13 18:21             ` Boris Pismenny
2020-12-15  5:19               ` David Ahern
2020-12-17 19:06                 ` Boris Pismenny
2020-12-18  0:44                   ` David Ahern [this message]
2020-12-09  0:57   ` David Ahern
2020-12-09  1:11     ` David Ahern
2020-12-09  8:28       ` Boris Pismenny
2020-12-09  8:25     ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 03/15] net: Introduce crc offload for tcp ddp ulp Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock Boris Pismenny
2020-12-09  1:06   ` David Ahern
2020-12-09  7:41     ` Boris Pismenny
2020-12-10  3:39       ` David Ahern
2020-12-11 18:43         ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path Boris Pismenny
2020-12-10 17:15   ` Shai Malin
2020-12-14  6:38     ` Boris Pismenny
2020-12-15 13:33       ` Shai Malin
2020-12-17 18:51         ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 06/15] nvme-tcp: Add DDP data-path Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 07/15] nvme-tcp : Recalculate crc in the end of the capsule Boris Pismenny
2020-12-15 14:07   ` Shai Malin
2020-12-07 21:06 ` [PATCH v1 net-next 08/15] nvme-tcp: Deal with netdevice DOWN events Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 09/15] net/mlx5: Header file changes for nvme-tcp offload Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 10/15] net/mlx5: Add 128B CQE for NVMEoTCP offload Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 11/15] net/mlx5e: TCP flow steering for nvme-tcp Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 12/15] net/mlx5e: NVMEoTCP DDP offload control path Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 13/15] net/mlx5e: NVMEoTCP, data-path for DDP offload Boris Pismenny
2020-12-18  0:57   ` David Ahern
2020-12-07 21:06 ` [PATCH v1 net-next 14/15] net/mlx5e: NVMEoTCP statistics Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 15/15] net/mlx5e: NVMEoTCP workaround CRC after resync Boris Pismenny
2021-01-14  1:27 ` [PATCH v1 net-next 00/15] nvme-tcp receive offloads Sagi Grimberg
2021-01-14  4:47   ` David Ahern
2021-01-14 19:21     ` Boris Pismenny
2021-01-14 19:17   ` Boris Pismenny
2021-01-14 21:07     ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1b6314aa-fa22-0847-67e6-280f73281110@gmail.com \
    --to=dsahern@gmail.com \
    --cc=axboe@fb.com \
    --cc=benishay@mellanox.com \
    --cc=benishay@nvidia.com \
    --cc=boris.pismenny@gmail.com \
    --cc=borisp@mellanox.com \
    --cc=borisp@nvidia.com \
    --cc=borispismenny@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=ogerlitz@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=sagi@grimberg.me \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yorayz@mellanox.com \
    --cc=yorayz@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).