RE: [PATCH v11 00/25] nvme-tcp receive offloads

From: Shai Malin <smalin@nvidia.com>
To: Sagi Grimberg <sagi@grimberg.me>,
	Aurelien Aptel <aaptel@nvidia.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"hch@lst.de" <hch@lst.de>,
	"kbusch@kernel.org" <kbusch@kernel.org>,
	"axboe@fb.com" <axboe@fb.com>,
	Chaitanya Kulkarni <chaitanyak@nvidia.com>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"kuba@kernel.org" <kuba@kernel.org>
Cc: "aurelien.aptel@gmail.com" <aurelien.aptel@gmail.com>,
	"malin1024@gmail.com" <malin1024@gmail.com>,
	Or Gerlitz <ogerlitz@nvidia.com>, Yoray Zack <yorayz@nvidia.com>,
	Boris Pismenny <borisp@nvidia.com>
Subject: RE: [PATCH v11 00/25] nvme-tcp receive offloads
Date: Wed, 1 Mar 2023 19:13:58 +0000	[thread overview]
Message-ID: <DM6PR12MB3564B3D0D489B7F4325E69D9BCAD9@DM6PR12MB3564.namprd12.prod.outlook.com> (raw)
In-Reply-To: <72746760-f045-d7bc-1557-255720d7638d@grimberg.me>

Hi Sagi,

On Thu, 23 Feb 2023, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hey Aurelien and Co,
> 
> I've spent some time today looking at the last iteration of this,
> What I cannot understand, is how will this ever be used outside
> of the kernel nvme-tcp host driver?
> 
> It seems that the interface is diesigned to fit only a kernel
> consumer, and a very specific one.

As part of this series, we are only covering the kernel nvme-tcp host driver.
The ULP layer we are introducing was designed also for other kernel drivers 
such as the kernel nvme-tcp target and iSCSI.

> 
> Have you considered using a more standard interfaces to use this
> such that spdk or an io_uring based initiator can use it?

The main problem which I will explain in more detail (under the digest, 
teardown and resync flows) is that in order to use a more standard interface 
that will hide what the HW needs it will require duplicating the NVMeTCP
logic of tracking the PDUs in the TCP stream – this seems wrong to us.

I can add that we are also working on an end-to-end user-space design for SPDK
and we don’t see how the two designs could use the same socket APIs without 
impacting the performance gain of both cases.

> 
> To me it appears that:
> - ddp limits can be obtained via getsockopt
> - sk_add/sk_del can be done via setsockopt
> - offloaded DDGST crc can be obtained via something like
>    msghdr.msg_control

If we wish to hide it from the NVMeTCP driver it will require us to duplicate 
the NVMeTCP logic of tracking the PDUs in the TCP stream.
In the positive case, when there are no DDGST errors, nothing is needed to be 
done in the NVMeTCP driver, but in the case of errors (or just in the case of 
out of order packet in the middle of the PDU), the DDGST will need to be 
calculated in SW and it seems wrong to us to duplicate it outside of the 
NVMeTCP driver.

> - Perhaps for setting up the offload per IO, recvmsg would be the
>    vehicle with a new msg flag MSG_RCV_DDP or something, that would hide
>    all the details of what the HW needs (the command_id would be set
>    somewhere in the msghdr).

Our design includes the following steps per IO:
- ddp_setup (register the command id and buffer to the HW)
- The existing flow which sends the command
- The existing flow of read_sock()
- ddp_teardown (for the per IO HW teardown, before posting the NVMe completion)

Using the recvmsg will only replace the read_sock() but this part in the NVMeTCP 
driver is not impacted by the offload design.
The ddp_setup is needed in order to set the command id and the buffer, and this 
needs to be done before or as part of the sending of command and prior to the 
receive flow.
In addition, without duplicating the tracking of the PDUs in the TCP stream, 
it is not possible to hide the teardown flow from the NVMeTCP driver.

> - And all of the resync flow would be something that a separate
>    ulp socket provider would take care of. Similar to how TLS presents
>    itself to a tcp application. So the application does not need to be
>    aware of it.

The resync flow requires awareness of TCP sequence numbers and NVMe 
PDUs boundaries. If we hide it from the NVMeTCP driver we would have 
to again duplicate NVMe PDU tracking code.
TLS is a stream protocol and maps cleanly to TCP socket operations.
NVMeTCP on the other hand is a request-response protocol. 
On the data path, it's not comparable.
While it could be done by contorting recvmsg() with new flags, adding input 
fields to the msghdr struct and changing the behaviour of uAPI, it will add 
a lot more friction than a separate ULP DDP API specifically made for this 
purpose.

> 
> I'm not sure that such interface could cover everything that is needed,
> but what I'm trying to convey, is that the current interface limits the
> usability for almost anything else. Please correct me if I'm wrong.
> Is this designed to also cater anything else outside of the kernel
> nvme-tcp host driver?

As part of this series, we are targeting the kernel nvme-tcp host driver, 
and later we are planning to add support for the kernel nvme-tcp target driver.

The ULP layer was designed to be generic for other request-response based 
protocols.

> 
> > Compatibility
> > =============
> > * The offload works with bare-metal or SRIOV.
> > * The HW can support up to 64K connections per device (assuming no
> >    other HW accelerations are used). In this series, we will introduce
> >    the support for up to 4k connections, and we have plans to increase it.
> > * SW TLS could not work together with the NVMeTCP offload as the HW
> >    will need to track the NVMeTCP headers in the TCP stream.
> 
> Can't say I like that.

The answer should be to support both TLS offload and NVMeTCP offload, 
which is a HW limit in our current generation.

> 
> > * The ConnectX HW support HW TLS, but in ConnectX-7 those features
> >    could not co-exists (and it is not part of this series).
> > * The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
> >    don’t see the need for this feature yet.
> > * NVMe poll queues are not in the scope of this series.
> 
> bonding/teaming?

We are planning to add it as part of the "incremental feature". 
It will follow the existing design of the mlx HW TLS.

> 
> >
> > Future Work
> > ===========
> > * NVMeTCP transmit offload.
> > * NVMeTCP host offloads incremental features.
> > * NVMeTCP target offload.
> 
> Which target? which host?

Kernel nvme-tcp host driver and kernel nvme-tcp target driver.