Re: [PATCH v11 00/25] nvme-tcp receive offloads

From: Sagi Grimberg <sagi@grimberg.me>
To: Shai Malin <smalin@nvidia.com>,
	Aurelien Aptel <aaptel@nvidia.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"hch@lst.de" <hch@lst.de>,
	"kbusch@kernel.org" <kbusch@kernel.org>,
	"axboe@fb.com" <axboe@fb.com>,
	Chaitanya Kulkarni <chaitanyak@nvidia.com>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"kuba@kernel.org" <kuba@kernel.org>
Cc: "aurelien.aptel@gmail.com" <aurelien.aptel@gmail.com>,
	"malin1024@gmail.com" <malin1024@gmail.com>,
	Or Gerlitz <ogerlitz@nvidia.com>, Yoray Zack <yorayz@nvidia.com>,
	Boris Pismenny <borisp@nvidia.com>
Subject: Re: [PATCH v11 00/25] nvme-tcp receive offloads
Date: Tue, 7 Mar 2023 12:11:43 +0200	[thread overview]
Message-ID: <81c57b4b-4568-baca-bf23-94b6e94873a0@grimberg.me> (raw)
In-Reply-To: <DM6PR12MB3564B3D0D489B7F4325E69D9BCAD9@DM6PR12MB3564.namprd12.prod.outlook.com>

> Hi Sagi,

Hey Shai,

> On Thu, 23 Feb 2023, Sagi Grimberg <sagi@grimberg.me> wrote:
>> Hey Aurelien and Co,
>>
>> I've spent some time today looking at the last iteration of this,
>> What I cannot understand, is how will this ever be used outside
>> of the kernel nvme-tcp host driver?
>>
>> It seems that the interface is diesigned to fit only a kernel
>> consumer, and a very specific one.
> 
> As part of this series, we are only covering the kernel nvme-tcp host driver.
> The ULP layer we are introducing was designed also for other kernel drivers
> such as the kernel nvme-tcp target and iSCSI.

Yes, I can see how this can fit iscsi, but my question was more for
userspace.

>> Have you considered using a more standard interfaces to use this
>> such that spdk or an io_uring based initiator can use it?
> 
> The main problem which I will explain in more detail (under the digest,
> teardown and resync flows) is that in order to use a more standard interface
> that will hide what the HW needs it will require duplicating the NVMeTCP
> logic of tracking the PDUs in the TCP stream – this seems wrong to us.

Hmm. Its not really the entire logic, its only the pdu/data
boundaries. Every data coming to the host is structured, with a standard
C2HData PDU.

If we assume for the sake of the discussion that such a parser exist,
how would the interface to the ulp look like?

> I can add that we are also working on an end-to-end user-space design for SPDK
> and we don’t see how the two designs could use the same socket APIs without
> impacting the performance gain of both cases.

Can you share a bit more how this will be done for spdk? spdk runs on
normal sockets today, hence I'd love to know how you plan to introduce
it there.

> 
>>
>> To me it appears that:
>> - ddp limits can be obtained via getsockopt
>> - sk_add/sk_del can be done via setsockopt
>> - offloaded DDGST crc can be obtained via something like
>>     msghdr.msg_control
> 
> If we wish to hide it from the NVMeTCP driver it will require us to duplicate
> the NVMeTCP logic of tracking the PDUs in the TCP stream.
> In the positive case, when there are no DDGST errors, nothing is needed to be
> done in the NVMeTCP driver, but in the case of errors (or just in the case of
> out of order packet in the middle of the PDU), the DDGST will need to be
> calculated in SW and it seems wrong to us to duplicate it outside of the
> NVMeTCP driver.

I didn't suggest that the recalc would be hidden, if the calculation
failed the ULP can do it.

Something like:
	ddp_ddgst = CMSG_DATA(CMSG_FIRSTHDR(msg))
	if (ddp_ddgst->valid) {
		great...
	} else {
		recalc_ddgst
	}

> 
>> - Perhaps for setting up the offload per IO, recvmsg would be the
>>     vehicle with a new msg flag MSG_RCV_DDP or something, that would hide
>>     all the details of what the HW needs (the command_id would be set
>>     somewhere in the msghdr).
> 
> Our design includes the following steps per IO:
> - ddp_setup (register the command id and buffer to the HW)
> - The existing flow which sends the command
> - The existing flow of read_sock()
> - ddp_teardown (for the per IO HW teardown, before posting the NVMe completion)
> 
> Using the recvmsg will only replace the read_sock() but this part in the NVMeTCP
> driver is not impacted by the offload design.
> The ddp_setup is needed in order to set the command id and the buffer, and this
> needs to be done before or as part of the sending of command and prior to the
> receive flow.

I agree, I didn't suggest that replacing the entire read_sock flow
(although that can change to recvmsg as well).

I was thinking more along the lines of MSG_PEEK that doesn't actually
reads anything, where the driver prior to sending the command, would do:

	// do a DDP recvmsg to setup direct data placement
	msg = { .msg_flags = MSG_DDP_SETUP };
	iov_iter_bvec(&msg.msg_iter, ITER_DEST, vec, nr_bvec, size);
	sock_recvmsg(queue->sock, &msg, msg.msg_flags);

	...

	// now send the cmd to the controller
	nvme_tcp_setup_cmd_pdu
	nvme_tcp_try_send_cmd_pdu
	...

That is what I was thinking about.

> In addition, without duplicating the tracking of the PDUs in the TCP stream,
> it is not possible to hide the teardown flow from the NVMeTCP driver.

The teardown flow is a bummer because it delays the nvme completion.
Is it ok to dma_unmap and then schedule a ddp teardown async without
delaying the nvme completion?

How does TLS offload handles the async teardown?

> 
>> - And all of the resync flow would be something that a separate
>>     ulp socket provider would take care of. Similar to how TLS presents
>>     itself to a tcp application. So the application does not need to be
>>     aware of it.
> 
> The resync flow requires awareness of TCP sequence numbers and NVMe
> PDUs boundaries. If we hide it from the NVMeTCP driver we would have
> to again duplicate NVMe PDU tracking code.

Its really a pretty trivial part of the pdu tracking. Just when
a C2HData PDU starts...

> TLS is a stream protocol and maps cleanly to TCP socket operations.
> NVMeTCP on the other hand is a request-response protocol.
> On the data path, it's not comparable.

Is it? looks similar to me in the sense that the input stream needs to
know when a msg starts and ends. The only difference is that the ULP
needs to timely provide the context to the offload.

> While it could be done by contorting recvmsg() with new flags, adding input
> fields to the msghdr struct and changing the behaviour of uAPI, it will add
> a lot more friction than a separate ULP DDP API specifically made for this
> purpose.

This may be true. I was merely asking if this was a path that was
genuinely explored and ruled out. To me, it sounds like it can
work, and also allow userspace to automatically gain from it.

>> I'm not sure that such interface could cover everything that is needed,
>> but what I'm trying to convey, is that the current interface limits the
>> usability for almost anything else. Please correct me if I'm wrong.
>> Is this designed to also cater anything else outside of the kernel
>> nvme-tcp host driver?
> 
> As part of this series, we are targeting the kernel nvme-tcp host driver,
> and later we are planning to add support for the kernel nvme-tcp target driver.
> 
> The ULP layer was designed to be generic for other request-response based
> protocols.

I understand.

> 
>>
>>> Compatibility
>>> =============
>>> * The offload works with bare-metal or SRIOV.
>>> * The HW can support up to 64K connections per device (assuming no
>>>     other HW accelerations are used). In this series, we will introduce
>>>     the support for up to 4k connections, and we have plans to increase it.
>>> * SW TLS could not work together with the NVMeTCP offload as the HW
>>>     will need to track the NVMeTCP headers in the TCP stream.
>>
>> Can't say I like that.
> 
> The answer should be to support both TLS offload and NVMeTCP offload,
> which is a HW limit in our current generation.

I can't complain about this because we don't have TLS supported in 
nvme-tcp. But _can_ they work together on future HW?

> 
>>
>>> * The ConnectX HW support HW TLS, but in ConnectX-7 those features
>>>     could not co-exists (and it is not part of this series).
>>> * The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
>>>     don’t see the need for this feature yet.
>>> * NVMe poll queues are not in the scope of this series.
>>
>> bonding/teaming?
> 
> We are planning to add it as part of the "incremental feature".
> It will follow the existing design of the mlx HW TLS.

OK.

>>> Future Work
>>> ===========
>>> * NVMeTCP transmit offload.
>>> * NVMeTCP host offloads incremental features.
>>> * NVMeTCP target offload.
>>
>> Which target? which host?
> 
> Kernel nvme-tcp host driver and kernel nvme-tcp target driver.

Look, I did spend time looking into this patchset several times
in the past, and also provided feedback. The reason why I'm questioning
the interface now, is because I'm wandering if it can be less intrusive
and use the common socket interface instead of inventing a new one.