All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sagi Grimberg <sagi@grimberg.me>
To: Shai Malin <smalin@nvidia.com>,
	Aurelien Aptel <aaptel@nvidia.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"hch@lst.de" <hch@lst.de>,
	"kbusch@kernel.org" <kbusch@kernel.org>,
	"axboe@fb.com" <axboe@fb.com>,
	Chaitanya Kulkarni <chaitanyak@nvidia.com>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"kuba@kernel.org" <kuba@kernel.org>
Cc: "aurelien.aptel@gmail.com" <aurelien.aptel@gmail.com>,
	"malin1024@gmail.com" <malin1024@gmail.com>,
	Or Gerlitz <ogerlitz@nvidia.com>, Yoray Zack <yorayz@nvidia.com>,
	Boris Pismenny <borisp@nvidia.com>
Subject: Re: [PATCH v11 00/25] nvme-tcp receive offloads
Date: Tue, 7 Mar 2023 12:11:43 +0200	[thread overview]
Message-ID: <81c57b4b-4568-baca-bf23-94b6e94873a0@grimberg.me> (raw)
In-Reply-To: <DM6PR12MB3564B3D0D489B7F4325E69D9BCAD9@DM6PR12MB3564.namprd12.prod.outlook.com>


> Hi Sagi,

Hey Shai,

> On Thu, 23 Feb 2023, Sagi Grimberg <sagi@grimberg.me> wrote:
>> Hey Aurelien and Co,
>>
>> I've spent some time today looking at the last iteration of this,
>> What I cannot understand, is how will this ever be used outside
>> of the kernel nvme-tcp host driver?
>>
>> It seems that the interface is diesigned to fit only a kernel
>> consumer, and a very specific one.
> 
> As part of this series, we are only covering the kernel nvme-tcp host driver.
> The ULP layer we are introducing was designed also for other kernel drivers
> such as the kernel nvme-tcp target and iSCSI.

Yes, I can see how this can fit iscsi, but my question was more for
userspace.

>> Have you considered using a more standard interfaces to use this
>> such that spdk or an io_uring based initiator can use it?
> 
> The main problem which I will explain in more detail (under the digest,
> teardown and resync flows) is that in order to use a more standard interface
> that will hide what the HW needs it will require duplicating the NVMeTCP
> logic of tracking the PDUs in the TCP stream – this seems wrong to us.

Hmm. Its not really the entire logic, its only the pdu/data
boundaries. Every data coming to the host is structured, with a standard
C2HData PDU.

If we assume for the sake of the discussion that such a parser exist,
how would the interface to the ulp look like?

> I can add that we are also working on an end-to-end user-space design for SPDK
> and we don’t see how the two designs could use the same socket APIs without
> impacting the performance gain of both cases.

Can you share a bit more how this will be done for spdk? spdk runs on
normal sockets today, hence I'd love to know how you plan to introduce
it there.

> 
>>
>> To me it appears that:
>> - ddp limits can be obtained via getsockopt
>> - sk_add/sk_del can be done via setsockopt
>> - offloaded DDGST crc can be obtained via something like
>>     msghdr.msg_control
> 
> If we wish to hide it from the NVMeTCP driver it will require us to duplicate
> the NVMeTCP logic of tracking the PDUs in the TCP stream.
> In the positive case, when there are no DDGST errors, nothing is needed to be
> done in the NVMeTCP driver, but in the case of errors (or just in the case of
> out of order packet in the middle of the PDU), the DDGST will need to be
> calculated in SW and it seems wrong to us to duplicate it outside of the
> NVMeTCP driver.

I didn't suggest that the recalc would be hidden, if the calculation
failed the ULP can do it.

Something like:
	ddp_ddgst = CMSG_DATA(CMSG_FIRSTHDR(msg))
	if (ddp_ddgst->valid) {
		great...
	} else {
		recalc_ddgst
	}

> 
>> - Perhaps for setting up the offload per IO, recvmsg would be the
>>     vehicle with a new msg flag MSG_RCV_DDP or something, that would hide
>>     all the details of what the HW needs (the command_id would be set
>>     somewhere in the msghdr).
> 
> Our design includes the following steps per IO:
> - ddp_setup (register the command id and buffer to the HW)
> - The existing flow which sends the command
> - The existing flow of read_sock()
> - ddp_teardown (for the per IO HW teardown, before posting the NVMe completion)
> 
> Using the recvmsg will only replace the read_sock() but this part in the NVMeTCP
> driver is not impacted by the offload design.
> The ddp_setup is needed in order to set the command id and the buffer, and this
> needs to be done before or as part of the sending of command and prior to the
> receive flow.

I agree, I didn't suggest that replacing the entire read_sock flow
(although that can change to recvmsg as well).

I was thinking more along the lines of MSG_PEEK that doesn't actually
reads anything, where the driver prior to sending the command, would do:

	// do a DDP recvmsg to setup direct data placement
	msg = { .msg_flags = MSG_DDP_SETUP };
	iov_iter_bvec(&msg.msg_iter, ITER_DEST, vec, nr_bvec, size);
	sock_recvmsg(queue->sock, &msg, msg.msg_flags);

	...

	// now send the cmd to the controller
	nvme_tcp_setup_cmd_pdu
	nvme_tcp_try_send_cmd_pdu
	...

That is what I was thinking about.
	
> In addition, without duplicating the tracking of the PDUs in the TCP stream,
> it is not possible to hide the teardown flow from the NVMeTCP driver.

The teardown flow is a bummer because it delays the nvme completion.
Is it ok to dma_unmap and then schedule a ddp teardown async without
delaying the nvme completion?

How does TLS offload handles the async teardown?

> 
>> - And all of the resync flow would be something that a separate
>>     ulp socket provider would take care of. Similar to how TLS presents
>>     itself to a tcp application. So the application does not need to be
>>     aware of it.
> 
> The resync flow requires awareness of TCP sequence numbers and NVMe
> PDUs boundaries. If we hide it from the NVMeTCP driver we would have
> to again duplicate NVMe PDU tracking code.

Its really a pretty trivial part of the pdu tracking. Just when
a C2HData PDU starts...

> TLS is a stream protocol and maps cleanly to TCP socket operations.
> NVMeTCP on the other hand is a request-response protocol.
> On the data path, it's not comparable.

Is it? looks similar to me in the sense that the input stream needs to
know when a msg starts and ends. The only difference is that the ULP
needs to timely provide the context to the offload.

> While it could be done by contorting recvmsg() with new flags, adding input
> fields to the msghdr struct and changing the behaviour of uAPI, it will add
> a lot more friction than a separate ULP DDP API specifically made for this
> purpose.

This may be true. I was merely asking if this was a path that was
genuinely explored and ruled out. To me, it sounds like it can
work, and also allow userspace to automatically gain from it.

>> I'm not sure that such interface could cover everything that is needed,
>> but what I'm trying to convey, is that the current interface limits the
>> usability for almost anything else. Please correct me if I'm wrong.
>> Is this designed to also cater anything else outside of the kernel
>> nvme-tcp host driver?
> 
> As part of this series, we are targeting the kernel nvme-tcp host driver,
> and later we are planning to add support for the kernel nvme-tcp target driver.
> 
> The ULP layer was designed to be generic for other request-response based
> protocols.

I understand.

> 
>>
>>> Compatibility
>>> =============
>>> * The offload works with bare-metal or SRIOV.
>>> * The HW can support up to 64K connections per device (assuming no
>>>     other HW accelerations are used). In this series, we will introduce
>>>     the support for up to 4k connections, and we have plans to increase it.
>>> * SW TLS could not work together with the NVMeTCP offload as the HW
>>>     will need to track the NVMeTCP headers in the TCP stream.
>>
>> Can't say I like that.
> 
> The answer should be to support both TLS offload and NVMeTCP offload,
> which is a HW limit in our current generation.

I can't complain about this because we don't have TLS supported in 
nvme-tcp. But _can_ they work together on future HW?

> 
>>
>>> * The ConnectX HW support HW TLS, but in ConnectX-7 those features
>>>     could not co-exists (and it is not part of this series).
>>> * The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
>>>     don’t see the need for this feature yet.
>>> * NVMe poll queues are not in the scope of this series.
>>
>> bonding/teaming?
> 
> We are planning to add it as part of the "incremental feature".
> It will follow the existing design of the mlx HW TLS.

OK.

>>> Future Work
>>> ===========
>>> * NVMeTCP transmit offload.
>>> * NVMeTCP host offloads incremental features.
>>> * NVMeTCP target offload.
>>
>> Which target? which host?
> 
> Kernel nvme-tcp host driver and kernel nvme-tcp target driver.

Look, I did spend time looking into this patchset several times
in the past, and also provided feedback. The reason why I'm questioning
the interface now, is because I'm wandering if it can be less intrusive
and use the common socket interface instead of inventing a new one.

  reply	other threads:[~2023-03-07 10:11 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-03 13:26 [PATCH v11 00/25] nvme-tcp receive offloads Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 01/25] net: Introduce direct data placement tcp offload Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 02/25] net/ethtool: add new stringset ETH_SS_ULP_DDP_{CAPS,STATS} Aurelien Aptel
2023-02-04  3:42   ` Jakub Kicinski
2023-02-03 13:26 ` [PATCH v11 03/25] net/ethtool: add ULP_DDP_{GET,SET} operations for caps and stats Aurelien Aptel
2023-02-04  3:46   ` Jakub Kicinski
2023-02-03 13:26 ` [PATCH v11 04/25] Documentation: document netlink ULP_DDP_GET/SET messages Aurelien Aptel
2023-02-04  3:41   ` Jakub Kicinski
2023-02-06  9:48     ` Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 05/25] iov_iter: skip copy if src == dst for direct data placement Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 06/25] net/tls,core: export get_netdev_for_sock Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 07/25] nvme-tcp: Add DDP offload control path Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 08/25] nvme-tcp: Add DDP data-path Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 09/25] nvme-tcp: RX DDGST offload Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 10/25] nvme-tcp: Deal with netdevice DOWN events Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 11/25] nvme-tcp: Add modparam to control the ULP offload enablement Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 12/25] Documentation: add ULP DDP offload documentation Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 13/25] net/mlx5e: Rename from tls to transport static params Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 14/25] net/mlx5e: Refactor ico sq polling to get budget Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 15/25] net/mlx5e: Have mdev pointer directly on the icosq structure Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 16/25] net/mlx5e: Refactor doorbell function to allow avoiding a completion Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 17/25] net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 18/25] net/mlx5e: NVMEoTCP, offload initialization Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 19/25] net/mlx5e: TCP flow steering for nvme-tcp acceleration Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 20/25] net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 21/25] net/mlx5e: NVMEoTCP, queue init/teardown Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 22/25] net/mlx5e: NVMEoTCP, ddp setup and resync Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 23/25] net/mlx5e: NVMEoTCP, async ddp invalidation Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 24/25] net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 25/25] net/mlx5e: NVMEoTCP, statistics Aurelien Aptel
2023-02-23 15:33 ` [PATCH v11 00/25] nvme-tcp receive offloads Sagi Grimberg
2023-03-01 19:13   ` Shai Malin
2023-03-07 10:11     ` Sagi Grimberg [this message]
2023-03-22 10:19       ` Shai Malin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=81c57b4b-4568-baca-bf23-94b6e94873a0@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=aaptel@nvidia.com \
    --cc=aurelien.aptel@gmail.com \
    --cc=axboe@fb.com \
    --cc=borisp@nvidia.com \
    --cc=chaitanyak@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=malin1024@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@nvidia.com \
    --cc=smalin@nvidia.com \
    --cc=yorayz@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.