All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shai Malin <smalin@nvidia.com>
To: Sagi Grimberg <sagi@grimberg.me>,
	Aurelien Aptel <aaptel@nvidia.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"hch@lst.de" <hch@lst.de>,
	"kbusch@kernel.org" <kbusch@kernel.org>,
	"axboe@fb.com" <axboe@fb.com>,
	Chaitanya Kulkarni <chaitanyak@nvidia.com>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"kuba@kernel.org" <kuba@kernel.org>
Cc: "aurelien.aptel@gmail.com" <aurelien.aptel@gmail.com>,
	"malin1024@gmail.com" <malin1024@gmail.com>,
	Or Gerlitz <ogerlitz@nvidia.com>, Yoray Zack <yorayz@nvidia.com>,
	Boris Pismenny <borisp@nvidia.com>
Subject: RE: [PATCH v11 00/25] nvme-tcp receive offloads
Date: Wed, 1 Mar 2023 19:13:58 +0000	[thread overview]
Message-ID: <DM6PR12MB3564B3D0D489B7F4325E69D9BCAD9@DM6PR12MB3564.namprd12.prod.outlook.com> (raw)
In-Reply-To: <72746760-f045-d7bc-1557-255720d7638d@grimberg.me>

Hi Sagi,

On Thu, 23 Feb 2023, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hey Aurelien and Co,
> 
> I've spent some time today looking at the last iteration of this,
> What I cannot understand, is how will this ever be used outside
> of the kernel nvme-tcp host driver?
> 
> It seems that the interface is diesigned to fit only a kernel
> consumer, and a very specific one.

As part of this series, we are only covering the kernel nvme-tcp host driver.
The ULP layer we are introducing was designed also for other kernel drivers 
such as the kernel nvme-tcp target and iSCSI.

> 
> Have you considered using a more standard interfaces to use this
> such that spdk or an io_uring based initiator can use it?

The main problem which I will explain in more detail (under the digest, 
teardown and resync flows) is that in order to use a more standard interface 
that will hide what the HW needs it will require duplicating the NVMeTCP
logic of tracking the PDUs in the TCP stream – this seems wrong to us.

I can add that we are also working on an end-to-end user-space design for SPDK
and we don’t see how the two designs could use the same socket APIs without 
impacting the performance gain of both cases.

> 
> To me it appears that:
> - ddp limits can be obtained via getsockopt
> - sk_add/sk_del can be done via setsockopt
> - offloaded DDGST crc can be obtained via something like
>    msghdr.msg_control

If we wish to hide it from the NVMeTCP driver it will require us to duplicate 
the NVMeTCP logic of tracking the PDUs in the TCP stream.
In the positive case, when there are no DDGST errors, nothing is needed to be 
done in the NVMeTCP driver, but in the case of errors (or just in the case of 
out of order packet in the middle of the PDU), the DDGST will need to be 
calculated in SW and it seems wrong to us to duplicate it outside of the 
NVMeTCP driver.

> - Perhaps for setting up the offload per IO, recvmsg would be the
>    vehicle with a new msg flag MSG_RCV_DDP or something, that would hide
>    all the details of what the HW needs (the command_id would be set
>    somewhere in the msghdr).

Our design includes the following steps per IO:
- ddp_setup (register the command id and buffer to the HW)
- The existing flow which sends the command
- The existing flow of read_sock()
- ddp_teardown (for the per IO HW teardown, before posting the NVMe completion)

Using the recvmsg will only replace the read_sock() but this part in the NVMeTCP 
driver is not impacted by the offload design.
The ddp_setup is needed in order to set the command id and the buffer, and this 
needs to be done before or as part of the sending of command and prior to the 
receive flow.
In addition, without duplicating the tracking of the PDUs in the TCP stream, 
it is not possible to hide the teardown flow from the NVMeTCP driver.

> - And all of the resync flow would be something that a separate
>    ulp socket provider would take care of. Similar to how TLS presents
>    itself to a tcp application. So the application does not need to be
>    aware of it.

The resync flow requires awareness of TCP sequence numbers and NVMe 
PDUs boundaries. If we hide it from the NVMeTCP driver we would have 
to again duplicate NVMe PDU tracking code.
TLS is a stream protocol and maps cleanly to TCP socket operations.
NVMeTCP on the other hand is a request-response protocol. 
On the data path, it's not comparable.
While it could be done by contorting recvmsg() with new flags, adding input 
fields to the msghdr struct and changing the behaviour of uAPI, it will add 
a lot more friction than a separate ULP DDP API specifically made for this 
purpose.

> 
> I'm not sure that such interface could cover everything that is needed,
> but what I'm trying to convey, is that the current interface limits the
> usability for almost anything else. Please correct me if I'm wrong.
> Is this designed to also cater anything else outside of the kernel
> nvme-tcp host driver?

As part of this series, we are targeting the kernel nvme-tcp host driver, 
and later we are planning to add support for the kernel nvme-tcp target driver.

The ULP layer was designed to be generic for other request-response based 
protocols.

> 
> > Compatibility
> > =============
> > * The offload works with bare-metal or SRIOV.
> > * The HW can support up to 64K connections per device (assuming no
> >    other HW accelerations are used). In this series, we will introduce
> >    the support for up to 4k connections, and we have plans to increase it.
> > * SW TLS could not work together with the NVMeTCP offload as the HW
> >    will need to track the NVMeTCP headers in the TCP stream.
> 
> Can't say I like that.

The answer should be to support both TLS offload and NVMeTCP offload, 
which is a HW limit in our current generation.

> 
> > * The ConnectX HW support HW TLS, but in ConnectX-7 those features
> >    could not co-exists (and it is not part of this series).
> > * The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
> >    don’t see the need for this feature yet.
> > * NVMe poll queues are not in the scope of this series.
> 
> bonding/teaming?

We are planning to add it as part of the "incremental feature". 
It will follow the existing design of the mlx HW TLS.

> 
> >
> > Future Work
> > ===========
> > * NVMeTCP transmit offload.
> > * NVMeTCP host offloads incremental features.
> > * NVMeTCP target offload.
> 
> Which target? which host?

Kernel nvme-tcp host driver and kernel nvme-tcp target driver.


  reply	other threads:[~2023-03-01 19:14 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-03 13:26 [PATCH v11 00/25] nvme-tcp receive offloads Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 01/25] net: Introduce direct data placement tcp offload Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 02/25] net/ethtool: add new stringset ETH_SS_ULP_DDP_{CAPS,STATS} Aurelien Aptel
2023-02-04  3:42   ` Jakub Kicinski
2023-02-03 13:26 ` [PATCH v11 03/25] net/ethtool: add ULP_DDP_{GET,SET} operations for caps and stats Aurelien Aptel
2023-02-04  3:46   ` Jakub Kicinski
2023-02-03 13:26 ` [PATCH v11 04/25] Documentation: document netlink ULP_DDP_GET/SET messages Aurelien Aptel
2023-02-04  3:41   ` Jakub Kicinski
2023-02-06  9:48     ` Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 05/25] iov_iter: skip copy if src == dst for direct data placement Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 06/25] net/tls,core: export get_netdev_for_sock Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 07/25] nvme-tcp: Add DDP offload control path Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 08/25] nvme-tcp: Add DDP data-path Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 09/25] nvme-tcp: RX DDGST offload Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 10/25] nvme-tcp: Deal with netdevice DOWN events Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 11/25] nvme-tcp: Add modparam to control the ULP offload enablement Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 12/25] Documentation: add ULP DDP offload documentation Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 13/25] net/mlx5e: Rename from tls to transport static params Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 14/25] net/mlx5e: Refactor ico sq polling to get budget Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 15/25] net/mlx5e: Have mdev pointer directly on the icosq structure Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 16/25] net/mlx5e: Refactor doorbell function to allow avoiding a completion Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 17/25] net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 18/25] net/mlx5e: NVMEoTCP, offload initialization Aurelien Aptel
2023-02-03 13:26 ` [PATCH v11 19/25] net/mlx5e: TCP flow steering for nvme-tcp acceleration Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 20/25] net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 21/25] net/mlx5e: NVMEoTCP, queue init/teardown Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 22/25] net/mlx5e: NVMEoTCP, ddp setup and resync Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 23/25] net/mlx5e: NVMEoTCP, async ddp invalidation Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 24/25] net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload Aurelien Aptel
2023-02-03 13:27 ` [PATCH v11 25/25] net/mlx5e: NVMEoTCP, statistics Aurelien Aptel
2023-02-23 15:33 ` [PATCH v11 00/25] nvme-tcp receive offloads Sagi Grimberg
2023-03-01 19:13   ` Shai Malin [this message]
2023-03-07 10:11     ` Sagi Grimberg
2023-03-22 10:19       ` Shai Malin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM6PR12MB3564B3D0D489B7F4325E69D9BCAD9@DM6PR12MB3564.namprd12.prod.outlook.com \
    --to=smalin@nvidia.com \
    --cc=aaptel@nvidia.com \
    --cc=aurelien.aptel@gmail.com \
    --cc=axboe@fb.com \
    --cc=borisp@nvidia.com \
    --cc=chaitanyak@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=malin1024@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@nvidia.com \
    --cc=sagi@grimberg.me \
    --cc=yorayz@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.