netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 net-next 00/15] nvme-tcp receive offloads
@ 2020-12-07 21:06 Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 01/15] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
                   ` (15 more replies)
  0 siblings, 16 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz

This series adds support for nvme-tcp receive offloads
which do not mandate the offload of the network stack to the device.
Instead, these work together with TCP to offload:
1. copy from SKB to the block layer buffers
2. CRC verification for received PDU

The series implements these as a generic offload infrastructure for storage
protocols, which we call TCP Direct Data Placement (TCP_DDP) and TCP DDP CRC,
respectively. We use this infrastructure to implement NVMe-TCP offload for copy
and CRC. Future implementations can reuse the same infrastructure for other
protcols such as iSCSI.

Note:
These offloads are similar in nature to the packet-based NIC TLS offloads,
which are already upstream (see net/tls/tls_device.c).
You can read more about TLS offload here:
https://www.kernel.org/doc/html/latest/networking/tls-offload.html

Initialization and teardown:
=========================================
The offload for IO queues is initialized after the handshake of the
NVMe-TCP protocol is finished by calling `nvme_tcp_offload_socket`
with the tcp socket of the nvme_tcp_queue:
This operation sets all relevant hardware contexts in
hardware. If it fails, then the IO queue proceeds as usually with no offload.
If it succeeds then `nvme_tcp_setup_ddp` and `nvme_tcp_teardown_ddp` may be
called to perform copy offload, and crc offload will be used.
This initialization does not change the normal operation of nvme-tcp in any
way besides adding the option to call the above mentioned NDO operations.

For the admin queue, nvme-tcp does not initialize the offload.
Instead, nvme-tcp calls the driver to configure limits for the controller,
such as max_hw_sectors and max_segments; these must be limited to accomodate
potential HW resource limits, and to improve performance.

If some error occured, and the IO queue must be closed or reconnected, then
offload is teardown and initialized again. Additionally, we handle netdev
down events via the existing error recovery flow.

Copy offload works as follows:
=========================================
The nvme-tcp layer calls the NIC drive to map block layer buffers to ccid using
`nvme_tcp_setup_ddp` before sending the read request. When the repsonse is
received, then the NIC HW will write the PDU payload directly into the
designated buffer, and build an SKB such that it points into the destination
buffer; this SKB represents the entire packet received on the wire, but it
points to the block layer buffers. Once nvme-tcp attempts to copy data from
this SKB to the block layer buffer it can skip the copy by checking in the
copying function (memcpy_to_page):
if (src == dst) -> skip copy
Finally, when the PDU has been processed to completion, the nvme-tcp layer
releases the NIC HW context be calling `nvme_tcp_teardown_ddp` which
asynchronously unmaps the buffers from NIC HW.

As the last change is to a sensative function, we are careful to place it under
static_key which is only enabled when this functionality is actually used for
nvme-tcp copy offload.

Asynchronous completion:
=========================================
The NIC must release its mapping between command IDs and the target buffers.
This mapping is released when NVMe-TCP calls the NIC
driver (`nvme_tcp_offload_socket`).
As completing IOs is performance criticial, we introduce asynchronous
completions for NVMe-TCP, i.e. NVMe-TCP calls the NIC, which will later
call NVMe-TCP to complete the IO (`nvme_tcp_ddp_teardown_done`).

An alternative approach is to move all the functions related to coping from
SKBs to the block layer buffers inside the nvme-tcp code - about 200 LOC.

CRC offload works as follows:
=========================================
After offload is initialized, we use the SKB's ddp_crc bit to indicate that:
"there was no problem with the verification of all CRC fields in this packet's
payload". The bit is set to zero if there was an error, or if HW skipped
offload for some reason. If *any* SKB in a PDU has (ddp_crc != 1), then software
must compute the CRC, and check it. We perform this check, and
accompanying software fallback at the end of the processing of a received PDU.

SKB changes:
=========================================
The CRC offload requires an additional bit in the SKB, which is useful for
preventing the coalescing of SKB with different crc offload values. This bit
is similar in concept to the "decrypted" bit. 

Performance:
=========================================
The expected performance gain from this offload varies with the block size.
We perform a CPU cycles breakdown of the copy/CRC operations in nvme-tcp
fio random read workloads:
For 4K blocks we see up to 11% improvement for a 100% read fio workload,
while for 128K blocks we see upto 52%. If we run nvme-tcp, and skip these
operations, then we observe a gain of about 1.1x and 2x respectively.

Resynchronization:
=========================================
The resynchronization flow is performed to reset the hardware tracking of
NVMe-TCP PDUs within the TCP stream. The flow consists of a request from
the driver, regarding a possible location of a PDU header. Followed by
a response from the nvme-tcp driver.

This flow is rare, and it should happen only after packet loss or
reordering events that involve nvme-tcp PDU headers.

The patches are organized as follows:
=========================================
Patch 1         the iov_iter change to skip copy if (src == dst)
Patches 2-3     the infrastructure for all TCP DDP
                and TCP DDP CRC offloads, respectively.
Patch 4         exposes the get_netdev_for_sock function from TLS
Patch 5         NVMe-TCP changes to call NIC driver on queue init/teardown
Patches 6       NVMe-TCP changes to call NIC driver on IO operation
                setup/teardown, and support async completions.
Patches 7       NVMe-TCP changes to support CRC offload on receive.
                Also, this patch moves CRC calculation to the end of PDU
                in case offload requires software fallback.
Patches 8       NVMe-TCP handling of netdev events: stop the offload if
                netdev is going down
Patches 9-15    implement support for NVMe-TCP copy and CRC offload in
                the mlx5 NIC driver as the first user

Testing:
=========================================
This series was tested using fio with various configurations of IO sizes,
depths, MTUs, and with both the SPDK and kernel NVMe-TCP targets.

Future work:
=========================================
A follow-up series will introduce support for transmit side CRC. Then,
we will work on adding support for TLS in NVMe-TCP and combining the
two offloads.

Changes since RFC v1:
=========================================
* Split mlx5 driver patches to several commits
* Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
  init/teardown to the start/stop functions.

Ben Ben-ishay (5):
  net/mlx5: Header file changes for nvme-tcp offload
  net/mlx5: Add 128B CQE for NVMEoTCP offload
  net/mlx5e: NVMEoTCP DDP offload control path
  net/mlx5e: NVMEoTCP, data-path for DDP offload
  net/mlx5e: NVMEoTCP statistics

Boris Pismenny (7):
  iov_iter: Skip copy in memcpy_to_page if src==dst
  net: Introduce direct data placement tcp offload
  net: Introduce crc offload for tcp ddp ulp
  net/tls: expose get_netdev_for_sock
  nvme-tcp: Add DDP offload control path
  nvme-tcp: Add DDP data-path
  net/mlx5e: TCP flow steering for nvme-tcp

Or Gerlitz (1):
  nvme-tcp: Deal with netdevice DOWN events

Yoray Zack (2):
  nvme-tcp : Recalculate crc in the end of the capsule
  net/mlx5e: NVMEoTCP workaround CRC after resync

 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   31 +-
 .../net/ethernet/mellanox/mlx5/core/en/fs.h   |    4 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |    1 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |   13 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |    1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |    1 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |    9 +-
 .../mellanox/mlx5/core/en_accel/fs_tcp.c      |   10 +
 .../mellanox/mlx5/core/en_accel/fs_tcp.h      |    2 +-
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 1002 +++++++++++++++++
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h    |  119 ++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        |  270 +++++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.h        |   26 +
 .../mlx5/core/en_accel/nvmeotcp_utils.h       |   80 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   39 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   72 +-
 .../ethernet/mellanox/mlx5/core/en_stats.c    |   37 +
 .../ethernet/mellanox/mlx5/core/en_stats.h    |   24 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |   16 +
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |    6 +
 drivers/nvme/host/tcp.c                       |  443 +++++++-
 include/linux/mlx5/device.h                   |   43 +-
 include/linux/mlx5/mlx5_ifc.h                 |  104 +-
 include/linux/mlx5/qp.h                       |    1 +
 include/linux/netdev_features.h               |    4 +
 include/linux/netdevice.h                     |    5 +
 include/linux/skbuff.h                        |    5 +
 include/linux/uio.h                           |    2 +
 include/net/inet_connection_sock.h            |    4 +
 include/net/sock.h                            |   17 +
 include/net/tcp_ddp.h                         |  129 +++
 lib/iov_iter.c                                |   11 +-
 net/Kconfig                                   |   17 +
 net/core/skbuff.c                             |    9 +-
 net/ethtool/common.c                          |    2 +
 net/ipv4/tcp_input.c                          |    7 +
 net/ipv4/tcp_ipv4.c                           |    3 +
 net/ipv4/tcp_offload.c                        |    3 +
 net/tls/tls_device.c                          |   20 +-
 41 files changed, 2551 insertions(+), 54 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
 create mode 100644 include/net/tcp_ddp.h

-- 
2.24.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 01/15] iov_iter: Skip copy in memcpy_to_page if src==dst
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-08  0:39   ` David Ahern
  2020-12-07 21:06 ` [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload Boris Pismenny
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

When using direct data placement the NIC writes some of the payload
directly to the destination buffer, and constructs the SKB such that it
points to this data. As a result, the skb_copy datagram_iter call will
attempt to copy data when it is not necessary.

This patch adds a check to avoid this copy, and a static_key to enabled
it when TCP direct data placement is possible.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 include/linux/uio.h |  2 ++
 lib/iov_iter.c      | 11 ++++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 72d88566694e..05573d848ff5 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -282,4 +282,6 @@ int iov_iter_for_each_range(struct iov_iter *i, size_t bytes,
 			    int (*f)(struct kvec *vec, void *context),
 			    void *context);
 
+extern struct static_key_false skip_copy_enabled;
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 1635111c5bd2..206edb051135 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -15,6 +15,9 @@
 
 #define PIPE_PARANOIA /* for now */
 
+DEFINE_STATIC_KEY_FALSE(skip_copy_enabled);
+EXPORT_SYMBOL_GPL(skip_copy_enabled);
+
 #define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
 	size_t left;					\
 	size_t wanted = n;				\
@@ -476,7 +479,13 @@ static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t
 static void memcpy_to_page(struct page *page, size_t offset, const char *from, size_t len)
 {
 	char *to = kmap_atomic(page);
-	memcpy(to + offset, from, len);
+
+	if (static_branch_unlikely(&skip_copy_enabled)) {
+		if (to + offset != from)
+			memcpy(to + offset, from, len);
+	} else {
+		memcpy(to + offset, from, len);
+	}
 	kunmap_atomic(to);
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 01/15] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-08  0:42   ` David Ahern
  2020-12-09  0:57   ` David Ahern
  2020-12-07 21:06 ` [PATCH v1 net-next 03/15] net: Introduce crc offload for tcp ddp ulp Boris Pismenny
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

This commit introduces direct data placement offload for TCP.
This capability is accompanied by new net_device operations that
configure
hardware contexts. There is a context per socket, and a context per DDP
opreation. Additionally, a resynchronization routine is used to assist
hardware handle TCP OOO, and continue the offload.
Furthermore, we let the offloading driver advertise what is the max hw
sectors/segments.

Using this interface, the NIC hardware will scatter TCP payload directly
to the BIO pages according to the command_id.
To maintain the correctness of the network stack, the driver is expected
to construct SKBs that point to the BIO pages.

This, the SKB represents the data on the wire, while it is pointing
to data that is already placed in the destination buffer.
As a result, data from page frags should not be copied out to
the linear part.

As SKBs that use DDP are already very memory efficient, we modify
skb_condence to avoid copying data from fragments to the linear
part of SKBs that belong to a socket that uses DDP offload.

A follow-up patch will use this interface for DDP in NVMe-TCP.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 include/linux/netdev_features.h    |   2 +
 include/linux/netdevice.h          |   5 ++
 include/net/inet_connection_sock.h |   4 +
 include/net/tcp_ddp.h              | 129 +++++++++++++++++++++++++++++
 net/Kconfig                        |   9 ++
 net/core/skbuff.c                  |   9 +-
 net/ethtool/common.c               |   1 +
 7 files changed, 158 insertions(+), 1 deletion(-)
 create mode 100644 include/net/tcp_ddp.h

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 934de56644e7..fb35dcac03d2 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -84,6 +84,7 @@ enum {
 	NETIF_F_GRO_FRAGLIST_BIT,	/* Fraglist GRO */
 
 	NETIF_F_HW_MACSEC_BIT,		/* Offload MACsec operations */
+	NETIF_F_HW_TCP_DDP_BIT,		/* TCP direct data placement offload */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -157,6 +158,7 @@ enum {
 #define NETIF_F_GRO_FRAGLIST	__NETIF_F(GRO_FRAGLIST)
 #define NETIF_F_GSO_FRAGLIST	__NETIF_F(GSO_FRAGLIST)
 #define NETIF_F_HW_MACSEC	__NETIF_F(HW_MACSEC)
+#define NETIF_F_HW_TCP_DDP	__NETIF_F(HW_TCP_DDP)
 
 /* Finds the next feature with the highest number of the range of start till 0.
  */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a07c8e431f45..755766976408 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -934,6 +934,7 @@ struct dev_ifalias {
 
 struct devlink;
 struct tlsdev_ops;
+struct tcp_ddp_dev_ops;
 
 struct netdev_name_node {
 	struct hlist_node hlist;
@@ -1930,6 +1931,10 @@ struct net_device {
 	const struct tlsdev_ops *tlsdev_ops;
 #endif
 
+#ifdef CONFIG_TCP_DDP
+	const struct tcp_ddp_dev_ops *tcp_ddp_ops;
+#endif
+
 	const struct header_ops *header_ops;
 
 	unsigned int		flags;
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 7338b3865a2a..a08b85b53aa8 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -66,6 +66,8 @@ struct inet_connection_sock_af_ops {
  * @icsk_ulp_ops	   Pluggable ULP control hook
  * @icsk_ulp_data	   ULP private data
  * @icsk_clean_acked	   Clean acked data hook
+ * @icsk_ulp_ddp_ops	   Pluggable ULP direct data placement control hook
+ * @icsk_ulp_ddp_data	   ULP direct data placement private data
  * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
  * @icsk_ca_state:	   Congestion control state
  * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
@@ -94,6 +96,8 @@ struct inet_connection_sock {
 	const struct tcp_ulp_ops  *icsk_ulp_ops;
 	void __rcu		  *icsk_ulp_data;
 	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
+	const struct tcp_ddp_ulp_ops  *icsk_ulp_ddp_ops;
+	void __rcu		  *icsk_ulp_ddp_data;
 	struct hlist_node         icsk_listen_portaddr_node;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
 	__u8			  icsk_ca_state:5,
diff --git a/include/net/tcp_ddp.h b/include/net/tcp_ddp.h
new file mode 100644
index 000000000000..df3264be4600
--- /dev/null
+++ b/include/net/tcp_ddp.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * tcp_ddp.h
+ *	Author:	Boris Pismenny <borisp@mellanox.com>
+ *	Copyright (C) 2020 Mellanox Technologies.
+ */
+#ifndef _TCP_DDP_H
+#define _TCP_DDP_H
+
+#include <linux/netdevice.h>
+#include <net/inet_connection_sock.h>
+#include <net/sock.h>
+
+/* limits returned by the offload driver, zero means don't care */
+struct tcp_ddp_limits {
+	int	 max_ddp_sgl_len;
+};
+
+enum tcp_ddp_type {
+	TCP_DDP_NVME = 1,
+};
+
+/**
+ * struct tcp_ddp_config - Generic tcp ddp configuration: tcp ddp IO queue
+ * config implementations must use this as the first member.
+ * Add new instances of tcp_ddp_config below (nvme-tcp, etc.).
+ */
+struct tcp_ddp_config {
+	enum tcp_ddp_type    type;
+	unsigned char        buf[];
+};
+
+/**
+ * struct nvme_tcp_ddp_config - nvme tcp ddp configuration for an IO queue
+ *
+ * @pfv:        pdu version (e.g., NVME_TCP_PFV_1_0)
+ * @cpda:       controller pdu data alignmend (dwords, 0's based)
+ * @dgst:       digest types enabled.
+ *              The netdev will offload crc if ddp_crc is supported.
+ * @queue_size: number of nvme-tcp IO queue elements
+ * @queue_id:   queue identifier
+ * @cpu_io:     cpu core running the IO thread for this queue
+ */
+struct nvme_tcp_ddp_config {
+	struct tcp_ddp_config   cfg;
+
+	u16			pfv;
+	u8			cpda;
+	u8			dgst;
+	int			queue_size;
+	int			queue_id;
+	int			io_cpu;
+};
+
+/**
+ * struct tcp_ddp_io - tcp ddp configuration for an IO request.
+ *
+ * @command_id:  identifier on the wire associated with these buffers
+ * @nents:       number of entries in the sg_table
+ * @sg_table:    describing the buffers for this IO request
+ * @first_sgl:   first SGL in sg_table
+ */
+struct tcp_ddp_io {
+	u32			command_id;
+	int			nents;
+	struct sg_table		sg_table;
+	struct scatterlist	first_sgl[SG_CHUNK_SIZE];
+};
+
+/* struct tcp_ddp_dev_ops - operations used by an upper layer protocol to configure ddp offload
+ *
+ * @tcp_ddp_limits:    limit the number of scatter gather entries per IO.
+ *                     the device driver can use this to limit the resources allocated per queue.
+ * @tcp_ddp_sk_add:    add offload for the queue represennted by the socket+config pair.
+ *                     this function is used to configure either copy, crc or both offloads.
+ * @tcp_ddp_sk_del:    remove offload from the socket, and release any device related resources.
+ * @tcp_ddp_setup:     request copy offload for buffers associated with a command_id in tcp_ddp_io.
+ * @tcp_ddp_teardown:  release offload resources association between buffers and command_id in
+ *                     tcp_ddp_io.
+ * @tcp_ddp_resync:    respond to the driver's resync_request. Called only if resync is successful.
+ */
+struct tcp_ddp_dev_ops {
+	int (*tcp_ddp_limits)(struct net_device *netdev,
+			      struct tcp_ddp_limits *limits);
+	int (*tcp_ddp_sk_add)(struct net_device *netdev,
+			      struct sock *sk,
+			      struct tcp_ddp_config *config);
+	void (*tcp_ddp_sk_del)(struct net_device *netdev,
+			       struct sock *sk);
+	int (*tcp_ddp_setup)(struct net_device *netdev,
+			     struct sock *sk,
+			     struct tcp_ddp_io *io);
+	int (*tcp_ddp_teardown)(struct net_device *netdev,
+				struct sock *sk,
+				struct tcp_ddp_io *io,
+				void *ddp_ctx);
+	void (*tcp_ddp_resync)(struct net_device *netdev,
+			       struct sock *sk, u32 seq);
+};
+
+#define TCP_DDP_RESYNC_REQ (1 << 0)
+
+/**
+ * struct tcp_ddp_ulp_ops - Interface to register uppper layer Direct Data Placement (DDP) TCP offload
+ */
+struct tcp_ddp_ulp_ops {
+	/* NIC requests ulp to indicate if @seq is the start of a message */
+	bool (*resync_request)(struct sock *sk, u32 seq, u32 flags);
+	/* NIC driver informs the ulp that ddp teardown is done - used for async completions*/
+	void (*ddp_teardown_done)(void *ddp_ctx);
+};
+
+/**
+ * struct tcp_ddp_ctx - Generic tcp ddp context: device driver per queue contexts must
+ * use this as the first member.
+ */
+struct tcp_ddp_ctx {
+	enum tcp_ddp_type    type;
+	unsigned char        buf[];
+};
+
+static inline struct tcp_ddp_ctx *tcp_ddp_get_ctx(const struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	return (__force struct tcp_ddp_ctx *)icsk->icsk_ulp_ddp_data;
+}
+
+#endif //_TCP_DDP_H
diff --git a/net/Kconfig b/net/Kconfig
index f4c32d982af6..3876861cdc90 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -457,6 +457,15 @@ config ETHTOOL_NETLINK
 	  netlink. It provides better extensibility and some new features,
 	  e.g. notification messages.
 
+config TCP_DDP
+	bool "TCP direct data placement offload"
+	default n
+	help
+	  Direct Data Placement (DDP) offload for TCP enables ULP, such as
+	  NVMe-TCP/iSCSI, to request the NIC to place TCP payload data
+	  of a command response directly into kernel pages.
+
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ed61eed1195d..75354fb8fe94 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,6 +69,7 @@
 #include <net/xfrm.h>
 #include <net/mpls.h>
 #include <net/mptcp.h>
+#include <net/tcp_ddp.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -6135,9 +6136,15 @@ EXPORT_SYMBOL(pskb_extract);
  */
 void skb_condense(struct sk_buff *skb)
 {
+	bool is_ddp = false;
+
+#ifdef CONFIG_TCP_DDP
+	is_ddp = skb->sk && inet_csk(skb->sk) &&
+		 inet_csk(skb->sk)->icsk_ulp_ddp_data;
+#endif
 	if (skb->data_len) {
 		if (skb->data_len > skb->end - skb->tail ||
-		    skb_cloned(skb))
+		    skb_cloned(skb) || is_ddp)
 			return;
 
 		/* Nice, we can free page frag(s) right now */
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 24036e3055a1..a2ff7a4a6bbf 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -68,6 +68,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
 	[NETIF_F_HW_TLS_RX_BIT] =	 "tls-hw-rx-offload",
 	[NETIF_F_GRO_FRAGLIST_BIT] =	 "rx-gro-list",
 	[NETIF_F_HW_MACSEC_BIT] =	 "macsec-hw-offload",
+	[NETIF_F_HW_TCP_DDP_BIT] =	 "tcp-ddp-offload",
 };
 
 const char
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 03/15] net: Introduce crc offload for tcp ddp ulp
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 01/15] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock Boris Pismenny
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

This commit introduces support for CRC offload to direct data placement
ULP on the receive side. Both DDP and CRC share a common API to
initialize the offload for a TCP socket. But otherwise, both can
be executed independently.

On the receive side, CRC offload requires a new SKB bit that
indicates that no CRC error was encountered while processing this packet.
If all packets of a ULP message have this bit set, then the CRC
verification for the message can be skipped, as hardware already checked
it.

The following patches will set and use this bit to perform NVME-TCP
CRC offload.

A subsequent series, will add NVMe-TCP transmit side CRC support.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 include/linux/netdev_features.h | 2 ++
 include/linux/skbuff.h          | 5 +++++
 net/Kconfig                     | 8 ++++++++
 net/ethtool/common.c            | 1 +
 net/ipv4/tcp_input.c            | 7 +++++++
 net/ipv4/tcp_ipv4.c             | 3 +++
 net/ipv4/tcp_offload.c          | 3 +++
 7 files changed, 29 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index fb35dcac03d2..dc79709586cd 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -85,6 +85,7 @@ enum {
 
 	NETIF_F_HW_MACSEC_BIT,		/* Offload MACsec operations */
 	NETIF_F_HW_TCP_DDP_BIT,		/* TCP direct data placement offload */
+	NETIF_F_HW_TCP_DDP_CRC_RX_BIT,	/* TCP DDP CRC RX offload */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -159,6 +160,7 @@ enum {
 #define NETIF_F_GSO_FRAGLIST	__NETIF_F(GSO_FRAGLIST)
 #define NETIF_F_HW_MACSEC	__NETIF_F(HW_MACSEC)
 #define NETIF_F_HW_TCP_DDP	__NETIF_F(HW_TCP_DDP)
+#define NETIF_F_HW_TCP_DDP_CRC_RX	__NETIF_F(HW_TCP_DDP_CRC_RX)
 
 /* Finds the next feature with the highest number of the range of start till 0.
  */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0a1239819fd2..c7daf93788e8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -683,6 +683,7 @@ typedef unsigned char *sk_buff_data_t;
  *		CHECKSUM_UNNECESSARY (max 3)
  *	@dst_pending_confirm: need to confirm neighbour
  *	@decrypted: Decrypted SKB
+ *	@ddp_crc: NIC is responsible for PDU's CRC computation and verification
  *	@napi_id: id of the NAPI struct this skb came from
  *	@sender_cpu: (aka @napi_id) source CPU in XPS
  *	@secmark: security marking
@@ -858,6 +859,10 @@ struct sk_buff {
 #ifdef CONFIG_TLS_DEVICE
 	__u8			decrypted:1;
 #endif
+#ifdef CONFIG_TCP_DDP_CRC
+	__u8                    ddp_crc:1;
+#endif
+
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
diff --git a/net/Kconfig b/net/Kconfig
index 3876861cdc90..80ed9f038968 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -465,6 +465,14 @@ config TCP_DDP
 	  NVMe-TCP/iSCSI, to request the NIC to place TCP payload data
 	  of a command response directly into kernel pages.
 
+config TCP_DDP_CRC
+	bool "TCP direct data placement CRC offload"
+	default n
+	help
+	  Direct Data Placement (DDP) CRC32C offload for TCP enables ULP, such as
+	  NVMe-TCP/iSCSI, to request the NIC to calculate/verify the data digest
+	  of commands as they go through the NIC. Thus avoiding the costly
+	  per-byte overhead.
 
 endif   # if NET
 
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index a2ff7a4a6bbf..cc6858105449 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -69,6 +69,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
 	[NETIF_F_GRO_FRAGLIST_BIT] =	 "rx-gro-list",
 	[NETIF_F_HW_MACSEC_BIT] =	 "macsec-hw-offload",
 	[NETIF_F_HW_TCP_DDP_BIT] =	 "tcp-ddp-offload",
+	[NETIF_F_HW_TCP_DDP_CRC_RX_BIT] =	 "tcp-ddp-crc-rx-offload",
 };
 
 const char
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fb3a7750f623..daa0680b2bc1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5128,6 +5128,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
 #ifdef CONFIG_TLS_DEVICE
 		nskb->decrypted = skb->decrypted;
+#endif
+#ifdef CONFIG_TCP_DDP_CRC
+		nskb->ddp_crc = skb->ddp_crc;
 #endif
 		TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
 		if (list)
@@ -5161,6 +5164,10 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 #ifdef CONFIG_TLS_DEVICE
 				if (skb->decrypted != nskb->decrypted)
 					goto end;
+#endif
+#ifdef CONFIG_TCP_DDP_CRC
+				if (skb->ddp_crc != nskb->ddp_crc)
+					goto end;
 #endif
 			}
 		}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index e4b31e70bd30..a12d016ce6c9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1807,6 +1807,9 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 	      TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
 #ifdef CONFIG_TLS_DEVICE
 	    tail->decrypted != skb->decrypted ||
+#endif
+#ifdef CONFIG_TCP_DDP_CRC
+	    tail->ddp_crc != skb->ddp_crc ||
 #endif
 	    thtail->doff != th->doff ||
 	    memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index e09147ac9a99..39f5f0bcf181 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -262,6 +262,9 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 #ifdef CONFIG_TLS_DEVICE
 	flush |= p->decrypted ^ skb->decrypted;
 #endif
+#ifdef CONFIG_TCP_DDP_CRC
+	flush |= p->ddp_crc ^ skb->ddp_crc;
+#endif
 
 	if (flush || skb_gro_receive(p, skb)) {
 		mss = 1;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (2 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 03/15] net: Introduce crc offload for tcp ddp ulp Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-09  1:06   ` David Ahern
  2020-12-07 21:06 ` [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path Boris Pismenny
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz

get_netdev_for_sock is a utility that is used to obtain
the net_device structure from a connected socket.

Later patches will use this for nvme-tcp DDP and DDP CRC offloads.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 include/net/sock.h   | 17 +++++++++++++++++
 net/tls/tls_device.c | 20 ++------------------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 093b51719c69..a8f7393ea433 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2711,4 +2711,21 @@ void sock_set_sndtimeo(struct sock *sk, s64 secs);
 
 int sock_bind_add(struct sock *sk, struct sockaddr *addr, int addr_len);
 
+/* Assume that the socket is already connected */
+static inline struct net_device *get_netdev_for_sock(struct sock *sk, bool hold)
+{
+	struct dst_entry *dst = sk_dst_get(sk);
+	struct net_device *netdev = NULL;
+
+	if (likely(dst)) {
+		netdev = dst->dev;
+		if (hold)
+			dev_hold(netdev);
+	}
+
+	dst_release(dst);
+
+	return netdev;
+}
+
 #endif	/* _SOCK_H */
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index a3ab2d3d4e4e..8c3bc8705efb 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -106,22 +106,6 @@ static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
 	spin_unlock_irqrestore(&tls_device_lock, flags);
 }
 
-/* We assume that the socket is already connected */
-static struct net_device *get_netdev_for_sock(struct sock *sk)
-{
-	struct dst_entry *dst = sk_dst_get(sk);
-	struct net_device *netdev = NULL;
-
-	if (likely(dst)) {
-		netdev = dst->dev;
-		dev_hold(netdev);
-	}
-
-	dst_release(dst);
-
-	return netdev;
-}
-
 static void destroy_record(struct tls_record_info *record)
 {
 	int i;
@@ -1104,7 +1088,7 @@ int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
 	if (skb)
 		TCP_SKB_CB(skb)->eor = 1;
 
-	netdev = get_netdev_for_sock(sk);
+	netdev = get_netdev_for_sock(sk, true);
 	if (!netdev) {
 		pr_err_ratelimited("%s: netdev not found\n", __func__);
 		rc = -EINVAL;
@@ -1180,7 +1164,7 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
 	if (ctx->crypto_recv.info.version != TLS_1_2_VERSION)
 		return -EOPNOTSUPP;
 
-	netdev = get_netdev_for_sock(sk);
+	netdev = get_netdev_for_sock(sk, true);
 	if (!netdev) {
 		pr_err_ratelimited("%s: netdev not found\n", __func__);
 		return -EINVAL;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (3 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-10 17:15   ` Shai Malin
  2020-12-07 21:06 ` [PATCH v1 net-next 06/15] nvme-tcp: Add DDP data-path Boris Pismenny
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

This commit introduces direct data placement offload to NVME
TCP. There is a context per queue, which is established after the
handshake
using the tcp_ddp_sk_add/del NDOs.

Additionally, a resynchronization routine is used to assist
hardware recovery from TCP OOO, and continue the offload.
Resynchronization operates as follows:
1. TCP OOO causes the NIC HW to stop the offload
2. NIC HW identifies a PDU header at some TCP sequence number,
and asks NVMe-TCP to confirm it.
This request is delivered from the NIC driver to NVMe-TCP by first
finding the socket for the packet that triggered the request, and
then fiding the nvme_tcp_queue that is used by this routine.
Finally, the request is recorded in the nvme_tcp_queue.
3. When NVMe-TCP observes the requested TCP sequence, it will compare
it with the PDU header TCP sequence, and report the result to the
NIC driver (tcp_ddp_resync), which will update the HW,
and resume offload when all is successful.

Furthermore, we let the offloading driver advertise what is the max hw
sectors/segments via tcp_ddp_limits.

A follow-up patch introduces the data-path changes required for this
offload.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 drivers/nvme/host/tcp.c | 197 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 195 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c0c33320fe65..ef96e4a02bbd 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -14,6 +14,7 @@
 #include <linux/blk-mq.h>
 #include <crypto/hash.h>
 #include <net/busy_poll.h>
+#include <net/tcp_ddp.h>
 
 #include "nvme.h"
 #include "fabrics.h"
@@ -62,6 +63,7 @@ enum nvme_tcp_queue_flags {
 	NVME_TCP_Q_ALLOCATED	= 0,
 	NVME_TCP_Q_LIVE		= 1,
 	NVME_TCP_Q_POLLING	= 2,
+	NVME_TCP_Q_OFFLOADS     = 3,
 };
 
 enum nvme_tcp_recv_state {
@@ -110,6 +112,8 @@ struct nvme_tcp_queue {
 	void (*state_change)(struct sock *);
 	void (*data_ready)(struct sock *);
 	void (*write_space)(struct sock *);
+
+	atomic64_t  resync_req;
 };
 
 struct nvme_tcp_ctrl {
@@ -128,6 +132,8 @@ struct nvme_tcp_ctrl {
 	struct delayed_work	connect_work;
 	struct nvme_tcp_request async_req;
 	u32			io_queues[HCTX_MAX_TYPES];
+
+	struct net_device       *offloading_netdev;
 };
 
 static LIST_HEAD(nvme_tcp_ctrl_list);
@@ -222,6 +228,180 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
 	return nvme_tcp_pdu_data_left(req) <= len;
 }
 
+#ifdef CONFIG_TCP_DDP
+
+bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
+const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops = {
+	.resync_request		= nvme_tcp_resync_request,
+};
+
+static
+int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
+{
+	struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, true);
+	struct nvme_tcp_ddp_config config = {};
+	int ret;
+
+	if (!netdev) {
+		dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not found\n");
+		return -ENODEV;
+	}
+
+	if (!(netdev->features & NETIF_F_HW_TCP_DDP)) {
+		dev_put(netdev);
+		return -EOPNOTSUPP;
+	}
+
+	config.cfg.type		= TCP_DDP_NVME;
+	config.pfv		= NVME_TCP_PFV_1_0;
+	config.cpda		= 0;
+	config.dgst		= queue->hdr_digest ?
+		NVME_TCP_HDR_DIGEST_ENABLE : 0;
+	config.dgst		|= queue->data_digest ?
+		NVME_TCP_DATA_DIGEST_ENABLE : 0;
+	config.queue_size	= queue->queue_size;
+	config.queue_id		= nvme_tcp_queue_id(queue);
+	config.io_cpu		= queue->io_cpu;
+
+	ret = netdev->tcp_ddp_ops->tcp_ddp_sk_add(netdev,
+						  queue->sock->sk,
+						  (struct tcp_ddp_config *)&config);
+	if (ret) {
+		dev_put(netdev);
+		return ret;
+	}
+
+	inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = &nvme_tcp_ddp_ulp_ops;
+	if (netdev->features & NETIF_F_HW_TCP_DDP)
+		set_bit(NVME_TCP_Q_OFFLOADS, &queue->flags);
+
+	return ret;
+}
+
+static
+void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
+{
+	struct net_device *netdev = queue->ctrl->offloading_netdev;
+
+	if (!netdev) {
+		dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not found\n");
+		return;
+	}
+
+	netdev->tcp_ddp_ops->tcp_ddp_sk_del(netdev, queue->sock->sk);
+
+	inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = NULL;
+	dev_put(netdev); /* put the queue_init get_netdev_for_sock() */
+}
+
+static
+int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue)
+{
+	struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, true);
+	struct tcp_ddp_limits limits;
+	int ret = 0;
+
+	if (!netdev) {
+		dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not found\n");
+		return -ENODEV;
+	}
+
+	if (netdev->features & NETIF_F_HW_TCP_DDP &&
+	    netdev->tcp_ddp_ops &&
+	    netdev->tcp_ddp_ops->tcp_ddp_limits)
+		ret = netdev->tcp_ddp_ops->tcp_ddp_limits(netdev, &limits);
+	else
+		ret = -EOPNOTSUPP;
+
+	if (!ret) {
+		queue->ctrl->offloading_netdev = netdev;
+		dev_dbg_ratelimited(queue->ctrl->ctrl.device,
+				    "netdev %s offload limits: max_ddp_sgl_len %d\n",
+				    netdev->name, limits.max_ddp_sgl_len);
+		queue->ctrl->ctrl.max_segments = limits.max_ddp_sgl_len;
+		queue->ctrl->ctrl.max_hw_sectors =
+			limits.max_ddp_sgl_len << (ilog2(SZ_4K) - 9);
+	} else {
+		queue->ctrl->offloading_netdev = NULL;
+	}
+
+	dev_put(netdev);
+
+	return ret;
+}
+
+static
+void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
+			      unsigned int pdu_seq)
+{
+	struct net_device *netdev = queue->ctrl->offloading_netdev;
+	u64 resync_val;
+	u32 resync_seq;
+
+	resync_val = atomic64_read(&queue->resync_req);
+	/* Lower 32 bit flags. Check validity of the request */
+	if ((resync_val & TCP_DDP_RESYNC_REQ) == 0)
+		return;
+
+	/* Obtain and check requested sequence number: is this PDU header before the request? */
+	resync_seq = resync_val >> 32;
+	if (before(pdu_seq, resync_seq))
+		return;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return;
+	}
+
+	/**
+	 * The atomic operation gurarantees that we don't miss any NIC driver
+	 * resync requests submitted after the above checks.
+	 */
+	if (atomic64_cmpxchg(&queue->resync_req, resync_val,
+			     resync_val & ~TCP_DDP_RESYNC_REQ))
+		netdev->tcp_ddp_ops->tcp_ddp_resync(netdev, queue->sock->sk, pdu_seq);
+}
+
+bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
+{
+	struct nvme_tcp_queue *queue = sk->sk_user_data;
+
+	atomic64_set(&queue->resync_req,
+		     (((uint64_t)seq << 32) | flags));
+
+	return true;
+}
+
+#else
+
+static
+int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
+{
+	return -EINVAL;
+}
+
+static
+void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
+{}
+
+static
+int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue)
+{
+	return -EINVAL;
+}
+
+static
+void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
+			      unsigned int pdu_seq)
+{}
+
+bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
+{
+	return false;
+}
+
+#endif
+
 static void nvme_tcp_init_iter(struct nvme_tcp_request *req,
 		unsigned int dir)
 {
@@ -627,6 +807,11 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 	size_t rcv_len = min_t(size_t, *len, queue->pdu_remaining);
 	int ret;
 
+	u64 pdu_seq = TCP_SKB_CB(skb)->seq + *offset - queue->pdu_offset;
+
+	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
+		nvme_tcp_resync_response(queue, pdu_seq);
+
 	ret = skb_copy_bits(skb, *offset,
 		&pdu[queue->pdu_offset], rcv_len);
 	if (unlikely(ret))
@@ -1517,6 +1702,9 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
 	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
 	nvme_tcp_restore_sock_calls(queue);
 	cancel_work_sync(&queue->io_work);
+
+	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
+		nvme_tcp_unoffload_socket(queue);
 }
 
 static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
@@ -1534,10 +1722,13 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	int ret;
 
-	if (idx)
+	if (idx) {
 		ret = nvmf_connect_io_queue(nctrl, idx, false);
-	else
+		nvme_tcp_offload_socket(&ctrl->queues[idx]);
+	} else {
 		ret = nvmf_connect_admin_queue(nctrl);
+		nvme_tcp_offload_limits(&ctrl->queues[idx]);
+	}
 
 	if (!ret) {
 		set_bit(NVME_TCP_Q_LIVE, &ctrl->queues[idx].flags);
@@ -1640,6 +1831,8 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
 {
 	int ret;
 
+	to_tcp_ctrl(ctrl)->offloading_netdev = NULL;
+
 	ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH);
 	if (ret)
 		return ret;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 06/15] nvme-tcp: Add DDP data-path
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (4 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 07/15] nvme-tcp : Recalculate crc in the end of the capsule Boris Pismenny
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

Introduce the NVMe-TCP DDP data-path offload.
Using this interface, the NIC hardware will scatter TCP payload directly
to the BIO pages according to the command_id in the PDU.
To maintain the correctness of the network stack, the driver is expected
to construct SKBs that point to the BIO pages.

The data-path interface contains two routines: tcp_ddp_setup/teardown.
The setup provides the mapping from command_id to the request buffers,
while the teardown removes this mapping.

For efficiency, we introduce an asynchronous nvme completion, which is
split between NVMe-TCP and the NIC driver as follows:
NVMe-TCP performs the specific completion, while NIC driver performs the
generic mq_blk completion.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 drivers/nvme/host/tcp.c | 119 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 115 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index ef96e4a02bbd..534fd5c00f33 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -57,6 +57,11 @@ struct nvme_tcp_request {
 	size_t			offset;
 	size_t			data_sent;
 	enum nvme_tcp_send_state state;
+
+	bool			offloaded;
+	struct tcp_ddp_io	ddp;
+	__le16			status;
+	union nvme_result	result;
 };
 
 enum nvme_tcp_queue_flags {
@@ -231,10 +236,74 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
 #ifdef CONFIG_TCP_DDP
 
 bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
+void nvme_tcp_ddp_teardown_done(void *ddp_ctx);
 const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops = {
 	.resync_request		= nvme_tcp_resync_request,
+	.ddp_teardown_done	= nvme_tcp_ddp_teardown_done,
 };
 
+static
+int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
+			  u16 command_id,
+			  struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct net_device *netdev = queue->ctrl->offloading_netdev;
+	int ret;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	ret = netdev->tcp_ddp_ops->tcp_ddp_teardown(netdev, queue->sock->sk,
+						    &req->ddp, rq);
+	sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE);
+	req->offloaded = false;
+	return ret;
+}
+
+void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
+{
+	struct request *rq = ddp_ctx;
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+	if (!nvme_try_complete_req(rq, cpu_to_le16(req->status << 1), req->result))
+		nvme_complete_rq(rq);
+}
+
+static
+int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
+		       u16 command_id,
+		       struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct net_device *netdev = queue->ctrl->offloading_netdev;
+	int ret;
+
+	req->offloaded = false;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	req->ddp.command_id = command_id;
+	req->ddp.sg_table.sgl = req->ddp.first_sgl;
+	ret = sg_alloc_table_chained(&req->ddp.sg_table, blk_rq_nr_phys_segments(rq),
+				     req->ddp.sg_table.sgl, SG_CHUNK_SIZE);
+	if (ret)
+		return -ENOMEM;
+	req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl);
+
+	ret = netdev->tcp_ddp_ops->tcp_ddp_setup(netdev,
+						 queue->sock->sk,
+						 &req->ddp);
+	if (!ret)
+		req->offloaded = true;
+	return ret;
+}
+
 static
 int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
 {
@@ -374,6 +443,25 @@ bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
 
 #else
 
+static
+int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
+		       u16 command_id,
+		       struct request *rq)
+{
+	return -EINVAL;
+}
+
+static
+int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
+			  u16 command_id,
+			  struct request *rq)
+{
+	return -EINVAL;
+}
+
+void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
+{}
+
 static
 int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
 {
@@ -651,6 +739,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
 		struct nvme_completion *cqe)
 {
+	struct nvme_tcp_request *req;
 	struct request *rq;
 
 	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
@@ -662,8 +751,15 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
 		return -EINVAL;
 	}
 
-	if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
-		nvme_complete_rq(rq);
+	req = blk_mq_rq_to_pdu(rq);
+	if (req->offloaded) {
+		req->status = cqe->status;
+		req->result = cqe->result;
+		nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
+	} else {
+		if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
+			nvme_complete_rq(rq);
+	}
 	queue->nr_cqe++;
 
 	return 0;
@@ -857,9 +953,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 static inline void nvme_tcp_end_request(struct request *rq, u16 status)
 {
 	union nvme_result res = {};
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_tcp_queue *queue = req->queue;
+	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
 
-	if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
-		nvme_complete_rq(rq);
+	if (req->offloaded) {
+		req->status = cpu_to_le16(status << 1);
+		req->result = res;
+		nvme_tcp_teardown_ddp(queue, pdu->command_id, rq);
+	} else {
+		if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
+			nvme_complete_rq(rq);
+	}
 }
 
 static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
@@ -1135,6 +1240,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	bool inline_data = nvme_tcp_has_inline_data(req);
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 	int len = sizeof(*pdu) + hdgst - req->offset;
+	struct request *rq = blk_mq_rq_from_pdu(req);
 	int flags = MSG_DONTWAIT;
 	int ret;
 
@@ -1143,6 +1249,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	else
 		flags |= MSG_EOR;
 
+	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&
+	    blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ)
+		nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq);
+
 	if (queue->hdr_digest && !req->offset)
 		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
 
@@ -2445,6 +2555,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
 	req->data_len = blk_rq_nr_phys_segments(rq) ?
 				blk_rq_payload_bytes(rq) : 0;
 	req->curr_bio = rq->bio;
+	req->offloaded = false;
 
 	if (rq_data_dir(rq) == WRITE &&
 	    req->data_len <= nvme_tcp_inline_data_size(queue))
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 07/15] nvme-tcp : Recalculate crc in the end of the capsule
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (5 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 06/15] nvme-tcp: Add DDP data-path Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-15 14:07   ` Shai Malin
  2020-12-07 21:06 ` [PATCH v1 net-next 08/15] nvme-tcp: Deal with netdevice DOWN events Boris Pismenny
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Yoray Zack, Ben Ben-Ishay, Or Gerlitz

From: Yoray Zack <yorayz@mellanox.com>

crc offload of the nvme capsule. Check if all the skb bits
are on, and if not recalculate the crc in SW and check it.

This patch reworks the receive-side crc calculation to always
run at the end, so as to keep a single flow for both offload
and non-offload. This change simplifies the code, but it may degrade
performance for non-offload crc calculation.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 drivers/nvme/host/tcp.c | 111 ++++++++++++++++++++++++++++++++--------
 1 file changed, 91 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 534fd5c00f33..3c10c8876036 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -69,6 +69,7 @@ enum nvme_tcp_queue_flags {
 	NVME_TCP_Q_LIVE		= 1,
 	NVME_TCP_Q_POLLING	= 2,
 	NVME_TCP_Q_OFFLOADS     = 3,
+	NVME_TCP_Q_OFF_CRC_RX   = 4,
 };
 
 enum nvme_tcp_recv_state {
@@ -95,6 +96,7 @@ struct nvme_tcp_queue {
 	size_t			data_remaining;
 	size_t			ddgst_remaining;
 	unsigned int		nr_cqe;
+	bool			ddgst_valid;
 
 	/* send state */
 	struct nvme_tcp_request *request;
@@ -233,6 +235,57 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
 	return nvme_tcp_pdu_data_left(req) <= len;
 }
 
+static inline bool nvme_tcp_ddp_ddgst_ok(struct nvme_tcp_queue *queue)
+{
+	return queue->ddgst_valid;
+}
+
+static inline void nvme_tcp_ddp_ddgst_update(struct nvme_tcp_queue *queue,
+						struct sk_buff *skb)
+{
+	if (queue->ddgst_valid)
+#ifdef CONFIG_TCP_DDP_CRC
+		queue->ddgst_valid = skb->ddp_crc;
+#else
+		queue->ddgst_valid = false;
+#endif
+}
+
+
+static int nvme_tcp_req_map_sg(struct nvme_tcp_request *req, struct request *rq)
+{
+	int ret;
+
+	req->ddp.sg_table.sgl = req->ddp.first_sgl;
+	ret = sg_alloc_table_chained(&req->ddp.sg_table, blk_rq_nr_phys_segments(rq),
+				     req->ddp.sg_table.sgl, SG_CHUNK_SIZE);
+	if (ret)
+		return -ENOMEM;
+	req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl);
+	return 0;
+}
+
+static void nvme_tcp_ddp_ddgst_recalc(struct ahash_request *hash,
+				      struct request *rq)
+{
+	struct nvme_tcp_request *req;
+
+	if (!rq)
+		return;
+
+	req = blk_mq_rq_to_pdu(rq);
+
+	if (!req->offloaded && nvme_tcp_req_map_sg(req, rq))
+		return;
+
+	crypto_ahash_init(hash);
+	req->ddp.sg_table.sgl = req->ddp.first_sgl;
+	ahash_request_set_crypt(hash, req->ddp.sg_table.sgl, NULL,
+				le32_to_cpu(req->data_len));
+	crypto_ahash_update(hash);
+}
+
+
 #ifdef CONFIG_TCP_DDP
 
 bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
@@ -289,12 +342,9 @@ int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
 	}
 
 	req->ddp.command_id = command_id;
-	req->ddp.sg_table.sgl = req->ddp.first_sgl;
-	ret = sg_alloc_table_chained(&req->ddp.sg_table, blk_rq_nr_phys_segments(rq),
-				     req->ddp.sg_table.sgl, SG_CHUNK_SIZE);
+	ret = nvme_tcp_req_map_sg(req, rq);
 	if (ret)
 		return -ENOMEM;
-	req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl);
 
 	ret = netdev->tcp_ddp_ops->tcp_ddp_setup(netdev,
 						 queue->sock->sk,
@@ -316,7 +366,7 @@ int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
 		return -ENODEV;
 	}
 
-	if (!(netdev->features & NETIF_F_HW_TCP_DDP)) {
+	if (!(netdev->features & (NETIF_F_HW_TCP_DDP | NETIF_F_HW_TCP_DDP_CRC_RX))) {
 		dev_put(netdev);
 		return -EOPNOTSUPP;
 	}
@@ -344,6 +394,9 @@ int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
 	if (netdev->features & NETIF_F_HW_TCP_DDP)
 		set_bit(NVME_TCP_Q_OFFLOADS, &queue->flags);
 
+	if (netdev->features & NETIF_F_HW_TCP_DDP_CRC_RX)
+		set_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags);
+
 	return ret;
 }
 
@@ -375,7 +428,7 @@ int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue)
 		return -ENODEV;
 	}
 
-	if (netdev->features & NETIF_F_HW_TCP_DDP &&
+	if ((netdev->features & (NETIF_F_HW_TCP_DDP | NETIF_F_HW_TCP_DDP_CRC_RX)) &&
 	    netdev->tcp_ddp_ops &&
 	    netdev->tcp_ddp_ops->tcp_ddp_limits)
 		ret = netdev->tcp_ddp_ops->tcp_ddp_limits(netdev, &limits);
@@ -725,6 +778,7 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
 	queue->pdu_offset = 0;
 	queue->data_remaining = -1;
 	queue->ddgst_remaining = 0;
+	queue->ddgst_valid = true;
 }
 
 static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
@@ -905,7 +959,7 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 
 	u64 pdu_seq = TCP_SKB_CB(skb)->seq + *offset - queue->pdu_offset;
 
-	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
+	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) || test_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags))
 		nvme_tcp_resync_response(queue, pdu_seq);
 
 	ret = skb_copy_bits(skb, *offset,
@@ -974,6 +1028,8 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 	struct nvme_tcp_request *req;
 	struct request *rq;
 
+	if (queue->data_digest && test_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags))
+		nvme_tcp_ddp_ddgst_update(queue, skb);
 	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
 	if (!rq) {
 		dev_err(queue->ctrl->ctrl.device,
@@ -1011,7 +1067,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 		recv_len = min_t(size_t, recv_len,
 				iov_iter_count(&req->iter));
 
-		if (queue->data_digest)
+		if (queue->data_digest && !test_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags))
 			ret = skb_copy_and_hash_datagram_iter(skb, *offset,
 				&req->iter, recv_len, queue->rcv_hash);
 		else
@@ -1031,7 +1087,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 
 	if (!queue->data_remaining) {
 		if (queue->data_digest) {
-			nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
 			queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
 		} else {
 			if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
@@ -1052,8 +1107,12 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
 	char *ddgst = (char *)&queue->recv_ddgst;
 	size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining);
 	off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
+	bool offload_fail, offload_en;
+	struct request *rq = NULL;
 	int ret;
 
+	if (test_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags))
+		nvme_tcp_ddp_ddgst_update(queue, skb);
 	ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
 	if (unlikely(ret))
 		return ret;
@@ -1064,17 +1123,29 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
 	if (queue->ddgst_remaining)
 		return 0;
 
-	if (queue->recv_ddgst != queue->exp_ddgst) {
-		dev_err(queue->ctrl->ctrl.device,
-			"data digest error: recv %#x expected %#x\n",
-			le32_to_cpu(queue->recv_ddgst),
-			le32_to_cpu(queue->exp_ddgst));
-		return -EIO;
+	offload_fail = !nvme_tcp_ddp_ddgst_ok(queue);
+	offload_en = test_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags);
+	if (!offload_en || offload_fail) {
+		if (offload_en && offload_fail) { // software-fallback
+			rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue),
+					      pdu->command_id);
+			nvme_tcp_ddp_ddgst_recalc(queue->rcv_hash, rq);
+		}
+
+		nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
+		if (queue->recv_ddgst != queue->exp_ddgst) {
+			dev_err(queue->ctrl->ctrl.device,
+				"data digest error: recv %#x expected %#x\n",
+				le32_to_cpu(queue->recv_ddgst),
+				le32_to_cpu(queue->exp_ddgst));
+			return -EIO;
+		}
 	}
 
 	if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
-		struct request *rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue),
-						pdu->command_id);
+		if (!rq)
+			rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue),
+					      pdu->command_id);
 
 		nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
 		queue->nr_cqe++;
@@ -1813,8 +1884,10 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
 	nvme_tcp_restore_sock_calls(queue);
 	cancel_work_sync(&queue->io_work);
 
-	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
+	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) ||
+	    test_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags))
 		nvme_tcp_unoffload_socket(queue);
+
 }
 
 static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
@@ -1941,8 +2014,6 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
 {
 	int ret;
 
-	to_tcp_ctrl(ctrl)->offloading_netdev = NULL;
-
 	ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH);
 	if (ret)
 		return ret;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 08/15] nvme-tcp: Deal with netdevice DOWN events
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (6 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 07/15] nvme-tcp : Recalculate crc in the end of the capsule Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 09/15] net/mlx5: Header file changes for nvme-tcp offload Boris Pismenny
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Or Gerlitz, Ben Ben-Ishay, Yoray Zack

From: Or Gerlitz <ogerlitz@mellanox.com>

For ddp setup/teardown and resync, the offloading logic
uses HW resources at the NIC driver such as SQ and CQ.

These resources are destroyed when the netdevice does down
and hence we must stop using them before the NIC driver
destroys them.

Use netdevice notifier for that matter -- offloaded connections
are stopped before the stack continues to call the NIC driver
close ndo.

We use the existing recovery flow which has the advantage
of resuming the offload once the connection is re-set.

This also buys us proper handling for the UNREGISTER event
b/c our offloading starts in the UP state, and down is always
there between up to unregister.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 drivers/nvme/host/tcp.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3c10c8876036..35c898fc833f 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -145,6 +145,7 @@ struct nvme_tcp_ctrl {
 
 static LIST_HEAD(nvme_tcp_ctrl_list);
 static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
+static struct notifier_block nvme_tcp_netdevice_nb;
 static struct workqueue_struct *nvme_tcp_wq;
 static const struct blk_mq_ops nvme_tcp_mq_ops;
 static const struct blk_mq_ops nvme_tcp_admin_mq_ops;
@@ -2900,6 +2901,27 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
 	return ERR_PTR(ret);
 }
 
+static int nvme_tcp_netdev_event(struct notifier_block *this,
+				 unsigned long event, void *ptr)
+{
+	struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
+	struct nvme_tcp_ctrl *ctrl;
+
+	switch (event) {
+	case NETDEV_GOING_DOWN:
+		mutex_lock(&nvme_tcp_ctrl_mutex);
+		list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
+			if (ndev != ctrl->offloading_netdev)
+				continue;
+			nvme_tcp_error_recovery(&ctrl->ctrl);
+		}
+		mutex_unlock(&nvme_tcp_ctrl_mutex);
+		flush_workqueue(nvme_reset_wq);
+		/* we assume that the going down part of error recovery is over */
+	}
+	return NOTIFY_DONE;
+}
+
 static struct nvmf_transport_ops nvme_tcp_transport = {
 	.name		= "tcp",
 	.module		= THIS_MODULE,
@@ -2914,13 +2936,26 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
 
 static int __init nvme_tcp_init_module(void)
 {
+	int ret;
+
 	nvme_tcp_wq = alloc_workqueue("nvme_tcp_wq",
 			WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
 	if (!nvme_tcp_wq)
 		return -ENOMEM;
 
+	nvme_tcp_netdevice_nb.notifier_call = nvme_tcp_netdev_event;
+	ret = register_netdevice_notifier(&nvme_tcp_netdevice_nb);
+	if (ret) {
+		pr_err("failed to register netdev notifier\n");
+		goto out_err_reg_notifier;
+	}
+
 	nvmf_register_transport(&nvme_tcp_transport);
 	return 0;
+
+out_err_reg_notifier:
+	destroy_workqueue(nvme_tcp_wq);
+	return ret;
 }
 
 static void __exit nvme_tcp_cleanup_module(void)
@@ -2928,6 +2963,7 @@ static void __exit nvme_tcp_cleanup_module(void)
 	struct nvme_tcp_ctrl *ctrl;
 
 	nvmf_unregister_transport(&nvme_tcp_transport);
+	unregister_netdevice_notifier(&nvme_tcp_netdevice_nb);
 
 	mutex_lock(&nvme_tcp_ctrl_mutex);
 	list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 09/15] net/mlx5: Header file changes for nvme-tcp offload
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (7 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 08/15] nvme-tcp: Deal with netdevice DOWN events Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 10/15] net/mlx5: Add 128B CQE for NVMEoTCP offload Boris Pismenny
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Or Gerlitz, Yoray Zack

From: Ben Ben-ishay <benishay@nvidia.com>

Add the necessary infrastructure for NVMEoTCP offload:

- Add nvmeocp_en + nvmeotcp_crc_en bit to the TIR for identify NVMEoTCP offload flow
  And tag_buffer_id that will be used by the connected nvmeotcp_queues
- Add new CQE field that will be used to pass scattered data information to SW
- Add new capability to HCA_CAP that represnts the NVMEoTCP offload ability

Signed-off-by: Ben Ben-ishay <benishay@nvidia.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 include/linux/mlx5/device.h   |   8 +++
 include/linux/mlx5/mlx5_ifc.h | 104 +++++++++++++++++++++++++++++++++-
 include/linux/mlx5/qp.h       |   1 +
 3 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 00057eae89ab..4433ff213d32 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -263,6 +263,7 @@ enum {
 enum {
 	MLX5_MKEY_MASK_LEN		= 1ull << 0,
 	MLX5_MKEY_MASK_PAGE_SIZE	= 1ull << 1,
+	MLX5_MKEY_MASK_XLT_OCT_SIZE     = 1ull << 2,
 	MLX5_MKEY_MASK_START_ADDR	= 1ull << 6,
 	MLX5_MKEY_MASK_PD		= 1ull << 7,
 	MLX5_MKEY_MASK_EN_RINVAL	= 1ull << 8,
@@ -1171,6 +1172,7 @@ enum mlx5_cap_type {
 	MLX5_CAP_VDPA_EMULATION = 0x13,
 	MLX5_CAP_DEV_EVENT = 0x14,
 	MLX5_CAP_IPSEC,
+	MLX5_CAP_DEV_NVMEOTCP = 0x19,
 	/* NUM OF CAP Types */
 	MLX5_CAP_NUM
 };
@@ -1391,6 +1393,12 @@ enum mlx5_qcam_feature_groups {
 #define MLX5_CAP_IPSEC(mdev, cap)\
 	MLX5_GET(ipsec_cap, (mdev)->caps.hca_cur[MLX5_CAP_IPSEC], cap)
 
+#define MLX5_CAP_DEV_NVMEOTCP(mdev, cap)\
+	MLX5_GET(nvmeotcp_cap, mdev->caps.hca_cur[MLX5_CAP_DEV_NVMEOTCP], cap)
+
+#define MLX5_CAP64_NVMEOTCP(mdev, cap)\
+	MLX5_GET64(nvmeotcp_cap, mdev->caps.hca_cur[MLX5_CAP_DEV_NVMEOTCP], cap)
+
 enum {
 	MLX5_CMD_STAT_OK			= 0x0,
 	MLX5_CMD_STAT_INT_ERR			= 0x1,
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index a3510e81ab3b..e2b75c580c37 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1271,7 +1271,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         log_max_srq_sz[0x8];
 	u8         log_max_qp_sz[0x8];
 	u8         event_cap[0x1];
-	u8         reserved_at_91[0x7];
+	u8         reserved_at_91[0x5];
+	u8         nvmeotcp[0x1];
+	u8         reserved_at_97[0x1];
 	u8         prio_tag_required[0x1];
 	u8         reserved_at_99[0x2];
 	u8         log_max_qp[0x5];
@@ -3020,6 +3022,21 @@ struct mlx5_ifc_roce_addr_layout_bits {
 	u8         reserved_at_e0[0x20];
 };
 
+struct mlx5_ifc_nvmeotcp_cap_bits {
+	u8    zerocopy[0x1];
+	u8    crc_rx[0x1];
+	u8    crc_tx[0x1];
+	u8    reserved_at_3[0x15];
+	u8    version[0x8];
+
+	u8    reserved_at_20[0x13];
+	u8    log_max_nvmeotcp_tag_buffer_table[0x5];
+	u8    reserved_at_38[0x3];
+	u8    log_max_nvmeotcp_tag_buffer_size[0x5];
+
+	u8    reserved_at_40[0x7c0];
+};
+
 union mlx5_ifc_hca_cap_union_bits {
 	struct mlx5_ifc_cmd_hca_cap_bits cmd_hca_cap;
 	struct mlx5_ifc_odp_cap_bits odp_cap;
@@ -3036,6 +3053,7 @@ union mlx5_ifc_hca_cap_union_bits {
 	struct mlx5_ifc_tls_cap_bits tls_cap;
 	struct mlx5_ifc_device_mem_cap_bits device_mem_cap;
 	struct mlx5_ifc_virtio_emulation_cap_bits virtio_emulation_cap;
+	struct mlx5_ifc_nvmeotcp_cap_bits nvmeotcp_cap;
 	u8         reserved_at_0[0x8000];
 };
 
@@ -3230,7 +3248,9 @@ struct mlx5_ifc_tirc_bits {
 
 	u8         disp_type[0x4];
 	u8         tls_en[0x1];
-	u8         reserved_at_25[0x1b];
+	u8         nvmeotcp_zero_copy_en[0x1];
+	u8         nvmeotcp_crc_en[0x1];
+	u8         reserved_at_27[0x19];
 
 	u8         reserved_at_40[0x40];
 
@@ -3261,7 +3281,8 @@ struct mlx5_ifc_tirc_bits {
 
 	struct mlx5_ifc_rx_hash_field_select_bits rx_hash_field_selector_inner;
 
-	u8         reserved_at_2c0[0x4c0];
+	u8         nvmeotcp_tag_buffer_table_id[0x20];
+	u8         reserved_at_2e0[0x4a0];
 };
 
 enum {
@@ -10716,12 +10737,14 @@ enum {
 	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = BIT(0xc),
 	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = BIT(0x13),
 	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
+	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE = BIT(0x21),
 };
 
 enum {
 	MLX5_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = 0xc,
 	MLX5_GENERAL_OBJECT_TYPES_IPSEC = 0x13,
 	MLX5_GENERAL_OBJECT_TYPES_SAMPLER = 0x20,
+	MLX5_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE = 0x21
 };
 
 enum {
@@ -10823,6 +10846,20 @@ struct mlx5_ifc_create_sampler_obj_in_bits {
 	struct mlx5_ifc_sampler_obj_bits sampler_object;
 };
 
+struct mlx5_ifc_nvmeotcp_tag_buf_table_obj_bits {
+	u8    modify_field_select[0x40];
+
+	u8    reserved_at_20[0x20];
+
+	u8    reserved_at_40[0x1b];
+	u8    log_tag_buffer_table_size[0x5];
+};
+
+struct mlx5_ifc_create_nvmeotcp_tag_buf_table_in_bits {
+	struct mlx5_ifc_general_obj_in_cmd_hdr_bits general_obj_in_cmd_hdr;
+	struct mlx5_ifc_nvmeotcp_tag_buf_table_obj_bits nvmeotcp_tag_buf_table_obj;
+};
+
 enum {
 	MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_KEY_SIZE_128 = 0x0,
 	MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_KEY_SIZE_256 = 0x1,
@@ -10833,6 +10870,18 @@ enum {
 	MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_TYPE_IPSEC = 0x2,
 };
 
+enum {
+	MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_XTS               = 0x0,
+	MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_TLS               = 0x1,
+	MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_NVMETCP           = 0x2,
+	MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_NVMETCP_WITH_TLS  = 0x3,
+};
+
+enum {
+	MLX5_TRANSPORT_STATIC_PARAMS_TI_INITIATOR  = 0x0,
+	MLX5_TRANSPORT_STATIC_PARAMS_TI_TARGET     = 0x1,
+};
+
 struct mlx5_ifc_tls_static_params_bits {
 	u8         const_2[0x2];
 	u8         tls_version[0x4];
@@ -10873,4 +10922,53 @@ enum {
 	MLX5_MTT_PERM_RW	= MLX5_MTT_PERM_READ | MLX5_MTT_PERM_WRITE,
 };
 
+struct mlx5_ifc_nvmeotcp_progress_params_bits {
+	u8    valid[0x1];
+	u8    reserved_at_1[0x7];
+	u8    pd[0x18];
+
+	u8    next_pdu_tcp_sn[0x20];
+
+	u8    hw_resync_tcp_sn[0x20];
+
+	u8    pdu_tracker_state[0x2];
+	u8    offloading_state[0x2];
+	u8    reserved_at_64[0xc];
+	u8    cccid_ttag[0x10];
+};
+
+struct mlx5_ifc_transport_static_params_bits {
+	u8    const_2[0x2];
+	u8    tls_version[0x4];
+	u8    const_1[0x2];
+	u8    reserved_at_8[0x14];
+	u8    acc_type[0x4];
+
+	u8    reserved_at_20[0x20];
+
+	u8    initial_record_number[0x40];
+
+	u8    resync_tcp_sn[0x20];
+
+	u8    gcm_iv[0x20];
+
+	u8    implicit_iv[0x40];
+
+	u8    reserved_at_100[0x8];
+	u8    dek_index[0x18];
+
+	u8    reserved_at_120[0x15];
+	u8    ti[0x1];
+	u8    zero_copy_en[0x1];
+	u8    ddgst_offload_en[0x1];
+	u8    hdgst_offload_en[0x1];
+	u8    ddgst_en[0x1];
+	u8    hddgst_en[0x1];
+	u8    pda[0x5];
+
+	u8    nvme_resync_tcp_sn[0x20];
+
+	u8    reserved_at_160[0xa0];
+};
+
 #endif /* MLX5_IFC_H */
diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h
index d75ef8aa8fac..5fa8b82c9edb 100644
--- a/include/linux/mlx5/qp.h
+++ b/include/linux/mlx5/qp.h
@@ -220,6 +220,7 @@ struct mlx5_wqe_ctrl_seg {
 #define MLX5_WQE_CTRL_OPCODE_MASK 0xff
 #define MLX5_WQE_CTRL_WQE_INDEX_MASK 0x00ffff00
 #define MLX5_WQE_CTRL_WQE_INDEX_SHIFT 8
+#define MLX5_WQE_CTRL_TIR_TIS_INDEX_SHIFT 8
 
 enum {
 	MLX5_ETH_WQE_L3_INNER_CSUM      = 1 << 4,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 10/15] net/mlx5: Add 128B CQE for NVMEoTCP offload
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (8 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 09/15] net/mlx5: Header file changes for nvme-tcp offload Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 11/15] net/mlx5e: TCP flow steering for nvme-tcp Boris Pismenny
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Or Gerlitz, Yoray Zack

From: Ben Ben-ishay <benishay@nvidia.com>

Add the NVMEoTCP offload definition and access functions for 128B cookies.

Signed-off-by: Ben Ben-ishay <benishay@nvidia.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 include/linux/mlx5/device.h | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 4433ff213d32..ea4d158e8329 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -791,7 +791,7 @@ struct mlx5_err_cqe {
 
 struct mlx5_cqe64 {
 	u8		tls_outer_l3_tunneled;
-	u8		rsvd0;
+	u8		nvmetcp;
 	__be16		wqe_id;
 	u8		lro_tcppsh_abort_dupack;
 	u8		lro_min_ttl;
@@ -824,6 +824,19 @@ struct mlx5_cqe64 {
 	u8		op_own;
 };
 
+struct mlx5e_cqe128 {
+	__be16		cclen;
+	__be16		hlen;
+	union {
+		__be32		resync_tcp_sn;
+		__be32		ccoff;
+	};
+	__be16		ccid;
+	__be16		rsvd8;
+	u8		rsvd12[52];
+	struct mlx5_cqe64 cqe64;
+};
+
 struct mlx5_mini_cqe8 {
 	union {
 		__be32 rx_hash_result;
@@ -854,6 +867,26 @@ enum {
 
 #define MLX5_MINI_CQE_ARRAY_SIZE 8
 
+static inline bool cqe_is_nvmeotcp_resync(struct mlx5_cqe64 *cqe)
+{
+	return ((cqe->nvmetcp >> 6) & 0x1);
+}
+
+static inline bool cqe_is_nvmeotcp_crcvalid(struct mlx5_cqe64 *cqe)
+{
+	return ((cqe->nvmetcp >> 5) & 0x1);
+}
+
+static inline bool cqe_is_nvmeotcp_zc(struct mlx5_cqe64 *cqe)
+{
+	return ((cqe->nvmetcp >> 4) & 0x1);
+}
+
+static inline bool cqe_is_nvmeotcp_zc_or_resync(struct mlx5_cqe64 *cqe)
+{
+	return ((cqe->nvmetcp >> 4) & 0x5);
+}
+
 static inline u8 mlx5_get_cqe_format(struct mlx5_cqe64 *cqe)
 {
 	return (cqe->op_own >> 2) & 0x3;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 11/15] net/mlx5e: TCP flow steering for nvme-tcp
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (9 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 10/15] net/mlx5: Add 128B CQE for NVMEoTCP offload Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 12/15] net/mlx5e: NVMEoTCP DDP offload control path Boris Pismenny
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

Both nvme-tcp and tls require tcp flow steering. Compile it for both of
them. Additionally, use reference counting to allocate/free TCP flow
steering.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/fs.h        |  4 ++--
 .../net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c  | 10 ++++++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.h  |  2 +-
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
index a16297e7e2ac..a7fe3a6358ea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
@@ -137,7 +137,7 @@ enum {
 	MLX5E_L2_FT_LEVEL,
 	MLX5E_TTC_FT_LEVEL,
 	MLX5E_INNER_TTC_FT_LEVEL,
-#ifdef CONFIG_MLX5_EN_TLS
+#if defined(CONFIG_MLX5_EN_TLS) || defined(CONFIG_MLX5_EN_NVMEOTCP)
 	MLX5E_ACCEL_FS_TCP_FT_LEVEL,
 #endif
 #ifdef CONFIG_MLX5_EN_ARFS
@@ -256,7 +256,7 @@ struct mlx5e_flow_steering {
 #ifdef CONFIG_MLX5_EN_ARFS
 	struct mlx5e_arfs_tables        arfs;
 #endif
-#ifdef CONFIG_MLX5_EN_TLS
+#if defined(CONFIG_MLX5_EN_TLS) || defined(CONFIG_MLX5_EN_NVMEOTCP)
 	struct mlx5e_accel_fs_tcp      *accel_tcp;
 #endif
 };
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c
index 97f1594cee11..feded6c8cca1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c
@@ -14,6 +14,7 @@ enum accel_fs_tcp_type {
 struct mlx5e_accel_fs_tcp {
 	struct mlx5e_flow_table tables[ACCEL_FS_TCP_NUM_TYPES];
 	struct mlx5_flow_handle *default_rules[ACCEL_FS_TCP_NUM_TYPES];
+	refcount_t		ref_count;
 };
 
 static enum mlx5e_traffic_types fs_accel2tt(enum accel_fs_tcp_type i)
@@ -335,6 +336,7 @@ static int accel_fs_tcp_enable(struct mlx5e_priv *priv)
 			return err;
 		}
 	}
+	refcount_set(&priv->fs.accel_tcp->ref_count, 1);
 	return 0;
 }
 
@@ -358,6 +360,9 @@ void mlx5e_accel_fs_tcp_destroy(struct mlx5e_priv *priv)
 	if (!priv->fs.accel_tcp)
 		return;
 
+	if (!refcount_dec_and_test(&priv->fs.accel_tcp->ref_count))
+		return;
+
 	accel_fs_tcp_disable(priv);
 
 	for (i = 0; i < ACCEL_FS_TCP_NUM_TYPES; i++)
@@ -374,6 +379,11 @@ int mlx5e_accel_fs_tcp_create(struct mlx5e_priv *priv)
 	if (!MLX5_CAP_FLOWTABLE_NIC_RX(priv->mdev, ft_field_support.outer_ip_version))
 		return -EOPNOTSUPP;
 
+	if (priv->fs.accel_tcp) {
+		refcount_inc(&priv->fs.accel_tcp->ref_count);
+		return 0;
+	}
+
 	priv->fs.accel_tcp = kzalloc(sizeof(*priv->fs.accel_tcp), GFP_KERNEL);
 	if (!priv->fs.accel_tcp)
 		return -ENOMEM;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.h
index 589235824543..8aff9298183c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.h
@@ -6,7 +6,7 @@
 
 #include "en.h"
 
-#ifdef CONFIG_MLX5_EN_TLS
+#if defined(CONFIG_MLX5_EN_TLS) || defined(CONFIG_MLX5_EN_NVMEOTCP)
 int mlx5e_accel_fs_tcp_create(struct mlx5e_priv *priv);
 void mlx5e_accel_fs_tcp_destroy(struct mlx5e_priv *priv);
 struct mlx5_flow_handle *mlx5e_accel_fs_add_sk(struct mlx5e_priv *priv,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 12/15] net/mlx5e: NVMEoTCP DDP offload control path
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (10 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 11/15] net/mlx5e: TCP flow steering for nvme-tcp Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 13/15] net/mlx5e: NVMEoTCP, data-path for DDP offload Boris Pismenny
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

From: Ben Ben-ishay <benishay@nvidia.com>

This commit introduces direct data placement offload to NVME
TCP. There is a context per queue, which is established after the
handshake
using the tcp_ddp_sk_add/del NDOs.

Additionally, a resynchronization routine is used to assist
hardware recovery from TCP OOO, and continue the offload.
Resynchronization operates as follows:
1. TCP OOO causes the NIC HW to stop the offload
2. NIC HW identifies a PDU header at some TCP sequence number,
and asks NVMe-TCP to confirm it.
This request is delivered from the NIC driver to NVMe-TCP by first
finding the socket for the packet that triggered the request, and
then fiding the nvme_tcp_queue that is used by this routine.
Finally, the request is recorded in the nvme_tcp_queue.
3. When NVMe-TCP observes the requested TCP sequence, it will compare
it with the PDU header TCP sequence, and report the result to the
NIC driver (tcp_ddp_resync), which will update the HW,
and resume offload when all is successful.

Furthermore, we let the offloading driver advertise what is the max hw
sectors/segments via tcp_ddp_limits.

A follow-up patch introduces the data-path changes required for this
offload.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  30 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |   1 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  13 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |   9 +-
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 984 ++++++++++++++++++
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h    | 116 +++
 .../mlx5/core/en_accel/nvmeotcp_utils.h       |  80 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  39 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  25 +-
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |  16 +
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |   6 +
 13 files changed, 1327 insertions(+), 5 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 485478979b1a..95c8c1980c96 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -202,3 +202,14 @@ config MLX5_SW_STEERING
 	default y
 	help
 	Build support for software-managed steering in the NIC.
+
+config MLX5_EN_NVMEOTCP
+	bool "NVMEoTCP accelaration"
+	depends on MLX5_CORE_EN
+	depends on TCP_DDP
+	depends on TCP_DDP_CRC
+	default y
+	help
+	Build support for NVMEoTCP accelaration in the NIC.
+	Note: Support for hardware with this capability needs to be selected
+	for this option to become available.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index ac7793057658..053655a96db8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -87,3 +87,5 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 					steering/dr_ste_v0.o \
 					steering/dr_cmd.o steering/dr_fw.o \
 					steering/dr_action.o steering/fs_dr.o
+
+mlx5_core-$(CONFIG_MLX5_EN_NVMEOTCP) += en_accel/fs_tcp.o en_accel/nvmeotcp.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 0da6ed47a571..8e257749018a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -152,6 +152,24 @@ struct page_pool;
 #define MLX5E_UMR_WQEBBS \
 	(DIV_ROUND_UP(MLX5E_UMR_WQE_INLINE_SZ, MLX5_SEND_WQE_BB))
 
+#define KLM_ALIGNMENT 4
+#define MLX5E_KLM_UMR_WQE_SZ(sgl_len)\
+	(sizeof(struct mlx5e_umr_wqe) +\
+	(sizeof(struct mlx5_klm) * (sgl_len)))
+
+#define MLX5E_KLM_UMR_WQEBBS(sgl_len)\
+	(DIV_ROUND_UP(MLX5E_KLM_UMR_WQE_SZ(sgl_len), MLX5_SEND_WQE_BB))
+
+#define MLX5E_KLM_UMR_DS_CNT(sgl_len)\
+	DIV_ROUND_UP(MLX5E_KLM_UMR_WQE_SZ(sgl_len), MLX5_SEND_WQE_DS)
+
+#define MLX5E_MAX_KLM_ENTRIES_PER_WQE(wqe_size)\
+	(((wqe_size) - sizeof(struct mlx5e_umr_wqe)) / sizeof(struct mlx5_klm))
+
+#define MLX5E_KLM_ENTRIES_PER_WQE(wqe_size)\
+	(MLX5E_MAX_KLM_ENTRIES_PER_WQE(wqe_size) -\
+			(MLX5E_MAX_KLM_ENTRIES_PER_WQE(wqe_size) % KLM_ALIGNMENT))
+
 #define MLX5E_MSG_LEVEL			NETIF_MSG_LINK
 
 #define mlx5e_dbg(mlevel, priv, format, ...)                    \
@@ -214,7 +232,10 @@ struct mlx5e_umr_wqe {
 	struct mlx5_wqe_ctrl_seg       ctrl;
 	struct mlx5_wqe_umr_ctrl_seg   uctrl;
 	struct mlx5_mkey_seg           mkc;
-	struct mlx5_mtt                inline_mtts[0];
+	union {
+		struct mlx5_mtt        inline_mtts[0];
+		struct mlx5_klm	       inline_klms[0];
+	};
 };
 
 extern const char mlx5e_self_tests[][ETH_GSTRING_LEN];
@@ -664,6 +685,10 @@ struct mlx5e_channel {
 	struct mlx5e_xdpsq         rq_xdpsq;
 	struct mlx5e_txqsq         sq[MLX5E_MAX_NUM_TC];
 	struct mlx5e_icosq         icosq;   /* internal control operations */
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	struct list_head	   list_nvmeotcpsq;   /* nvmeotcp umrs  */
+	spinlock_t                 nvmeotcp_icosq_lock;
+#endif
 	bool                       xdp;
 	struct napi_struct         napi;
 	struct device             *pdev;
@@ -856,6 +881,9 @@ struct mlx5e_priv {
 #endif
 #ifdef CONFIG_MLX5_EN_TLS
 	struct mlx5e_tls          *tls;
+#endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	struct mlx5e_nvmeotcp      *nvmeotcp;
 #endif
 	struct devlink_health_reporter *tx_reporter;
 	struct devlink_health_reporter *rx_reporter;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
index 807147d97a0f..20e9e5e81ae7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
@@ -16,6 +16,7 @@ struct mlx5e_cq_param {
 	struct mlx5_wq_param       wq;
 	u16                        eq_ix;
 	u8                         cq_period_mode;
+	bool                       force_cqe128;
 };
 
 struct mlx5e_rq_param {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 7943eb30b837..eb929edabd6b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -34,6 +34,11 @@ enum mlx5e_icosq_wqe_type {
 	MLX5E_ICOSQ_WQE_SET_PSV_TLS,
 	MLX5E_ICOSQ_WQE_GET_PSV_TLS,
 #endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	MLX5E_ICOSQ_WQE_UMR_NVME_TCP,
+	MLX5E_ICOSQ_WQE_UMR_NVME_TCP_INVALIDATE,
+	MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP,
+#endif
 };
 
 /* General */
@@ -175,6 +180,14 @@ struct mlx5e_icosq_wqe_info {
 		struct {
 			struct mlx5e_ktls_rx_resync_buf *buf;
 		} tls_get_params;
+#endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+		struct {
+			struct mlx5e_nvmeotcp_queue *queue;
+		} nvmeotcp_q;
+		struct {
+			struct nvmeotcp_queue_entry *entry;
+		} nvmeotcp_qe;
 #endif
 	};
 };
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
index fb89b24deb2b..98728f7404ec 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -39,6 +39,7 @@
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/tls.h"
 #include "en_accel/tls_rxtx.h"
+#include "en_accel/nvmeotcp.h"
 #include "en.h"
 #include "en/txrx.h"
 
@@ -196,11 +197,17 @@ static inline void mlx5e_accel_tx_finish(struct mlx5e_txqsq *sq,
 
 static inline int mlx5e_accel_init_rx(struct mlx5e_priv *priv)
 {
-	return mlx5e_ktls_init_rx(priv);
+	int tls, nvmeotcp;
+
+	tls = mlx5e_ktls_init_rx(priv);
+	nvmeotcp = mlx5e_nvmeotcp_init_rx(priv);
+
+	return tls && nvmeotcp;
 }
 
 static inline void mlx5e_accel_cleanup_rx(struct mlx5e_priv *priv)
 {
+	mlx5e_nvmeotcp_cleanup_rx(priv);
 	mlx5e_ktls_cleanup_rx(priv);
 }
 #endif /* __MLX5E_EN_ACCEL_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
new file mode 100644
index 000000000000..843e653699e9
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
@@ -0,0 +1,984 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies.
+
+#include <linux/netdevice.h>
+#include <linux/idr.h>
+#include <linux/nvme-tcp.h>
+#include "en_accel/nvmeotcp.h"
+#include "en_accel/nvmeotcp_utils.h"
+#include "en_accel/fs_tcp.h"
+#include "en/txrx.h"
+
+#define MAX_NVMEOTCP_QUEUES	(512)
+#define MIN_NVMEOTCP_QUEUES	(1)
+
+static const struct rhashtable_params rhash_queues = {
+	.key_len = sizeof(int),
+	.key_offset = offsetof(struct mlx5e_nvmeotcp_queue, id),
+	.head_offset = offsetof(struct mlx5e_nvmeotcp_queue, hash),
+	.automatic_shrinking = true,
+	.min_size = 1,
+	.max_size = MAX_NVMEOTCP_QUEUES,
+};
+
+#define MLX5_NVME_TCP_MAX_SEGMENTS 128
+
+static u32 mlx5e_get_max_sgl(struct mlx5_core_dev *mdev)
+{
+	return min_t(u32,
+		     MLX5_NVME_TCP_MAX_SEGMENTS,
+		     1 << MLX5_CAP_GEN(mdev, log_max_klm_list_size));
+}
+
+static void mlx5e_nvmeotcp_destroy_tir(struct mlx5e_priv *priv, int tirn)
+{
+	mlx5_core_destroy_tir(priv->mdev, tirn);
+}
+
+static inline u32
+mlx5e_get_channel_ix_from_io_cpu(struct mlx5e_priv *priv, u32 io_cpu)
+{
+	int num_channels = priv->channels.params.num_channels;
+	u32 channel_ix = io_cpu;
+
+	if (channel_ix >= num_channels)
+		channel_ix = channel_ix % num_channels;
+
+	return channel_ix;
+}
+
+static int mlx5e_nvmeotcp_create_tir(struct mlx5e_priv *priv,
+				     struct sock *sk,
+				     struct nvme_tcp_ddp_config *config,
+				     struct mlx5e_nvmeotcp_queue *queue,
+				     bool zerocopy, bool crc_rx)
+{
+	u32 rqtn = priv->direct_tir[queue->channel_ix].rqt.rqtn;
+	int err, inlen;
+	void *tirc;
+	u32 tirn;
+	u32 *in;
+
+	inlen = MLX5_ST_SZ_BYTES(create_tir_in);
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+	tirc = MLX5_ADDR_OF(create_tir_in, in, ctx);
+	MLX5_SET(tirc, tirc, disp_type, MLX5_TIRC_DISP_TYPE_INDIRECT);
+	MLX5_SET(tirc, tirc, rx_hash_fn, MLX5_RX_HASH_FN_INVERTED_XOR8);
+	MLX5_SET(tirc, tirc, indirect_table, rqtn);
+	MLX5_SET(tirc, tirc, transport_domain, priv->mdev->mlx5e_res.td.tdn);
+	if (zerocopy) {
+		MLX5_SET(tirc, tirc, nvmeotcp_zero_copy_en, 1);
+		MLX5_SET(tirc, tirc, nvmeotcp_tag_buffer_table_id,
+			 queue->tag_buf_table_id);
+	}
+
+	if (crc_rx)
+		MLX5_SET(tirc, tirc, nvmeotcp_crc_en, 1);
+
+	MLX5_SET(tirc, tirc, self_lb_block,
+		 MLX5_TIRC_SELF_LB_BLOCK_BLOCK_UNICAST |
+		 MLX5_TIRC_SELF_LB_BLOCK_BLOCK_MULTICAST);
+	err = mlx5_core_create_tir(priv->mdev, in, &tirn);
+
+	if (!err)
+		queue->tirn = tirn;
+
+	kvfree(in);
+	return err;
+}
+
+static
+int mlx5e_create_nvmeotcp_tag_buf_table(struct mlx5_core_dev *mdev,
+					struct mlx5e_nvmeotcp_queue *queue,
+					u8 log_table_size)
+{
+	u32 in[MLX5_ST_SZ_DW(create_nvmeotcp_tag_buf_table_in)] = {};
+	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)];
+	u64 general_obj_types;
+	void *obj;
+	int err;
+
+	obj = MLX5_ADDR_OF(create_nvmeotcp_tag_buf_table_in, in,
+			   nvmeotcp_tag_buf_table_obj);
+
+	general_obj_types = MLX5_CAP_GEN_64(mdev, general_obj_types);
+	if (!(general_obj_types &
+	      MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE))
+		return -EINVAL;
+
+	MLX5_SET(general_obj_in_cmd_hdr, in, opcode,
+		 MLX5_CMD_OP_CREATE_GENERAL_OBJECT);
+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_type,
+		 MLX5_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE);
+	MLX5_SET(nvmeotcp_tag_buf_table_obj, obj,
+		 log_tag_buffer_table_size, log_table_size);
+
+	err = mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+	if (!err)
+		queue->tag_buf_table_id = MLX5_GET(general_obj_out_cmd_hdr,
+						   out, obj_id);
+	return err;
+}
+
+static
+void mlx5_destroy_nvmeotcp_tag_buf_table(struct mlx5_core_dev *mdev, u32 uid)
+{
+	u32 in[MLX5_ST_SZ_DW(general_obj_in_cmd_hdr)] = {};
+	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)];
+
+	MLX5_SET(general_obj_in_cmd_hdr, in, opcode,
+		 MLX5_CMD_OP_DESTROY_GENERAL_OBJECT);
+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_type,
+		 MLX5_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE);
+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_id, uid);
+
+	mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
+#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_TIR_PARAMS 0x2
+#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_STATIC_PARAMS 0x2
+#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_UMR 0x0
+
+#define STATIC_PARAMS_DS_CNT \
+	DIV_ROUND_UP(MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ, MLX5_SEND_WQE_DS)
+
+#define PROGRESS_PARAMS_DS_CNT \
+	DIV_ROUND_UP(MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ, MLX5_SEND_WQE_DS)
+
+enum wqe_type {
+	KLM_UMR = 0,
+	BSF_KLM_UMR = 1,
+	SET_PSV_UMR = 2,
+	BSF_UMR = 3,
+	KLM_INV_UMR = 4,
+};
+
+static void
+fill_nvmeotcp_klm_wqe(struct mlx5e_nvmeotcp_queue *queue,
+		      struct mlx5e_umr_wqe *wqe, u16 ccid, u32 klm_entries,
+		      u16 klm_offset, enum wqe_type klm_type)
+{
+	struct scatterlist *sgl_mkey;
+	u32 lkey, i;
+
+	if (klm_type == BSF_KLM_UMR) {
+		for (i = 0; i < klm_entries; i++) {
+			lkey = queue->ccid_table[i + klm_offset].klm_mkey.key;
+			wqe->inline_klms[i].bcount = cpu_to_be32(1);
+			wqe->inline_klms[i].key	   = cpu_to_be32(lkey);
+			wqe->inline_klms[i].va	   = 0;
+		}
+	} else {
+		lkey = queue->priv->mdev->mlx5e_res.mkey.key;
+		for (i = 0; i < klm_entries; i++) {
+			sgl_mkey = &queue->ccid_table[ccid].sgl[i + klm_offset];
+			wqe->inline_klms[i].bcount = cpu_to_be32(sgl_mkey->length);
+			wqe->inline_klms[i].key	   = cpu_to_be32(lkey);
+			wqe->inline_klms[i].va	   = cpu_to_be64(sgl_mkey->dma_address);
+		}
+	}
+}
+
+static void
+build_nvmeotcp_klm_umr(struct mlx5e_nvmeotcp_queue *queue,
+		       struct mlx5e_umr_wqe *wqe, u16 ccid, int klm_entries,
+		       u32 klm_offset, u32 len, enum wqe_type klm_type)
+{
+	u32 id = (klm_type == KLM_UMR) ? queue->ccid_table[ccid].klm_mkey.key :
+		(queue->tirn << MLX5_WQE_CTRL_TIR_TIS_INDEX_SHIFT);
+	u8 opc_mod = (klm_type == KLM_UMR) ? MLX5_CTRL_SEGMENT_OPC_MOD_UMR_UMR :
+		MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_STATIC_PARAMS;
+	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
+	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
+	struct mlx5_mkey_seg	       *mkc = &wqe->mkc;
+
+	u32 sqn = queue->sq->icosq.sqn;
+	u16 pc = queue->sq->icosq.pc;
+
+	cseg->opmod_idx_opcode = cpu_to_be32((pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
+					     MLX5_OPCODE_UMR | (opc_mod) << 24);
+	cseg->qpn_ds = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
+				   MLX5E_KLM_UMR_DS_CNT(ALIGN(klm_entries, KLM_ALIGNMENT)));
+	cseg->general_id = cpu_to_be32(id);
+
+	if (!klm_entries) { /* this is invalidate */
+		ucseg->mkey_mask = cpu_to_be64(MLX5_MKEY_MASK_FREE);
+		ucseg->flags = MLX5_UMR_INLINE;
+		mkc->status = MLX5_MKEY_STATUS_FREE;
+		return;
+	}
+
+	if (klm_type == KLM_UMR && !klm_offset) {
+		ucseg->mkey_mask |= cpu_to_be64(MLX5_MKEY_MASK_XLT_OCT_SIZE);
+		mkc->xlt_oct_size = cpu_to_be32(ALIGN(len, KLM_ALIGNMENT));
+	}
+
+	ucseg->flags = MLX5_UMR_INLINE | MLX5_UMR_TRANSLATION_OFFSET_EN;
+	ucseg->xlt_octowords = cpu_to_be16(ALIGN(klm_entries, KLM_ALIGNMENT));
+	ucseg->xlt_offset = cpu_to_be16(klm_offset);
+	fill_nvmeotcp_klm_wqe(queue, wqe, ccid, klm_entries, klm_offset, klm_type);
+}
+
+static void
+fill_nvmeotcp_progress_params(struct mlx5e_nvmeotcp_queue *queue,
+			      struct mlx5_seg_nvmeotcp_progress_params *params,
+			      u32 seq)
+{
+	void *ctx = params->ctx;
+
+	MLX5_SET(nvmeotcp_progress_params, ctx,
+		 next_pdu_tcp_sn, seq);
+	MLX5_SET(nvmeotcp_progress_params, ctx, valid, 1);
+	MLX5_SET(nvmeotcp_progress_params, ctx, pdu_tracker_state,
+		 MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_START);
+}
+
+void
+build_nvmeotcp_progress_params(struct mlx5e_nvmeotcp_queue *queue,
+			       struct mlx5e_set_nvmeotcp_progress_params_wqe *wqe,
+			       u32 seq)
+{
+	struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
+	u32 sqn = queue->sq->icosq.sqn;
+	u16 pc = queue->sq->icosq.pc;
+	u8 opc_mod;
+
+	memset(wqe, 0, MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ);
+	opc_mod = MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_PROGRESS_PARAMS;
+	cseg->opmod_idx_opcode = cpu_to_be32((pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
+					     MLX5_OPCODE_SET_PSV | (opc_mod << 24));
+	cseg->qpn_ds = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
+				   PROGRESS_PARAMS_DS_CNT);
+	cseg->general_id = cpu_to_be32(queue->tirn <<
+				       MLX5_WQE_CTRL_TIR_TIS_INDEX_SHIFT);
+	fill_nvmeotcp_progress_params(queue, &wqe->params, seq);
+}
+
+static void
+fill_nvmeotcp_static_params(struct mlx5e_nvmeotcp_queue *queue,
+			    struct mlx5_seg_nvmeotcp_static_params *params,
+			    u32 resync_seq, bool zero_copy_en,
+			    bool ddgst_offload_en)
+{
+	void *ctx = params->ctx;
+
+	MLX5_SET(transport_static_params, ctx, const_1, 1);
+	MLX5_SET(transport_static_params, ctx, const_2, 2);
+	MLX5_SET(transport_static_params, ctx, acc_type,
+		 MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_NVMETCP);
+	MLX5_SET(transport_static_params, ctx, nvme_resync_tcp_sn, resync_seq);
+	MLX5_SET(transport_static_params, ctx, pda, queue->pda);
+	MLX5_SET(transport_static_params, ctx, ddgst_en, queue->dgst);
+	MLX5_SET(transport_static_params, ctx, ddgst_offload_en, ddgst_offload_en);
+	MLX5_SET(transport_static_params, ctx, hddgst_en, 0);
+	MLX5_SET(transport_static_params, ctx, hdgst_offload_en, 0);
+	MLX5_SET(transport_static_params, ctx, ti,
+		 MLX5_TRANSPORT_STATIC_PARAMS_TI_INITIATOR);
+	MLX5_SET(transport_static_params, ctx, zero_copy_en, zero_copy_en);
+}
+
+void
+build_nvmeotcp_static_params(struct mlx5e_nvmeotcp_queue *queue,
+			     struct mlx5e_set_nvmeotcp_static_params_wqe *wqe,
+			     u32 resync_seq, bool zerocopy, bool crc_rx)
+{
+	u8 opc_mod = MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_STATIC_PARAMS;
+	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
+	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
+	u32 sqn = queue->sq->icosq.sqn;
+	u16 pc = queue->sq->icosq.pc;
+
+	memset(wqe, 0, MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ);
+
+	cseg->opmod_idx_opcode = cpu_to_be32((pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
+					     MLX5_OPCODE_UMR | (opc_mod) << 24);
+	cseg->qpn_ds = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
+				   STATIC_PARAMS_DS_CNT);
+	cseg->imm = cpu_to_be32(queue->tirn << MLX5_WQE_CTRL_TIR_TIS_INDEX_SHIFT);
+
+	ucseg->flags = MLX5_UMR_INLINE;
+	ucseg->bsf_octowords =
+		cpu_to_be16(MLX5E_NVMEOTCP_STATIC_PARAMS_OCTWORD_SIZE);
+	fill_nvmeotcp_static_params(queue, &wqe->params, resync_seq, zerocopy, crc_rx);
+}
+
+static void
+mlx5e_nvmeotcp_fill_wi(struct mlx5e_nvmeotcp_queue *nvmeotcp_queue,
+		       struct mlx5e_icosq *sq, u32 wqe_bbs,
+		       u16 pi, u16 ccid, enum wqe_type type)
+{
+	struct mlx5e_icosq_wqe_info *wi = &sq->db.wqe_info[pi];
+
+	wi->num_wqebbs = wqe_bbs;
+	switch (type) {
+	case SET_PSV_UMR:
+		wi->wqe_type = MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP;
+		break;
+	case KLM_INV_UMR:
+		wi->wqe_type = MLX5E_ICOSQ_WQE_UMR_NVME_TCP_INVALIDATE;
+		break;
+	default:
+		wi->wqe_type = MLX5E_ICOSQ_WQE_UMR_NVME_TCP;
+		break;
+	}
+
+	if (type == KLM_INV_UMR)
+		wi->nvmeotcp_qe.entry = &nvmeotcp_queue->ccid_table[ccid];
+	else if (type == SET_PSV_UMR)
+		wi->nvmeotcp_q.queue = nvmeotcp_queue;
+}
+
+static void
+mlx5e_nvmeotcp_rx_post_static_params_wqe(struct mlx5e_nvmeotcp_queue *queue,
+					 u32 resync_seq)
+{
+	struct mlx5e_set_nvmeotcp_static_params_wqe *wqe;
+	struct mlx5e_icosq *sq = &queue->sq->icosq;
+	u16 pi, wqe_bbs;
+
+	wqe_bbs = MLX5E_NVMEOTCP_STATIC_PARAMS_WQEBBS;
+	pi = mlx5e_icosq_get_next_pi(sq, wqe_bbs);
+	wqe = MLX5E_NVMEOTCP_FETCH_STATIC_PARAMS_WQE(sq, pi);
+	mlx5e_nvmeotcp_fill_wi(NULL, sq, wqe_bbs, pi, 0, BSF_UMR);
+	build_nvmeotcp_static_params(queue, wqe, resync_seq, queue->zerocopy, queue->crc_rx);
+	sq->pc += wqe_bbs;
+	mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, &wqe->ctrl);
+}
+
+static void
+mlx5e_nvmeotcp_rx_post_progress_params_wqe(struct mlx5e_nvmeotcp_queue *queue,
+					   u32 seq)
+{
+	struct mlx5e_set_nvmeotcp_progress_params_wqe *wqe;
+	struct mlx5e_icosq *sq = &queue->sq->icosq;
+	u16 pi, wqe_bbs;
+
+	wqe_bbs = MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQEBBS;
+	pi = mlx5e_icosq_get_next_pi(sq, wqe_bbs);
+	wqe = MLX5E_NVMEOTCP_FETCH_PROGRESS_PARAMS_WQE(sq, pi);
+	mlx5e_nvmeotcp_fill_wi(queue, sq, wqe_bbs, pi, 0, SET_PSV_UMR);
+	build_nvmeotcp_progress_params(queue, wqe, seq);
+	sq->pc += wqe_bbs;
+	mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, &wqe->ctrl);
+}
+
+static void
+post_klm_wqe(struct mlx5e_nvmeotcp_queue *queue,
+	     enum wqe_type wqe_type,
+	     u16 ccid,
+	     u32 klm_length,
+	     u32 *klm_offset)
+{
+	struct mlx5e_icosq *sq = &queue->sq->icosq;
+	u32 wqe_bbs, cur_klm_entries;
+	struct mlx5e_umr_wqe *wqe;
+	u16 pi, wqe_sz;
+
+	cur_klm_entries = min_t(int, queue->max_klms_per_wqe,
+				klm_length - *klm_offset);
+	wqe_sz = MLX5E_KLM_UMR_WQE_SZ(ALIGN(cur_klm_entries, KLM_ALIGNMENT));
+	wqe_bbs = DIV_ROUND_UP(wqe_sz, MLX5_SEND_WQE_BB);
+	pi = mlx5e_icosq_get_next_pi(sq, wqe_bbs);
+	wqe = MLX5E_NVMEOTCP_FETCH_KLM_WQE(sq, pi);
+	mlx5e_nvmeotcp_fill_wi(queue, sq, wqe_bbs, pi, ccid,
+			       klm_length ? KLM_UMR : KLM_INV_UMR);
+	build_nvmeotcp_klm_umr(queue, wqe, ccid, cur_klm_entries, *klm_offset,
+			       klm_length, wqe_type);
+	*klm_offset += cur_klm_entries;
+	sq->pc += wqe_bbs;
+	sq->doorbell_cseg = &wqe->ctrl;
+}
+
+static int
+mlx5e_nvmeotcp_post_klm_wqe(struct mlx5e_nvmeotcp_queue *queue,
+			    enum wqe_type wqe_type,
+			    u16 ccid,
+			    u32 klm_length)
+{
+	u32 klm_offset = 0, wqes, wqe_sz, max_wqe_bbs, i, room;
+	struct mlx5e_icosq *sq = &queue->sq->icosq;
+
+	/* TODO: set stricter wqe_sz; using max for now */
+	if (klm_length == 0) {
+		wqes = 1;
+		wqe_sz = MLX5E_NVMEOTCP_STATIC_PARAMS_WQEBBS;
+	} else {
+		wqes = DIV_ROUND_UP(klm_length, queue->max_klms_per_wqe);
+		wqe_sz = MLX5E_KLM_UMR_WQE_SZ(queue->max_klms_per_wqe);
+	}
+
+	max_wqe_bbs = DIV_ROUND_UP(wqe_sz, MLX5_SEND_WQE_BB);
+
+	room = mlx5e_stop_room_for_wqe(max_wqe_bbs) * wqes;
+	if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, room)))
+		return -ENOSPC;
+
+	for (i = 0; i < wqes; i++)
+		post_klm_wqe(queue, wqe_type, ccid, klm_length, &klm_offset);
+
+	mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, sq->doorbell_cseg);
+	return 0;
+}
+
+static int mlx5e_create_nvmeotcp_mkey(struct mlx5_core_dev *mdev,
+				      u8 access_mode,
+				      u32 translation_octword_size,
+				      struct mlx5_core_mkey *mkey)
+{
+	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
+	void *mkc;
+	u32 *in;
+	int err;
+
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+	MLX5_SET(mkc, mkc, free, 1);
+	MLX5_SET(mkc, mkc, translations_octword_size, translation_octword_size);
+	MLX5_SET(mkc, mkc, umr_en, 1);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, lr, 1);
+	MLX5_SET(mkc, mkc, access_mode_1_0, access_mode);
+
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+	MLX5_SET(mkc, mkc, pd, mdev->mlx5e_res.pdn);
+	MLX5_SET(mkc, mkc, length64, 1);
+
+	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
+
+	kvfree(in);
+	return err;
+}
+
+static int
+mlx5e_nvmeotcp_offload_limits(struct net_device *netdev,
+			      struct tcp_ddp_limits *limits)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5_core_dev *mdev = priv->mdev;
+
+	limits->max_ddp_sgl_len = mlx5e_get_max_sgl(mdev);
+	return 0;
+}
+
+static void
+mlx5e_nvmeotcp_destroy_sq(struct mlx5e_nvmeotcp_sq *nvmeotcpsq)
+{
+	mlx5e_deactivate_icosq(&nvmeotcpsq->icosq);
+	mlx5e_close_icosq(&nvmeotcpsq->icosq);
+	mlx5e_close_cq(&nvmeotcpsq->icosq.cq);
+	list_del(&nvmeotcpsq->list);
+	kfree(nvmeotcpsq);
+}
+
+static int
+mlx5e_nvmeotcp_build_icosq(struct mlx5e_nvmeotcp_queue *queue,
+			   struct mlx5e_priv *priv)
+{
+	u16 max_sgl, max_klm_per_wqe, max_umr_per_ccid, sgl_rest, wqebbs_rest;
+	struct mlx5e_channel *c = priv->channels.c[queue->channel_ix];
+	struct mlx5e_sq_param icosq_param = {0};
+	struct dim_cq_moder icocq_moder = {0};
+	struct mlx5e_nvmeotcp_sq *nvmeotcp_sq;
+	struct mlx5e_create_cq_param ccp;
+	struct mlx5e_icosq *icosq;
+	int err = -ENOMEM;
+	u16 log_icosq_sz; 
+	u32 max_wqebbs;
+
+	nvmeotcp_sq = kzalloc(sizeof(*nvmeotcp_sq), GFP_KERNEL);
+	if (!nvmeotcp_sq)
+		return err;
+
+	icosq = &nvmeotcp_sq->icosq;
+	max_sgl = mlx5e_get_max_sgl(priv->mdev);
+	max_klm_per_wqe = queue->max_klms_per_wqe;
+	max_umr_per_ccid = max_sgl / max_klm_per_wqe;
+	sgl_rest = max_sgl % max_klm_per_wqe;
+	wqebbs_rest = sgl_rest ? MLX5E_KLM_UMR_WQEBBS(sgl_rest) : 0;
+	max_wqebbs = (MLX5E_KLM_UMR_WQEBBS(max_klm_per_wqe) *
+		     max_umr_per_ccid + wqebbs_rest) * queue->size;
+	log_icosq_sz = order_base_2(max_wqebbs);
+
+	mlx5e_build_icosq_param(priv, log_icosq_sz, &icosq_param);
+	mlx5e_build_create_cq_param(&ccp, c);
+	err = mlx5e_open_cq(priv, icocq_moder, &icosq_param.cqp, &ccp, &icosq->cq);
+	if (err)
+		goto err_nvmeotcp_sq;
+
+	err = mlx5e_open_icosq(c, &priv->channels.params, &icosq_param, icosq);
+	if (err)
+		goto close_cq;
+
+	INIT_LIST_HEAD(&nvmeotcp_sq->list);
+	spin_lock(&c->nvmeotcp_icosq_lock);
+	list_add(&nvmeotcp_sq->list, &c->list_nvmeotcpsq);
+	spin_unlock(&c->nvmeotcp_icosq_lock);
+	queue->sq = nvmeotcp_sq;
+	mlx5e_activate_icosq(icosq);
+	return 0;
+
+close_cq:
+	mlx5e_close_cq(&icosq->cq);
+err_nvmeotcp_sq:
+	kfree(nvmeotcp_sq);
+
+	return err;
+}
+
+static void
+mlx5e_nvmeotcp_destroy_rx(struct mlx5e_nvmeotcp_queue *queue,
+			  struct mlx5_core_dev *mdev, bool zerocopy)
+{
+	int i;
+
+	mlx5e_accel_fs_del_sk(queue->fh);
+	for (i = 0; i < queue->size && zerocopy; i++)
+		mlx5_core_destroy_mkey(mdev, &queue->ccid_table[i].klm_mkey);
+
+	mlx5e_nvmeotcp_destroy_tir(queue->priv, queue->tirn);
+	if (zerocopy) {
+		kfree(queue->ccid_table);
+		mlx5_destroy_nvmeotcp_tag_buf_table(mdev, queue->tag_buf_table_id);
+		static_branch_dec(&skip_copy_enabled);
+	}
+
+	mlx5e_nvmeotcp_destroy_sq(queue->sq);
+}
+
+static int
+mlx5e_nvmeotcp_queue_rx_init(struct mlx5e_nvmeotcp_queue *queue,
+			     struct nvme_tcp_ddp_config *config,
+			     struct net_device *netdev,
+			     bool zerocopy, bool crc)
+{
+	u8 log_queue_size = order_base_2(config->queue_size);
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5_core_dev *mdev = priv->mdev;
+	struct sock *sk = queue->sk;
+	int err, max_sgls, i;
+
+	if (zerocopy) {
+		if (config->queue_size >
+		    BIT(MLX5_CAP_DEV_NVMEOTCP(mdev, log_max_nvmeotcp_tag_buffer_size))) {
+			return -EINVAL;
+		}
+
+		err = mlx5e_create_nvmeotcp_tag_buf_table(mdev, queue, log_queue_size);
+		if (err)
+			return err;
+	}
+
+	err = mlx5e_nvmeotcp_build_icosq(queue, priv);
+	if (err)
+		goto destroy_tag_buffer_table;
+
+	/* initializes queue->tirn */
+	err = mlx5e_nvmeotcp_create_tir(priv, sk, config, queue, zerocopy, crc);
+	if (err)
+		goto destroy_icosq;
+
+	mlx5e_nvmeotcp_rx_post_static_params_wqe(queue, 0);
+	mlx5e_nvmeotcp_rx_post_progress_params_wqe(queue, tcp_sk(sk)->copied_seq);
+
+	if (zerocopy) {
+		queue->ccid_table = kcalloc(queue->size,
+					    sizeof(struct nvmeotcp_queue_entry),
+					    GFP_KERNEL);
+		if (!queue->ccid_table) {
+			err = -ENOMEM;
+			goto destroy_tir;
+		}
+
+		max_sgls = mlx5e_get_max_sgl(mdev);
+		for (i = 0; i < queue->size; i++) {
+			err = mlx5e_create_nvmeotcp_mkey(mdev,
+							 MLX5_MKC_ACCESS_MODE_KLMS,
+							 max_sgls,
+							 &queue->ccid_table[i].klm_mkey);
+			if (err)
+				goto free_sgl;
+		}
+
+		err = mlx5e_nvmeotcp_post_klm_wqe(queue, BSF_KLM_UMR, 0, queue->size);
+		if (err)
+			goto free_sgl;
+	}
+
+	if (!(WARN_ON(!wait_for_completion_timeout(&queue->done, 0))))
+		queue->fh = mlx5e_accel_fs_add_sk(priv, sk, queue->tirn, queue->id);
+
+	if (IS_ERR_OR_NULL(queue->fh)) {
+		err = -EINVAL;
+		goto free_sgl;
+	}
+
+	if (zerocopy)
+		static_branch_inc(&skip_copy_enabled);
+
+	return 0;
+
+free_sgl:
+	while ((i--) && zerocopy)
+		mlx5_core_destroy_mkey(mdev, &queue->ccid_table[i].klm_mkey);
+
+	if (zerocopy)
+		kfree(queue->ccid_table);
+destroy_tir:
+	mlx5e_nvmeotcp_destroy_tir(priv, queue->tirn);
+destroy_icosq:
+	mlx5e_nvmeotcp_destroy_sq(queue->sq);
+destroy_tag_buffer_table:
+	if (zerocopy)
+		mlx5_destroy_nvmeotcp_tag_buf_table(mdev, queue->tag_buf_table_id);
+
+	return err;
+}
+
+#define OCTWORD_SHIFT 4
+#define MAX_DS_VALUE 63
+static int
+mlx5e_nvmeotcp_queue_init(struct net_device *netdev,
+			  struct sock *sk,
+			  struct tcp_ddp_config *tconfig)
+{
+	struct nvme_tcp_ddp_config *config = (struct nvme_tcp_ddp_config *)tconfig;
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5_core_dev *mdev = priv->mdev;
+	struct mlx5e_nvmeotcp_queue *queue;
+	int max_wqe_sz_cap, queue_id, err;
+
+	if (tconfig->type != TCP_DDP_NVME) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+
+	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
+	if (!queue) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	queue_id = ida_simple_get(&priv->nvmeotcp->queue_ids,
+				  MIN_NVMEOTCP_QUEUES, MAX_NVMEOTCP_QUEUES,
+				  GFP_KERNEL);
+	if (queue_id < 0) {
+		err = -ENOSPC;
+		goto free_queue;
+	}
+
+	queue->crc_rx = (config->dgst & NVME_TCP_DATA_DIGEST_ENABLE) &&
+			(netdev->features & NETIF_F_HW_TCP_DDP_CRC_RX);
+	queue->zerocopy = (netdev->features & NETIF_F_HW_TCP_DDP);
+	queue->tcp_ddp_ctx.type = TCP_DDP_NVME;
+	queue->sk = sk;
+	queue->id = queue_id;
+	queue->dgst = config->dgst;
+	queue->pda = config->cpda;
+	queue->channel_ix = mlx5e_get_channel_ix_from_io_cpu(priv,
+							     config->io_cpu);
+	queue->size = config->queue_size;
+	max_wqe_sz_cap  = min_t(int, MAX_DS_VALUE * MLX5_SEND_WQE_DS,
+				MLX5_CAP_GEN(mdev, max_wqe_sz_sq) << OCTWORD_SHIFT);
+	queue->max_klms_per_wqe = MLX5E_KLM_ENTRIES_PER_WQE(max_wqe_sz_cap);
+	queue->priv = priv;
+	init_completion(&queue->done);
+
+	if (queue->zerocopy || queue->crc_rx) {
+		err = mlx5e_nvmeotcp_queue_rx_init(queue, config, netdev,
+						   queue->zerocopy, queue->crc_rx);
+			if (err)
+				goto remove_queue_id;
+	}
+
+	err = rhashtable_insert_fast(&priv->nvmeotcp->queue_hash, &queue->hash,
+				     rhash_queues);
+	if (err)
+		goto destroy_rx;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	rcu_assign_pointer(inet_csk(sk)->icsk_ulp_ddp_data, queue);
+	write_unlock_bh(&sk->sk_callback_lock);
+	refcount_set(&queue->ref_count, 1);
+	return err;
+
+destroy_rx:
+	if (queue->zerocopy || queue->crc_rx)
+		mlx5e_nvmeotcp_destroy_rx(queue, mdev, queue->zerocopy);
+remove_queue_id:
+	ida_simple_remove(&priv->nvmeotcp->queue_ids, queue_id);
+free_queue:
+	kfree(queue);
+out:
+	return err;
+}
+
+static void
+mlx5e_nvmeotcp_queue_teardown(struct net_device *netdev,
+			      struct sock *sk)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5_core_dev *mdev = priv->mdev;
+	struct mlx5e_nvmeotcp_queue *queue;
+
+	queue = (struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
+
+	napi_synchronize(&priv->channels.c[queue->channel_ix]->napi);
+
+	WARN_ON(refcount_read(&queue->ref_count) != 1);
+	if (queue->zerocopy | queue->crc_rx)
+		mlx5e_nvmeotcp_destroy_rx(queue, mdev, queue->zerocopy);
+
+	rhashtable_remove_fast(&priv->nvmeotcp->queue_hash, &queue->hash,
+			       rhash_queues);
+
+	write_lock_bh(&sk->sk_callback_lock);
+	rcu_assign_pointer(inet_csk(sk)->icsk_ulp_ddp_data, NULL);
+	write_unlock_bh(&sk->sk_callback_lock);
+	mlx5e_nvmeotcp_put_queue(queue);
+}
+
+static int
+mlx5e_nvmeotcp_ddp_setup(struct net_device *netdev,
+			 struct sock *sk,
+			 struct tcp_ddp_io *ddp)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct scatterlist *sg = ddp->sg_table.sgl;
+	struct mlx5e_nvmeotcp_queue *queue;
+	struct mlx5_core_dev *mdev;
+	int count = 0;
+
+	queue = (struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
+
+	mdev = queue->priv->mdev;
+	count = dma_map_sg(mdev->device, ddp->sg_table.sgl, ddp->nents,
+			   DMA_FROM_DEVICE);
+
+	if (WARN_ON(count > mlx5e_get_max_sgl(mdev)))
+		return -ENOSPC;
+
+	queue->ccid_table[ddp->command_id].ddp = ddp;
+	queue->ccid_table[ddp->command_id].sgl = sg;
+	queue->ccid_table[ddp->command_id].ccid_gen++;
+	queue->ccid_table[ddp->command_id].sgl_length = count;
+
+	return 0;
+}
+
+void mlx5e_nvmeotcp_ddp_inv_done(struct mlx5e_icosq_wqe_info *wi)
+{
+	struct nvmeotcp_queue_entry *q_entry = wi->nvmeotcp_qe.entry;
+	struct mlx5e_nvmeotcp_queue *queue = q_entry->queue;
+	struct mlx5_core_dev *mdev = queue->priv->mdev;
+	struct tcp_ddp_io *ddp = q_entry->ddp;
+	const struct tcp_ddp_ulp_ops *ulp_ops;
+
+	dma_unmap_sg(mdev->device, ddp->sg_table.sgl,
+		     q_entry->sgl_length, DMA_FROM_DEVICE);
+
+	q_entry->sgl_length = 0;
+
+	ulp_ops = inet_csk(queue->sk)->icsk_ulp_ddp_ops;
+	if (ulp_ops && ulp_ops->ddp_teardown_done)
+		ulp_ops->ddp_teardown_done(q_entry->ddp_ctx);
+}
+
+void mlx5e_nvmeotcp_ctx_comp(struct mlx5e_icosq_wqe_info *wi)
+{
+	struct mlx5e_nvmeotcp_queue *queue = wi->nvmeotcp_q.queue;
+
+	if (unlikely(!queue))
+		return;
+
+	complete(&queue->done);
+}
+
+static int
+mlx5e_nvmeotcp_ddp_teardown(struct net_device *netdev,
+			    struct sock *sk,
+			    struct tcp_ddp_io *ddp,
+			    void *ddp_ctx)
+{
+	struct mlx5e_nvmeotcp_queue *queue =
+		(struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct nvmeotcp_queue_entry *q_entry;
+
+	q_entry  = &queue->ccid_table[ddp->command_id];
+	WARN_ON(q_entry->sgl_length == 0);
+
+	q_entry->ddp_ctx = ddp_ctx;
+	q_entry->queue = queue;
+
+	mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, 0);
+
+	return 0;
+}
+
+static void
+mlx5e_nvmeotcp_dev_resync(struct net_device *netdev,
+			  struct sock *sk, u32 seq)
+{
+	struct mlx5e_nvmeotcp_queue *queue =
+				(struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
+
+	mlx5e_nvmeotcp_rx_post_static_params_wqe(queue, seq);
+}
+
+static const struct tcp_ddp_dev_ops mlx5e_nvmeotcp_ops = {
+	.tcp_ddp_limits = mlx5e_nvmeotcp_offload_limits,
+	.tcp_ddp_sk_add = mlx5e_nvmeotcp_queue_init,
+	.tcp_ddp_sk_del = mlx5e_nvmeotcp_queue_teardown,
+	.tcp_ddp_setup = mlx5e_nvmeotcp_ddp_setup,
+	.tcp_ddp_teardown = mlx5e_nvmeotcp_ddp_teardown,
+	.tcp_ddp_resync = mlx5e_nvmeotcp_dev_resync,
+};
+
+struct mlx5e_nvmeotcp_queue *
+mlx5e_nvmeotcp_get_queue(struct mlx5e_nvmeotcp *nvmeotcp, int id)
+{
+	struct mlx5e_nvmeotcp_queue *queue;
+
+	rcu_read_lock();
+	queue = rhashtable_lookup_fast(&nvmeotcp->queue_hash,
+				       &id, rhash_queues);
+	if (queue && !IS_ERR(queue))
+		if (!refcount_inc_not_zero(&queue->ref_count))
+			queue = NULL;
+	rcu_read_unlock();
+	return queue;
+}
+
+void mlx5e_nvmeotcp_put_queue(struct mlx5e_nvmeotcp_queue *queue)
+{
+	if (refcount_dec_and_test(&queue->ref_count))
+		kfree(queue);
+}
+
+int set_feature_nvme_tcp(struct net_device *netdev, bool enable)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	int err = 0;
+
+	mutex_lock(&priv->state_lock);
+	if (enable)
+		err = mlx5e_accel_fs_tcp_create(priv);
+	else
+		mlx5e_accel_fs_tcp_destroy(priv);
+	mutex_unlock(&priv->state_lock);
+	if (err)
+		return err;
+
+	priv->nvmeotcp->enable = enable;
+	err = mlx5e_safe_reopen_channels(priv);
+	return err;
+}
+
+int set_feature_nvme_tcp_crc(struct net_device *netdev, bool enable)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	int err = 0;
+
+	mutex_lock(&priv->state_lock);
+	if (enable)
+		err = mlx5e_accel_fs_tcp_create(priv);
+	else
+		mlx5e_accel_fs_tcp_destroy(priv);
+	mutex_unlock(&priv->state_lock);
+
+	priv->nvmeotcp->crc_rx_enable = enable;
+	err = mlx5e_safe_reopen_channels(priv);
+	if (err)
+		netdev_err(priv->netdev,
+			   "%s failed to reopen channels, err(%d).\n",
+			   __func__, err);
+
+	return err;
+}
+
+void mlx5e_nvmeotcp_build_netdev(struct mlx5e_priv *priv)
+{
+	struct net_device *netdev = priv->netdev;
+
+	if (!MLX5_CAP_GEN(priv->mdev, nvmeotcp))
+		return;
+
+	if (MLX5_CAP_DEV_NVMEOTCP(priv->mdev, zerocopy)) {
+		netdev->features |= NETIF_F_HW_TCP_DDP;
+		netdev->hw_features |= NETIF_F_HW_TCP_DDP;
+	}
+
+	if (MLX5_CAP_DEV_NVMEOTCP(priv->mdev, crc_rx)) {
+		netdev->features |= NETIF_F_HW_TCP_DDP_CRC_RX;
+		netdev->hw_features |= NETIF_F_HW_TCP_DDP_CRC_RX;
+	}
+
+	netdev->tcp_ddp_ops = &mlx5e_nvmeotcp_ops;
+	priv->nvmeotcp->enable = true;
+}
+
+int mlx5e_nvmeotcp_init_rx(struct mlx5e_priv *priv)
+{
+	int ret = 0;
+
+	if (priv->netdev->features & NETIF_F_HW_TCP_DDP) {
+		ret = mlx5e_accel_fs_tcp_create(priv);
+		if (ret)
+			return ret;
+	}
+
+	if (priv->netdev->features & NETIF_F_HW_TCP_DDP_CRC_RX)
+		ret = mlx5e_accel_fs_tcp_create(priv);
+
+	return ret;
+}
+
+void mlx5e_nvmeotcp_cleanup_rx(struct mlx5e_priv *priv)
+{
+	if (priv->netdev->features & NETIF_F_HW_TCP_DDP)
+		mlx5e_accel_fs_tcp_destroy(priv);
+
+	if (priv->netdev->features & NETIF_F_HW_TCP_DDP_CRC_RX)
+		mlx5e_accel_fs_tcp_destroy(priv);
+}
+
+int mlx5e_nvmeotcp_init(struct mlx5e_priv *priv)
+{
+	struct mlx5e_nvmeotcp *nvmeotcp = kzalloc(sizeof(*nvmeotcp), GFP_KERNEL);
+	int ret = 0;
+
+	if (!nvmeotcp)
+		return -ENOMEM;
+
+	ida_init(&nvmeotcp->queue_ids);
+	ret = rhashtable_init(&nvmeotcp->queue_hash, &rhash_queues);
+	if (ret)
+		goto err_ida;
+
+	priv->nvmeotcp = nvmeotcp;
+	goto out;
+
+err_ida:
+	ida_destroy(&nvmeotcp->queue_ids);
+	kfree(nvmeotcp);
+out:
+	return ret;
+}
+
+void mlx5e_nvmeotcp_cleanup(struct mlx5e_priv *priv)
+{
+	struct mlx5e_nvmeotcp *nvmeotcp = priv->nvmeotcp;
+
+	if (!nvmeotcp)
+		return;
+
+	rhashtable_destroy(&nvmeotcp->queue_hash);
+	ida_destroy(&nvmeotcp->queue_ids);
+	kfree(nvmeotcp);
+	priv->nvmeotcp = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
new file mode 100644
index 000000000000..5be300d8299e
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies.
+#ifndef __MLX5E_NVMEOTCP_H__
+#define __MLX5E_NVMEOTCP_H__
+
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+
+#include "net/tcp_ddp.h"
+#include "en.h"
+#include "en/params.h"
+
+struct nvmeotcp_queue_entry {
+	struct mlx5e_nvmeotcp_queue	*queue;
+	u32				sgl_length;
+	struct mlx5_core_mkey		klm_mkey;
+	struct scatterlist		*sgl;
+	u32				ccid_gen;
+
+	/* for the ddp invalidate done callback */
+	void				*ddp_ctx;
+	struct tcp_ddp_io		*ddp;
+};
+
+struct mlx5e_nvmeotcp_sq {
+	struct list_head		list;
+	struct mlx5e_icosq		icosq;
+};
+
+/**
+ *	struct mlx5e_nvmeotcp_queue - MLX5 metadata for NVMEoTCP queue
+ *	@fh: Flow handle representing the 5-tuple steering for this flow
+ *	@tirn: Destination TIR number created for NVMEoTCP offload
+ *	@id: Flow tag ID used to identify this queue
+ *	@size: NVMEoTCP queue depth
+ *	@sq: Send queue used for sending control messages
+ *	@ccid_table: Table holding metadata for each CC
+ *	@tag_buf_table_id: Tag buffer table for CCIDs
+ *	@hash: Hash table of queues mapped by @id
+ *	@ref_count: Reference count for this structure
+ *	@ccoff: Offset within the current CC
+ *	@pda: Padding alignment
+ *	@ccid_gen: Generation ID for the CCID, used to avoid conflicts in DDP
+ *	@max_klms_per_wqe: Number of KLMs per DDP operation
+ *	@channel_ix: Channel IX for this nvmeotcp_queue
+ *	@sk: The socket used by the NVMe-TCP queue
+ *	@zerocopy: if this queue is used for zerocopy offload.
+ *	@crc_rx: if this queue is used for CRC Rx offload.
+ *	@ccid: ID of the current CC
+ *	@ccsglidx: Index within the scatter-gather list (SGL) of the current CC
+ *	@ccoff_inner: Current offset within the @ccsglidx element
+ *	@priv: mlx5e netdev priv
+ *	@inv_done: invalidate callback of the nvme tcp driver
+ */
+struct mlx5e_nvmeotcp_queue {
+	struct tcp_ddp_ctx		tcp_ddp_ctx;
+	struct mlx5_flow_handle		*fh;
+	int				tirn;
+	int				id;
+	u32				size;
+	struct mlx5e_nvmeotcp_sq	*sq;
+	struct nvmeotcp_queue_entry	*ccid_table;
+	u32				tag_buf_table_id;
+	struct rhash_head		hash;
+	refcount_t			ref_count;
+	bool				dgst;
+	int				pda;
+	u32				ccid_gen;
+	u32				max_klms_per_wqe;
+	u32				channel_ix;
+	struct sock			*sk;
+	bool				zerocopy;
+	bool				crc_rx;
+
+	/* current ccid fields */
+	off_t				ccoff;
+	int				ccid;
+	int				ccsglidx;
+	int				ccoff_inner;
+
+	/* for ddp invalidate flow */
+	struct mlx5e_priv		*priv;
+
+	/* for flow_steering flow */
+	struct completion		done;
+};
+
+struct mlx5e_nvmeotcp {
+	struct ida			queue_ids;
+	struct rhashtable		queue_hash;
+	bool				enable;
+	bool				crc_rx_enable;
+};
+
+void mlx5e_nvmeotcp_build_netdev(struct mlx5e_priv *priv);
+int mlx5e_nvmeotcp_init(struct mlx5e_priv *priv);
+int set_feature_nvme_tcp(struct net_device *netdev, bool enable);
+int set_feature_nvme_tcp_crc(struct net_device *netdev, bool enable);
+void mlx5e_nvmeotcp_cleanup(struct mlx5e_priv *priv);
+struct mlx5e_nvmeotcp_queue *
+mlx5e_nvmeotcp_get_queue(struct mlx5e_nvmeotcp *nvmeotcp, int id);
+void mlx5e_nvmeotcp_put_queue(struct mlx5e_nvmeotcp_queue *queue);
+void mlx5e_nvmeotcp_ddp_inv_done(struct mlx5e_icosq_wqe_info *wi);
+void mlx5e_nvmeotcp_ctx_comp(struct mlx5e_icosq_wqe_info *wi);
+int mlx5e_nvmeotcp_init_rx(struct mlx5e_priv *priv);
+void mlx5e_nvmeotcp_cleanup_rx(struct mlx5e_priv *priv);
+#else
+
+static inline void mlx5e_nvmeotcp_build_netdev(struct mlx5e_priv *priv) { }
+static inline int mlx5e_nvmeotcp_init(struct mlx5e_priv *priv) { return 0; }
+static inline void mlx5e_nvmeotcp_cleanup(struct mlx5e_priv *priv) { }
+static inline int set_feature_nvme_tcp(struct net_device *netdev, bool enable) { return 0; }
+static inline int set_feature_nvme_tcp_crc(struct net_device *netdev, bool enable) { return 0; }
+static inline int mlx5e_nvmeotcp_init_rx(struct mlx5e_priv *priv) { return 0; }
+static inline void mlx5e_nvmeotcp_cleanup_rx(struct mlx5e_priv *priv) { }
+#endif
+#endif /* __MLX5E_NVMEOTCP_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
new file mode 100644
index 000000000000..3848fcec59c3
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies.
+
+#ifndef __MLX5E_NVMEOTCP_UTILS_H__
+#define __MLX5E_NVMEOTCP_UTILS_H__
+
+#include "en.h"
+#include "en_accel/nvmeotcp.h"
+
+enum {
+	MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_START     = 0,
+	MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_TRACKING  = 1,
+	MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_SEARCHING = 2,
+};
+
+struct mlx5_seg_nvmeotcp_static_params {
+	u8     ctx[MLX5_ST_SZ_BYTES(transport_static_params)];
+};
+
+struct mlx5_seg_nvmeotcp_progress_params {
+	u8     ctx[MLX5_ST_SZ_BYTES(nvmeotcp_progress_params)];
+};
+
+struct mlx5e_set_nvmeotcp_static_params_wqe {
+	struct mlx5_wqe_ctrl_seg          ctrl;
+	struct mlx5_wqe_umr_ctrl_seg      uctrl;
+	struct mlx5_mkey_seg              mkc;
+	struct mlx5_seg_nvmeotcp_static_params params;
+};
+
+struct mlx5e_set_nvmeotcp_progress_params_wqe {
+	struct mlx5_wqe_ctrl_seg            ctrl;
+	struct mlx5_seg_nvmeotcp_progress_params params;
+};
+
+struct mlx5e_get_psv_wqe {
+	struct mlx5_wqe_ctrl_seg ctrl;
+	struct mlx5_seg_get_psv  psv;
+};
+
+///////////////////////////////////////////
+#define MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ \
+	(sizeof(struct mlx5e_set_nvmeotcp_static_params_wqe))
+
+#define MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ \
+	(sizeof(struct mlx5e_set_nvmeotcp_progress_params_wqe))
+#define MLX5E_NVMEOTCP_STATIC_PARAMS_OCTWORD_SIZE \
+	(MLX5_ST_SZ_BYTES(transport_static_params) / MLX5_SEND_WQE_DS)
+
+#define MLX5E_NVMEOTCP_STATIC_PARAMS_WQEBBS \
+	(DIV_ROUND_UP(MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ, MLX5_SEND_WQE_BB))
+#define MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQEBBS \
+	(DIV_ROUND_UP(MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ, MLX5_SEND_WQE_BB))
+
+#define MLX5E_NVMEOTCP_FETCH_STATIC_PARAMS_WQE(sq, pi) \
+	((struct mlx5e_set_nvmeotcp_static_params_wqe *)\
+	 mlx5e_fetch_wqe(&(sq)->wq, pi, sizeof(struct mlx5e_set_nvmeotcp_static_params_wqe)))
+
+#define MLX5E_NVMEOTCP_FETCH_PROGRESS_PARAMS_WQE(sq, pi) \
+	((struct mlx5e_set_nvmeotcp_progress_params_wqe *)\
+	 mlx5e_fetch_wqe(&(sq)->wq, pi, sizeof(struct mlx5e_set_nvmeotcp_progress_params_wqe)))
+
+#define MLX5E_NVMEOTCP_FETCH_KLM_WQE(sq, pi) \
+	((struct mlx5e_umr_wqe *)\
+	 mlx5e_fetch_wqe(&(sq)->wq, pi, sizeof(struct mlx5e_umr_wqe)))
+
+#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_PROGRESS_PARAMS 0x4
+
+void
+build_nvmeotcp_progress_params(struct mlx5e_nvmeotcp_queue *queue,
+			       struct mlx5e_set_nvmeotcp_progress_params_wqe *wqe,
+			       u32 seq);
+
+void
+build_nvmeotcp_static_params(struct mlx5e_nvmeotcp_queue *queue,
+			     struct mlx5e_set_nvmeotcp_static_params_wqe *wqe,
+			     u32 resync_seq,
+			     bool zerocopy, bool crc_rx);
+
+#endif /* __MLX5E_NVMEOTCP_UTILS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 158fc05f0c4c..d58826d93f3c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -47,6 +47,7 @@
 #include "en_accel/ipsec.h"
 #include "en_accel/en_accel.h"
 #include "en_accel/tls.h"
+#include "en_accel/nvmeotcp.h"
 #include "accel/ipsec.h"
 #include "accel/tls.h"
 #include "lib/vxlan.h"
@@ -2015,6 +2016,10 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	c->irq_desc = irq_to_desc(irq);
 	c->lag_port = mlx5e_enumerate_lag_port(priv->mdev, ix);
 
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	INIT_LIST_HEAD(&c->list_nvmeotcpsq);
+	spin_lock_init(&c->nvmeotcp_icosq_lock);
+#endif
 	netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64);
 
 	err = mlx5e_open_queues(c, params, cparam);
@@ -2247,7 +2252,8 @@ static void mlx5e_build_common_cq_param(struct mlx5e_priv *priv,
 	void *cqc = param->cqc;
 
 	MLX5_SET(cqc, cqc, uar_page, priv->mdev->priv.uar->index);
-	if (MLX5_CAP_GEN(priv->mdev, cqe_128_always) && cache_line_size() >= 128)
+	if (MLX5_CAP_GEN(priv->mdev, cqe_128_always) &&
+	    (cache_line_size() >= 128 || param->force_cqe128))
 		MLX5_SET(cqc, cqc, cqe_sz, CQE_STRIDE_128_PAD);
 }
 
@@ -2261,6 +2267,11 @@ void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv,
 	void *cqc = param->cqc;
 	u8 log_cq_size;
 
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	/* nvme-tcp offload mandates 128 byte cqes */
+	param->force_cqe128 |= (priv->nvmeotcp->enable || priv->nvmeotcp->crc_rx_enable);
+#endif
+
 	switch (params->rq_wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		log_cq_size = mlx5e_mpwqe_get_log_rq_size(params, xsk) +
@@ -3957,6 +3968,10 @@ int mlx5e_set_features(struct net_device *netdev, netdev_features_t features)
 	err |= MLX5E_HANDLE_FEATURE(NETIF_F_NTUPLE, set_feature_arfs);
 #endif
 	err |= MLX5E_HANDLE_FEATURE(NETIF_F_HW_TLS_RX, mlx5e_ktls_set_feature_rx);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	err |= MLX5E_HANDLE_FEATURE(NETIF_F_HW_TCP_DDP, set_feature_nvme_tcp);
+	err |= MLX5E_HANDLE_FEATURE(NETIF_F_HW_TCP_DDP_CRC_RX, set_feature_nvme_tcp_crc);
+#endif
 
 	if (err) {
 		netdev->features = oper_features;
@@ -3993,6 +4008,23 @@ static netdev_features_t mlx5e_fix_features(struct net_device *netdev,
 		features &= ~NETIF_F_RXHASH;
 		if (netdev->features & NETIF_F_RXHASH)
 			netdev_warn(netdev, "Disabling rxhash, not supported when CQE compress is active\n");
+
+		features &= ~NETIF_F_HW_TCP_DDP;
+		if (netdev->features & NETIF_F_HW_TCP_DDP)
+			netdev_warn(netdev, "Disabling tcp-ddp offload, not supported when CQE compress is active\n");
+
+		features &= ~NETIF_F_HW_TCP_DDP_CRC_RX;
+		if (netdev->features & NETIF_F_HW_TCP_DDP_CRC_RX)
+			netdev_warn(netdev, "Disabling tcp-ddp-crc-rx offload, not supported when CQE compression is active\n");
+	}
+
+	if (netdev->features & NETIF_F_LRO) {
+		features &= ~NETIF_F_HW_TCP_DDP;
+		if (netdev->features & NETIF_F_HW_TCP_DDP)
+			netdev_warn(netdev, "Disabling tcp-ddp offload, not supported when LRO is active\n");
+		features &= ~NETIF_F_HW_TCP_DDP_CRC_RX;
+		if (netdev->features & NETIF_F_HW_TCP_DDP_CRC_RX)
+			netdev_warn(netdev, "Disabling tcp-ddp-crc-rx offload, not supported when LRO is active\n");
 	}
 
 	mutex_unlock(&priv->state_lock);
@@ -5064,6 +5096,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	mlx5e_set_netdev_dev_addr(netdev);
 	mlx5e_ipsec_build_netdev(priv);
 	mlx5e_tls_build_netdev(priv);
+	mlx5e_nvmeotcp_build_netdev(priv);
 }
 
 void mlx5e_create_q_counters(struct mlx5e_priv *priv)
@@ -5128,6 +5161,9 @@ static int mlx5e_nic_init(struct mlx5_core_dev *mdev,
 	err = mlx5e_tls_init(priv);
 	if (err)
 		mlx5_core_err(mdev, "TLS initialization failed, %d\n", err);
+	err = mlx5e_nvmeotcp_init(priv);
+	if (err)
+		mlx5_core_err(mdev, "NVMEoTCP initialization failed, %d\n", err);
 	mlx5e_build_nic_netdev(netdev);
 	err = mlx5e_devlink_port_register(priv);
 	if (err)
@@ -5141,6 +5177,7 @@ static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
 {
 	mlx5e_health_destroy_reporters(priv);
 	mlx5e_devlink_port_unregister(priv);
+	mlx5e_nvmeotcp_cleanup(priv);
 	mlx5e_tls_cleanup(priv);
 	mlx5e_ipsec_cleanup(priv);
 	mlx5e_netdev_cleanup(priv->netdev, priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 377e547840f3..598d62366af2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -47,6 +47,7 @@
 #include "fpga/ipsec.h"
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/tls_rxtx.h"
+#include "en_accel/nvmeotcp.h"
 #include "lib/clock.h"
 #include "en/xdp.h"
 #include "en/xsk/rx.h"
@@ -617,16 +618,26 @@ void mlx5e_free_icosq_descs(struct mlx5e_icosq *sq)
 		ci = mlx5_wq_cyc_ctr2ix(&sq->wq, sqcc);
 		wi = &sq->db.wqe_info[ci];
 		sqcc += wi->num_wqebbs;
-#ifdef CONFIG_MLX5_EN_TLS
 		switch (wi->wqe_type) {
+#ifdef CONFIG_MLX5_EN_TLS
 		case MLX5E_ICOSQ_WQE_SET_PSV_TLS:
 			mlx5e_ktls_handle_ctx_completion(wi);
 			break;
 		case MLX5E_ICOSQ_WQE_GET_PSV_TLS:
 			mlx5e_ktls_handle_get_psv_completion(wi, sq);
 			break;
-		}
 #endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+		case MLX5E_ICOSQ_WQE_UMR_NVME_TCP:
+			break;
+		case MLX5E_ICOSQ_WQE_UMR_NVME_TCP_INVALIDATE:
+			mlx5e_nvmeotcp_ddp_inv_done(wi);
+			break;
+		case MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP:
+			mlx5e_nvmeotcp_ctx_comp(wi);
+			break;
+#endif
+		}
 	}
 	sq->cc = sqcc;
 }
@@ -695,6 +706,16 @@ int mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 			case MLX5E_ICOSQ_WQE_GET_PSV_TLS:
 				mlx5e_ktls_handle_get_psv_completion(wi, sq);
 				break;
+#endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+			case MLX5E_ICOSQ_WQE_UMR_NVME_TCP:
+				break;
+			case MLX5E_ICOSQ_WQE_UMR_NVME_TCP_INVALIDATE:
+				mlx5e_nvmeotcp_ddp_inv_done(wi);
+				break;
+			case MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP:
+				mlx5e_nvmeotcp_ctx_comp(wi);
+				break;
 #endif
 			default:
 				netdev_WARN_ONCE(cq->netdev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 1ec3d62f026d..cd89d4dd2710 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -36,6 +36,7 @@
 #include "en/xdp.h"
 #include "en/xsk/rx.h"
 #include "en/xsk/tx.h"
+#include "en_accel/nvmeotcp.h"
 
 static inline bool mlx5e_channel_no_affinity_change(struct mlx5e_channel *c)
 {
@@ -158,6 +159,15 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 		 * queueing more WQEs and overflowing the async ICOSQ.
 		 */
 		clear_bit(MLX5E_SQ_STATE_PENDING_XSK_TX, &c->async_icosq.state);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	struct list_head *cur;
+	struct mlx5e_nvmeotcp_sq *nvmeotcp_sq;
+
+	list_for_each(cur, &c->list_nvmeotcpsq) {
+		nvmeotcp_sq = list_entry(cur, struct mlx5e_nvmeotcp_sq, list);
+		mlx5e_poll_ico_cq(&nvmeotcp_sq->icosq.cq);
+	}
+#endif
 
 	busy |= INDIRECT_CALL_2(rq->post_wqes,
 				mlx5e_post_rx_mpwqes,
@@ -196,6 +206,12 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	mlx5e_cq_arm(&rq->cq);
 	mlx5e_cq_arm(&c->icosq.cq);
 	mlx5e_cq_arm(&c->async_icosq.cq);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	list_for_each(cur, &c->list_nvmeotcpsq) {
+		nvmeotcp_sq = list_entry(cur, struct mlx5e_nvmeotcp_sq, list);
+		mlx5e_cq_arm(&nvmeotcp_sq->icosq.cq);
+	}
+#endif
 	mlx5e_cq_arm(&c->xdpsq.cq);
 
 	if (xsk_open) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw.c b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
index 02558ac2ace6..5e7544ccae91 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
@@ -256,6 +256,12 @@ int mlx5_query_hca_caps(struct mlx5_core_dev *dev)
 			return err;
 	}
 
+	if (MLX5_CAP_GEN(dev, nvmeotcp)) {
+		err = mlx5_core_get_caps(dev, MLX5_CAP_DEV_NVMEOTCP);
+		if (err)
+			return err;
+	}
+
 	return 0;
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 13/15] net/mlx5e: NVMEoTCP, data-path for DDP offload
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (11 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 12/15] net/mlx5e: NVMEoTCP DDP offload control path Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-18  0:57   ` David Ahern
  2020-12-07 21:06 ` [PATCH v1 net-next 14/15] net/mlx5e: NVMEoTCP statistics Boris Pismenny
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

From: Ben Ben-ishay <benishay@nvidia.com>

NVMEoTCP direct data placement constructs an SKB from each CQE, while
pointing at NVME buffers.

This enables the offload, as the NVMe-TCP layer will skip the copy when
src == dst.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |   1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |   1 +
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        | 240 ++++++++++++++++++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.h        |  26 ++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  51 +++-
 7 files changed, 315 insertions(+), 7 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 053655a96db8..c7735e2d938a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -88,4 +88,4 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 					steering/dr_cmd.o steering/dr_fw.o \
 					steering/dr_action.o steering/fs_dr.o
 
-mlx5_core-$(CONFIG_MLX5_EN_NVMEOTCP) += en_accel/fs_tcp.o en_accel/nvmeotcp.o
+mlx5_core-$(CONFIG_MLX5_EN_NVMEOTCP) += en_accel/fs_tcp.o en_accel/nvmeotcp.o en_accel/nvmeotcp_rxtx.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 8e257749018a..4f617e663361 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -573,6 +573,7 @@ struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
+			       struct mlx5_cqe64 *cqe,
 			       u16 cqe_bcnt, u32 head_offset, u32 page_idx);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 8e7b877d8a12..9a6fbd1b1c34 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -25,6 +25,7 @@ static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, void *data,
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5e_mpw_info *wi,
+						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
 						    u32 page_idx)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
index 7f88ccf67fdd..112c5b3ec165 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
@@ -11,6 +11,7 @@
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5e_mpw_info *wi,
+						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
 						    u32 page_idx);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
new file mode 100644
index 000000000000..be5111b66cc9
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
@@ -0,0 +1,240 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies.
+
+#include "en_accel/nvmeotcp_rxtx.h"
+#include "en_accel/nvmeotcp.h"
+#include <linux/mlx5/mlx5_ifc.h>
+
+#define	MLX5E_TC_FLOW_ID_MASK  0x00ffffff
+static void nvmeotcp_update_resync(struct mlx5e_nvmeotcp_queue *queue,
+				   struct mlx5e_cqe128 *cqe128)
+{
+	const struct tcp_ddp_ulp_ops *ulp_ops;
+	u32 seq;
+
+	seq = be32_to_cpu(cqe128->resync_tcp_sn);
+	ulp_ops = inet_csk(queue->sk)->icsk_ulp_ddp_ops;
+	if (ulp_ops && ulp_ops->resync_request)
+		ulp_ops->resync_request(queue->sk, seq, TCP_DDP_RESYNC_REQ);
+}
+
+static void mlx5e_nvmeotcp_advance_sgl_iter(struct mlx5e_nvmeotcp_queue *queue)
+{
+	struct nvmeotcp_queue_entry *nqe = &queue->ccid_table[queue->ccid];
+
+	queue->ccoff += nqe->sgl[queue->ccsglidx].length;
+	queue->ccoff_inner = 0;
+	queue->ccsglidx++;
+}
+
+static inline void
+mlx5e_nvmeotcp_add_skb_frag(struct net_device *netdev, struct sk_buff *skb,
+			    struct mlx5e_nvmeotcp_queue *queue,
+			    struct nvmeotcp_queue_entry *nqe, u32 fragsz)
+{
+	dma_sync_single_for_cpu(&netdev->dev,
+				nqe->sgl[queue->ccsglidx].offset + queue->ccoff_inner,
+				fragsz, DMA_FROM_DEVICE);
+	page_ref_inc(compound_head(sg_page(&nqe->sgl[queue->ccsglidx])));
+	// XXX: consider reducing the truesize, as no new memory is consumed
+	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+			sg_page(&nqe->sgl[queue->ccsglidx]),
+			nqe->sgl[queue->ccsglidx].offset + queue->ccoff_inner,
+			fragsz,
+			fragsz);
+}
+
+int mlx5_nvmeotcp_get_headlen(struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
+{
+	struct mlx5e_cqe128 *cqe128;
+
+	if (!cqe_is_nvmeotcp_zc(cqe) || cqe_is_nvmeotcp_resync(cqe))
+		return cqe_bcnt;
+
+	cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);
+	return be16_to_cpu(cqe128->hlen);
+}
+
+static struct sk_buff*
+mlx5_nvmeotcp_add_tail_nonlinear(struct mlx5e_nvmeotcp_queue *queue,
+				 struct sk_buff *skb, skb_frag_t *org_frags,
+				 int org_nr_frags, int frag_index)
+{
+	struct mlx5e_priv *priv = queue->priv;
+
+	while (org_nr_frags != frag_index) {
+		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
+			dev_kfree_skb_any(skb);
+			return NULL;
+		}
+		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+				skb_frag_page(&org_frags[frag_index]),
+				skb_frag_off(&org_frags[frag_index]),
+				skb_frag_size(&org_frags[frag_index]),
+				skb_frag_size(&org_frags[frag_index]));
+		page_ref_inc(skb_frag_page(&org_frags[frag_index]));
+		frag_index++;
+	}
+	return skb;
+}
+
+static struct sk_buff*
+mlx5_nvmeotcp_add_tail(struct mlx5e_nvmeotcp_queue *queue, struct sk_buff *skb,
+		       int offset, int len)
+{
+	struct mlx5e_priv *priv = queue->priv;
+
+	if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
+		dev_kfree_skb_any(skb);
+		return NULL;
+	}
+	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+			virt_to_page(skb->data),
+			offset,
+			len,
+			len);
+	page_ref_inc(virt_to_page(skb->data));
+	return skb;
+}
+
+static void mlx5_nvmeotcp_trim_nonlinear(struct sk_buff *skb,
+					 skb_frag_t *org_frags,
+					 int *frag_index,
+					 int remaining)
+{
+	unsigned int frag_size;
+	int nr_frags;
+
+	/* skip @remaining bytes in frags */
+	*frag_index = 0;
+	while (remaining) {
+		frag_size = skb_frag_size(&skb_shinfo(skb)->frags[*frag_index]);
+		if (frag_size > remaining) {
+			skb_frag_off_add(&skb_shinfo(skb)->frags[*frag_index],
+					 remaining);
+			skb_frag_size_sub(&skb_shinfo(skb)->frags[*frag_index],
+					  remaining);
+			remaining = 0;
+		} else {
+			remaining -= frag_size;
+			skb_frag_unref(skb, *frag_index);
+			*frag_index += 1;
+		}
+	}
+
+	/* save original frags for the tail and unref */
+	nr_frags = skb_shinfo(skb)->nr_frags;
+	memcpy(&org_frags[*frag_index], &skb_shinfo(skb)->frags[*frag_index],
+	       (nr_frags - *frag_index) * sizeof(skb_frag_t));
+	while (--nr_frags >= *frag_index)
+		skb_frag_unref(skb, nr_frags);
+
+	/* remove frags from skb */
+	skb_shinfo(skb)->nr_frags = 0;
+	skb->len -= skb->data_len;
+	skb->truesize -= skb->data_len;
+	skb->data_len = 0;
+}
+
+struct sk_buff*
+mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt,
+			     bool linear)
+{
+	int ccoff, cclen, hlen, ccid, remaining, fragsz, to_copy = 0;
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	skb_frag_t org_frags[MAX_SKB_FRAGS];
+	struct mlx5e_nvmeotcp_queue *queue;
+	struct nvmeotcp_queue_entry *nqe;
+	int org_nr_frags, frag_index;
+	struct mlx5e_cqe128 *cqe128;
+	u32 queue_id;
+
+	queue_id = (be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK);
+	queue = mlx5e_nvmeotcp_get_queue(priv->nvmeotcp, queue_id);
+	if (unlikely(!queue)) {
+		dev_kfree_skb_any(skb);
+		return NULL;
+	}
+
+	cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);
+	if (cqe_is_nvmeotcp_resync(cqe)) {
+		nvmeotcp_update_resync(queue, cqe128);
+		mlx5e_nvmeotcp_put_queue(queue);
+		return skb;
+	}
+
+	/* cc ddp from cqe */
+	ccid = be16_to_cpu(cqe128->ccid);
+	ccoff = be32_to_cpu(cqe128->ccoff);
+	cclen = be16_to_cpu(cqe128->cclen);
+	hlen  = be16_to_cpu(cqe128->hlen);
+
+	/* carve a hole in the skb for DDP data */
+	if (linear) {
+		skb_trim(skb, hlen);
+	} else {
+		org_nr_frags = skb_shinfo(skb)->nr_frags;
+		mlx5_nvmeotcp_trim_nonlinear(skb, org_frags, &frag_index,
+					     cclen);
+	}
+
+	nqe = &queue->ccid_table[ccid];
+
+	/* packet starts new ccid? */
+	if (queue->ccid != ccid || queue->ccid_gen != nqe->ccid_gen) {
+		queue->ccid = ccid;
+		queue->ccoff = 0;
+		queue->ccoff_inner = 0;
+		queue->ccsglidx = 0;
+		queue->ccid_gen = nqe->ccid_gen;
+	}
+
+	/* skip inside cc until the ccoff in the cqe */
+	while (queue->ccoff + queue->ccoff_inner < ccoff) {
+		remaining = nqe->sgl[queue->ccsglidx].length - queue->ccoff_inner;
+		fragsz = min_t(off_t, remaining,
+			       ccoff - (queue->ccoff + queue->ccoff_inner));
+
+		if (fragsz == remaining)
+			mlx5e_nvmeotcp_advance_sgl_iter(queue);
+		else
+			queue->ccoff_inner += fragsz;
+	}
+
+	/* adjust the skb according to the cqe cc */
+	while (to_copy < cclen) {
+		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
+			dev_kfree_skb_any(skb);
+			mlx5e_nvmeotcp_put_queue(queue);
+			return NULL;
+		}
+
+		remaining = nqe->sgl[queue->ccsglidx].length - queue->ccoff_inner;
+		fragsz = min_t(int, remaining, cclen - to_copy);
+
+		mlx5e_nvmeotcp_add_skb_frag(netdev, skb, queue, nqe, fragsz);
+		to_copy += fragsz;
+		if (fragsz == remaining)
+			mlx5e_nvmeotcp_advance_sgl_iter(queue);
+		else
+			queue->ccoff_inner += fragsz;
+	}
+
+	if (cqe_bcnt > hlen + cclen) {
+		remaining = cqe_bcnt - hlen - cclen;
+		if (linear)
+			skb = mlx5_nvmeotcp_add_tail(queue, skb,
+						     offset_in_page(skb->data) +
+								hlen + cclen,
+						     remaining);
+		else
+			skb = mlx5_nvmeotcp_add_tail_nonlinear(queue, skb,
+							       org_frags,
+							       org_nr_frags,
+							       frag_index);
+	}
+
+	mlx5e_nvmeotcp_put_queue(queue);
+	return skb;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
new file mode 100644
index 000000000000..bb2b074327ae
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies.
+
+#ifndef __MLX5E_NVMEOTCP_RXTX_H__
+#define __MLX5E_NVMEOTCP_RXTX_H__
+
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+
+#include <linux/skbuff.h>
+#include "en.h"
+
+struct sk_buff*
+mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt, bool linear);
+
+int mlx5_nvmeotcp_get_headlen(struct mlx5_cqe64 *cqe, u32 cqe_bcnt);
+#else
+int mlx5_nvmeotcp_get_headlen(struct mlx5_cqe64 *cqe, u32 cqe_bcnt) { return cqe_bcnt; }
+struct sk_buff*
+mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt, bool linear)
+{ return skb; }
+
+#endif /* CONFIG_MLX5_EN_NVMEOTCP */
+
+#endif /* __MLX5E_NVMEOTCP_RXTX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 598d62366af2..2688396d21f8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -48,6 +48,7 @@
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/tls_rxtx.h"
 #include "en_accel/nvmeotcp.h"
+#include "en_accel/nvmeotcp_rxtx.h"
 #include "lib/clock.h"
 #include "en/xdp.h"
 #include "en/xsk/rx.h"
@@ -57,9 +58,11 @@
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
+				struct mlx5_cqe64 *cqe,
 				u16 cqe_bcnt, u32 head_offset, u32 page_idx);
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
+				   struct mlx5_cqe64 *cqe,
 				   u16 cqe_bcnt, u32 head_offset, u32 page_idx);
 static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
@@ -1076,6 +1079,10 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 	if (unlikely(mlx5_ipsec_is_rx_flow(cqe)))
 		mlx5e_ipsec_offload_handle_rx_skb(netdev, skb, cqe);
 
+#if defined(CONFIG_TCP_DDP_CRC) && defined(CONFIG_MLX5_EN_NVMEOTCP)
+	skb->ddp_crc = cqe_is_nvmeotcp_crcvalid(cqe);
+#endif
+
 	if (lro_num_seg > 1) {
 		mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
 		skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt, lro_num_seg);
@@ -1189,16 +1196,28 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	/* queue up for recycling/reuse */
 	page_ref_inc(di->page);
 
+#if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
+	if (cqe_is_nvmeotcp_zc_or_resync(cqe))
+		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
+						   cqe_bcnt, true);
+#endif
+
 	return skb;
 }
 
+static u16 mlx5e_get_headlen_hint(struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
+{
+	return  min_t(u32, MLX5E_RX_MAX_HEAD,
+		      mlx5_nvmeotcp_get_headlen(cqe, cqe_bcnt));
+}
+
 static struct sk_buff *
 mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			     struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt)
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
+	u16 headlen = mlx5e_get_headlen_hint(cqe, cqe_bcnt);
 	struct mlx5e_wqe_frag_info *head_wi = wi;
-	u16 headlen      = min_t(u32, MLX5E_RX_MAX_HEAD, cqe_bcnt);
 	u16 frag_headlen = headlen;
 	u16 byte_cnt     = cqe_bcnt - headlen;
 	struct sk_buff *skb;
@@ -1207,7 +1226,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	 * might spread among multiple pages.
 	 */
 	skb = napi_alloc_skb(rq->cq.napi,
-			     ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
+			     ALIGN(headlen, sizeof(long)));
 	if (unlikely(!skb)) {
 		rq->stats->buff_alloc_err++;
 		return NULL;
@@ -1233,6 +1252,12 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	skb->tail += headlen;
 	skb->len  += headlen;
 
+#if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
+	if (cqe_is_nvmeotcp_zc_or_resync(cqe))
+		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
+						   cqe_bcnt, false);
+#endif
+
 	return skb;
 }
 
@@ -1387,7 +1412,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
 	skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
-			      rq, wi, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -1418,17 +1443,18 @@ const struct mlx5e_rx_handlers mlx5e_rx_handlers_rep = {
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
+				   struct mlx5_cqe64 *cqe,
 				   u16 cqe_bcnt, u32 head_offset, u32 page_idx)
 {
-	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
+	u16 headlen = mlx5e_get_headlen_hint(cqe, cqe_bcnt);
 	u32 frag_offset    = head_offset + headlen;
 	u32 byte_cnt       = cqe_bcnt - headlen;
 	struct mlx5e_dma_info *head_di = di;
 	struct sk_buff *skb;
 
 	skb = napi_alloc_skb(rq->cq.napi,
-			     ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
+			     ALIGN(headlen, sizeof(long)));
 	if (unlikely(!skb)) {
 		rq->stats->buff_alloc_err++;
 		return NULL;
@@ -1459,11 +1485,18 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	skb->tail += headlen;
 	skb->len  += headlen;
 
+#if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
+	if (cqe_is_nvmeotcp_zc_or_resync(cqe))
+		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
+						   cqe_bcnt, false);
+#endif
+
 	return skb;
 }
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
+				struct mlx5_cqe64 *cqe,
 				u16 cqe_bcnt, u32 head_offset, u32 page_idx)
 {
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
@@ -1505,6 +1538,12 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 	/* queue up for recycling/reuse */
 	page_ref_inc(di->page);
 
+#if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
+	if (cqe_is_nvmeotcp_zc_or_resync(cqe))
+		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
+						   cqe_bcnt, true);
+#endif
+
 	return skb;
 }
 
@@ -1543,7 +1582,7 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
 	skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
-			      rq, wi, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 14/15] net/mlx5e: NVMEoTCP statistics
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (12 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 13/15] net/mlx5e: NVMEoTCP, data-path for DDP offload Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2020-12-07 21:06 ` [PATCH v1 net-next 15/15] net/mlx5e: NVMEoTCP workaround CRC after resync Boris Pismenny
  2021-01-14  1:27 ` [PATCH v1 net-next 00/15] nvme-tcp receive offloads Sagi Grimberg
  15 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

From: Ben Ben-ishay <benishay@nvidia.com>

NVMEoTCP offload statistics includes both control and data path
statistic: counters for ndo, offloaded packets/bytes , dropped packets
and resync operation.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 17 +++++++++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        | 16 ++++++++
 .../ethernet/mellanox/mlx5/core/en_stats.c    | 37 +++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/en_stats.h    | 24 ++++++++++++
 4 files changed, 94 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
index 843e653699e9..756decf53930 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
@@ -651,6 +651,7 @@ mlx5e_nvmeotcp_queue_init(struct net_device *netdev,
 	struct mlx5_core_dev *mdev = priv->mdev;
 	struct mlx5e_nvmeotcp_queue *queue;
 	int max_wqe_sz_cap, queue_id, err;
+	struct mlx5e_rq_stats *stats;
 
 	if (tconfig->type != TCP_DDP_NVME) {
 		err = -EOPNOTSUPP;
@@ -700,6 +701,8 @@ mlx5e_nvmeotcp_queue_init(struct net_device *netdev,
 	if (err)
 		goto destroy_rx;
 
+	stats = &priv->channel_stats[queue->channel_ix].rq;
+	stats->nvmeotcp_queue_init++;
 	write_lock_bh(&sk->sk_callback_lock);
 	rcu_assign_pointer(inet_csk(sk)->icsk_ulp_ddp_data, queue);
 	write_unlock_bh(&sk->sk_callback_lock);
@@ -714,6 +717,7 @@ mlx5e_nvmeotcp_queue_init(struct net_device *netdev,
 free_queue:
 	kfree(queue);
 out:
+	stats->nvmeotcp_queue_init_fail++;
 	return err;
 }
 
@@ -724,11 +728,15 @@ mlx5e_nvmeotcp_queue_teardown(struct net_device *netdev,
 	struct mlx5e_priv *priv = netdev_priv(netdev);
 	struct mlx5_core_dev *mdev = priv->mdev;
 	struct mlx5e_nvmeotcp_queue *queue;
+	struct mlx5e_rq_stats *stats;
 
 	queue = (struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
 
 	napi_synchronize(&priv->channels.c[queue->channel_ix]->napi);
 
+	stats = &priv->channel_stats[queue->channel_ix].rq;
+	stats->nvmeotcp_queue_teardown++;
+
 	WARN_ON(refcount_read(&queue->ref_count) != 1);
 	if (queue->zerocopy | queue->crc_rx)
 		mlx5e_nvmeotcp_destroy_rx(queue, mdev, queue->zerocopy);
@@ -750,6 +758,7 @@ mlx5e_nvmeotcp_ddp_setup(struct net_device *netdev,
 	struct mlx5e_priv *priv = netdev_priv(netdev);
 	struct scatterlist *sg = ddp->sg_table.sgl;
 	struct mlx5e_nvmeotcp_queue *queue;
+	struct mlx5e_rq_stats *stats;
 	struct mlx5_core_dev *mdev;
 	int count = 0;
 
@@ -767,6 +776,11 @@ mlx5e_nvmeotcp_ddp_setup(struct net_device *netdev,
 	queue->ccid_table[ddp->command_id].ccid_gen++;
 	queue->ccid_table[ddp->command_id].sgl_length = count;
 
+	stats = &priv->channel_stats[queue->channel_ix].rq;
+	stats->nvmeotcp_ddp_setup++;
+	if (unlikely(mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, count)))
+		stats->nvmeotcp_ddp_setup_fail++;
+
 	return 0;
 }
 
@@ -808,6 +822,7 @@ mlx5e_nvmeotcp_ddp_teardown(struct net_device *netdev,
 		(struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
 	struct mlx5e_priv *priv = netdev_priv(netdev);
 	struct nvmeotcp_queue_entry *q_entry;
+	struct mlx5e_rq_stats *stats;
 
 	q_entry  = &queue->ccid_table[ddp->command_id];
 	WARN_ON(q_entry->sgl_length == 0);
@@ -816,6 +831,8 @@ mlx5e_nvmeotcp_ddp_teardown(struct net_device *netdev,
 	q_entry->queue = queue;
 
 	mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, 0);
+	stats = &priv->channel_stats[queue->channel_ix].rq;
+	stats->nvmeotcp_ddp_teardown++;
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
index be5111b66cc9..298558ae2dcd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
@@ -10,12 +10,16 @@ static void nvmeotcp_update_resync(struct mlx5e_nvmeotcp_queue *queue,
 				   struct mlx5e_cqe128 *cqe128)
 {
 	const struct tcp_ddp_ulp_ops *ulp_ops;
+	struct mlx5e_rq_stats *stats;
 	u32 seq;
 
 	seq = be32_to_cpu(cqe128->resync_tcp_sn);
 	ulp_ops = inet_csk(queue->sk)->icsk_ulp_ddp_ops;
 	if (ulp_ops && ulp_ops->resync_request)
 		ulp_ops->resync_request(queue->sk, seq, TCP_DDP_RESYNC_REQ);
+
+	stats = queue->priv->channels.c[queue->channel_ix]->rq.stats;
+	stats->nvmeotcp_resync++;
 }
 
 static void mlx5e_nvmeotcp_advance_sgl_iter(struct mlx5e_nvmeotcp_queue *queue)
@@ -61,10 +65,13 @@ mlx5_nvmeotcp_add_tail_nonlinear(struct mlx5e_nvmeotcp_queue *queue,
 				 int org_nr_frags, int frag_index)
 {
 	struct mlx5e_priv *priv = queue->priv;
+	struct mlx5e_rq_stats *stats;
 
 	while (org_nr_frags != frag_index) {
 		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
 			dev_kfree_skb_any(skb);
+			stats = priv->channels.c[queue->channel_ix]->rq.stats;
+			stats->nvmeotcp_drop++;
 			return NULL;
 		}
 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
@@ -83,9 +90,12 @@ mlx5_nvmeotcp_add_tail(struct mlx5e_nvmeotcp_queue *queue, struct sk_buff *skb,
 		       int offset, int len)
 {
 	struct mlx5e_priv *priv = queue->priv;
+	struct mlx5e_rq_stats *stats;
 
 	if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
 		dev_kfree_skb_any(skb);
+		stats = priv->channels.c[queue->channel_ix]->rq.stats;
+		stats->nvmeotcp_drop++;
 		return NULL;
 	}
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
@@ -146,6 +156,7 @@ mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
 	skb_frag_t org_frags[MAX_SKB_FRAGS];
 	struct mlx5e_nvmeotcp_queue *queue;
 	struct nvmeotcp_queue_entry *nqe;
+	struct mlx5e_rq_stats *stats;
 	int org_nr_frags, frag_index;
 	struct mlx5e_cqe128 *cqe128;
 	u32 queue_id;
@@ -164,6 +175,8 @@ mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
 		return skb;
 	}
 
+	stats = priv->channels.c[queue->channel_ix]->rq.stats;
+
 	/* cc ddp from cqe */
 	ccid = be16_to_cpu(cqe128->ccid);
 	ccoff = be32_to_cpu(cqe128->ccoff);
@@ -206,6 +219,7 @@ mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
 	while (to_copy < cclen) {
 		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
 			dev_kfree_skb_any(skb);
+			stats->nvmeotcp_drop++;
 			mlx5e_nvmeotcp_put_queue(queue);
 			return NULL;
 		}
@@ -235,6 +249,8 @@ mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
 							       frag_index);
 	}
 
+	stats->nvmeotcp_offload_packets++;
+	stats->nvmeotcp_offload_bytes += cclen;
 	mlx5e_nvmeotcp_put_queue(queue);
 	return skb;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 2cf2042b37c7..ca7d2cb5099f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -34,6 +34,7 @@
 #include "en.h"
 #include "en_accel/tls.h"
 #include "en_accel/en_accel.h"
+#include "en_accel/nvmeotcp.h"
 
 static unsigned int stats_grps_num(struct mlx5e_priv *priv)
 {
@@ -189,6 +190,18 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_resync_res_ok) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_resync_res_skip) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_err) },
+#endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_queue_init) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_queue_init_fail) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_queue_teardown) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_setup) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_setup_fail) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_teardown) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_drop) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_resync) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_offload_packets) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_offload_bytes) },
 #endif
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_events) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_poll) },
@@ -352,6 +365,18 @@ static void mlx5e_stats_grp_sw_update_stats_rq_stats(struct mlx5e_sw_stats *s,
 	s->rx_tls_resync_res_skip     += rq_stats->tls_resync_res_skip;
 	s->rx_tls_err                 += rq_stats->tls_err;
 #endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	s->rx_nvmeotcp_queue_init      += rq_stats->nvmeotcp_queue_init;
+	s->rx_nvmeotcp_queue_init_fail += rq_stats->nvmeotcp_queue_init_fail;
+	s->rx_nvmeotcp_queue_teardown  += rq_stats->nvmeotcp_queue_teardown;
+	s->rx_nvmeotcp_ddp_setup       += rq_stats->nvmeotcp_ddp_setup;
+	s->rx_nvmeotcp_ddp_setup_fail  += rq_stats->nvmeotcp_ddp_setup_fail;
+	s->rx_nvmeotcp_ddp_teardown    += rq_stats->nvmeotcp_ddp_teardown;
+	s->rx_nvmeotcp_drop            += rq_stats->nvmeotcp_drop;
+	s->rx_nvmeotcp_resync          += rq_stats->nvmeotcp_resync;
+	s->rx_nvmeotcp_offload_packets += rq_stats->nvmeotcp_offload_packets;
+	s->rx_nvmeotcp_offload_bytes   += rq_stats->nvmeotcp_offload_bytes;
+#endif
 }
 
 static void mlx5e_stats_grp_sw_update_stats_ch_stats(struct mlx5e_sw_stats *s,
@@ -1612,6 +1637,18 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, tls_resync_res_skip) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, tls_err) },
 #endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_queue_init) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_queue_init_fail) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_queue_teardown) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_setup) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_setup_fail) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_teardown) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_drop) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_resync) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_offload_packets) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_offload_bytes) },
+#endif
 };
 
 static const struct counter_desc sq_stats_desc[] = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index e41fc11f2ce7..a5cf8ec4295b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -179,6 +179,18 @@ struct mlx5e_sw_stats {
 	u64 rx_congst_umr;
 	u64 rx_arfs_err;
 	u64 rx_recover;
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	u64 rx_nvmeotcp_queue_init;
+	u64 rx_nvmeotcp_queue_init_fail;
+	u64 rx_nvmeotcp_queue_teardown;
+	u64 rx_nvmeotcp_ddp_setup;
+	u64 rx_nvmeotcp_ddp_setup_fail;
+	u64 rx_nvmeotcp_ddp_teardown;
+	u64 rx_nvmeotcp_drop;
+	u64 rx_nvmeotcp_resync;
+	u64 rx_nvmeotcp_offload_packets;
+	u64 rx_nvmeotcp_offload_bytes;
+#endif
 	u64 ch_events;
 	u64 ch_poll;
 	u64 ch_arm;
@@ -342,6 +354,18 @@ struct mlx5e_rq_stats {
 	u64 tls_resync_res_skip;
 	u64 tls_err;
 #endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	u64 nvmeotcp_queue_init;
+	u64 nvmeotcp_queue_init_fail;
+	u64 nvmeotcp_queue_teardown;
+	u64 nvmeotcp_ddp_setup;
+	u64 nvmeotcp_ddp_setup_fail;
+	u64 nvmeotcp_ddp_teardown;
+	u64 nvmeotcp_drop;
+	u64 nvmeotcp_resync;
+	u64 nvmeotcp_offload_packets;
+	u64 nvmeotcp_offload_bytes;
+#endif
 };
 
 struct mlx5e_sq_stats {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v1 net-next 15/15] net/mlx5e: NVMEoTCP workaround CRC after resync
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (13 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 14/15] net/mlx5e: NVMEoTCP statistics Boris Pismenny
@ 2020-12-07 21:06 ` Boris Pismenny
  2021-01-14  1:27 ` [PATCH v1 net-next 00/15] nvme-tcp receive offloads Sagi Grimberg
  15 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-07 21:06 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz

From: Yoray Zack <yorayz@nvidia.com>

The nvme-tcp crc computed over the first packet after resync may provide
the wrong signal when the packet contains multiple PDUs. We workaround
that by ignoring the cqe->nvmeotcp_crc signal for the first packet after
resync.

Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c         |  1 +
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h         |  3 +++
 .../mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c    | 14 ++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 12 ++++--------
 include/linux/mlx5/device.h                        |  4 ++--
 5 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
index 756decf53930..e9f7f8b17c92 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
@@ -844,6 +844,7 @@ mlx5e_nvmeotcp_dev_resync(struct net_device *netdev,
 	struct mlx5e_nvmeotcp_queue *queue =
 				(struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
 
+	queue->after_resync_cqe = 1;
 	mlx5e_nvmeotcp_rx_post_static_params_wqe(queue, seq);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
index 5be300d8299e..a309971e11b1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
@@ -50,6 +50,7 @@ struct mlx5e_nvmeotcp_sq {
  *	@ccoff_inner: Current offset within the @ccsglidx element
  *	@priv: mlx5e netdev priv
  *	@inv_done: invalidate callback of the nvme tcp driver
+ *	@after_resync_cqe: indicate if resync occurred
  */
 struct mlx5e_nvmeotcp_queue {
 	struct tcp_ddp_ctx		tcp_ddp_ctx;
@@ -82,6 +83,8 @@ struct mlx5e_nvmeotcp_queue {
 
 	/* for flow_steering flow */
 	struct completion		done;
+	/* for MASK HW resync cqe */
+	bool				after_resync_cqe;
 };
 
 struct mlx5e_nvmeotcp {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
index 298558ae2dcd..4b813de592be 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
@@ -175,6 +175,20 @@ mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
 		return skb;
 	}
 
+#ifdef CONFIG_TCP_DDP_CRC
+	/* If a resync occurred in the previous cqe,
+	 * the current cqe.crcvalid bit may not be valid,
+	 * so we will treat it as 0
+	 */
+	skb->ddp_crc = queue->after_resync_cqe ? 0 :
+		cqe_is_nvmeotcp_crcvalid(cqe);
+	queue->after_resync_cqe = 0;
+#endif
+	if (!cqe_is_nvmeotcp_zc(cqe)) {
+		mlx5e_nvmeotcp_put_queue(queue);
+		return skb;
+	}
+
 	stats = priv->channels.c[queue->channel_ix]->rq.stats;
 
 	/* cc ddp from cqe */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 2688396d21f8..960aee0d5f0c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1079,10 +1079,6 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 	if (unlikely(mlx5_ipsec_is_rx_flow(cqe)))
 		mlx5e_ipsec_offload_handle_rx_skb(netdev, skb, cqe);
 
-#if defined(CONFIG_TCP_DDP_CRC) && defined(CONFIG_MLX5_EN_NVMEOTCP)
-	skb->ddp_crc = cqe_is_nvmeotcp_crcvalid(cqe);
-#endif
-
 	if (lro_num_seg > 1) {
 		mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
 		skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt, lro_num_seg);
@@ -1197,7 +1193,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	page_ref_inc(di->page);
 
 #if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
-	if (cqe_is_nvmeotcp_zc_or_resync(cqe))
+	if (cqe_is_nvmeotcp(cqe))
 		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
 						   cqe_bcnt, true);
 #endif
@@ -1253,7 +1249,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	skb->len  += headlen;
 
 #if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
-	if (cqe_is_nvmeotcp_zc_or_resync(cqe))
+	if (cqe_is_nvmeotcp(cqe))
 		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
 						   cqe_bcnt, false);
 #endif
@@ -1486,7 +1482,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	skb->len  += headlen;
 
 #if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
-	if (cqe_is_nvmeotcp_zc_or_resync(cqe))
+	if (cqe_is_nvmeotcp(cqe))
 		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
 						   cqe_bcnt, false);
 #endif
@@ -1539,7 +1535,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 	page_ref_inc(di->page);
 
 #if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
-	if (cqe_is_nvmeotcp_zc_or_resync(cqe))
+	if (cqe_is_nvmeotcp(cqe))
 		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
 						   cqe_bcnt, true);
 #endif
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index ea4d158e8329..ae879576e371 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -882,9 +882,9 @@ static inline bool cqe_is_nvmeotcp_zc(struct mlx5_cqe64 *cqe)
 	return ((cqe->nvmetcp >> 4) & 0x1);
 }
 
-static inline bool cqe_is_nvmeotcp_zc_or_resync(struct mlx5_cqe64 *cqe)
+static inline bool cqe_is_nvmeotcp(struct mlx5_cqe64 *cqe)
 {
-	return ((cqe->nvmetcp >> 4) & 0x5);
+	return ((cqe->nvmetcp >> 4) & 0x7);
 }
 
 static inline u8 mlx5_get_cqe_format(struct mlx5_cqe64 *cqe)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 01/15] iov_iter: Skip copy in memcpy_to_page if src==dst
  2020-12-07 21:06 ` [PATCH v1 net-next 01/15] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
@ 2020-12-08  0:39   ` David Ahern
  2020-12-08 14:30     ` Boris Pismenny
  0 siblings, 1 reply; 53+ messages in thread
From: David Ahern @ 2020-12-08  0:39 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, sagi, axboe, kbusch,
	viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

On 12/7/20 2:06 PM, Boris Pismenny wrote:
> When using direct data placement the NIC writes some of the payload
> directly to the destination buffer, and constructs the SKB such that it
> points to this data. As a result, the skb_copy datagram_iter call will
> attempt to copy data when it is not necessary.
> 
> This patch adds a check to avoid this copy, and a static_key to enabled
> it when TCP direct data placement is possible.
> 

Why not mark the skb as ZEROCOPY -- an Rx version of the existing
SKBTX_DEV_ZEROCOPY and skb_shared_info->tx_flags? Use that as a generic
way of indicating to the stack what is happening.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-07 21:06 ` [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload Boris Pismenny
@ 2020-12-08  0:42   ` David Ahern
  2020-12-08 14:36     ` Boris Pismenny
  2020-12-09  0:57   ` David Ahern
  1 sibling, 1 reply; 53+ messages in thread
From: David Ahern @ 2020-12-08  0:42 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, sagi, axboe, kbusch,
	viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

On 12/7/20 2:06 PM, Boris Pismenny wrote:
> This commit introduces direct data placement offload for TCP.
> This capability is accompanied by new net_device operations that
> configure
> hardware contexts. There is a context per socket, and a context per DDP
> opreation. Additionally, a resynchronization routine is used to assist
> hardware handle TCP OOO, and continue the offload.
> Furthermore, we let the offloading driver advertise what is the max hw
> sectors/segments.
> 
> Using this interface, the NIC hardware will scatter TCP payload directly
> to the BIO pages according to the command_id.
> To maintain the correctness of the network stack, the driver is expected
> to construct SKBs that point to the BIO pages.
> 
> This, the SKB represents the data on the wire, while it is pointing
> to data that is already placed in the destination buffer.
> As a result, data from page frags should not be copied out to
> the linear part.
> 
> As SKBs that use DDP are already very memory efficient, we modify
> skb_condence to avoid copying data from fragments to the linear
> part of SKBs that belong to a socket that uses DDP offload.
> 
> A follow-up patch will use this interface for DDP in NVMe-TCP.
> 

You call this Direct Data Placement - which sounds like a marketing name.

Fundamentally, this starts with offloading TCP socket buffers for a
specific flow, so generically a TCP Rx zerocopy for kernel stack managed
sockets (as opposed to AF_XDP's zerocopy). Why is this not building in
that level of infrastructure first and adding ULPs like NVME on top?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 01/15] iov_iter: Skip copy in memcpy_to_page if src==dst
  2020-12-08  0:39   ` David Ahern
@ 2020-12-08 14:30     ` Boris Pismenny
  0 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-08 14:30 UTC (permalink / raw)
  To: David Ahern, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack, borisp


On 08/12/2020 2:39, David Ahern wrote:
> On 12/7/20 2:06 PM, Boris Pismenny wrote:
>> When using direct data placement the NIC writes some of the payload
>> directly to the destination buffer, and constructs the SKB such that it
>> points to this data. As a result, the skb_copy datagram_iter call will
>> attempt to copy data when it is not necessary.
>>
>> This patch adds a check to avoid this copy, and a static_key to enabled
>> it when TCP direct data placement is possible.
>>
> Why not mark the skb as ZEROCOPY -- an Rx version of the existing
> SKBTX_DEV_ZEROCOPY and skb_shared_info->tx_flags? Use that as a generic
> way of indicating to the stack what is happening.
>
>

[Re-sending as the previous one didn't hit the mailing list]

Interesting idea. But, unlike SKBTX_DEV_ZEROCOPY this SKB can be inspected/modified by the stack without the need to copy things out. Additionally, the SKB may contain both data that is already placed in its final destination buffer (PDU data) and data that isn't (PDU header); it doesn't matter. Therefore, labeling the entire SKB as zerocopy doesn't convey the desired information. Moreover, skipping copies in the stack to receive zerocopy SKBs will require more invasive changes.

Our goal in this approach was to provide the smallest change that enables the desired functionality while preserving the performance of existing flows that do not care for it. An alternative approach, that doesn't affect existing flows at all, which we considered was to make a special version of memcpy_to_page to be used by DDP providers (nvme-tcp). This alternative will require creating corresponding special versions for users of this function such skb_copy_datagram_iter. Thit is more invasive, thus in this patchset we decided to avoid it.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-08  0:42   ` David Ahern
@ 2020-12-08 14:36     ` Boris Pismenny
  2020-12-09  0:38       ` David Ahern
  0 siblings, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-08 14:36 UTC (permalink / raw)
  To: David Ahern, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack, Boris Pismenny

On 08/12/2020 2:42, David Ahern wrote:
> On 12/7/20 2:06 PM, Boris Pismenny wrote:
>> This commit introduces direct data placement offload for TCP.
>> This capability is accompanied by new net_device operations that
>> configure
>> hardware contexts. There is a context per socket, and a context per DDP
>> opreation. Additionally, a resynchronization routine is used to assist
>> hardware handle TCP OOO, and continue the offload.
>> Furthermore, we let the offloading driver advertise what is the max hw
>> sectors/segments.
>>
>> Using this interface, the NIC hardware will scatter TCP payload directly
>> to the BIO pages according to the command_id.
>> To maintain the correctness of the network stack, the driver is expected
>> to construct SKBs that point to the BIO pages.
>>
>> This, the SKB represents the data on the wire, while it is pointing
>> to data that is already placed in the destination buffer.
>> As a result, data from page frags should not be copied out to
>> the linear part.
>>
>> As SKBs that use DDP are already very memory efficient, we modify
>> skb_condence to avoid copying data from fragments to the linear
>> part of SKBs that belong to a socket that uses DDP offload.
>>
>> A follow-up patch will use this interface for DDP in NVMe-TCP.
>>
> 
> You call this Direct Data Placement - which sounds like a marketing name.
> 

[Re-sending as the previous one didn't hit the mailing list. Sorry for the spam]

Interesting idea. But, unlike SKBTX_DEV_ZEROCOPY this SKB can be inspected/modified by the stack without the need to copy things out. Additionally, the SKB may contain both data that is already placed in its final destination buffer (PDU data) and data that isn't (PDU header); it doesn't matter. Therefore, labeling the entire SKB as zerocopy doesn't convey the desired information. Moreover, skipping copies in the stack to receive zerocopy SKBs will require more invasive changes.

Our goal in this approach was to provide the smallest change that enables the desired functionality while preserving the performance of existing flows that do not care for it. An alternative approach, that doesn't affect existing flows at all, which we considered was to make a special version of memcpy_to_page to be used by DDP providers (nvme-tcp). This alternative will require creating corresponding special versions for users of this function such skb_copy_datagram_iter. Thit is more invasive, thus in this patchset we decided to avoid it.

> Fundamentally, this starts with offloading TCP socket buffers for a
> specific flow, so generically a TCP Rx zerocopy for kernel stack managed
> sockets (as opposed to AF_XDP's zerocopy). Why is this not building in
> that level of infrastructure first and adding ULPs like NVME on top?
> 

We aren't using AF_XDP or any of the Rx zerocopy infrastructure, because it is unsuitable for data placement for nvme-tcp, which reordes responses relatively to requests for efficiency and requires that data reside in specific destination buffers.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-08 14:36     ` Boris Pismenny
@ 2020-12-09  0:38       ` David Ahern
  2020-12-09  8:15         ` Boris Pismenny
  0 siblings, 1 reply; 53+ messages in thread
From: David Ahern @ 2020-12-09  0:38 UTC (permalink / raw)
  To: Boris Pismenny, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack, Boris Pismenny

On 12/8/20 7:36 AM, Boris Pismenny wrote:
> On 08/12/2020 2:42, David Ahern wrote:
>> On 12/7/20 2:06 PM, Boris Pismenny wrote:
>>> This commit introduces direct data placement offload for TCP.
>>> This capability is accompanied by new net_device operations that
>>> configure
>>> hardware contexts. There is a context per socket, and a context per DDP
>>> opreation. Additionally, a resynchronization routine is used to assist
>>> hardware handle TCP OOO, and continue the offload.
>>> Furthermore, we let the offloading driver advertise what is the max hw
>>> sectors/segments.
>>>
>>> Using this interface, the NIC hardware will scatter TCP payload directly
>>> to the BIO pages according to the command_id.
>>> To maintain the correctness of the network stack, the driver is expected
>>> to construct SKBs that point to the BIO pages.
>>>
>>> This, the SKB represents the data on the wire, while it is pointing
>>> to data that is already placed in the destination buffer.
>>> As a result, data from page frags should not be copied out to
>>> the linear part.
>>>
>>> As SKBs that use DDP are already very memory efficient, we modify
>>> skb_condence to avoid copying data from fragments to the linear
>>> part of SKBs that belong to a socket that uses DDP offload.
>>>
>>> A follow-up patch will use this interface for DDP in NVMe-TCP.
>>>
>>
>> You call this Direct Data Placement - which sounds like a marketing name.
>>
> 
> [Re-sending as the previous one didn't hit the mailing list. Sorry for the spam]
> 
> Interesting idea. But, unlike SKBTX_DEV_ZEROCOPY this SKB can be inspected/modified by the stack without the need to copy things out. Additionally, the SKB may contain both data that is already placed in its final destination buffer (PDU data) and data that isn't (PDU header); it doesn't matter. Therefore, labeling the entire SKB as zerocopy doesn't convey the desired information. Moreover, skipping copies in the stack to receive zerocopy SKBs will require more invasive changes.
> 
> Our goal in this approach was to provide the smallest change that enables the desired functionality while preserving the performance of existing flows that do not care for it. An alternative approach, that doesn't affect existing flows at all, which we considered was to make a special version of memcpy_to_page to be used by DDP providers (nvme-tcp). This alternative will require creating corresponding special versions for users of this function such skb_copy_datagram_iter. Thit is more invasive, thus in this patchset we decided to avoid it.
> 
>> Fundamentally, this starts with offloading TCP socket buffers for a
>> specific flow, so generically a TCP Rx zerocopy for kernel stack managed
>> sockets (as opposed to AF_XDP's zerocopy). Why is this not building in
>> that level of infrastructure first and adding ULPs like NVME on top?
>>
> 
> We aren't using AF_XDP or any of the Rx zerocopy infrastructure, because it is unsuitable for data placement for nvme-tcp, which reordes responses relatively to requests for efficiency and requires that data reside in specific destination buffers.
> 
> 

The AF_XDP reference was to differentiate one zerocopy use case (all
packets go to userspace) from another (kernel managed TCP socket with
zerocopy payload). You are focusing on a very narrow use case - kernel
based NVMe over TCP - of a more general problem.

You have a TCP socket and a design that only works for kernel owned
sockets. You have specialized queues in the NIC, a flow rule directing
packets to those queues. Presumably some ULP parser in the NIC
associated with the queues to process NVMe packets. Rather than copying
headers (ethernet/ip/tcp) to one buffer and payload to another (which is
similar to what Jonathan Lemon is working on), this design has a ULP
processor that just splits out the TCP payload even more making it
highly selective about which part of the packet is put into which
buffer. Take out the NVMe part, and it is header split with zerocopy for
the payload - a generic feature that can have a wider impact with NVMe
as a special case.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-07 21:06 ` [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload Boris Pismenny
  2020-12-08  0:42   ` David Ahern
@ 2020-12-09  0:57   ` David Ahern
  2020-12-09  1:11     ` David Ahern
  2020-12-09  8:25     ` Boris Pismenny
  1 sibling, 2 replies; 53+ messages in thread
From: David Ahern @ 2020-12-09  0:57 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, sagi, axboe, kbusch,
	viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

On 12/7/20 2:06 PM, Boris Pismenny wrote:
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index 934de56644e7..fb35dcac03d2 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -84,6 +84,7 @@ enum {
>  	NETIF_F_GRO_FRAGLIST_BIT,	/* Fraglist GRO */
>  
>  	NETIF_F_HW_MACSEC_BIT,		/* Offload MACsec operations */
> +	NETIF_F_HW_TCP_DDP_BIT,		/* TCP direct data placement offload */
>  
>  	/*
>  	 * Add your fresh new feature above and remember to update
> @@ -157,6 +158,7 @@ enum {
>  #define NETIF_F_GRO_FRAGLIST	__NETIF_F(GRO_FRAGLIST)
>  #define NETIF_F_GSO_FRAGLIST	__NETIF_F(GSO_FRAGLIST)
>  #define NETIF_F_HW_MACSEC	__NETIF_F(HW_MACSEC)
> +#define NETIF_F_HW_TCP_DDP	__NETIF_F(HW_TCP_DDP)

All of the DDP naming seems wrong to me. I realize the specific use case
is targeted payloads of a ULP, but it is still S/W handing H/W specific
buffers for a payload of a flow.


>  
>  /* Finds the next feature with the highest number of the range of start till 0.
>   */
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index a07c8e431f45..755766976408 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -934,6 +934,7 @@ struct dev_ifalias {
>  
>  struct devlink;
>  struct tlsdev_ops;
> +struct tcp_ddp_dev_ops;
>  
>  struct netdev_name_node {
>  	struct hlist_node hlist;
> @@ -1930,6 +1931,10 @@ struct net_device {
>  	const struct tlsdev_ops *tlsdev_ops;
>  #endif
>  
> +#ifdef CONFIG_TCP_DDP
> +	const struct tcp_ddp_dev_ops *tcp_ddp_ops;
> +#endif
> +
>  	const struct header_ops *header_ops;
>  
>  	unsigned int		flags;
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 7338b3865a2a..a08b85b53aa8 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -66,6 +66,8 @@ struct inet_connection_sock_af_ops {
>   * @icsk_ulp_ops	   Pluggable ULP control hook
>   * @icsk_ulp_data	   ULP private data
>   * @icsk_clean_acked	   Clean acked data hook
> + * @icsk_ulp_ddp_ops	   Pluggable ULP direct data placement control hook
> + * @icsk_ulp_ddp_data	   ULP direct data placement private data

Neither of these socket layer intrusions are needed. All references but
1 -- the skbuff check -- are in the mlx5 driver. Any skb check that is
needed can be handled with a different setting.

>   * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
>   * @icsk_ca_state:	   Congestion control state
>   * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
> @@ -94,6 +96,8 @@ struct inet_connection_sock {
>  	const struct tcp_ulp_ops  *icsk_ulp_ops;
>  	void __rcu		  *icsk_ulp_data;
>  	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
> +	const struct tcp_ddp_ulp_ops  *icsk_ulp_ddp_ops;
> +	void __rcu		  *icsk_ulp_ddp_data;
>  	struct hlist_node         icsk_listen_portaddr_node;
>  	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
>  	__u8			  icsk_ca_state:5,
> diff --git a/include/net/tcp_ddp.h b/include/net/tcp_ddp.h
> new file mode 100644
> index 000000000000..df3264be4600
> --- /dev/null
> +++ b/include/net/tcp_ddp.h
> @@ -0,0 +1,129 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * tcp_ddp.h
> + *	Author:	Boris Pismenny <borisp@mellanox.com>
> + *	Copyright (C) 2020 Mellanox Technologies.
> + */
> +#ifndef _TCP_DDP_H
> +#define _TCP_DDP_H
> +
> +#include <linux/netdevice.h>
> +#include <net/inet_connection_sock.h>
> +#include <net/sock.h>
> +
> +/* limits returned by the offload driver, zero means don't care */
> +struct tcp_ddp_limits {
> +	int	 max_ddp_sgl_len;
> +};
> +
> +enum tcp_ddp_type {
> +	TCP_DDP_NVME = 1,
> +};
> +
> +/**
> + * struct tcp_ddp_config - Generic tcp ddp configuration: tcp ddp IO queue
> + * config implementations must use this as the first member.
> + * Add new instances of tcp_ddp_config below (nvme-tcp, etc.).
> + */
> +struct tcp_ddp_config {
> +	enum tcp_ddp_type    type;
> +	unsigned char        buf[];

you have this variable length buf, but it is not used (as far as I can
tell). But then ...


> +};
> +
> +/**
> + * struct nvme_tcp_ddp_config - nvme tcp ddp configuration for an IO queue
> + *
> + * @pfv:        pdu version (e.g., NVME_TCP_PFV_1_0)
> + * @cpda:       controller pdu data alignmend (dwords, 0's based)
> + * @dgst:       digest types enabled.
> + *              The netdev will offload crc if ddp_crc is supported.
> + * @queue_size: number of nvme-tcp IO queue elements
> + * @queue_id:   queue identifier
> + * @cpu_io:     cpu core running the IO thread for this queue
> + */
> +struct nvme_tcp_ddp_config {
> +	struct tcp_ddp_config   cfg;

... how would you use it within another struct like this?

> +
> +	u16			pfv;
> +	u8			cpda;
> +	u8			dgst;
> +	int			queue_size;
> +	int			queue_id;
> +	int			io_cpu;
> +};
> +
> +/**
> + * struct tcp_ddp_io - tcp ddp configuration for an IO request.
> + *
> + * @command_id:  identifier on the wire associated with these buffers
> + * @nents:       number of entries in the sg_table
> + * @sg_table:    describing the buffers for this IO request
> + * @first_sgl:   first SGL in sg_table
> + */
> +struct tcp_ddp_io {
> +	u32			command_id;
> +	int			nents;
> +	struct sg_table		sg_table;
> +	struct scatterlist	first_sgl[SG_CHUNK_SIZE];
> +};
> +
> +/* struct tcp_ddp_dev_ops - operations used by an upper layer protocol to configure ddp offload
> + *
> + * @tcp_ddp_limits:    limit the number of scatter gather entries per IO.
> + *                     the device driver can use this to limit the resources allocated per queue.
> + * @tcp_ddp_sk_add:    add offload for the queue represennted by the socket+config pair.
> + *                     this function is used to configure either copy, crc or both offloads.
> + * @tcp_ddp_sk_del:    remove offload from the socket, and release any device related resources.
> + * @tcp_ddp_setup:     request copy offload for buffers associated with a command_id in tcp_ddp_io.
> + * @tcp_ddp_teardown:  release offload resources association between buffers and command_id in
> + *                     tcp_ddp_io.
> + * @tcp_ddp_resync:    respond to the driver's resync_request. Called only if resync is successful.
> + */
> +struct tcp_ddp_dev_ops {
> +	int (*tcp_ddp_limits)(struct net_device *netdev,
> +			      struct tcp_ddp_limits *limits);
> +	int (*tcp_ddp_sk_add)(struct net_device *netdev,
> +			      struct sock *sk,
> +			      struct tcp_ddp_config *config);
> +	void (*tcp_ddp_sk_del)(struct net_device *netdev,
> +			       struct sock *sk);
> +	int (*tcp_ddp_setup)(struct net_device *netdev,
> +			     struct sock *sk,
> +			     struct tcp_ddp_io *io);
> +	int (*tcp_ddp_teardown)(struct net_device *netdev,
> +				struct sock *sk,
> +				struct tcp_ddp_io *io,
> +				void *ddp_ctx);
> +	void (*tcp_ddp_resync)(struct net_device *netdev,
> +			       struct sock *sk, u32 seq);
> +};
> +
> +#define TCP_DDP_RESYNC_REQ (1 << 0)
> +
> +/**
> + * struct tcp_ddp_ulp_ops - Interface to register uppper layer Direct Data Placement (DDP) TCP offload
> + */
> +struct tcp_ddp_ulp_ops {
> +	/* NIC requests ulp to indicate if @seq is the start of a message */
> +	bool (*resync_request)(struct sock *sk, u32 seq, u32 flags);
> +	/* NIC driver informs the ulp that ddp teardown is done - used for async completions*/
> +	void (*ddp_teardown_done)(void *ddp_ctx);
> +};
> +
> +/**
> + * struct tcp_ddp_ctx - Generic tcp ddp context: device driver per queue contexts must
> + * use this as the first member.
> + */
> +struct tcp_ddp_ctx {
> +	enum tcp_ddp_type    type;
> +	unsigned char        buf[];

similar to my comment above, I did not see any uses of the buf element.





^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock
  2020-12-07 21:06 ` [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock Boris Pismenny
@ 2020-12-09  1:06   ` David Ahern
  2020-12-09  7:41     ` Boris Pismenny
  0 siblings, 1 reply; 53+ messages in thread
From: David Ahern @ 2020-12-09  1:06 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, sagi, axboe, kbusch,
	viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz

On 12/7/20 2:06 PM, Boris Pismenny wrote:
> get_netdev_for_sock is a utility that is used to obtain
> the net_device structure from a connected socket.
> 
> Later patches will use this for nvme-tcp DDP and DDP CRC offloads.
> 
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  include/net/sock.h   | 17 +++++++++++++++++
>  net/tls/tls_device.c | 20 ++------------------
>  2 files changed, 19 insertions(+), 18 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 093b51719c69..a8f7393ea433 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2711,4 +2711,21 @@ void sock_set_sndtimeo(struct sock *sk, s64 secs);
>  
>  int sock_bind_add(struct sock *sk, struct sockaddr *addr, int addr_len);
>  
> +/* Assume that the socket is already connected */
> +static inline struct net_device *get_netdev_for_sock(struct sock *sk, bool hold)
> +{
> +	struct dst_entry *dst = sk_dst_get(sk);
> +	struct net_device *netdev = NULL;
> +
> +	if (likely(dst)) {
> +		netdev = dst->dev;

I noticed you grab this once when the offload is configured. The dst
device could change - e.g., ECMP, routing changes. I'm guessing that
does not matter much for the use case - you are really wanting to
configure queues and zc buffers for a flow with the device; the netdev
is an easy gateway to get to it.

But, data center deployments tend to have redundant access points --
either multipath for L3 or bond for L2. For the latter, this offload
setup won't work - dst->dev will be the bond, the bond does not support
the offload, so user is out of luck.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-09  0:57   ` David Ahern
@ 2020-12-09  1:11     ` David Ahern
  2020-12-09  8:28       ` Boris Pismenny
  2020-12-09  8:25     ` Boris Pismenny
  1 sibling, 1 reply; 53+ messages in thread
From: David Ahern @ 2020-12-09  1:11 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, sagi, axboe, kbusch,
	viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

On 12/8/20 5:57 PM, David Ahern wrote:
>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>> index 7338b3865a2a..a08b85b53aa8 100644
>> --- a/include/net/inet_connection_sock.h
>> +++ b/include/net/inet_connection_sock.h
>> @@ -66,6 +66,8 @@ struct inet_connection_sock_af_ops {
>>   * @icsk_ulp_ops	   Pluggable ULP control hook
>>   * @icsk_ulp_data	   ULP private data
>>   * @icsk_clean_acked	   Clean acked data hook
>> + * @icsk_ulp_ddp_ops	   Pluggable ULP direct data placement control hook
>> + * @icsk_ulp_ddp_data	   ULP direct data placement private data
> 
> Neither of these socket layer intrusions are needed. All references but
> 1 -- the skbuff check -- are in the mlx5 driver. Any skb check that is
> needed can be handled with a different setting.

missed the nvme ops for the driver to callback to the socket owner.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock
  2020-12-09  1:06   ` David Ahern
@ 2020-12-09  7:41     ` Boris Pismenny
  2020-12-10  3:39       ` David Ahern
  0 siblings, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-09  7:41 UTC (permalink / raw)
  To: David Ahern, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz

On 09/12/2020 3:06, David Ahern wrote:
> On 12/7/20 2:06 PM, Boris Pismenny wrote:
>> get_netdev_for_sock is a utility that is used to obtain
>> the net_device structure from a connected socket.
>>
>> Later patches will use this for nvme-tcp DDP and DDP CRC offloads.
>>
>> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
>> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>>  include/net/sock.h   | 17 +++++++++++++++++
>>  net/tls/tls_device.c | 20 ++------------------
>>  2 files changed, 19 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 093b51719c69..a8f7393ea433 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -2711,4 +2711,21 @@ void sock_set_sndtimeo(struct sock *sk, s64 secs);
>>  
>>  int sock_bind_add(struct sock *sk, struct sockaddr *addr, int addr_len);
>>  
>> +/* Assume that the socket is already connected */
>> +static inline struct net_device *get_netdev_for_sock(struct sock *sk, bool hold)
>> +{
>> +	struct dst_entry *dst = sk_dst_get(sk);
>> +	struct net_device *netdev = NULL;
>> +
>> +	if (likely(dst)) {
>> +		netdev = dst->dev;
> 
> I noticed you grab this once when the offload is configured. The dst
> device could change - e.g., ECMP, routing changes. I'm guessing that
> does not matter much for the use case - you are really wanting to
> configure queues and zc buffers for a flow with the device; the netdev
> is an easy gateway to get to it.
> 
> But, data center deployments tend to have redundant access points --
> either multipath for L3 or bond for L2. For the latter, this offload
> setup won't work - dst->dev will be the bond, the bond does not support
> the offload, so user is out of luck.
> 

You are correct, and bond support is currently under review for TLS,
i.e., search for "TLS TX HW offload for Bond". The same approach that
is applied there is relevant here. More generally, this offload is
very similar in concept to TLS offload (tls_device).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-09  0:38       ` David Ahern
@ 2020-12-09  8:15         ` Boris Pismenny
  2020-12-10  4:26           ` David Ahern
  0 siblings, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-09  8:15 UTC (permalink / raw)
  To: David Ahern, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack, Boris Pismenny

On 09/12/2020 2:38, David Ahern wrote:
> 
> The AF_XDP reference was to differentiate one zerocopy use case (all
> packets go to userspace) from another (kernel managed TCP socket with
> zerocopy payload). You are focusing on a very narrow use case - kernel
> based NVMe over TCP - of a more general problem.
> 

Please note that although our framework implements support for nvme-tcp,
we designed it to fit iscsi as well, and hopefully future protocols too,
as general as we could. For why this could not be generalized further
see below.

> You have a TCP socket and a design that only works for kernel owned
> sockets. You have specialized queues in the NIC, a flow rule directing
> packets to those queues. Presumably some ULP parser in the NIC
> associated with the queues to process NVMe packets. Rather than copying
> headers (ethernet/ip/tcp) to one buffer and payload to another (which is
> similar to what Jonathan Lemon is working on), this design has a ULP
> processor that just splits out the TCP payload even more making it
> highly selective about which part of the packet is put into which
> buffer. Take out the NVMe part, and it is header split with zerocopy for
> the payload - a generic feature that can have a wider impact with NVMe
> as a special case.
> 

There is more to this than TCP zerocopy that exists in userspace or
inside the kernel. First, please note that the patches include support for
CRC offload as well as data placement. Second, data-placement is not the same
as zerocopy for the following reasons:
(1) The former places buffers *exactly* where the user requests
regardless of the order of response arrivals, while the latter places packets
in anonymous buffers according to packet arrival order. Therefore, zerocopy
can be implemented using data placement, but not vice versa.
(2) Data-placement supports sub-page zerocopy, unlike page-flipping
techniques (i.e., TCP_ZEROCOPY).
(3) Page-flipping can't work for any storage initiator because the
destination buffer is owned by some user pagecache or process using O_DIRECT.
(4) Storage over TCP PDUs are not necessarily aligned to TCP packets,
i.e., the PDU header can be in the middle of a packet, so header-data split
alone isn't enough.

I wish we could do the same using some simpler zerocopy mechanism,
it would indeed simplify things. But, unfortunately this would severely
restrict generality, no sub-page support and alignment between PDUs
and packets, and performance (ordering of PDUs).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-09  0:57   ` David Ahern
  2020-12-09  1:11     ` David Ahern
@ 2020-12-09  8:25     ` Boris Pismenny
  1 sibling, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-09  8:25 UTC (permalink / raw)
  To: David Ahern, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack


On 09/12/2020 2:57, David Ahern wrote:
> On 12/7/20 2:06 PM, Boris Pismenny wrote:
>> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
>> index 934de56644e7..fb35dcac03d2 100644
>> --- a/include/linux/netdev_features.h
>> +++ b/include/linux/netdev_features.h
>> @@ -84,6 +84,7 @@ enum {
>>  	NETIF_F_GRO_FRAGLIST_BIT,	/* Fraglist GRO */
>>  
>>  	NETIF_F_HW_MACSEC_BIT,		/* Offload MACsec operations */
>> +	NETIF_F_HW_TCP_DDP_BIT,		/* TCP direct data placement offload */
>>  
>>  	/*
>>  	 * Add your fresh new feature above and remember to update
>> @@ -157,6 +158,7 @@ enum {
>>  #define NETIF_F_GRO_FRAGLIST	__NETIF_F(GRO_FRAGLIST)
>>  #define NETIF_F_GSO_FRAGLIST	__NETIF_F(GSO_FRAGLIST)
>>  #define NETIF_F_HW_MACSEC	__NETIF_F(HW_MACSEC)
>> +#define NETIF_F_HW_TCP_DDP	__NETIF_F(HW_TCP_DDP)
> 
> All of the DDP naming seems wrong to me. I realize the specific use case
> is targeted payloads of a ULP, but it is still S/W handing H/W specific
> buffers for a payload of a flow.
> 
> 

This is intended to be used strictly by ULPs. DDP is how such things were
called in the past. It is more than zerocopy as explained before, so naming
it zerocopy will be misleading at best. I can propose another name. How about:
"Autonomous copy offload (ACO)" and "Autonomous crc offload (ACRC)"?
This will indicate that it is independent of other offloads, i.e. autonomous,
while also informing users of the functionality (copy/crc). Other names are
welcome too.


>>   * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
>>   * @icsk_ca_state:	   Congestion control state
>>   * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
>> @@ -94,6 +96,8 @@ struct inet_connection_sock {
>>  	const struct tcp_ulp_ops  *icsk_ulp_ops;
>>  	void __rcu		  *icsk_ulp_data;
>>  	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
>> +	const struct tcp_ddp_ulp_ops  *icsk_ulp_ddp_ops;
>> +	void __rcu		  *icsk_ulp_ddp_data;
>>  	struct hlist_node         icsk_listen_portaddr_node;
>>  	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
>>  	__u8			  icsk_ca_state:5,
>> diff --git a/include/net/tcp_ddp.h b/include/net/tcp_ddp.h
>> new file mode 100644
>> index 000000000000..df3264be4600
>> --- /dev/null
>> +++ b/include/net/tcp_ddp.h
>> @@ -0,0 +1,129 @@
>> +/* SPDX-License-Identifier: GPL-2.0
>> + *
>> + * tcp_ddp.h
>> + *	Author:	Boris Pismenny <borisp@mellanox.com>
>> + *	Copyright (C) 2020 Mellanox Technologies.
>> + */
>> +#ifndef _TCP_DDP_H
>> +#define _TCP_DDP_H
>> +
>> +#include <linux/netdevice.h>
>> +#include <net/inet_connection_sock.h>
>> +#include <net/sock.h>
>> +
>> +/* limits returned by the offload driver, zero means don't care */
>> +struct tcp_ddp_limits {
>> +	int	 max_ddp_sgl_len;
>> +};
>> +
>> +enum tcp_ddp_type {
>> +	TCP_DDP_NVME = 1,
>> +};
>> +
>> +/**
>> + * struct tcp_ddp_config - Generic tcp ddp configuration: tcp ddp IO queue
>> + * config implementations must use this as the first member.
>> + * Add new instances of tcp_ddp_config below (nvme-tcp, etc.).
>> + */
>> +struct tcp_ddp_config {
>> +	enum tcp_ddp_type    type;
>> +	unsigned char        buf[];
> 
> you have this variable length buf, but it is not used (as far as I can
> tell). But then ...
> 
> 

True. This buf[] is here to indicate that users are expected to extend it with
the ULP specific data as in nvme_tcp_config. We can remove it and leave a comment
if you prefer that.

>> +};
>> +
>> +/**
>> + * struct nvme_tcp_ddp_config - nvme tcp ddp configuration for an IO queue
>> + *
>> + * @pfv:        pdu version (e.g., NVME_TCP_PFV_1_0)
>> + * @cpda:       controller pdu data alignmend (dwords, 0's based)
>> + * @dgst:       digest types enabled.
>> + *              The netdev will offload crc if ddp_crc is supported.
>> + * @queue_size: number of nvme-tcp IO queue elements
>> + * @queue_id:   queue identifier
>> + * @cpu_io:     cpu core running the IO thread for this queue
>> + */
>> +struct nvme_tcp_ddp_config {
>> +	struct tcp_ddp_config   cfg;
> 
> ... how would you use it within another struct like this?
> 

You don't.

>> +
>> +	u16			pfv;
>> +	u8			cpda;
>> +	u8			dgst;
>> +	int			queue_size;
>> +	int			queue_id;
>> +	int			io_cpu;
>> +};
>> +
>> +/**
>> + * struct tcp_ddp_io - tcp ddp configuration for an IO request.
>> + *
>> + * @command_id:  identifier on the wire associated with these buffers
>> + * @nents:       number of entries in the sg_table
>> + * @sg_table:    describing the buffers for this IO request
>> + * @first_sgl:   first SGL in sg_table
>> + */
>> +struct tcp_ddp_io {
>> +	u32			command_id;
>> +	int			nents;
>> +	struct sg_table		sg_table;
>> +	struct scatterlist	first_sgl[SG_CHUNK_SIZE];
>> +};
>> +
>> +/* struct tcp_ddp_dev_ops - operations used by an upper layer protocol to configure ddp offload
>> + *
>> + * @tcp_ddp_limits:    limit the number of scatter gather entries per IO.
>> + *                     the device driver can use this to limit the resources allocated per queue.
>> + * @tcp_ddp_sk_add:    add offload for the queue represennted by the socket+config pair.
>> + *                     this function is used to configure either copy, crc or both offloads.
>> + * @tcp_ddp_sk_del:    remove offload from the socket, and release any device related resources.
>> + * @tcp_ddp_setup:     request copy offload for buffers associated with a command_id in tcp_ddp_io.
>> + * @tcp_ddp_teardown:  release offload resources association between buffers and command_id in
>> + *                     tcp_ddp_io.
>> + * @tcp_ddp_resync:    respond to the driver's resync_request. Called only if resync is successful.
>> + */
>> +struct tcp_ddp_dev_ops {
>> +	int (*tcp_ddp_limits)(struct net_device *netdev,
>> +			      struct tcp_ddp_limits *limits);
>> +	int (*tcp_ddp_sk_add)(struct net_device *netdev,
>> +			      struct sock *sk,
>> +			      struct tcp_ddp_config *config);
>> +	void (*tcp_ddp_sk_del)(struct net_device *netdev,
>> +			       struct sock *sk);
>> +	int (*tcp_ddp_setup)(struct net_device *netdev,
>> +			     struct sock *sk,
>> +			     struct tcp_ddp_io *io);
>> +	int (*tcp_ddp_teardown)(struct net_device *netdev,
>> +				struct sock *sk,
>> +				struct tcp_ddp_io *io,
>> +				void *ddp_ctx);
>> +	void (*tcp_ddp_resync)(struct net_device *netdev,
>> +			       struct sock *sk, u32 seq);
>> +};
>> +
>> +#define TCP_DDP_RESYNC_REQ (1 << 0)
>> +
>> +/**
>> + * struct tcp_ddp_ulp_ops - Interface to register uppper layer Direct Data Placement (DDP) TCP offload
>> + */
>> +struct tcp_ddp_ulp_ops {
>> +	/* NIC requests ulp to indicate if @seq is the start of a message */
>> +	bool (*resync_request)(struct sock *sk, u32 seq, u32 flags);
>> +	/* NIC driver informs the ulp that ddp teardown is done - used for async completions*/
>> +	void (*ddp_teardown_done)(void *ddp_ctx);
>> +};
>> +
>> +/**
>> + * struct tcp_ddp_ctx - Generic tcp ddp context: device driver per queue contexts must
>> + * use this as the first member.
>> + */
>> +struct tcp_ddp_ctx {
>> +	enum tcp_ddp_type    type;
>> +	unsigned char        buf[];
> 
> similar to my comment above, I did not see any uses of the buf element.
> 

Same idea, we will remove it an leave a comment.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-09  1:11     ` David Ahern
@ 2020-12-09  8:28       ` Boris Pismenny
  0 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-09  8:28 UTC (permalink / raw)
  To: David Ahern, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack




On 09/12/2020 3:11, David Ahern wrote:
> On 12/8/20 5:57 PM, David Ahern wrote:
>>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>>> index 7338b3865a2a..a08b85b53aa8 100644
>>> --- a/include/net/inet_connection_sock.h
>>> +++ b/include/net/inet_connection_sock.h
>>> @@ -66,6 +66,8 @@ struct inet_connection_sock_af_ops {
>>>   * @icsk_ulp_ops	   Pluggable ULP control hook
>>>   * @icsk_ulp_data	   ULP private data
>>>   * @icsk_clean_acked	   Clean acked data hook
>>> + * @icsk_ulp_ddp_ops	   Pluggable ULP direct data placement control hook
>>> + * @icsk_ulp_ddp_data	   ULP direct data placement private data
>>
>> Neither of these socket layer intrusions are needed. All references but
>> 1 -- the skbuff check -- are in the mlx5 driver. Any skb check that is
>> needed can be handled with a different setting.
> 
> missed the nvme ops for the driver to callback to the socket owner.
> 

Hopefully it is clear that these are needed, and indeed we use them in
both driver and nvme-tcp layers.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock
  2020-12-09  7:41     ` Boris Pismenny
@ 2020-12-10  3:39       ` David Ahern
  2020-12-11 18:43         ` Boris Pismenny
  0 siblings, 1 reply; 53+ messages in thread
From: David Ahern @ 2020-12-10  3:39 UTC (permalink / raw)
  To: Boris Pismenny, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz

On 12/9/20 12:41 AM, Boris Pismenny wrote:
> 
> You are correct, and bond support is currently under review for TLS,
> i.e., search for "TLS TX HW offload for Bond". The same approach that

ok, missed that.

> is applied there is relevant here. More generally, this offload is
> very similar in concept to TLS offload (tls_device).
> 

I disagree with the TLS comparison. As an example, AIUI the TLS offload
works with userspace sockets, and it is a protocol offload.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-09  8:15         ` Boris Pismenny
@ 2020-12-10  4:26           ` David Ahern
  2020-12-11  2:01             ` Jakub Kicinski
  2020-12-13 18:21             ` Boris Pismenny
  0 siblings, 2 replies; 53+ messages in thread
From: David Ahern @ 2020-12-10  4:26 UTC (permalink / raw)
  To: Boris Pismenny, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack, Boris Pismenny

On 12/9/20 1:15 AM, Boris Pismenny wrote:
> On 09/12/2020 2:38, David Ahern wrote:
>>
>> The AF_XDP reference was to differentiate one zerocopy use case (all
>> packets go to userspace) from another (kernel managed TCP socket with
>> zerocopy payload). You are focusing on a very narrow use case - kernel
>> based NVMe over TCP - of a more general problem.
>>
> 
> Please note that although our framework implements support for nvme-tcp,
> we designed it to fit iscsi as well, and hopefully future protocols too,
> as general as we could. For why this could not be generalized further
> see below.
> 
>> You have a TCP socket and a design that only works for kernel owned
>> sockets. You have specialized queues in the NIC, a flow rule directing
>> packets to those queues. Presumably some ULP parser in the NIC
>> associated with the queues to process NVMe packets. Rather than copying
>> headers (ethernet/ip/tcp) to one buffer and payload to another (which is
>> similar to what Jonathan Lemon is working on), this design has a ULP
>> processor that just splits out the TCP payload even more making it
>> highly selective about which part of the packet is put into which
>> buffer. Take out the NVMe part, and it is header split with zerocopy for
>> the payload - a generic feature that can have a wider impact with NVMe
>> as a special case.
>>
> 
> There is more to this than TCP zerocopy that exists in userspace or
> inside the kernel. First, please note that the patches include support for
> CRC offload as well as data placement. Second, data-placement is not the same

Yes, the CRC offload is different, but I think it is orthogonal to the
'where does h/w put the data' problem.

> as zerocopy for the following reasons:
> (1) The former places buffers *exactly* where the user requests
> regardless of the order of response arrivals, while the latter places packets
> in anonymous buffers according to packet arrival order. Therefore, zerocopy
> can be implemented using data placement, but not vice versa.

Fundamentally, it is an SGL and a TCP sequence number. There is a
starting point where seq N == sgl element 0, position 0. Presumably
there is a hardware cursor to track where you are in filling the SGL as
packets are processed. You abort on OOO, so it seems like a fairly
straightfoward problem.

> (2) Data-placement supports sub-page zerocopy, unlike page-flipping
> techniques (i.e., TCP_ZEROCOPY).

I am not pushing for or suggesting any page-flipping. I understand the
limitations of that approach.

> (3) Page-flipping can't work for any storage initiator because the
> destination buffer is owned by some user pagecache or process using O_DIRECT.
> (4) Storage over TCP PDUs are not necessarily aligned to TCP packets,
> i.e., the PDU header can be in the middle of a packet, so header-data split
> alone isn't enough.

yes, TCP is a byte stream and you have to have a cursor marking last
written spot in the SGL. More below.

> 
> I wish we could do the same using some simpler zerocopy mechanism,
> it would indeed simplify things. But, unfortunately this would severely
> restrict generality, no sub-page support and alignment between PDUs
> and packets, and performance (ordering of PDUs).
> 

My biggest concern is that you are adding checks in the fast path for a
very specific use case. If / when Rx zerocopy happens (and I suspect it
has to happen soon to handle the ever increasing speeds), nothing about
this patch set is reusable and worse more checks are needed in the fast
path. I think it is best if you make this more generic — at least
anything touching core code.

For example, you have an iov static key hook managed by a driver for
generic code. There are a few ways around that. One is by adding skb
details to the nvme code — ie., walking the skb fragments, seeing that a
given frag is in your allocated memory and skipping the copy. This would
offer best performance since it skips all unnecessary checks. Another
option is to export __skb_datagram_iter, use it and define your own copy
handler that does the address compare and skips the copy. Key point -
only your code path is affected.

Similarly for the NVMe SGLs and DDP offload - a more generic solution
allows other use cases to build on this as opposed to the checks you
want for a special case. For example, a split at the protocol headers /
payload boundaries would be a generic solution where kernel managed
protocols get data in one buffer and socket data is put into a given
SGL. I am guessing that you have to be already doing this to put PDU
payloads into an SGL and other headers into other memory to make a
complete packet, so this is not too far off from what you are already doing.

Let me walk through an example with assumptions about your hardware's
capabilities, and you correct me where I am wrong. Assume you have a
'full' command response of this form:

 +------------- ... ----------------+---------+---------+--------+-----+
 |          big data segment        | PDU hdr | TCP hdr | IP hdr | eth |
 +------------- ... ----------------+---------+---------+--------+-----+

but it shows up to the host in 3 packets like this (ideal case):

 +-------------------------+---------+---------+--------+-----+
 |       data - seg 1      | PDU hdr | TCP hdr | IP hdr | eth |
 +-------------------------+---------+---------+--------+-----+
 +-----------------------------------+---------+--------+-----+
 |       data - seg 2                | TCP hdr | IP hdr | eth |
 +-----------------------------------+---------+--------+-----+
                   +-----------------+---------+--------+-----+
                   | payload - seg 3 | TCP hdr | IP hdr | eth |
                   +-----------------+---------+--------+-----+


The hardware splits the eth/IP/tcp headers from payload like this
(again, your hardware has to know these boundaries to accomplish what
you want):

 +-------------------------+---------+     +---------+--------+-----+
 |       data - seg 1      | PDU hdr |     | TCP hdr | IP hdr | eth |
 +-------------------------+---------+     +---------+--------+-----+

 +-----------------------------------+     +---------+--------+-----+
 |       data - seg 2                |     | TCP hdr | IP hdr | eth |
 +-----------------------------------+     +---------+--------+-----+

                   +-----------------+     +---------+--------+-----+
                   | payload - seg 3 |     | TCP hdr | IP hdr | eth |
                   +-----------------+     +---------+--------+-----+

Left side goes into the SGLs posted for this socket / flow; the right
side goes into some other memory resource made available for headers.
This is very close to what you are doing now - with the exception of the
PDU header being put to the right side. NVMe code then just needs to set
the iov offset (or adjust the base_addr) to skip over the PDU header -
standard options for an iov.

Yes, TCP is a byte stream, so the packets could very well show up like this:

 +--------------+---------+-----------+---------+--------+-----+
 | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
 +--------------+---------+-----------+---------+--------+-----+
 +-----------------------------------+---------+--------+-----+
 |     payload - seg 2               | TCP hdr | IP hdr | eth |
 +-----------------------------------+---------+--------+-----+
 +-------- +-------------------------+---------+--------+-----+
 | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
 +---------+-------------------------+---------+--------+-----+

If your hardware can extract the NVMe payload into a targeted SGL like
you want in this set, then it has some logic for parsing headers and
"snapping" an SGL to a new element. ie., it already knows 'prev data'
goes with the in-progress PDU, sees more data, recognizes a new PDU
header and a new payload. That means it already has to handle a
'snap-to-PDU' style argument where the end of the payload closes out an
SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
this case, you want 'snap-to-PDU' but that could just as easily be 'no
snap at all', just a byte stream and filling an SGL after the protocol
headers.

Key point here is that this is the start of a generic header / data
split that could work for other applications - not just NVMe. eth/IP/TCP
headers are consumed by the Linux networking stack; data is in
application owned, socket based SGLs to avoid copies.

###

A dump of other comments about this patch set:
- there are a LOT of unnecessary typecasts around tcp_ddp_ctx that can
be avoided by using container_of.

- you have an accessor tcp_ddp_get_ctx but no setter; all uses of
tcp_ddp_get_ctx are within mlx5. why open code the set but use the
accessor for the get? Worse, mlx5e_nvmeotcp_queue_teardown actually has
both — uses the accessor and open codes setting icsk_ulp_ddp_data.

- the driver is storing private data on the socket. Nothing about the
socket layer cares and the mlx5 driver is already tracking that data in
priv->nvmeotcp->queue_hash. As I mentioned in a previous response, I
understand the socket ops are needed for the driver level to call into
the socket layer, but the data part does not seem to be needed.

- nvme_tcp_offload_socket and nvme_tcp_offload_limits both return int
yet the value is ignored

- the build robot found a number of problems (it pulls my github tree
and I pushed this set to it to move across computers).

I think the patch set would be easier to follow if you restructured the
patches to 1 thing only per patch -- e.g., split patch 2 into netdev
bits and socket bits. Add the netdev feature bit and operations in 1
patch and add the socket ops in a second patch with better commit logs
about why each is needed and what is done.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path
  2020-12-07 21:06 ` [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path Boris Pismenny
@ 2020-12-10 17:15   ` Shai Malin
  2020-12-14  6:38     ` Boris Pismenny
  0 siblings, 1 reply; 53+ messages in thread
From: Shai Malin @ 2020-12-10 17:15 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, sagi, axboe, kbusch,
	viro, edumazet
  Cc: Yoray Zack, yorayz, boris.pismenny, Ben Ben-Ishay, benishay,
	linux-nvme, netdev, Or Gerlitz, ogerlitz, Ariel Elior,
	Michal Kalderon, malin1024

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index c0c33320fe65..ef96e4a02bbd 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -14,6 +14,7 @@
 #include <linux/blk-mq.h>
 #include <crypto/hash.h>
 #include <net/busy_poll.h>
+#include <net/tcp_ddp.h>
 
 #include "nvme.h"
 #include "fabrics.h"
@@ -62,6 +63,7 @@ enum nvme_tcp_queue_flags {
 	NVME_TCP_Q_ALLOCATED	= 0,
 	NVME_TCP_Q_LIVE		= 1,
 	NVME_TCP_Q_POLLING	= 2,
+	NVME_TCP_Q_OFFLOADS     = 3,
 };

The same comment from the previous version - we are concerned that perhaps 
the generic term "offload" for both the transport type (for the Marvell work) 
and for the DDP and CRC offload queue (for the Mellanox work) may be 
misleading and confusing to developers and to users.

As suggested by Sagi, we can call this NVME_TCP_Q_DDP. 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-10  4:26           ` David Ahern
@ 2020-12-11  2:01             ` Jakub Kicinski
  2020-12-11  2:43               ` David Ahern
  2020-12-13 18:21             ` Boris Pismenny
  1 sibling, 1 reply; 53+ messages in thread
From: Jakub Kicinski @ 2020-12-11  2:01 UTC (permalink / raw)
  To: David Ahern
  Cc: Boris Pismenny, Boris Pismenny, davem, saeedm, hch, sagi, axboe,
	kbusch, viro, edumazet, boris.pismenny, linux-nvme, netdev,
	benishay, ogerlitz, yorayz, Ben Ben-Ishay, Or Gerlitz,
	Yoray Zack, Boris Pismenny

On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:
> Yes, TCP is a byte stream, so the packets could very well show up like this:
> 
>  +--------------+---------+-----------+---------+--------+-----+
>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
>  +--------------+---------+-----------+---------+--------+-----+
>  +-----------------------------------+---------+--------+-----+
>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
>  +-----------------------------------+---------+--------+-----+
>  +-------- +-------------------------+---------+--------+-----+
>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
>  +---------+-------------------------+---------+--------+-----+
> 
> If your hardware can extract the NVMe payload into a targeted SGL like
> you want in this set, then it has some logic for parsing headers and
> "snapping" an SGL to a new element. ie., it already knows 'prev data'
> goes with the in-progress PDU, sees more data, recognizes a new PDU
> header and a new payload. That means it already has to handle a
> 'snap-to-PDU' style argument where the end of the payload closes out an
> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
> this case, you want 'snap-to-PDU' but that could just as easily be 'no
> snap at all', just a byte stream and filling an SGL after the protocol
> headers.

This 'snap-to-PDU' requirement is something that I don't understand
with the current TCP zero copy. In case of, say, a storage application
which wants to send some headers (whatever RPC info, block number,
etc.) and then a 4k block of data - how does the RX side get just the
4k block a into a page so it can zero copy it out to its storage device?

Per-connection state in the NIC, and FW parsing headers is one way,
but I wonder how this record split problem is best resolved generically.
Perhaps by passing hints in the headers somehow?

Sorry for the slight off-topic :)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-11  2:01             ` Jakub Kicinski
@ 2020-12-11  2:43               ` David Ahern
  2020-12-11 18:45                 ` Jakub Kicinski
  0 siblings, 1 reply; 53+ messages in thread
From: David Ahern @ 2020-12-11  2:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Boris Pismenny, Boris Pismenny, davem, saeedm, hch, sagi, axboe,
	kbusch, viro, edumazet, boris.pismenny, linux-nvme, netdev,
	benishay, ogerlitz, yorayz, Ben Ben-Ishay, Or Gerlitz,
	Yoray Zack, Boris Pismenny

On 12/10/20 7:01 PM, Jakub Kicinski wrote:
> On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:
>> Yes, TCP is a byte stream, so the packets could very well show up like this:
>>
>>  +--------------+---------+-----------+---------+--------+-----+
>>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
>>  +--------------+---------+-----------+---------+--------+-----+
>>  +-----------------------------------+---------+--------+-----+
>>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
>>  +-----------------------------------+---------+--------+-----+
>>  +-------- +-------------------------+---------+--------+-----+
>>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
>>  +---------+-------------------------+---------+--------+-----+
>>
>> If your hardware can extract the NVMe payload into a targeted SGL like
>> you want in this set, then it has some logic for parsing headers and
>> "snapping" an SGL to a new element. ie., it already knows 'prev data'
>> goes with the in-progress PDU, sees more data, recognizes a new PDU
>> header and a new payload. That means it already has to handle a
>> 'snap-to-PDU' style argument where the end of the payload closes out an
>> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
>> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
>> this case, you want 'snap-to-PDU' but that could just as easily be 'no
>> snap at all', just a byte stream and filling an SGL after the protocol
>> headers.
> 
> This 'snap-to-PDU' requirement is something that I don't understand
> with the current TCP zero copy. In case of, say, a storage application

current TCP zero-copy does not handle this and it can't AFAIK. I believe
it requires hardware level support where an Rx queue is dedicated to a
flow / socket and some degree of header and payload splitting (header is
consumed by the kernel stack and payload goes to socket owner's memory).

> which wants to send some headers (whatever RPC info, block number,
> etc.) and then a 4k block of data - how does the RX side get just the
> 4k block a into a page so it can zero copy it out to its storage device?
> 
> Per-connection state in the NIC, and FW parsing headers is one way,
> but I wonder how this record split problem is best resolved generically.
> Perhaps by passing hints in the headers somehow?
> 
> Sorry for the slight off-topic :)
> 
Hardware has to be parsing the incoming packets to find the usual
ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
have some kind of ULP processor to know how to parse the TCP byte stream
at least well enough to find the PDU header and interpret it to get pdu
header length and payload length.

At that point you push the protocol headers (eth/ip/tcp) into one buffer
for the kernel stack protocols and put the payload into another. The
former would be some page owned by the OS and the latter owned by the
process / socket (generically, in this case it is a kernel level
socket). In addition, since the payload is spread across multiple
packets the hardware has to keep track of TCP sequence number and its
current place in the SGL where it is writing the payload to keep the
bytes contiguous and detect out-of-order.

If the ULP processor knows about PDU headers it knows when enough
payload has been found to satisfy that PDU in which case it can tell the
cursor to move on to the next SGL element (or separate SGL). That's what
I meant by 'snap-to-PDU'.

Alternatively, if it is any random application with a byte stream not
understood by hardware, the cursor just keeps moving along the SGL
elements assigned it for this particular flow.

If you have a socket whose payload is getting offloaded to its own queue
(which this set is effectively doing), you can create the queue with
some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
that controls the parsing when you stop writing to one SGL element and
move on to the next. Again, assuming hardware support for such attributes.

I don't work for Nvidia, so this is all supposition based on what the
patches are doing.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock
  2020-12-10  3:39       ` David Ahern
@ 2020-12-11 18:43         ` Boris Pismenny
  0 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-11 18:43 UTC (permalink / raw)
  To: David Ahern, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz


On 10/12/2020 5:39, David Ahern wrote:
> On 12/9/20 12:41 AM, Boris Pismenny wrote:
> 
>> is applied there is relevant here. More generally, this offload is
>> very similar in concept to TLS offload (tls_device).
>>
> 
> I disagree with the TLS comparison. As an example, AIUI the TLS offload
> works with userspace sockets, and it is a protocol offload.
> 

First, tls offload can be used by kernel sockets just as userspace
sockets. Second, it is not a protocol offload per-se, if you are looking
for a carch-phrase to define it partial/autonomous offload is what I
think fits better.

IMO, the concepts are very similar, and those who implemented offload
using the tls_device APIs will find this offload fits both hardware drivers
naturally.

To compare between them, please look at the NDOs, in particular the add/del/resync.
Additionally, see the similarity between skb->ddp_crc and skb->decrypted.
Also, consider the software-fallback flows used in both.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-11  2:43               ` David Ahern
@ 2020-12-11 18:45                 ` Jakub Kicinski
  2020-12-11 18:58                   ` Eric Dumazet
                                     ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Jakub Kicinski @ 2020-12-11 18:45 UTC (permalink / raw)
  To: David Ahern
  Cc: Boris Pismenny, Boris Pismenny, davem, saeedm, hch, sagi, axboe,
	kbusch, viro, edumazet, boris.pismenny, linux-nvme, netdev,
	benishay, ogerlitz, yorayz, Ben Ben-Ishay, Or Gerlitz,
	Yoray Zack, Boris Pismenny

On Thu, 10 Dec 2020 19:43:57 -0700 David Ahern wrote:
> On 12/10/20 7:01 PM, Jakub Kicinski wrote:
> > On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:  
> >> Yes, TCP is a byte stream, so the packets could very well show up like this:
> >>
> >>  +--------------+---------+-----------+---------+--------+-----+
> >>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
> >>  +--------------+---------+-----------+---------+--------+-----+
> >>  +-----------------------------------+---------+--------+-----+
> >>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
> >>  +-----------------------------------+---------+--------+-----+
> >>  +-------- +-------------------------+---------+--------+-----+
> >>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
> >>  +---------+-------------------------+---------+--------+-----+
> >>
> >> If your hardware can extract the NVMe payload into a targeted SGL like
> >> you want in this set, then it has some logic for parsing headers and
> >> "snapping" an SGL to a new element. ie., it already knows 'prev data'
> >> goes with the in-progress PDU, sees more data, recognizes a new PDU
> >> header and a new payload. That means it already has to handle a
> >> 'snap-to-PDU' style argument where the end of the payload closes out an
> >> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
> >> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
> >> this case, you want 'snap-to-PDU' but that could just as easily be 'no
> >> snap at all', just a byte stream and filling an SGL after the protocol
> >> headers.  
> > 
> > This 'snap-to-PDU' requirement is something that I don't understand
> > with the current TCP zero copy. In case of, say, a storage application  
> 
> current TCP zero-copy does not handle this and it can't AFAIK. I believe
> it requires hardware level support where an Rx queue is dedicated to a
> flow / socket and some degree of header and payload splitting (header is
> consumed by the kernel stack and payload goes to socket owner's memory).

Yet, Google claims to use the RX ZC in production, and with a CX3 Pro /
mlx4 NICs.

Simple workaround that comes to mind is have the headers and payloads
on separate TCP streams. That doesn't seem too slick.. but neither is
the 4k MSS, so maybe that's what Google does?

> > which wants to send some headers (whatever RPC info, block number,
> > etc.) and then a 4k block of data - how does the RX side get just the
> > 4k block a into a page so it can zero copy it out to its storage device?
> > 
> > Per-connection state in the NIC, and FW parsing headers is one way,
> > but I wonder how this record split problem is best resolved generically.
> > Perhaps by passing hints in the headers somehow?
> > 
> > Sorry for the slight off-topic :)
> >   
> Hardware has to be parsing the incoming packets to find the usual
> ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
> have some kind of ULP processor to know how to parse the TCP byte stream
> at least well enough to find the PDU header and interpret it to get pdu
> header length and payload length.

The big difference between normal headers and L7 headers is that one is
at ~constant offset, self-contained, and always complete (PDU header
can be split across segments).

Edwin Peer did an implementation of TLS ULP for the NFP, it was
complex. Not to mention it's L7 protocol ossification.

To put it bluntly maybe it's fine for smaller shops but I'm guessing
it's going to be a hard sell to hyperscalers and people who don't like
to be locked in to HW.

> At that point you push the protocol headers (eth/ip/tcp) into one buffer
> for the kernel stack protocols and put the payload into another. The
> former would be some page owned by the OS and the latter owned by the
> process / socket (generically, in this case it is a kernel level
> socket). In addition, since the payload is spread across multiple
> packets the hardware has to keep track of TCP sequence number and its
> current place in the SGL where it is writing the payload to keep the
> bytes contiguous and detect out-of-order.
> 
> If the ULP processor knows about PDU headers it knows when enough
> payload has been found to satisfy that PDU in which case it can tell the
> cursor to move on to the next SGL element (or separate SGL). That's what
> I meant by 'snap-to-PDU'.
> 
> Alternatively, if it is any random application with a byte stream not
> understood by hardware, the cursor just keeps moving along the SGL
> elements assigned it for this particular flow.
> 
> If you have a socket whose payload is getting offloaded to its own queue
> (which this set is effectively doing), you can create the queue with
> some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
> that controls the parsing when you stop writing to one SGL element and
> move on to the next. Again, assuming hardware support for such attributes.
> 
> I don't work for Nvidia, so this is all supposition based on what the
> patches are doing.

Ack, these patches are not exciting (to me), so I'm wondering if there
is a better way. The only reason NIC would have to understand a ULP for
ZC is to parse out header/message lengths. There's gotta be a way to
pass those in header options or such...

And, you know, if we figure something out - maybe we stand a chance
against having 4 different zero copy implementations (this, TCP,
AF_XDP, netgpu) :(

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-11 18:45                 ` Jakub Kicinski
@ 2020-12-11 18:58                   ` Eric Dumazet
  2020-12-11 19:59                   ` David Ahern
  2020-12-13 18:34                   ` Boris Pismenny
  2 siblings, 0 replies; 53+ messages in thread
From: Eric Dumazet @ 2020-12-11 18:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, Boris Pismenny, Boris Pismenny, David Miller,
	Saeed Mahameed, Christoph Hellwig, sagi, axboe, kbusch, Al Viro,
	boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack, Boris Pismenny

On Fri, Dec 11, 2020 at 7:45 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Dec 2020 19:43:57 -0700 David Ahern wrote:
> > On 12/10/20 7:01 PM, Jakub Kicinski wrote:
> > > On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:
> > >> Yes, TCP is a byte stream, so the packets could very well show up like this:
> > >>
> > >>  +--------------+---------+-----------+---------+--------+-----+
> > >>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
> > >>  +--------------+---------+-----------+---------+--------+-----+
> > >>  +-----------------------------------+---------+--------+-----+
> > >>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
> > >>  +-----------------------------------+---------+--------+-----+
> > >>  +-------- +-------------------------+---------+--------+-----+
> > >>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
> > >>  +---------+-------------------------+---------+--------+-----+
> > >>
> > >> If your hardware can extract the NVMe payload into a targeted SGL like
> > >> you want in this set, then it has some logic for parsing headers and
> > >> "snapping" an SGL to a new element. ie., it already knows 'prev data'
> > >> goes with the in-progress PDU, sees more data, recognizes a new PDU
> > >> header and a new payload. That means it already has to handle a
> > >> 'snap-to-PDU' style argument where the end of the payload closes out an
> > >> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
> > >> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
> > >> this case, you want 'snap-to-PDU' but that could just as easily be 'no
> > >> snap at all', just a byte stream and filling an SGL after the protocol
> > >> headers.
> > >
> > > This 'snap-to-PDU' requirement is something that I don't understand
> > > with the current TCP zero copy. In case of, say, a storage application
> >
> > current TCP zero-copy does not handle this and it can't AFAIK. I believe
> > it requires hardware level support where an Rx queue is dedicated to a
> > flow / socket and some degree of header and payload splitting (header is
> > consumed by the kernel stack and payload goes to socket owner's memory).
>
> Yet, Google claims to use the RX ZC in production, and with a CX3 Pro /
> mlx4 NICs.

And other proprietary NIC

We do not dedicate RX queues, TCP RX zerocopy is not direct placement
into application space.

These slides (from netdev 0x14) might be helpful

https://netdevconf.info/0x14/pub/slides/62/Implementing%20TCP%20RX%20zero%20copy.pdf


>
> Simple workaround that comes to mind is have the headers and payloads
> on separate TCP streams. That doesn't seem too slick.. but neither is
> the 4k MSS, so maybe that's what Google does?
>
> > > which wants to send some headers (whatever RPC info, block number,
> > > etc.) and then a 4k block of data - how does the RX side get just the
> > > 4k block a into a page so it can zero copy it out to its storage device?
> > >
> > > Per-connection state in the NIC, and FW parsing headers is one way,
> > > but I wonder how this record split problem is best resolved generically.
> > > Perhaps by passing hints in the headers somehow?
> > >
> > > Sorry for the slight off-topic :)
> > >
> > Hardware has to be parsing the incoming packets to find the usual
> > ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
> > have some kind of ULP processor to know how to parse the TCP byte stream
> > at least well enough to find the PDU header and interpret it to get pdu
> > header length and payload length.
>
> The big difference between normal headers and L7 headers is that one is
> at ~constant offset, self-contained, and always complete (PDU header
> can be split across segments).
>
> Edwin Peer did an implementation of TLS ULP for the NFP, it was
> complex. Not to mention it's L7 protocol ossification.
>
> To put it bluntly maybe it's fine for smaller shops but I'm guessing
> it's going to be a hard sell to hyperscalers and people who don't like
> to be locked in to HW.
>
> > At that point you push the protocol headers (eth/ip/tcp) into one buffer
> > for the kernel stack protocols and put the payload into another. The
> > former would be some page owned by the OS and the latter owned by the
> > process / socket (generically, in this case it is a kernel level
> > socket). In addition, since the payload is spread across multiple
> > packets the hardware has to keep track of TCP sequence number and its
> > current place in the SGL where it is writing the payload to keep the
> > bytes contiguous and detect out-of-order.
> >
> > If the ULP processor knows about PDU headers it knows when enough
> > payload has been found to satisfy that PDU in which case it can tell the
> > cursor to move on to the next SGL element (or separate SGL). That's what
> > I meant by 'snap-to-PDU'.
> >
> > Alternatively, if it is any random application with a byte stream not
> > understood by hardware, the cursor just keeps moving along the SGL
> > elements assigned it for this particular flow.
> >
> > If you have a socket whose payload is getting offloaded to its own queue
> > (which this set is effectively doing), you can create the queue with
> > some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
> > that controls the parsing when you stop writing to one SGL element and
> > move on to the next. Again, assuming hardware support for such attributes.
> >
> > I don't work for Nvidia, so this is all supposition based on what the
> > patches are doing.
>
> Ack, these patches are not exciting (to me), so I'm wondering if there
> is a better way. The only reason NIC would have to understand a ULP for
> ZC is to parse out header/message lengths. There's gotta be a way to
> pass those in header options or such...
>
> And, you know, if we figure something out - maybe we stand a chance
> against having 4 different zero copy implementations (this, TCP,
> AF_XDP, netgpu) :(

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-11 18:45                 ` Jakub Kicinski
  2020-12-11 18:58                   ` Eric Dumazet
@ 2020-12-11 19:59                   ` David Ahern
  2020-12-11 23:05                     ` Jonathan Lemon
  2020-12-13 18:34                   ` Boris Pismenny
  2 siblings, 1 reply; 53+ messages in thread
From: David Ahern @ 2020-12-11 19:59 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Boris Pismenny, Boris Pismenny, davem, saeedm, hch, sagi, axboe,
	kbusch, viro, edumazet, boris.pismenny, linux-nvme, netdev,
	benishay, ogerlitz, yorayz, Ben Ben-Ishay, Or Gerlitz,
	Yoray Zack, Boris Pismenny

On 12/11/20 11:45 AM, Jakub Kicinski wrote:
> Ack, these patches are not exciting (to me), so I'm wondering if there
> is a better way. The only reason NIC would have to understand a ULP for
> ZC is to parse out header/message lengths. There's gotta be a way to
> pass those in header options or such...
> 
> And, you know, if we figure something out - maybe we stand a chance
> against having 4 different zero copy implementations (this, TCP,
> AF_XDP, netgpu) :(

AF_XDP is for userspace IP/TCP/packet handling.

netgpu is fundamentally kernel stack socket with zerocopy direct to
userspace buffers. Yes, it has more complications like the process is
running on a GPU, but essential characteristics are header / payload
split, a kernel stack managed socket, and dedicated H/W queues for the
socket and those are all common to what the nvme patches are doing. So,
can we create a common framework for those characteristics which enables
other use cases (like a more generic Rx zerocopy)?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-11 19:59                   ` David Ahern
@ 2020-12-11 23:05                     ` Jonathan Lemon
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Lemon @ 2020-12-11 23:05 UTC (permalink / raw)
  To: David Ahern
  Cc: Jakub Kicinski, Boris Pismenny, Boris Pismenny, davem, saeedm,
	hch, sagi, axboe, kbusch, viro, edumazet, boris.pismenny,
	linux-nvme, netdev, benishay, ogerlitz, yorayz, Ben Ben-Ishay,
	Or Gerlitz, Yoray Zack, Boris Pismenny

On Fri, Dec 11, 2020 at 12:59:52PM -0700, David Ahern wrote:
> On 12/11/20 11:45 AM, Jakub Kicinski wrote:
> > Ack, these patches are not exciting (to me), so I'm wondering if there
> > is a better way. The only reason NIC would have to understand a ULP for
> > ZC is to parse out header/message lengths. There's gotta be a way to
> > pass those in header options or such...
> > 
> > And, you know, if we figure something out - maybe we stand a chance
> > against having 4 different zero copy implementations (this, TCP,
> > AF_XDP, netgpu) :(
> 
> AF_XDP is for userspace IP/TCP/packet handling.
> 
> netgpu is fundamentally kernel stack socket with zerocopy direct to
> userspace buffers. Yes, it has more complications like the process is
> running on a GPU, but essential characteristics are header / payload
> split, a kernel stack managed socket, and dedicated H/W queues for the
> socket and those are all common to what the nvme patches are doing. So,
> can we create a common framework for those characteristics which enables
> other use cases (like a more generic Rx zerocopy)?

AF_XDP could also be viewed as a special case of netgpu.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-10  4:26           ` David Ahern
  2020-12-11  2:01             ` Jakub Kicinski
@ 2020-12-13 18:21             ` Boris Pismenny
  2020-12-15  5:19               ` David Ahern
  1 sibling, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-13 18:21 UTC (permalink / raw)
  To: David Ahern, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack, Boris Pismenny

On 10/12/2020 6:26, David Ahern wrote:
> On 12/9/20 1:15 AM, Boris Pismenny wrote:
>> On 09/12/2020 2:38, David Ahern wrote:
[...]
>>
>> There is more to this than TCP zerocopy that exists in userspace or
>> inside the kernel. First, please note that the patches include support for
>> CRC offload as well as data placement. Second, data-placement is not the same
> 
> Yes, the CRC offload is different, but I think it is orthogonal to the
> 'where does h/w put the data' problem.
> 

I agree

>> as zerocopy for the following reasons:
>> (1) The former places buffers *exactly* where the user requests
>> regardless of the order of response arrivals, while the latter places packets
>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>> can be implemented using data placement, but not vice versa.
> 
> Fundamentally, it is an SGL and a TCP sequence number. There is a
> starting point where seq N == sgl element 0, position 0. Presumably
> there is a hardware cursor to track where you are in filling the SGL as
> packets are processed. You abort on OOO, so it seems like a fairly
> straightfoward problem.
> 

We do not abort on OOO. Moreover, we can keep going as long as
PDU headers are not reordered.

>> (2) Data-placement supports sub-page zerocopy, unlike page-flipping
>> techniques (i.e., TCP_ZEROCOPY).
> 
> I am not pushing for or suggesting any page-flipping. I understand the
> limitations of that approach.
> 
>> (3) Page-flipping can't work for any storage initiator because the
>> destination buffer is owned by some user pagecache or process using O_DIRECT.
>> (4) Storage over TCP PDUs are not necessarily aligned to TCP packets,
>> i.e., the PDU header can be in the middle of a packet, so header-data split
>> alone isn't enough.
> 
> yes, TCP is a byte stream and you have to have a cursor marking last
> written spot in the SGL. More below.
> 
>>
>> I wish we could do the same using some simpler zerocopy mechanism,
>> it would indeed simplify things. But, unfortunately this would severely
>> restrict generality, no sub-page support and alignment between PDUs
>> and packets, and performance (ordering of PDUs).
>>
> 
> My biggest concern is that you are adding checks in the fast path for a
> very specific use case. If / when Rx zerocopy happens (and I suspect it
> has to happen soon to handle the ever increasing speeds), nothing about
> this patch set is reusable and worse more checks are needed in the fast
> path. I think it is best if you make this more generic — at least
> anything touching core code.
> 
> For example, you have an iov static key hook managed by a driver for
> generic code. There are a few ways around that. One is by adding skb
> details to the nvme code — ie., walking the skb fragments, seeing that a
> given frag is in your allocated memory and skipping the copy. This would
> offer best performance since it skips all unnecessary checks. Another
> option is to export __skb_datagram_iter, use it and define your own copy
> handler that does the address compare and skips the copy. Key point -
> only your code path is affected.

I'll submit V2 that is less invasive to core code:
IMO exporting __skb_datagram_iter and all child functions
is more generic, so we'll use that.

> 
> Similarly for the NVMe SGLs and DDP offload - a more generic solution
> allows other use cases to build on this as opposed to the checks you
> want for a special case. For example, a split at the protocol headers /
> payload boundaries would be a generic solution where kernel managed
> protocols get data in one buffer and socket data is put into a given
> SGL. I am guessing that you have to be already doing this to put PDU
> payloads into an SGL and other headers into other memory to make a
> complete packet, so this is not too far off from what you are already doing.
> 

Splitting at protocol header boundaries and placing data at socket defined
SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
responses. Here is an example:

the host submits the following requests:
+--------+--------+--------+
| Read 1 | Read 2 | Read 3 |
+--------+--------+--------+

the target responds with the following responses:
+--------+--------+--------+
| Resp 2 | Resp 3 | Resp 1 |
+--------+--------+--------+

Therefore, hardware must hold a mapping between PDU identifiers (command_id)
and the corresponding buffers. This interface is missing in the proposal
above, which is why it won't work for nvme-tcp.

> Let me walk through an example with assumptions about your hardware's
> capabilities, and you correct me where I am wrong. Assume you have a
> 'full' command response of this form:
> 
>  +------------- ... ----------------+---------+---------+--------+-----+
>  |          big data segment        | PDU hdr | TCP hdr | IP hdr | eth |
>  +------------- ... ----------------+---------+---------+--------+-----+
> 
> but it shows up to the host in 3 packets like this (ideal case):
> 
>  +-------------------------+---------+---------+--------+-----+
>  |       data - seg 1      | PDU hdr | TCP hdr | IP hdr | eth |
>  +-------------------------+---------+---------+--------+-----+
>  +-----------------------------------+---------+--------+-----+
>  |       data - seg 2                | TCP hdr | IP hdr | eth |
>  +-----------------------------------+---------+--------+-----+
>                    +-----------------+---------+--------+-----+
>                    | payload - seg 3 | TCP hdr | IP hdr | eth |
>                    +-----------------+---------+--------+-----+
> 
> 
> The hardware splits the eth/IP/tcp headers from payload like this
> (again, your hardware has to know these boundaries to accomplish what
> you want):
> 
>  +-------------------------+---------+     +---------+--------+-----+
>  |       data - seg 1      | PDU hdr |     | TCP hdr | IP hdr | eth |
>  +-------------------------+---------+     +---------+--------+-----+
> 
>  +-----------------------------------+     +---------+--------+-----+
>  |       data - seg 2                |     | TCP hdr | IP hdr | eth |
>  +-----------------------------------+     +---------+--------+-----+
> 
>                    +-----------------+     +---------+--------+-----+
>                    | payload - seg 3 |     | TCP hdr | IP hdr | eth |
>                    +-----------------+     +---------+--------+-----+
> 
> Left side goes into the SGLs posted for this socket / flow; the right
> side goes into some other memory resource made available for headers.
> This is very close to what you are doing now - with the exception of the
> PDU header being put to the right side. NVMe code then just needs to set
> the iov offset (or adjust the base_addr) to skip over the PDU header -
> standard options for an iov.
> 
> Yes, TCP is a byte stream, so the packets could very well show up like this:
> 
>  +--------------+---------+-----------+---------+--------+-----+
>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
>  +--------------+---------+-----------+---------+--------+-----+
>  +-----------------------------------+---------+--------+-----+
>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
>  +-----------------------------------+---------+--------+-----+
>  +-------- +-------------------------+---------+--------+-----+
>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
>  +---------+-------------------------+---------+--------+-----+
> 
> If your hardware can extract the NVMe payload into a targeted SGL like
> you want in this set, then it has some logic for parsing headers and
> "snapping" an SGL to a new element. ie., it already knows 'prev data'
> goes with the in-progress PDU, sees more data, recognizes a new PDU
> header and a new payload. That means it already has to handle a
> 'snap-to-PDU' style argument where the end of the payload closes out an
> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
> this case, you want 'snap-to-PDU' but that could just as easily be 'no
> snap at all', just a byte stream and filling an SGL after the protocol
> headers.
> 
> Key point here is that this is the start of a generic header / data
> split that could work for other applications - not just NVMe. eth/IP/TCP
> headers are consumed by the Linux networking stack; data is in
> application owned, socket based SGLs to avoid copies.
> 

I think that the interface we created (tcp_ddp) is sufficiently generic
for the task at hand, which is offloading protocols that can re-order
their responses, a non-trivial task that we claim is important.

We designed it to support other protocols and not just nvme-tcp,
which is merely an example. For instance, I think that supporting iSCSI
would be natural, and that other protocols will fit nicely.

> ###
> 
> A dump of other comments about this patch set:

Thanks for reviewing! We will fix and resubmit.

> - there are a LOT of unnecessary typecasts around tcp_ddp_ctx that can
> be avoided by using container_of.
> 
> - you have an accessor tcp_ddp_get_ctx but no setter; all uses of
> tcp_ddp_get_ctx are within mlx5. why open code the set but use the
> accessor for the get? Worse, mlx5e_nvmeotcp_queue_teardown actually has
> both — uses the accessor and open codes setting icsk_ulp_ddp_data.
> 
> - the driver is storing private data on the socket. Nothing about the
> socket layer cares and the mlx5 driver is already tracking that data in
> priv->nvmeotcp->queue_hash. As I mentioned in a previous response, I
> understand the socket ops are needed for the driver level to call into
> the socket layer, but the data part does not seem to be needed.

The socket layer does care: the socket wouldn't disappear under the
driver as it will clean things up and call the driver before the socket
disappears. This part is similar to what we have in tls where
drivers store some private data per-socket to assist offload.

> 
> - nvme_tcp_offload_socket and nvme_tcp_offload_limits both return int
> yet the value is ignored
> 

This is on purpose. Users can know whether it is successful or not using
ethtool counters of the NIC. Offload is opportunistic and its failure
is non-fatal, and as nvme-tcp has no stats we intentionally ignore the
returned values for now.

> - the build robot found a number of problems (it pulls my github tree
> and I pushed this set to it to move across computers).
> 
> I think the patch set would be easier to follow if you restructured the
> patches to 1 thing only per patch -- e.g., split patch 2 into netdev
> bits and socket bits. Add the netdev feature bit and operations in 1
> patch and add the socket ops in a second patch with better commit logs
> about why each is needed and what is done.
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-11 18:45                 ` Jakub Kicinski
  2020-12-11 18:58                   ` Eric Dumazet
  2020-12-11 19:59                   ` David Ahern
@ 2020-12-13 18:34                   ` Boris Pismenny
  2 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-13 18:34 UTC (permalink / raw)
  To: Jakub Kicinski, David Ahern
  Cc: Boris Pismenny, davem, saeedm, hch, sagi, axboe, kbusch, viro,
	edumazet, boris.pismenny, linux-nvme, netdev, benishay, ogerlitz,
	yorayz, Ben Ben-Ishay, Or Gerlitz, Yoray Zack, Boris Pismenny



On 11/12/2020 20:45, Jakub Kicinski wrote:
> On Thu, 10 Dec 2020 19:43:57 -0700 David Ahern wrote:
>> On 12/10/20 7:01 PM, Jakub Kicinski wrote:
>>> On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:  
>>>> Yes, TCP is a byte stream, so the packets could very well show up like this:
>>>>
>>>>  +--------------+---------+-----------+---------+--------+-----+
>>>>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
>>>>  +--------------+---------+-----------+---------+--------+-----+
>>>>  +-----------------------------------+---------+--------+-----+
>>>>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
>>>>  +-----------------------------------+---------+--------+-----+
>>>>  +-------- +-------------------------+---------+--------+-----+
>>>>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
>>>>  +---------+-------------------------+---------+--------+-----+
>>>>
>>>> If your hardware can extract the NVMe payload into a targeted SGL like
>>>> you want in this set, then it has some logic for parsing headers and
>>>> "snapping" an SGL to a new element. ie., it already knows 'prev data'
>>>> goes with the in-progress PDU, sees more data, recognizes a new PDU
>>>> header and a new payload. That means it already has to handle a
>>>> 'snap-to-PDU' style argument where the end of the payload closes out an
>>>> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
>>>> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
>>>> this case, you want 'snap-to-PDU' but that could just as easily be 'no
>>>> snap at all', just a byte stream and filling an SGL after the protocol
>>>> headers.  
>>>
>>> This 'snap-to-PDU' requirement is something that I don't understand
>>> with the current TCP zero copy. In case of, say, a storage application  
>>
>> current TCP zero-copy does not handle this and it can't AFAIK. I believe
>> it requires hardware level support where an Rx queue is dedicated to a
>> flow / socket and some degree of header and payload splitting (header is
>> consumed by the kernel stack and payload goes to socket owner's memory).
> 
> Yet, Google claims to use the RX ZC in production, and with a CX3 Pro /
> mlx4 NICs.
> 
> Simple workaround that comes to mind is have the headers and payloads
> on separate TCP streams. That doesn't seem too slick.. but neither is
> the 4k MSS, so maybe that's what Google does?
> 
>>> which wants to send some headers (whatever RPC info, block number,
>>> etc.) and then a 4k block of data - how does the RX side get just the
>>> 4k block a into a page so it can zero copy it out to its storage device?
>>>
>>> Per-connection state in the NIC, and FW parsing headers is one way,
>>> but I wonder how this record split problem is best resolved generically.
>>> Perhaps by passing hints in the headers somehow?
>>>
>>> Sorry for the slight off-topic :)
>>>   
>> Hardware has to be parsing the incoming packets to find the usual
>> ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
>> have some kind of ULP processor to know how to parse the TCP byte stream
>> at least well enough to find the PDU header and interpret it to get pdu
>> header length and payload length.
> 
> The big difference between normal headers and L7 headers is that one is
> at ~constant offset, self-contained, and always complete (PDU header
> can be split across segments).
> 
> Edwin Peer did an implementation of TLS ULP for the NFP, it was
> complex. Not to mention it's L7 protocol ossification.
> 

Some programability on the PDU header parsing part will resolve the
ossification, and AFAICT the interfaces in the kernel do not ossify the
protocols. 

> To put it bluntly maybe it's fine for smaller shops but I'm guessing
> it's going to be a hard sell to hyperscalers and people who don't like
> to be locked in to HW.
> 
>> At that point you push the protocol headers (eth/ip/tcp) into one buffer
>> for the kernel stack protocols and put the payload into another. The
>> former would be some page owned by the OS and the latter owned by the
>> process / socket (generically, in this case it is a kernel level
>> socket). In addition, since the payload is spread across multiple
>> packets the hardware has to keep track of TCP sequence number and its
>> current place in the SGL where it is writing the payload to keep the
>> bytes contiguous and detect out-of-order.
>>
>> If the ULP processor knows about PDU headers it knows when enough
>> payload has been found to satisfy that PDU in which case it can tell the
>> cursor to move on to the next SGL element (or separate SGL). That's what
>> I meant by 'snap-to-PDU'.
>>
>> Alternatively, if it is any random application with a byte stream not
>> understood by hardware, the cursor just keeps moving along the SGL
>> elements assigned it for this particular flow.
>>
>> If you have a socket whose payload is getting offloaded to its own queue
>> (which this set is effectively doing), you can create the queue with
>> some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
>> that controls the parsing when you stop writing to one SGL element and
>> move on to the next. Again, assuming hardware support for such attributes.
>>
>> I don't work for Nvidia, so this is all supposition based on what the
>> patches are doing.
> 
> Ack, these patches are not exciting (to me), so I'm wondering if there
> is a better way. The only reason NIC would have to understand a ULP for
> ZC is to parse out header/message lengths. There's gotta be a way to
> pass those in header options or such...
> 
> And, you know, if we figure something out - maybe we stand a chance
> against having 4 different zero copy implementations (this, TCP,
> AF_XDP, netgpu) :(
> 

As stated on another thread here. Simply splitting header and data
while also placing payload at some socket buffer address is zerocopy
but not data placement. The latter handles PDU reordering. I think
that it is unjust to place them all in the same category.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path
  2020-12-10 17:15   ` Shai Malin
@ 2020-12-14  6:38     ` Boris Pismenny
  2020-12-15 13:33       ` Shai Malin
  0 siblings, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-14  6:38 UTC (permalink / raw)
  To: Shai Malin, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, yorayz, boris.pismenny, Ben Ben-Ishay, benishay,
	linux-nvme, netdev, Or Gerlitz, ogerlitz, Ariel Elior,
	Michal Kalderon, malin1024


On 10/12/2020 19:15, Shai Malin wrote:
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index c0c33320fe65..ef96e4a02bbd 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -14,6 +14,7 @@
>  #include <linux/blk-mq.h>
>  #include <crypto/hash.h>
>  #include <net/busy_poll.h>
> +#include <net/tcp_ddp.h>
>  
>  #include "nvme.h"
>  #include "fabrics.h"
> @@ -62,6 +63,7 @@ enum nvme_tcp_queue_flags {
>  	NVME_TCP_Q_ALLOCATED	= 0,
>  	NVME_TCP_Q_LIVE		= 1,
>  	NVME_TCP_Q_POLLING	= 2,
> +	NVME_TCP_Q_OFFLOADS     = 3,
>  };
> 
> The same comment from the previous version - we are concerned that perhaps 
> the generic term "offload" for both the transport type (for the Marvell work) 
> and for the DDP and CRC offload queue (for the Mellanox work) may be 
> misleading and confusing to developers and to users.
> 
> As suggested by Sagi, we can call this NVME_TCP_Q_DDP. 
> 

While I don't mind changing the naming here. I wonder  why not call the
toe you use TOE and not TCP_OFFLOAD, and then offload is free for this?

Moreover, the most common use of offload in the kernel is for partial offloads
like this one, and not for full offloads (such as toe).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-13 18:21             ` Boris Pismenny
@ 2020-12-15  5:19               ` David Ahern
  2020-12-17 19:06                 ` Boris Pismenny
  0 siblings, 1 reply; 53+ messages in thread
From: David Ahern @ 2020-12-15  5:19 UTC (permalink / raw)
  To: Boris Pismenny, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack, Boris Pismenny

On 12/13/20 11:21 AM, Boris Pismenny wrote:
>>> as zerocopy for the following reasons:
>>> (1) The former places buffers *exactly* where the user requests
>>> regardless of the order of response arrivals, while the latter places packets
>>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>>> can be implemented using data placement, but not vice versa.
>>
>> Fundamentally, it is an SGL and a TCP sequence number. There is a
>> starting point where seq N == sgl element 0, position 0. Presumably
>> there is a hardware cursor to track where you are in filling the SGL as
>> packets are processed. You abort on OOO, so it seems like a fairly
>> straightfoward problem.
>>
> 
> We do not abort on OOO. Moreover, we can keep going as long as
> PDU headers are not reordered.

Meaning packets received OOO which contain all or part of a PDU header
are aborted, but pure data packets can arrive out-of-order?

Did you measure the affect of OOO packets? e.g., randomly drop 1 in 1000
nvme packets, 1 in 10,000, 1 in 100,000? How does that affect the fio
benchmarks?

>> Similarly for the NVMe SGLs and DDP offload - a more generic solution
>> allows other use cases to build on this as opposed to the checks you
>> want for a special case. For example, a split at the protocol headers /
>> payload boundaries would be a generic solution where kernel managed
>> protocols get data in one buffer and socket data is put into a given
>> SGL. I am guessing that you have to be already doing this to put PDU
>> payloads into an SGL and other headers into other memory to make a
>> complete packet, so this is not too far off from what you are already doing.
>>
> 
> Splitting at protocol header boundaries and placing data at socket defined
> SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
> responses. Here is an example:
> 
> the host submits the following requests:
> +--------+--------+--------+
> | Read 1 | Read 2 | Read 3 |
> +--------+--------+--------+
> 
> the target responds with the following responses:
> +--------+--------+--------+
> | Resp 2 | Resp 3 | Resp 1 |
> +--------+--------+--------+

Does 'Resp N' == 'PDU + data' like this:

 +---------+--------+---------+--------+---------+--------+
 | PDU hdr | Resp 2 | PDU hdr | Resp 3 | PDU hdr | Resp 1 |
 +---------+--------+---------+--------+---------+--------+

or is it 1 PDU hdr + all of the responses?

> 
> I think that the interface we created (tcp_ddp) is sufficiently generic
> for the task at hand, which is offloading protocols that can re-order
> their responses, a non-trivial task that we claim is important.
> 
> We designed it to support other protocols and not just nvme-tcp,
> which is merely an example. For instance, I think that supporting iSCSI
> would be natural, and that other protocols will fit nicely.

It would be good to add documentation that describes the design, its
assumptions and its limitations. tls has several under
Documentation/networking. e.g., one important limitation to note is that
this design only works for TCP sockets owned by kernel modules.

> 
>> ###
>>
>> A dump of other comments about this patch set:
> 
> Thanks for reviewing! We will fix and resubmit.

Another one I noticed today. You have several typecasts like this:

cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);

since cqe is a member of cqe128, container_of should be used instead of
the magic '- 64'.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path
  2020-12-14  6:38     ` Boris Pismenny
@ 2020-12-15 13:33       ` Shai Malin
  2020-12-17 18:51         ` Boris Pismenny
  0 siblings, 1 reply; 53+ messages in thread
From: Shai Malin @ 2020-12-15 13:33 UTC (permalink / raw)
  To: Boris Pismenny, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, yorayz, boris.pismenny, Ben Ben-Ishay, benishay,
	linux-nvme, netdev, Or Gerlitz, ogerlitz, Shai Malin,
	Ariel Elior, Michal Kalderon, Shai Malin

On 12/14/2020 08:38, Boris Pismenny wrote:
> On 10/12/2020 19:15, Shai Malin wrote:
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index c0c33320fe65..ef96e4a02bbd 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/blk-mq.h>
> >  #include <crypto/hash.h>
> >  #include <net/busy_poll.h>
> > +#include <net/tcp_ddp.h>
> >
> >  #include "nvme.h"
> >  #include "fabrics.h"
> > @@ -62,6 +63,7 @@ enum nvme_tcp_queue_flags {
> >       NVME_TCP_Q_ALLOCATED    = 0,
> >       NVME_TCP_Q_LIVE         = 1,
> >       NVME_TCP_Q_POLLING      = 2,
> > +     NVME_TCP_Q_OFFLOADS     = 3,
> >  };
> >
> > The same comment from the previous version - we are concerned that perhaps
> > the generic term "offload" for both the transport type (for the Marvell work)
> > and for the DDP and CRC offload queue (for the Mellanox work) may be
> > misleading and confusing to developers and to users.
> >
> > As suggested by Sagi, we can call this NVME_TCP_Q_DDP.
> >
>
> While I don't mind changing the naming here. I wonder  why not call the
> toe you use TOE and not TCP_OFFLOAD, and then offload is free for this?

Thanks - please do change the name to NVME_TCP_Q_DDP.
The Marvell nvme-tcp-offload patch series introducing the offloading of both the
TCP as well as the NVMe/TCP layer, therefore it's not TOE.

>
> Moreover, the most common use of offload in the kernel is for partial offloads
> like this one, and not for full offloads (such as toe).

Because each vendor might implement a different partial offload I
suggest naming it
with the specific technique which is used, as was suggested - NVME_TCP_Q_DDP.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v1 net-next 07/15] nvme-tcp : Recalculate crc in the end of the capsule
  2020-12-07 21:06 ` [PATCH v1 net-next 07/15] nvme-tcp : Recalculate crc in the end of the capsule Boris Pismenny
@ 2020-12-15 14:07   ` Shai Malin
  0 siblings, 0 replies; 53+ messages in thread
From: Shai Malin @ 2020-12-15 14:07 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, sagi, axboe, kbusch,
	viro, edumazet
  Cc: Yoray Zack, yorayz, boris.pismenny, Ben Ben-Ishay, benishay,
	linux-nvme, netdev, Or Gerlitz, ogerlitz


> crc offload of the nvme capsule. Check if all the skb bits are on, and if not
> recalculate the crc in SW and check it.
> 
> This patch reworks the receive-side crc calculation to always run at the end,
> so as to keep a single flow for both offload and non-offload. This change
> simplifies the code, but it may degrade performance for non-offload crc
> calculation.
> 
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> Signed-off-by: Yoray Zack <yorayz@mellanox.com>
> ---
>  drivers/nvme/host/tcp.c | 111 ++++++++++++++++++++++++++++++++----
> ----
>  1 file changed, 91 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index
> 534fd5c00f33..3c10c8876036 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -69,6 +69,7 @@ enum nvme_tcp_queue_flags {
>  	NVME_TCP_Q_LIVE		= 1,
>  	NVME_TCP_Q_POLLING	= 2,
>  	NVME_TCP_Q_OFFLOADS     = 3,
> +	NVME_TCP_Q_OFF_CRC_RX   = 4,

Because only the data digest is offloaded, and not the header digest, 
in order to avoid confusion, I suggest replacing the term 
NVME_TCP_Q_OFF_CRC_RX with NVME_TCP_Q_OFF_DDGST_RX.

>  };


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path
  2020-12-15 13:33       ` Shai Malin
@ 2020-12-17 18:51         ` Boris Pismenny
  0 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2020-12-17 18:51 UTC (permalink / raw)
  To: Shai Malin, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, yorayz, boris.pismenny, Ben Ben-Ishay, benishay,
	linux-nvme, netdev, Or Gerlitz, ogerlitz, Shai Malin,
	Ariel Elior, Michal Kalderon



On 15/12/2020 15:33, Shai Malin wrote:
> On 12/14/2020 08:38, Boris Pismenny wrote:
>> On 10/12/2020 19:15, Shai Malin wrote:
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index c0c33320fe65..ef96e4a02bbd 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -14,6 +14,7 @@
>>>  #include <linux/blk-mq.h>
>>>  #include <crypto/hash.h>
>>>  #include <net/busy_poll.h>
>>> +#include <net/tcp_ddp.h>
>>>
>>>  #include "nvme.h"
>>>  #include "fabrics.h"
>>> @@ -62,6 +63,7 @@ enum nvme_tcp_queue_flags {
>>>       NVME_TCP_Q_ALLOCATED    = 0,
>>>       NVME_TCP_Q_LIVE         = 1,
>>>       NVME_TCP_Q_POLLING      = 2,
>>> +     NVME_TCP_Q_OFFLOADS     = 3,
>>>  };
>>>
>>> The same comment from the previous version - we are concerned that perhaps
>>> the generic term "offload" for both the transport type (for the Marvell work)
>>> and for the DDP and CRC offload queue (for the Mellanox work) may be
>>> misleading and confusing to developers and to users.
>>>
>>> As suggested by Sagi, we can call this NVME_TCP_Q_DDP.
>>>
>>
>> While I don't mind changing the naming here. I wonder  why not call the
>> toe you use TOE and not TCP_OFFLOAD, and then offload is free for this?
> 
> Thanks - please do change the name to NVME_TCP_Q_DDP.
> The Marvell nvme-tcp-offload patch series introducing the offloading of both the
> TCP as well as the NVMe/TCP layer, therefore it's not TOE.
> 

Will do.

>>
>> Moreover, the most common use of offload in the kernel is for partial offloads
>> like this one, and not for full offloads (such as toe).
> 
> Because each vendor might implement a different partial offload I
> suggest naming it
> with the specific technique which is used, as was suggested - NVME_TCP_Q_DDP.
> 


IIUC, if TCP/IP is offloaded entirely then it is called TOE. It doesn't matter
than you offload additional stuff (nvme-tcp)



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-15  5:19               ` David Ahern
@ 2020-12-17 19:06                 ` Boris Pismenny
  2020-12-18  0:44                   ` David Ahern
  0 siblings, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2020-12-17 19:06 UTC (permalink / raw)
  To: David Ahern, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack, Boris Pismenny



On 15/12/2020 7:19, David Ahern wrote:
> On 12/13/20 11:21 AM, Boris Pismenny wrote:
>>>> as zerocopy for the following reasons:
>>>> (1) The former places buffers *exactly* where the user requests
>>>> regardless of the order of response arrivals, while the latter places packets
>>>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>>>> can be implemented using data placement, but not vice versa.
>>>
>>> Fundamentally, it is an SGL and a TCP sequence number. There is a
>>> starting point where seq N == sgl element 0, position 0. Presumably
>>> there is a hardware cursor to track where you are in filling the SGL as
>>> packets are processed. You abort on OOO, so it seems like a fairly
>>> straightfoward problem.
>>>
>>
>> We do not abort on OOO. Moreover, we can keep going as long as
>> PDU headers are not reordered.
> 
> Meaning packets received OOO which contain all or part of a PDU header
> are aborted, but pure data packets can arrive out-of-order?
> 
> Did you measure the affect of OOO packets? e.g., randomly drop 1 in 1000
> nvme packets, 1 in 10,000, 1 in 100,000? How does that affect the fio
> benchmarks?
> 

Yes for TLS where similar ideas are used, but not for NVMe-TCP, yet.
At the worst case we measured (5% OOO), and we got the same performance
as pure software TLS under these conditions. We will strive to have the
same for nvme-tcp. We would be able to test this on nvme-tcp only when we
have hardware. For now, we are using a mix of emulation and simulation to
test and benchmark.

>>> Similarly for the NVMe SGLs and DDP offload - a more generic solution
>>> allows other use cases to build on this as opposed to the checks you
>>> want for a special case. For example, a split at the protocol headers /
>>> payload boundaries would be a generic solution where kernel managed
>>> protocols get data in one buffer and socket data is put into a given
>>> SGL. I am guessing that you have to be already doing this to put PDU
>>> payloads into an SGL and other headers into other memory to make a
>>> complete packet, so this is not too far off from what you are already doing.
>>>
>>
>> Splitting at protocol header boundaries and placing data at socket defined
>> SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
>> responses. Here is an example:
>>
>> the host submits the following requests:
>> +--------+--------+--------+
>> | Read 1 | Read 2 | Read 3 |
>> +--------+--------+--------+
>>
>> the target responds with the following responses:
>> +--------+--------+--------+
>> | Resp 2 | Resp 3 | Resp 1 |
>> +--------+--------+--------+
> 
> Does 'Resp N' == 'PDU + data' like this:
> 
>  +---------+--------+---------+--------+---------+--------+
>  | PDU hdr | Resp 2 | PDU hdr | Resp 3 | PDU hdr | Resp 1 |
>  +---------+--------+---------+--------+---------+--------+
> 
> or is it 1 PDU hdr + all of the responses?
> 

Yes, 'RespN = PDU header + PDU data' segmented by TCP whichever way it
chooses to do so. The PDU header's command_id field correlates between
the request and the response. We use that correlation in hardware to
identify the buffers where data needs to be scattered. In other words,
hardware holds a map between command_id and block layer buffers SGL.

>>
>> I think that the interface we created (tcp_ddp) is sufficiently generic
>> for the task at hand, which is offloading protocols that can re-order
>> their responses, a non-trivial task that we claim is important.
>>
>> We designed it to support other protocols and not just nvme-tcp,
>> which is merely an example. For instance, I think that supporting iSCSI
>> would be natural, and that other protocols will fit nicely.
> 
> It would be good to add documentation that describes the design, its
> assumptions and its limitations. tls has several under
> Documentation/networking. e.g., one important limitation to note is that
> this design only works for TCP sockets owned by kernel modules.
> 

You are right. I'll do so for tcp_ddp. You are right that it works only
for kernel TCP sockets, but maybe future work will extend it.

>>
>>> ###
>>>
>>> A dump of other comments about this patch set:
>>
>> Thanks for reviewing! We will fix and resubmit.
> 
> Another one I noticed today. You have several typecasts like this:
> 
> cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);
> 
> since cqe is a member of cqe128, container_of should be used instead of
> the magic '- 64'.
> 

Will fix



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload
  2020-12-17 19:06                 ` Boris Pismenny
@ 2020-12-18  0:44                   ` David Ahern
  0 siblings, 0 replies; 53+ messages in thread
From: David Ahern @ 2020-12-18  0:44 UTC (permalink / raw)
  To: Boris Pismenny, Boris Pismenny, kuba, davem, saeedm, hch, sagi,
	axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack, Boris Pismenny

On 12/17/20 12:06 PM, Boris Pismenny wrote:
> 
> 
> On 15/12/2020 7:19, David Ahern wrote:
>> On 12/13/20 11:21 AM, Boris Pismenny wrote:
>>>>> as zerocopy for the following reasons:
>>>>> (1) The former places buffers *exactly* where the user requests
>>>>> regardless of the order of response arrivals, while the latter places packets
>>>>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>>>>> can be implemented using data placement, but not vice versa.
>>>>
>>>> Fundamentally, it is an SGL and a TCP sequence number. There is a
>>>> starting point where seq N == sgl element 0, position 0. Presumably
>>>> there is a hardware cursor to track where you are in filling the SGL as
>>>> packets are processed. You abort on OOO, so it seems like a fairly
>>>> straightfoward problem.
>>>>
>>>
>>> We do not abort on OOO. Moreover, we can keep going as long as
>>> PDU headers are not reordered.
>>
>> Meaning packets received OOO which contain all or part of a PDU header
>> are aborted, but pure data packets can arrive out-of-order?
>>
>> Did you measure the affect of OOO packets? e.g., randomly drop 1 in 1000
>> nvme packets, 1 in 10,000, 1 in 100,000? How does that affect the fio
>> benchmarks?
>>
> 
> Yes for TLS where similar ideas are used, but not for NVMe-TCP, yet.
> At the worst case we measured (5% OOO), and we got the same performance
> as pure software TLS under these conditions. We will strive to have the
> same for nvme-tcp. We would be able to test this on nvme-tcp only when we
> have hardware. For now, we are using a mix of emulation and simulation to
> test and benchmark.

Interesting. So you don't have hardware today for this feature.

> 
>>>> Similarly for the NVMe SGLs and DDP offload - a more generic solution
>>>> allows other use cases to build on this as opposed to the checks you
>>>> want for a special case. For example, a split at the protocol headers /
>>>> payload boundaries would be a generic solution where kernel managed
>>>> protocols get data in one buffer and socket data is put into a given
>>>> SGL. I am guessing that you have to be already doing this to put PDU
>>>> payloads into an SGL and other headers into other memory to make a
>>>> complete packet, so this is not too far off from what you are already doing.
>>>>
>>>
>>> Splitting at protocol header boundaries and placing data at socket defined
>>> SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
>>> responses. Here is an example:
>>>
>>> the host submits the following requests:
>>> +--------+--------+--------+
>>> | Read 1 | Read 2 | Read 3 |
>>> +--------+--------+--------+
>>>
>>> the target responds with the following responses:
>>> +--------+--------+--------+
>>> | Resp 2 | Resp 3 | Resp 1 |
>>> +--------+--------+--------+
>>
>> Does 'Resp N' == 'PDU + data' like this:
>>
>>  +---------+--------+---------+--------+---------+--------+
>>  | PDU hdr | Resp 2 | PDU hdr | Resp 3 | PDU hdr | Resp 1 |
>>  +---------+--------+---------+--------+---------+--------+
>>
>> or is it 1 PDU hdr + all of the responses?
>>
> 
> Yes, 'RespN = PDU header + PDU data' segmented by TCP whichever way it
> chooses to do so. The PDU header's command_id field correlates between

I thought so and verified in the NVME over Fabrics spec. Not sure why
you brought up order of responses; that should be irrelevant to the design.

> the request and the response. We use that correlation in hardware to
> identify the buffers where data needs to be scattered. In other words,
> hardware holds a map between command_id and block layer buffers SGL.

Understood. Your design is forcing a command id to sgl correlation vs
handing h/w generic or shaped SGLs for writing the socket data.

> 
>>>
>>> I think that the interface we created (tcp_ddp) is sufficiently generic
>>> for the task at hand, which is offloading protocols that can re-order
>>> their responses, a non-trivial task that we claim is important.
>>>
>>> We designed it to support other protocols and not just nvme-tcp,
>>> which is merely an example. For instance, I think that supporting iSCSI
>>> would be natural, and that other protocols will fit nicely.
>>
>> It would be good to add documentation that describes the design, its
>> assumptions and its limitations. tls has several under
>> Documentation/networking. e.g., one important limitation to note is that
>> this design only works for TCP sockets owned by kernel modules.
>>
> 
> You are right. I'll do so for tcp_ddp. You are right that it works only
> for kernel TCP sockets, but maybe future work will extend it.

Little to nothing about this design can work for userspace sockets or
setups like Jonathan's netgpu. e.g., the memory address compares for
whether to copy the data will not work with userspace buffers which is
why I suggested something more generic like marking the skb as "h/w
offload" but that does not work for you since your skbs have a mixed bag
of fragments.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 13/15] net/mlx5e: NVMEoTCP, data-path for DDP offload
  2020-12-07 21:06 ` [PATCH v1 net-next 13/15] net/mlx5e: NVMEoTCP, data-path for DDP offload Boris Pismenny
@ 2020-12-18  0:57   ` David Ahern
  0 siblings, 0 replies; 53+ messages in thread
From: David Ahern @ 2020-12-18  0:57 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, sagi, axboe, kbusch,
	viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev, benishay, ogerlitz, yorayz,
	Ben Ben-Ishay, Or Gerlitz, Yoray Zack

On 12/7/20 2:06 PM, Boris Pismenny wrote:
> +struct sk_buff*
> +mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
> +			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt,
> +			     bool linear)
> +{
> +	int ccoff, cclen, hlen, ccid, remaining, fragsz, to_copy = 0;
> +	struct mlx5e_priv *priv = netdev_priv(netdev);
> +	skb_frag_t org_frags[MAX_SKB_FRAGS];
> +	struct mlx5e_nvmeotcp_queue *queue;
> +	struct nvmeotcp_queue_entry *nqe;
> +	int org_nr_frags, frag_index;
> +	struct mlx5e_cqe128 *cqe128;
> +	u32 queue_id;
> +
> +	queue_id = (be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK);
> +	queue = mlx5e_nvmeotcp_get_queue(priv->nvmeotcp, queue_id);
> +	if (unlikely(!queue)) {
> +		dev_kfree_skb_any(skb);
> +		return NULL;
> +	}
> +
> +	cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);
> +	if (cqe_is_nvmeotcp_resync(cqe)) {
> +		nvmeotcp_update_resync(queue, cqe128);
> +		mlx5e_nvmeotcp_put_queue(queue);
> +		return skb;
> +	}
> +
> +	/* cc ddp from cqe */
> +	ccid = be16_to_cpu(cqe128->ccid);
> +	ccoff = be32_to_cpu(cqe128->ccoff);
> +	cclen = be16_to_cpu(cqe128->cclen);
> +	hlen  = be16_to_cpu(cqe128->hlen);
> +
> +	/* carve a hole in the skb for DDP data */
> +	if (linear) {
> +		skb_trim(skb, hlen);
> +	} else {
> +		org_nr_frags = skb_shinfo(skb)->nr_frags;
> +		mlx5_nvmeotcp_trim_nonlinear(skb, org_frags, &frag_index,
> +					     cclen);
> +	}

mlx5e_skb_from_cqe_mpwrq_linear and mlx5e_skb_from_cqe_mpwrq_nolinear
create an skb and then this function comes behind it, strips any frags
originally added to the skb, ...


> +
> +	nqe = &queue->ccid_table[ccid];
> +
> +	/* packet starts new ccid? */
> +	if (queue->ccid != ccid || queue->ccid_gen != nqe->ccid_gen) {
> +		queue->ccid = ccid;
> +		queue->ccoff = 0;
> +		queue->ccoff_inner = 0;
> +		queue->ccsglidx = 0;
> +		queue->ccid_gen = nqe->ccid_gen;
> +	}
> +
> +	/* skip inside cc until the ccoff in the cqe */
> +	while (queue->ccoff + queue->ccoff_inner < ccoff) {
> +		remaining = nqe->sgl[queue->ccsglidx].length - queue->ccoff_inner;
> +		fragsz = min_t(off_t, remaining,
> +			       ccoff - (queue->ccoff + queue->ccoff_inner));
> +
> +		if (fragsz == remaining)
> +			mlx5e_nvmeotcp_advance_sgl_iter(queue);
> +		else
> +			queue->ccoff_inner += fragsz;
> +	}
> +
> +	/* adjust the skb according to the cqe cc */
> +	while (to_copy < cclen) {
> +		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
> +			dev_kfree_skb_any(skb);
> +			mlx5e_nvmeotcp_put_queue(queue);
> +			return NULL;
> +		}
> +
> +		remaining = nqe->sgl[queue->ccsglidx].length - queue->ccoff_inner;
> +		fragsz = min_t(int, remaining, cclen - to_copy);
> +
> +		mlx5e_nvmeotcp_add_skb_frag(netdev, skb, queue, nqe, fragsz);
> +		to_copy += fragsz;
> +		if (fragsz == remaining)
> +			mlx5e_nvmeotcp_advance_sgl_iter(queue);
> +		else
> +			queue->ccoff_inner += fragsz;
> +	}

... adds the frags for the sgls, ...

> +
> +	if (cqe_bcnt > hlen + cclen) {
> +		remaining = cqe_bcnt - hlen - cclen;
> +		if (linear)
> +			skb = mlx5_nvmeotcp_add_tail(queue, skb,
> +						     offset_in_page(skb->data) +
> +								hlen + cclen,
> +						     remaining);
> +		else
> +			skb = mlx5_nvmeotcp_add_tail_nonlinear(queue, skb,
> +							       org_frags,
> +							       org_nr_frags,
> +							       frag_index);

... and then re-adds the original frags.

Why is this needed? Why can't the skb be created with all of the frags
in proper order?

It seems like this dance is not needed if you had generic header/payload
splits with the payload written to less retrictive SGLs.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 00/15] nvme-tcp receive offloads
  2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
                   ` (14 preceding siblings ...)
  2020-12-07 21:06 ` [PATCH v1 net-next 15/15] net/mlx5e: NVMEoTCP workaround CRC after resync Boris Pismenny
@ 2021-01-14  1:27 ` Sagi Grimberg
  2021-01-14  4:47   ` David Ahern
  2021-01-14 19:17   ` Boris Pismenny
  15 siblings, 2 replies; 53+ messages in thread
From: Sagi Grimberg @ 2021-01-14  1:27 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: yorayz, boris.pismenny, benishay, linux-nvme, netdev, ogerlitz

Hey Boris, sorry for some delays on my end...

I saw some long discussions on this set with David, what is
the status here?

I'll take some more look into the patches, but if you
addressed the feedback from the last iteration I don't
expect major issues with this patch set (at least from
nvme-tcp side).

> Changes since RFC v1:
> =========================================
> * Split mlx5 driver patches to several commits
> * Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
>    init/teardown to the start/stop functions.

I'm assuming that you tested controller resets and network hiccups
during traffic right?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 00/15] nvme-tcp receive offloads
  2021-01-14  1:27 ` [PATCH v1 net-next 00/15] nvme-tcp receive offloads Sagi Grimberg
@ 2021-01-14  4:47   ` David Ahern
  2021-01-14 19:21     ` Boris Pismenny
  2021-01-14 19:17   ` Boris Pismenny
  1 sibling, 1 reply; 53+ messages in thread
From: David Ahern @ 2021-01-14  4:47 UTC (permalink / raw)
  To: Sagi Grimberg, Boris Pismenny, kuba, davem, saeedm, hch, axboe,
	kbusch, viro, edumazet
  Cc: yorayz, boris.pismenny, benishay, linux-nvme, netdev, ogerlitz

On 1/13/21 6:27 PM, Sagi Grimberg wrote:
>> Changes since RFC v1:
>> =========================================
>> * Split mlx5 driver patches to several commits
>> * Fix nvme-tcp handling of recovery flows. In particular, move queue
>> offlaod
>>    init/teardown to the start/stop functions.
> 
> I'm assuming that you tested controller resets and network hiccups
> during traffic right?

I had questions on this part as well -- e.g., what happens on a TCP
retry? packets arrive, sgl filled for the command id, but packet is
dropped in the stack (e.g., enqueue backlog is filled, so packet gets
dropped)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 00/15] nvme-tcp receive offloads
  2021-01-14  1:27 ` [PATCH v1 net-next 00/15] nvme-tcp receive offloads Sagi Grimberg
  2021-01-14  4:47   ` David Ahern
@ 2021-01-14 19:17   ` Boris Pismenny
  2021-01-14 21:07     ` Sagi Grimberg
  1 sibling, 1 reply; 53+ messages in thread
From: Boris Pismenny @ 2021-01-14 19:17 UTC (permalink / raw)
  To: Sagi Grimberg, Boris Pismenny, kuba, davem, saeedm, hch, axboe,
	kbusch, viro, edumazet
  Cc: yorayz, boris.pismenny, benishay, linux-nvme, netdev, ogerlitz

On 14/01/2021 3:27, Sagi Grimberg wrote:
> Hey Boris, sorry for some delays on my end...
> 
> I saw some long discussions on this set with David, what is
> the status here?
> 

The main purpose of this series is to address these.

> I'll take some more look into the patches, but if you
> addressed the feedback from the last iteration I don't
> expect major issues with this patch set (at least from
> nvme-tcp side).
> 
>> Changes since RFC v1:
>> =========================================
>> * Split mlx5 driver patches to several commits
>> * Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
>>    init/teardown to the start/stop functions.
> 
> I'm assuming that you tested controller resets and network hiccups
> during traffic right?
> 

Network hiccups were tested through netem packet drops and reordering.
We tested error recovery by taking the controller down and bringing it
back up while the system is quiescent and during traffic.

If you have another test in mind, please let me know.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 00/15] nvme-tcp receive offloads
  2021-01-14  4:47   ` David Ahern
@ 2021-01-14 19:21     ` Boris Pismenny
  0 siblings, 0 replies; 53+ messages in thread
From: Boris Pismenny @ 2021-01-14 19:21 UTC (permalink / raw)
  To: David Ahern, Sagi Grimberg, Boris Pismenny, kuba, davem, saeedm,
	hch, axboe, kbusch, viro, edumazet
  Cc: yorayz, boris.pismenny, benishay, linux-nvme, netdev, ogerlitz



On 14/01/2021 6:47, David Ahern wrote:
> On 1/13/21 6:27 PM, Sagi Grimberg wrote:
>>> Changes since RFC v1:
>>> =========================================
>>> * Split mlx5 driver patches to several commits
>>> * Fix nvme-tcp handling of recovery flows. In particular, move queue
>>> offlaod
>>>    init/teardown to the start/stop functions.
>>
>> I'm assuming that you tested controller resets and network hiccups
>> during traffic right?
> 
> I had questions on this part as well -- e.g., what happens on a TCP
> retry? packets arrive, sgl filled for the command id, but packet is
> dropped in the stack (e.g., enqueue backlog is filled, so packet gets
> dropped)
> 

On re-transmission the HW context's expected tcp sequence number doesn't
match. As a result, the received packet is un-offloaded and software
will do the copy/crc for its data.

As a general rule, if HW context expected sequence numbers don't match,
then there's no offload.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 net-next 00/15] nvme-tcp receive offloads
  2021-01-14 19:17   ` Boris Pismenny
@ 2021-01-14 21:07     ` Sagi Grimberg
  0 siblings, 0 replies; 53+ messages in thread
From: Sagi Grimberg @ 2021-01-14 21:07 UTC (permalink / raw)
  To: Boris Pismenny, Boris Pismenny, kuba, davem, saeedm, hch, axboe,
	kbusch, viro, edumazet
  Cc: yorayz, boris.pismenny, benishay, linux-nvme, netdev, ogerlitz


>> Hey Boris, sorry for some delays on my end...
>>
>> I saw some long discussions on this set with David, what is
>> the status here?
>>
> 
> The main purpose of this series is to address these.
> 
>> I'll take some more look into the patches, but if you
>> addressed the feedback from the last iteration I don't
>> expect major issues with this patch set (at least from
>> nvme-tcp side).
>>
>>> Changes since RFC v1:
>>> =========================================
>>> * Split mlx5 driver patches to several commits
>>> * Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
>>>     init/teardown to the start/stop functions.
>>
>> I'm assuming that you tested controller resets and network hiccups
>> during traffic right?
>>
> 
> Network hiccups were tested through netem packet drops and reordering.
> We tested error recovery by taking the controller down and bringing it
> back up while the system is quiescent and during traffic.
> 
> If you have another test in mind, please let me know.

I suggest to also perform interface down/up during traffic both
on the host and the targets.

Other than that we should be in decent shape...

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2021-01-14 21:08 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-07 21:06 [PATCH v1 net-next 00/15] nvme-tcp receive offloads Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 01/15] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
2020-12-08  0:39   ` David Ahern
2020-12-08 14:30     ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload Boris Pismenny
2020-12-08  0:42   ` David Ahern
2020-12-08 14:36     ` Boris Pismenny
2020-12-09  0:38       ` David Ahern
2020-12-09  8:15         ` Boris Pismenny
2020-12-10  4:26           ` David Ahern
2020-12-11  2:01             ` Jakub Kicinski
2020-12-11  2:43               ` David Ahern
2020-12-11 18:45                 ` Jakub Kicinski
2020-12-11 18:58                   ` Eric Dumazet
2020-12-11 19:59                   ` David Ahern
2020-12-11 23:05                     ` Jonathan Lemon
2020-12-13 18:34                   ` Boris Pismenny
2020-12-13 18:21             ` Boris Pismenny
2020-12-15  5:19               ` David Ahern
2020-12-17 19:06                 ` Boris Pismenny
2020-12-18  0:44                   ` David Ahern
2020-12-09  0:57   ` David Ahern
2020-12-09  1:11     ` David Ahern
2020-12-09  8:28       ` Boris Pismenny
2020-12-09  8:25     ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 03/15] net: Introduce crc offload for tcp ddp ulp Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 04/15] net/tls: expose get_netdev_for_sock Boris Pismenny
2020-12-09  1:06   ` David Ahern
2020-12-09  7:41     ` Boris Pismenny
2020-12-10  3:39       ` David Ahern
2020-12-11 18:43         ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 05/15] nvme-tcp: Add DDP offload control path Boris Pismenny
2020-12-10 17:15   ` Shai Malin
2020-12-14  6:38     ` Boris Pismenny
2020-12-15 13:33       ` Shai Malin
2020-12-17 18:51         ` Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 06/15] nvme-tcp: Add DDP data-path Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 07/15] nvme-tcp : Recalculate crc in the end of the capsule Boris Pismenny
2020-12-15 14:07   ` Shai Malin
2020-12-07 21:06 ` [PATCH v1 net-next 08/15] nvme-tcp: Deal with netdevice DOWN events Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 09/15] net/mlx5: Header file changes for nvme-tcp offload Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 10/15] net/mlx5: Add 128B CQE for NVMEoTCP offload Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 11/15] net/mlx5e: TCP flow steering for nvme-tcp Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 12/15] net/mlx5e: NVMEoTCP DDP offload control path Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 13/15] net/mlx5e: NVMEoTCP, data-path for DDP offload Boris Pismenny
2020-12-18  0:57   ` David Ahern
2020-12-07 21:06 ` [PATCH v1 net-next 14/15] net/mlx5e: NVMEoTCP statistics Boris Pismenny
2020-12-07 21:06 ` [PATCH v1 net-next 15/15] net/mlx5e: NVMEoTCP workaround CRC after resync Boris Pismenny
2021-01-14  1:27 ` [PATCH v1 net-next 00/15] nvme-tcp receive offloads Sagi Grimberg
2021-01-14  4:47   ` David Ahern
2021-01-14 19:21     ` Boris Pismenny
2021-01-14 19:17   ` Boris Pismenny
2021-01-14 21:07     ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).