Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads
@ 2020-09-30 16:20 Boris Pismenny
  2020-09-30 16:20 ` [PATCH net-next RFC v1 01/10] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
                   ` (10 more replies)
  0 siblings, 11 replies; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev

This series adds support for nvme-tcp receive offloads
which do not mandate the offload of the network stack to the device.
Instead, these work together with TCP to offload:
1. copy from SKB to the block layer buffers
2. CRC verification for received PDU

The series implements these as a generic offload infrastructure for storage
protocols, which we call TCP Direct Data Placement (TCP_DDP) and TCP DDP CRC,
respectively. We use this infrastructure to implement NVMe-TCP offload for copy
and CRC. Future implementations can reuse the same infrastructure for other
protcols such as iSCSI.

Note:
These offloads are similar in nature to the packet-based NIC TLS offloads,
which are already upstream (see net/tls/tls_device.c).
You can read more about TLS offload here:
https://www.kernel.org/doc/html/latest/networking/tls-offload.html

Initialization and teardown:
=========================================
The offload for IO queues is initialized after the handshake of the
NVMe-TCP protocol is finished by calling `nvme_tcp_offload_socket`
with the tcp socket of the nvme_tcp_queue:
This operation sets all relevant hardware contexts in
hardware. If it fails, then the IO queue proceeds as usually with no offload.
If it succeeds then `nvme_tcp_setup_ddp` and `nvme_tcp_teardown_ddp` may be
called to perform copy offload, and crc offload will be used.
This initialization does not change the normal operation of nvme-tcp in any
way besides adding the option to call the above mentioned NDO operations.

For the admin queue, nvme-tcp does not initialize the offload.
Instead, nvme-tcp calls the driver to configure limits for the controller,
such as max_hw_sectors and max_segments; these must be limited to accomodate
potential HW resource limits, and to improve performance.

If some error occured, and the IO queue must be closed or reconnected, then
offload is teardown and initialized again. Additionally, we handle netdev
down events via the existing error recovery flow.

Copy offload works as follows:
=========================================
The nvme-tcp layer calls the NIC drive to map block layer buffers to ccid using
`nvme_tcp_setup_ddp` before sending the read request. When the repsonse is
received, then the NIC HW will write the PDU payload directly into the
designated buffer, and build an SKB such that it points into the destination
buffer; this SKB represents the entire packet received on the wire, but it
points to the block layer buffers. Once nvme-tcp attempts to copy data from
this SKB to the block layer buffer it can skip the copy by checking in the
copying function (memcpy_to_page):
if (src == dst) -> skip copy
Finally, when the PDU has been processed to completion, the nvme-tcp layer
releases the NIC HW context be calling `nvme_tcp_teardown_ddp` which
asynchronously unmaps the buffers from NIC HW.

As the last change is to a sensative function, we are careful to place it under
static_key which is only enabled when this functionality is actually used for
nvme-tcp copy offload.

Asynchronous completion:
=========================================
The NIC must release its mapping between command IDs and the target buffers.
This mapping is released when NVMe-TCP calls the NIC
driver (`nvme_tcp_offload_socket`).
As completing IOs is performance criticial, we introduce asynchronous
completions for NVMe-TCP, i.e. NVMe-TCP calls the NIC, which will later
call NVMe-TCP to complete the IO (`nvme_tcp_ddp_teardown_done`).

An alternative approach is to move all the functions related to coping from
SKBs to the block layer buffers inside the nvme-tcp code - about 200 LOC.

CRC offload works as follows:
=========================================
After offload is initialized, we use the SKB's ddp_crc bit to indicate that:
"there was no problem with the verification of all CRC fields in this packet's
payload". The bit is set to zero if there was an error, or if HW skipped
offload for some reason. If *any* SKB in a PDU has (ddp_crc != 1), then software
must compute the CRC, and check it. We perform this check, and
accompanying software fallback at the end of the processing of a received PDU.

SKB changes:
=========================================
The CRC offload requires an additional bit in the SKB, which is useful for
preventing the coalescing of SKB with different crc offload values. This bit
is similar in concept to the "decrypted" bit. 

Performance:
=========================================
The expected performance gain from this offload varies with the block size.
We perform a CPU cycles breakdown of the copy/CRC operations in nvme-tcp
fio random read workloads:
For 4K blocks we see up to 11% improvement for a 100% read fio workload,
while for 128K blocks we see upto 52%. If we run nvme-tcp, and skip these
operations, then we observe a gain of about 1.1x and 2x respectively.

Resynchronization:
=========================================
The resynchronization flow is performed to reset the hardware tracking of
NVMe-TCP PDUs within the TCP stream. The flow consists of a request from
the driver, regarding a possible location of a PDU header. Followed by
a response from the nvme-tcp driver.

This flow is rare, and it should happen only after packet loss or
reordering events that involve nvme-tcp PDU headers.

The patches are organized as follows:
=========================================
Patch 1         the iov_iter change to skip copy if (src == dst)
Patches 2-3     the infrastructure for all TCP DDP
                and TCP DDP CRC offloads, respectively.
Patch 4         exposes the get_netdev_for_sock function from TLS
Patch 5         NVMe-TCP changes to call NIC driver on queue init/teardown
Patches 6       NVMe-TCP changes to call NIC driver on IO operation
                setup/teardown, and support async completions.
Patches 7       NVMe-TCP changes to support CRC offload on receive.
                Also, this patch moves CRC calculation to the end of PDU
                in case offload requires software fallback.
Patches 8       NVMe-TCP handling of netdev events: stop the offload if
                netdev is going down
Patches 9-10    implement support for NVMe-TCP copy and CRC offload in
                the mlx5 NIC driver

Testing:
=========================================
This series was tested using fio with various configurations of IO sizes,
depths, MTUs, and with both the SPDK and kernel NVMe-TCP targets.

Future work:
=========================================
A follow-up series will introduce support for transmit side CRC. Then,
we will work on adding support for TLS in NVMe-TCP and combining the
two offloads.


Boris Pismenny (8):
  iov_iter: Skip copy in memcpy_to_page if src==dst
  net: Introduce direct data placement tcp offload
  net: Introduce crc offload for tcp ddp ulp
  net/tls: expose get_netdev_for_sock
  nvme-tcp: Add DDP offload control path
  nvme-tcp: Add DDP data-path
  net/mlx5e: Add NVMEoTCP offload
  net/mlx5e: NVMEoTCP, data-path for DDP offload

Or Gerlitz (1):
  nvme-tcp: Deal with netdevice DOWN events

Yoray Zack (1):
  nvme-tcp : Recalculate crc in the end of the capsule

 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  12 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |   4 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  13 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |   1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |   1 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |   9 +-
 .../mellanox/mlx5/core/en_accel/fs_tcp.c      |  10 +
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 894 ++++++++++++++++++
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h    | 116 +++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        | 256 +++++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.h        |  25 +
 .../mlx5/core/en_accel/nvmeotcp_utils.h       |  79 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  73 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  76 +-
 .../ethernet/mellanox/mlx5/core/en_stats.c    |  38 +
 .../ethernet/mellanox/mlx5/core/en_stats.h    |  24 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |   6 +
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |   6 +
 drivers/nvme/host/tcp.c                       | 410 +++++++-
 include/linux/mlx5/device.h                   |  38 +-
 include/linux/mlx5/mlx5_ifc.h                 | 121 ++-
 include/linux/mlx5/qp.h                       |   1 +
 include/linux/netdev_features.h               |   4 +
 include/linux/netdevice.h                     |   5 +
 include/linux/nvme-tcp.h                      |   2 +
 include/linux/skbuff.h                        |   4 +
 include/linux/uio.h                           |   2 +
 include/net/inet_connection_sock.h            |   4 +
 include/net/sock.h                            |  17 +
 include/net/tcp_ddp.h                         |  90 ++
 lib/iov_iter.c                                |  11 +-
 net/Kconfig                                   |  17 +
 net/core/skbuff.c                             |   9 +-
 net/ethtool/common.c                          |   2 +
 net/ipv4/tcp_input.c                          |   7 +
 net/ipv4/tcp_ipv4.c                           |   3 +
 net/ipv4/tcp_offload.c                        |   3 +
 net/tls/tls_device.c                          |  20 +-
 40 files changed, 2375 insertions(+), 51 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
 create mode 100644 include/net/tcp_ddp.h


base-commit: 300fd579b2e8608586b002207e906ac95c74b911
prerequisite-patch-id: b3079fa1f4c0e3d6d4a92f59e70981e8a28f3b83
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next RFC v1 01/10] iov_iter: Skip copy in memcpy_to_page if src==dst
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
@ 2020-09-30 16:20 ` Boris Pismenny
  2020-10-08 23:05   ` Sagi Grimberg
  2020-09-30 16:20 ` [PATCH net-next RFC v1 02/10] net: Introduce direct data placement tcp offload Boris Pismenny
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

When using direct data placement the NIC writes some of the payload
directly to the destination buffer, and constructs the SKB such that it
points to this data. As a result, the skb_copy datagram_iter call will
attempt to copy data when it is not necessary.

This patch adds a check to avoid this copy, and a static_key to enabled
it when TCP direct data placement is possible.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 include/linux/uio.h |  2 ++
 lib/iov_iter.c      | 11 ++++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 3835a8a8e9ea..036df4073737 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -284,4 +284,6 @@ int iov_iter_for_each_range(struct iov_iter *i, size_t bytes,
 			    int (*f)(struct kvec *vec, void *context),
 			    void *context);
 
+extern struct static_key_false skip_copy_enabled;
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 5e40786c8f12..27856c3db28b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -13,6 +13,9 @@
 
 #define PIPE_PARANOIA /* for now */
 
+DEFINE_STATIC_KEY_FALSE(skip_copy_enabled);
+EXPORT_SYMBOL_GPL(skip_copy_enabled);
+
 #define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
 	size_t left;					\
 	size_t wanted = n;				\
@@ -470,7 +473,13 @@ static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t
 static void memcpy_to_page(struct page *page, size_t offset, const char *from, size_t len)
 {
 	char *to = kmap_atomic(page);
-	memcpy(to + offset, from, len);
+
+	if (static_branch_unlikely(&skip_copy_enabled)) {
+		if (to + offset != from)
+			memcpy(to + offset, from, len);
+	} else {
+		memcpy(to + offset, from, len);
+	}
 	kunmap_atomic(to);
 }
 
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next RFC v1 02/10] net: Introduce direct data placement tcp offload
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
  2020-09-30 16:20 ` [PATCH net-next RFC v1 01/10] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
@ 2020-09-30 16:20 ` Boris Pismenny
  2020-10-08 21:47   ` Sagi Grimberg
  2020-09-30 16:20 ` [PATCH net-next RFC v1 03/10] net: Introduce crc offload for tcp ddp ulp Boris Pismenny
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

This commit introduces direct data placement offload for TCP.
This capability is accompanied by new net_device operations that
configure
hardware contexts. There is a context per socket, and a context per DDP
opreation. Additionally, a resynchronization routine is used to assist
hardware handle TCP OOO, and continue the offload.
Furthermore, we let the offloading driver advertise what is the max hw
sectors/segments.

Using this interface, the NIC hardware will scatter TCP payload directly
to the BIO pages according to the command_id.
To maintain the correctness of the network stack, the driver is expected
to construct SKBs that point to the BIO pages.

This, the SKB represents the data on the wire, while it is pointing
to data that is already placed in the destination buffer.
As a result, data from page frags should not be copied out to
the linear part.

As SKBs that use DDP are already very memory efficient, we modify
skb_condence to avoid copying data from fragments to the linear
part of SKBs that belong to a socket that uses DDP offload.

A follow-up patch will use this interface for DDP in NVMe-TCP.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 include/linux/netdev_features.h    |  2 +
 include/linux/netdevice.h          |  5 ++
 include/net/inet_connection_sock.h |  4 ++
 include/net/tcp_ddp.h              | 90 ++++++++++++++++++++++++++++++
 net/Kconfig                        |  9 +++
 net/core/skbuff.c                  |  9 ++-
 net/ethtool/common.c               |  1 +
 7 files changed, 119 insertions(+), 1 deletion(-)
 create mode 100644 include/net/tcp_ddp.h

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 0b17c4322b09..a0074b244372 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -84,6 +84,7 @@ enum {
 	NETIF_F_GRO_FRAGLIST_BIT,	/* Fraglist GRO */
 
 	NETIF_F_HW_MACSEC_BIT,		/* Offload MACsec operations */
+	NETIF_F_HW_TCP_DDP_BIT,		/* TCP direct data placement offload */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -157,6 +158,7 @@ enum {
 #define NETIF_F_GRO_FRAGLIST	__NETIF_F(GRO_FRAGLIST)
 #define NETIF_F_GSO_FRAGLIST	__NETIF_F(GSO_FRAGLIST)
 #define NETIF_F_HW_MACSEC	__NETIF_F(HW_MACSEC)
+#define NETIF_F_HW_TCP_DDP	__NETIF_F(HW_TCP_DDP)
 
 /* Finds the next feature with the highest number of the range of start till 0.
  */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a431c3229cbf..b00e1663724b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -935,6 +935,7 @@ struct dev_ifalias {
 
 struct devlink;
 struct tlsdev_ops;
+struct tcp_ddp_dev_ops;
 
 struct netdev_name_node {
 	struct hlist_node hlist;
@@ -1922,6 +1923,10 @@ struct net_device {
 	const struct tlsdev_ops *tlsdev_ops;
 #endif
 
+#ifdef CONFIG_TCP_DDP
+	const struct tcp_ddp_dev_ops *tcp_ddp_ops;
+#endif
+
 	const struct header_ops *header_ops;
 
 	unsigned int		flags;
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index dc763ca9413c..503092d8860d 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -66,6 +66,8 @@ struct inet_connection_sock_af_ops {
  * @icsk_ulp_ops	   Pluggable ULP control hook
  * @icsk_ulp_data	   ULP private data
  * @icsk_clean_acked	   Clean acked data hook
+ * @icsk_ulp_ddp_ops	   Pluggable ULP direct data placement control hook
+ * @icsk_ulp_ddp_data	   ULP direct data placement private data
  * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
  * @icsk_ca_state:	   Congestion control state
  * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
@@ -94,6 +96,8 @@ struct inet_connection_sock {
 	const struct tcp_ulp_ops  *icsk_ulp_ops;
 	void __rcu		  *icsk_ulp_data;
 	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
+	const struct tcp_ddp_ulp_ops  *icsk_ulp_ddp_ops;
+	void __rcu		  *icsk_ulp_ddp_data;
 	struct hlist_node         icsk_listen_portaddr_node;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
 	__u8			  icsk_ca_state:5,
diff --git a/include/net/tcp_ddp.h b/include/net/tcp_ddp.h
new file mode 100644
index 000000000000..4c0d9660b039
--- /dev/null
+++ b/include/net/tcp_ddp.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * tcp_ddp.h
+ *	Author:	Boris Pismenny <borisp@mellanox.com>
+ *	Copyright (C) 2020 Mellanox Technologies.
+ */
+#ifndef _TCP_DDP_H
+#define _TCP_DDP_H
+
+#include <linux/blkdev.h>
+#include <linux/netdevice.h>
+#include <net/inet_connection_sock.h>
+#include <net/sock.h>
+
+/* limits returned by the offload driver, zero means don't care */
+struct tcp_ddp_limits {
+	int	 max_ddp_sgl_len;
+};
+
+enum tcp_ddp_type {
+	TCP_DDP_NVME = 1,
+};
+
+struct tcp_ddp_config {
+	enum tcp_ddp_type    type;
+	unsigned char        buf[];
+};
+
+struct nvme_tcp_config {
+	struct tcp_ddp_config   cfg;
+
+	u16			pfv;
+	u8			cpda;
+	u8			dgst;
+	int			queue_size;
+	int			queue_id;
+	int			io_cpu;
+};
+
+struct tcp_ddp_io {
+	u32			command_id;
+	int			nents;
+	struct sg_table		sg_table;
+	struct scatterlist	first_sgl[SG_CHUNK_SIZE];
+};
+
+struct tcp_ddp_dev_ops {
+	int (*tcp_ddp_limits)(struct net_device *netdev,
+			      struct tcp_ddp_limits *limits);
+	int (*tcp_ddp_sk_add)(struct net_device *netdev,
+			      struct sock *sk,
+			      struct tcp_ddp_config *config);
+	void (*tcp_ddp_sk_del)(struct net_device *netdev,
+			       struct sock *sk);
+	int (*tcp_ddp_setup)(struct net_device *netdev,
+			     struct sock *sk,
+			     struct tcp_ddp_io *io);
+	int (*tcp_ddp_teardown)(struct net_device *netdev,
+				struct sock *sk,
+				struct tcp_ddp_io *io,
+				void *ddp_ctx);
+	void (*tcp_ddp_resync)(struct net_device *netdev,
+			       struct sock *sk, u32 seq);
+};
+
+#define TCP_DDP_RESYNC_REQ (1 << 0)
+
+/*
+ * Interface to register uppper layer Direct Data Placement (DDP) TCP offload
+ */
+struct tcp_ddp_ulp_ops {
+	/* NIC requests ulp to indicate if @seq is the start of a message */
+	bool (*resync_request)(struct sock *sk, u32 seq, u32 flags);
+	/* NIC driver informs the ulp that ddp teardown is done */
+	void (*ddp_teardown_done)(void *ddp_ctx);
+};
+
+struct tcp_ddp_ctx {
+	enum tcp_ddp_type    type;
+	unsigned char        buf[];
+};
+
+static inline struct tcp_ddp_ctx *tcp_ddp_get_ctx(const struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	return (__force struct tcp_ddp_ctx *)icsk->icsk_ulp_ddp_data;
+}
+
+#endif //_TCP_DDP_H
diff --git a/net/Kconfig b/net/Kconfig
index 3831206977a1..346e9fa7a6ec 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -460,6 +460,15 @@ config ETHTOOL_NETLINK
 	  netlink. It provides better extensibility and some new features,
 	  e.g. notification messages.
 
+config TCP_DDP
+	bool "TCP direct data placement offload"
+	default n
+	help
+	  Direct Data Placement (DDP) offload for TCP enables ULP, such as
+	  NVMe-TCP/iSCSI, to request the NIC to place TCP payload data
+	  of a command response directly into kernel pages.
+
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e0774471f56d..ad32985876da 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,6 +69,7 @@
 #include <net/xfrm.h>
 #include <net/mpls.h>
 #include <net/mptcp.h>
+#include <net/tcp_ddp.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -6059,9 +6060,15 @@ EXPORT_SYMBOL(pskb_extract);
  */
 void skb_condense(struct sk_buff *skb)
 {
+	bool is_ddp = false;
+
+#ifdef CONFIG_TCP_DDP
+	is_ddp = skb->sk && inet_csk(skb->sk) &&
+		 inet_csk(skb->sk)->icsk_ulp_ddp_data;
+#endif
 	if (skb->data_len) {
 		if (skb->data_len > skb->end - skb->tail ||
-		    skb_cloned(skb))
+		    skb_cloned(skb) || is_ddp)
 			return;
 
 		/* Nice, we can free page frag(s) right now */
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 24036e3055a1..a2ff7a4a6bbf 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -68,6 +68,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
 	[NETIF_F_HW_TLS_RX_BIT] =	 "tls-hw-rx-offload",
 	[NETIF_F_GRO_FRAGLIST_BIT] =	 "rx-gro-list",
 	[NETIF_F_HW_MACSEC_BIT] =	 "macsec-hw-offload",
+	[NETIF_F_HW_TCP_DDP_BIT] =	 "tcp-ddp-offload",
 };
 
 const char
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next RFC v1 03/10] net: Introduce crc offload for tcp ddp ulp
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
  2020-09-30 16:20 ` [PATCH net-next RFC v1 01/10] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
  2020-09-30 16:20 ` [PATCH net-next RFC v1 02/10] net: Introduce direct data placement tcp offload Boris Pismenny
@ 2020-09-30 16:20 ` Boris Pismenny
  2020-10-08 21:51   ` Sagi Grimberg
  2020-09-30 16:20 ` [PATCH net-next RFC v1 04/10] net/tls: expose get_netdev_for_sock Boris Pismenny
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

This commit itroduces support for CRC offload to direct data placement
ULP on the receive side. Both DDP and CRC share a common API to
initialize the offload for a TCP socket. But otherwise, both can
be executed independently.

On the receive side, CRC offload requires a new SKB bit that
indicates that no CRC error was encountered while processing this packet.
If all packets of a ULP message have this bit set, then the CRC
verification for the message can be skipped, as hardware already checked
it.

The following patches will set and use this bit to perform NVME-TCP
CRC offload.

A subsequent series, will add NVMe-TCP transmit side CRC support.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 include/linux/netdev_features.h | 2 ++
 include/linux/skbuff.h          | 4 ++++
 net/Kconfig                     | 8 ++++++++
 net/ethtool/common.c            | 1 +
 net/ipv4/tcp_input.c            | 7 +++++++
 net/ipv4/tcp_ipv4.c             | 3 +++
 net/ipv4/tcp_offload.c          | 3 +++
 7 files changed, 28 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index a0074b244372..27001f5c0be1 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -85,6 +85,7 @@ enum {
 
 	NETIF_F_HW_MACSEC_BIT,		/* Offload MACsec operations */
 	NETIF_F_HW_TCP_DDP_BIT,		/* TCP direct data placement offload */
+	NETIF_F_HW_TCP_DDP_CRC_BIT,	/* TCP DDP CRC offload */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -159,6 +160,7 @@ enum {
 #define NETIF_F_GSO_FRAGLIST	__NETIF_F(GSO_FRAGLIST)
 #define NETIF_F_HW_MACSEC	__NETIF_F(HW_MACSEC)
 #define NETIF_F_HW_TCP_DDP	__NETIF_F(HW_TCP_DDP)
+#define NETIF_F_HW_TCP_DDP_CRC	__NETIF_F(HW_TCP_DDP_CRC)
 
 /* Finds the next feature with the highest number of the range of start till 0.
  */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 04a18e01b362..0530d849ebf2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -858,6 +858,10 @@ struct sk_buff {
 #ifdef CONFIG_TLS_DEVICE
 	__u8			decrypted:1;
 #endif
+#ifdef CONFIG_TCP_DDP_CRC
+	__u8                    ddp_crc:1;
+#endif
+
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
diff --git a/net/Kconfig b/net/Kconfig
index 346e9fa7a6ec..7891b4380e56 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -468,6 +468,14 @@ config TCP_DDP
 	  NVMe-TCP/iSCSI, to request the NIC to place TCP payload data
 	  of a command response directly into kernel pages.
 
+config TCP_DDP_CRC
+	bool "TCP direct data placement CRC offload"
+	default n
+	help
+	  Direct Data Placement (DDP) CRC32C offload for TCP enables ULP, such as
+	  NVMe-TCP/iSCSI, to request the NIC to calculate/verify the data digest
+	  of commands as they go through the NIC. Thus avoiding the costly
+	  per-byte overhead.
 
 endif   # if NET
 
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index a2ff7a4a6bbf..4941670dd25d 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -69,6 +69,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
 	[NETIF_F_GRO_FRAGLIST_BIT] =	 "rx-gro-list",
 	[NETIF_F_HW_MACSEC_BIT] =	 "macsec-hw-offload",
 	[NETIF_F_HW_TCP_DDP_BIT] =	 "tcp-ddp-offload",
+	[NETIF_F_HW_TCP_DDP_CRC_BIT] =	 "tcp-ddp-crc-offload",
 };
 
 const char
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f7b3e37d2198..f8e1e8ebef2c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5099,6 +5099,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
 #ifdef CONFIG_TLS_DEVICE
 		nskb->decrypted = skb->decrypted;
+#endif
+#ifdef CONFIG_TCP_DDP_CRC
+		nskb->ddp_crc = skb->ddp_crc;
 #endif
 		TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
 		if (list)
@@ -5132,6 +5135,10 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 #ifdef CONFIG_TLS_DEVICE
 				if (skb->decrypted != nskb->decrypted)
 					goto end;
+#endif
+#ifdef CONFIG_TCP_DDP_CRC
+				if (skb->ddp_crc != nskb->ddp_crc)
+					goto end;
 #endif
 			}
 		}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ace48b2790ff..2693029e6ee7 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1791,6 +1791,9 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 	      TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
 #ifdef CONFIG_TLS_DEVICE
 	    tail->decrypted != skb->decrypted ||
+#endif
+#ifdef CONFIG_TCP_DDP_CRC
+	    tail->ddp_crc != skb->ddp_crc ||
 #endif
 	    thtail->doff != th->doff ||
 	    memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index e09147ac9a99..39f5f0bcf181 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -262,6 +262,9 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 #ifdef CONFIG_TLS_DEVICE
 	flush |= p->decrypted ^ skb->decrypted;
 #endif
+#ifdef CONFIG_TCP_DDP_CRC
+	flush |= p->ddp_crc ^ skb->ddp_crc;
+#endif
 
 	if (flush || skb_gro_receive(p, skb)) {
 		mss = 1;
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next RFC v1 04/10] net/tls: expose get_netdev_for_sock
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
                   ` (2 preceding siblings ...)
  2020-09-30 16:20 ` [PATCH net-next RFC v1 03/10] net: Introduce crc offload for tcp ddp ulp Boris Pismenny
@ 2020-09-30 16:20 ` Boris Pismenny
  2020-10-08 21:56   ` Sagi Grimberg
  2020-09-30 16:20 ` [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path Boris Pismenny
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev

get_netdev_for_sock is a utility that is used to obtain
the net_device structure from a connected socket.

Later patches will use this for nvme-tcp DDP and DDP CRC offloads.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
---
 include/net/sock.h   | 17 +++++++++++++++++
 net/tls/tls_device.c | 20 ++------------------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index a5c6ae78df77..e0a92187f9ea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2703,4 +2703,21 @@ void sock_set_sndtimeo(struct sock *sk, s64 secs);
 
 int sock_bind_add(struct sock *sk, struct sockaddr *addr, int addr_len);
 
+/* Assume that the socket is already connected */
+static inline struct net_device *get_netdev_for_sock(struct sock *sk, bool hold)
+{
+	struct dst_entry *dst = sk_dst_get(sk);
+	struct net_device *netdev = NULL;
+
+	if (likely(dst)) {
+		netdev = dst->dev;
+		if (hold)
+			dev_hold(netdev);
+	}
+
+	dst_release(dst);
+
+	return netdev;
+}
+
 #endif	/* _SOCK_H */
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index b74e2741f74f..4df0de65db04 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -106,22 +106,6 @@ static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
 	spin_unlock_irqrestore(&tls_device_lock, flags);
 }
 
-/* We assume that the socket is already connected */
-static struct net_device *get_netdev_for_sock(struct sock *sk)
-{
-	struct dst_entry *dst = sk_dst_get(sk);
-	struct net_device *netdev = NULL;
-
-	if (likely(dst)) {
-		netdev = dst->dev;
-		dev_hold(netdev);
-	}
-
-	dst_release(dst);
-
-	return netdev;
-}
-
 static void destroy_record(struct tls_record_info *record)
 {
 	int i;
@@ -1086,7 +1070,7 @@ int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
 	if (skb)
 		TCP_SKB_CB(skb)->eor = 1;
 
-	netdev = get_netdev_for_sock(sk);
+	netdev = get_netdev_for_sock(sk, true);
 	if (!netdev) {
 		pr_err_ratelimited("%s: netdev not found\n", __func__);
 		rc = -EINVAL;
@@ -1162,7 +1146,7 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
 	if (ctx->crypto_recv.info.version != TLS_1_2_VERSION)
 		return -EOPNOTSUPP;
 
-	netdev = get_netdev_for_sock(sk);
+	netdev = get_netdev_for_sock(sk, true);
 	if (!netdev) {
 		pr_err_ratelimited("%s: netdev not found\n", __func__);
 		return -EINVAL;
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
                   ` (3 preceding siblings ...)
  2020-09-30 16:20 ` [PATCH net-next RFC v1 04/10] net/tls: expose get_netdev_for_sock Boris Pismenny
@ 2020-09-30 16:20 ` Boris Pismenny
  2020-10-08 22:19   ` Sagi Grimberg
  2020-09-30 16:20 ` [PATCH net-next RFC v1 06/10] nvme-tcp: Add DDP data-path Boris Pismenny
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

This commit introduces direct data placement offload to NVME
TCP. There is a context per queue, which is established after the
handshake
using the tcp_ddp_sk_add/del NDOs.

Additionally, a resynchronization routine is used to assist
hardware recovery from TCP OOO, and continue the offload.
Resynchronization operates as follows:
1. TCP OOO causes the NIC HW to stop the offload
2. NIC HW identifies a PDU header at some TCP sequence number,
and asks NVMe-TCP to confirm it.
This request is delivered from the NIC driver to NVMe-TCP by first
finding the socket for the packet that triggered the request, and
then fiding the nvme_tcp_queue that is used by this routine.
Finally, the request is recorded in the nvme_tcp_queue.
3. When NVMe-TCP observes the requested TCP sequence, it will compare
it with the PDU header TCP sequence, and report the result to the
NIC driver (tcp_ddp_resync), which will update the HW,
and resume offload when all is successful.

Furthermore, we let the offloading driver advertise what is the max hw
sectors/segments via tcp_ddp_limits.

A follow-up patch introduces the data-path changes required for this
offload.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 drivers/nvme/host/tcp.c  | 188 +++++++++++++++++++++++++++++++++++++++
 include/linux/nvme-tcp.h |   2 +
 2 files changed, 190 insertions(+)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 8f4f29f18b8c..06711ac095f2 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -62,6 +62,7 @@ enum nvme_tcp_queue_flags {
 	NVME_TCP_Q_ALLOCATED	= 0,
 	NVME_TCP_Q_LIVE		= 1,
 	NVME_TCP_Q_POLLING	= 2,
+	NVME_TCP_Q_OFFLOADS     = 3,
 };
 
 enum nvme_tcp_recv_state {
@@ -110,6 +111,8 @@ struct nvme_tcp_queue {
 	void (*state_change)(struct sock *);
 	void (*data_ready)(struct sock *);
 	void (*write_space)(struct sock *);
+
+	atomic64_t  resync_req;
 };
 
 struct nvme_tcp_ctrl {
@@ -129,6 +132,8 @@ struct nvme_tcp_ctrl {
 	struct delayed_work	connect_work;
 	struct nvme_tcp_request async_req;
 	u32			io_queues[HCTX_MAX_TYPES];
+
+	struct net_device       *offloading_netdev;
 };
 
 static LIST_HEAD(nvme_tcp_ctrl_list);
@@ -223,6 +228,159 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
 	return nvme_tcp_pdu_data_left(req) <= len;
 }
 
+#ifdef CONFIG_TCP_DDP
+
+bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
+const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops __read_mostly = {
+	.resync_request		= nvme_tcp_resync_request,
+};
+
+static
+int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
+			    struct nvme_tcp_config *config)
+{
+	struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, true);
+	struct tcp_ddp_config *ddp_config = (struct tcp_ddp_config *)config;
+	int ret;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	if (!(netdev->features & NETIF_F_HW_TCP_DDP)) {
+		dev_put(netdev);
+		return -EINVAL;
+	}
+
+	ret = netdev->tcp_ddp_ops->tcp_ddp_sk_add(netdev,
+						 queue->sock->sk,
+						 ddp_config);
+	if (!ret)
+		inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = &nvme_tcp_ddp_ulp_ops;
+	else
+		dev_put(netdev);
+	return ret;
+}
+
+static
+void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
+{
+	struct net_device *netdev = queue->ctrl->offloading_netdev;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return;
+	}
+
+	netdev->tcp_ddp_ops->tcp_ddp_sk_del(netdev, queue->sock->sk);
+
+	inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = NULL;
+	dev_put(netdev); /* put the queue_init get_netdev_for_sock() */
+}
+
+static
+int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue,
+			    struct tcp_ddp_limits *limits)
+{
+	struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, true);
+	int ret = 0;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	if (netdev->features & NETIF_F_HW_TCP_DDP &&
+	    netdev->tcp_ddp_ops &&
+	    netdev->tcp_ddp_ops->tcp_ddp_limits)
+			ret = netdev->tcp_ddp_ops->tcp_ddp_limits(netdev, limits);
+	else
+			ret = -EOPNOTSUPP;
+
+	if (!ret) {
+		queue->ctrl->offloading_netdev = netdev;
+		pr_info("%s netdev %s offload limits: max_ddp_sgl_len %d\n",
+			__func__, netdev->name, limits->max_ddp_sgl_len);
+		queue->ctrl->ctrl.max_segments = limits->max_ddp_sgl_len;
+		queue->ctrl->ctrl.max_hw_sectors =
+			limits->max_ddp_sgl_len << (ilog2(SZ_4K) - 9);
+	} else {
+		queue->ctrl->offloading_netdev = NULL;
+	}
+
+	dev_put(netdev);
+
+	return ret;
+}
+
+static
+void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
+			      unsigned int pdu_seq)
+{
+	struct net_device *netdev = queue->ctrl->offloading_netdev;
+	u64 resync_val;
+	u32 resync_seq;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return;
+	}
+
+	resync_val = atomic64_read(&queue->resync_req);
+	if ((resync_val & TCP_DDP_RESYNC_REQ) == 0)
+		return;
+
+	resync_seq = resync_val >> 32;
+	if (before(pdu_seq, resync_seq))
+		return;
+
+	if (atomic64_cmpxchg(&queue->resync_req, resync_val, (resync_val - 1)))
+		netdev->tcp_ddp_ops->tcp_ddp_resync(netdev, queue->sock->sk, pdu_seq);
+}
+
+bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
+{
+	struct nvme_tcp_queue *queue = sk->sk_user_data;
+
+	atomic64_set(&queue->resync_req,
+		     (((uint64_t)seq << 32) | flags));
+
+	return true;
+}
+
+#else
+
+static
+int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
+			    struct nvme_tcp_config *config)
+{
+	return -EINVAL;
+}
+
+static
+void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
+{}
+
+static
+int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue,
+			    struct tcp_ddp_limits *limits)
+{
+	return -EINVAL;
+}
+
+static
+void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
+			      unsigned int pdu_seq)
+{}
+
+bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
+{
+	return false;
+}
+
+#endif
+
 static void nvme_tcp_init_iter(struct nvme_tcp_request *req,
 		unsigned int dir)
 {
@@ -628,6 +786,11 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 	size_t rcv_len = min_t(size_t, *len, queue->pdu_remaining);
 	int ret;
 
+	u64 pdu_seq = TCP_SKB_CB(skb)->seq + *offset - queue->pdu_offset;
+
+	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
+		nvme_tcp_resync_response(queue, pdu_seq);
+
 	ret = skb_copy_bits(skb, *offset,
 		&pdu[queue->pdu_offset], rcv_len);
 	if (unlikely(ret))
@@ -1370,6 +1533,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 {
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+	struct nvme_tcp_config config;
+	struct tcp_ddp_limits limits;
 	int ret, rcv_pdu_size;
 
 	queue->ctrl = ctrl;
@@ -1487,6 +1652,26 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 #endif
 	write_unlock_bh(&queue->sock->sk->sk_callback_lock);
 
+	if (nvme_tcp_queue_id(queue) != 0) {
+		config.cfg.type		= TCP_DDP_NVME;
+		config.pfv		= NVME_TCP_PFV_1_0;
+		config.cpda		= 0;
+		config.dgst		= queue->hdr_digest ?
+						NVME_TCP_HDR_DIGEST_ENABLE : 0;
+		config.dgst		|= queue->data_digest ?
+						NVME_TCP_DATA_DIGEST_ENABLE : 0;
+		config.queue_size	= queue->queue_size;
+		config.queue_id		= nvme_tcp_queue_id(queue);
+		config.io_cpu		= queue->io_cpu;
+
+		ret = nvme_tcp_offload_socket(queue, &config);
+		if (!ret)
+			set_bit(NVME_TCP_Q_OFFLOADS, &queue->flags);
+	} else {
+		ret = nvme_tcp_offload_limits(queue, &limits);
+	}
+	/* offload is opportunistic - failure is non-critical */
+
 	return 0;
 
 err_init_connect:
@@ -1519,6 +1704,9 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
 	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
 	nvme_tcp_restore_sock_calls(queue);
 	cancel_work_sync(&queue->io_work);
+
+	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
+		nvme_tcp_unoffload_socket(queue);
 }
 
 static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
index 959e0bd9a913..65df64c34ecd 100644
--- a/include/linux/nvme-tcp.h
+++ b/include/linux/nvme-tcp.h
@@ -8,6 +8,8 @@
 #define _LINUX_NVME_TCP_H
 
 #include <linux/nvme.h>
+#include <net/sock.h>
+#include <net/tcp_ddp.h>
 
 #define NVME_TCP_DISC_PORT	8009
 #define NVME_TCP_ADMIN_CCSZ	SZ_8K
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next RFC v1 06/10] nvme-tcp: Add DDP data-path
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
                   ` (4 preceding siblings ...)
  2020-09-30 16:20 ` [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path Boris Pismenny
@ 2020-09-30 16:20 ` Boris Pismenny
  2020-10-08 22:29   ` Sagi Grimberg
  2020-09-30 16:20 ` [PATCH net-next RFC v1 07/10] nvme-tcp : Recalculate crc in the end of the capsule Boris Pismenny
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

Introduce the NVMe-TCP DDP data-path offload.
Using this interface, the NIC hardware will scatter TCP payload directly
to the BIO pages according to the command_id in the PDU.
To maintain the correctness of the network stack, the driver is expected
to construct SKBs that point to the BIO pages.

The data-path interface contains two routines: tcp_ddp_setup/teardown.
The setup provides the mapping from command_id to the request buffers,
while the teardown removes this mapping.

For efficiency, we introduce an asynchronous nvme completion, which is
split between NVMe-TCP and the NIC driver as follows:
NVMe-TCP performs the specific completion, while NIC driver performs the
generic mq_blk completion.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 drivers/nvme/host/tcp.c | 121 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 117 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 06711ac095f2..7bd97f856677 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -56,6 +56,11 @@ struct nvme_tcp_request {
 	size_t			offset;
 	size_t			data_sent;
 	enum nvme_tcp_send_state state;
+
+	bool			offloaded;
+	struct tcp_ddp_io	ddp;
+	__le16			status;
+	union nvme_result	result;
 };
 
 enum nvme_tcp_queue_flags {
@@ -231,10 +236,76 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
 #ifdef CONFIG_TCP_DDP
 
 bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
+void nvme_tcp_ddp_teardown_done(void *ddp_ctx);
 const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops __read_mostly = {
+
 	.resync_request		= nvme_tcp_resync_request,
+	.ddp_teardown_done	= nvme_tcp_ddp_teardown_done,
 };
 
+static
+int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
+			  uint16_t command_id,
+			  struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct net_device *netdev = queue->ctrl->offloading_netdev;
+	int ret;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	ret = netdev->tcp_ddp_ops->tcp_ddp_teardown(netdev, queue->sock->sk,
+						    &req->ddp, rq);
+	sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE);
+	req->offloaded = false;
+	return ret;
+}
+
+void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
+{
+	struct request *rq = ddp_ctx;
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+	if (!nvme_try_complete_req(rq, cpu_to_le16(req->status << 1), req->result))
+		nvme_complete_rq(rq);
+}
+
+static
+int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
+		       uint16_t command_id,
+		       struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct net_device *netdev = queue->ctrl->offloading_netdev;
+	int ret;
+
+	req->offloaded = false;
+
+	if (unlikely(!netdev)) {
+		pr_info_ratelimited("%s: netdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	req->ddp.command_id = command_id;
+	req->ddp.sg_table.sgl = req->ddp.first_sgl;
+	ret = sg_alloc_table_chained(&req->ddp.sg_table,
+		blk_rq_nr_phys_segments(rq), req->ddp.sg_table.sgl,
+		SG_CHUNK_SIZE);
+	if (ret)
+		return -ENOMEM;
+	req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl);
+
+	ret = netdev->tcp_ddp_ops->tcp_ddp_setup(netdev,
+						 queue->sock->sk,
+						 &req->ddp);
+	if (!ret)
+		req->offloaded = true;
+	return ret;
+}
+
 static
 int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
 			    struct nvme_tcp_config *config)
@@ -351,6 +422,25 @@ bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
 
 #else
 
+static
+int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
+		       uint16_t command_id,
+		       struct request *rq)
+{
+	return -EINVAL;
+}
+
+static
+int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
+			  uint16_t command_id,
+			  struct request *rq)
+{
+	return -EINVAL;
+}
+
+void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
+{}
+
 static
 int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
 			    struct nvme_tcp_config *config)
@@ -630,6 +720,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
 		struct nvme_completion *cqe)
 {
+	struct nvme_tcp_request *req;
 	struct request *rq;
 
 	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
@@ -641,8 +732,15 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
 		return -EINVAL;
 	}
 
-	if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
-		nvme_complete_rq(rq);
+	req = blk_mq_rq_to_pdu(rq);
+	if (req->offloaded) {
+		req->status = cqe->status;
+		req->result = cqe->result;
+		nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
+	} else {
+		if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
+			nvme_complete_rq(rq);
+	}
 	queue->nr_cqe++;
 
 	return 0;
@@ -836,9 +934,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 static inline void nvme_tcp_end_request(struct request *rq, u16 status)
 {
 	union nvme_result res = {};
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_tcp_queue *queue = req->queue;
+	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
 
-	if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
-		nvme_complete_rq(rq);
+	if (req->offloaded) {
+		req->status = cpu_to_le16(status << 1);
+		req->result = res;
+		nvme_tcp_teardown_ddp(queue, pdu->command_id, rq);
+	} else {
+		if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
+			nvme_complete_rq(rq);
+	}
 }
 
 static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
@@ -1115,6 +1222,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	bool inline_data = nvme_tcp_has_inline_data(req);
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 	int len = sizeof(*pdu) + hdgst - req->offset;
+	struct request *rq = blk_mq_rq_from_pdu(req);
 	int flags = MSG_DONTWAIT;
 	int ret;
 
@@ -1123,6 +1231,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	else
 		flags |= MSG_EOR;
 
+	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&
+	    blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ)
+		nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq);
+
 	if (queue->hdr_digest && !req->offset)
 		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
 
@@ -2448,6 +2560,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
 	req->data_len = blk_rq_nr_phys_segments(rq) ?
 				blk_rq_payload_bytes(rq) : 0;
 	req->curr_bio = rq->bio;
+	req->offloaded = false;
 
 	if (rq_data_dir(rq) == WRITE &&
 	    req->data_len <= nvme_tcp_inline_data_size(queue))
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next RFC v1 07/10] nvme-tcp : Recalculate crc in the end of the capsule
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
                   ` (5 preceding siblings ...)
  2020-09-30 16:20 ` [PATCH net-next RFC v1 06/10] nvme-tcp: Add DDP data-path Boris Pismenny
@ 2020-09-30 16:20 ` Boris Pismenny
  2020-10-08 22:44   ` Sagi Grimberg
  2020-09-30 16:20 ` [PATCH net-next RFC v1 08/10] nvme-tcp: Deal with netdevice DOWN events Boris Pismenny
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

From: Yoray Zack <yorayz@mellanox.com>

crc offload of the nvme capsule. Check if all the skb bits
are on, and if not recalculate the crc in SW and check it.

This patch reworks the receive-side crc calculation to always
run at the end, so as to keep a single flow for both offload
and non-offload. This change simplifies the code, but it may degrade
performance for non-offload crc calculation.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 drivers/nvme/host/tcp.c | 66 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 7bd97f856677..9a620d1dacb4 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -94,6 +94,7 @@ struct nvme_tcp_queue {
 	size_t			data_remaining;
 	size_t			ddgst_remaining;
 	unsigned int		nr_cqe;
+	bool			crc_valid;
 
 	/* send state */
 	struct nvme_tcp_request *request;
@@ -233,6 +234,41 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
 	return nvme_tcp_pdu_data_left(req) <= len;
 }
 
+static inline bool nvme_tcp_device_ddgst_ok(struct nvme_tcp_queue *queue)
+{
+	return queue->crc_valid;
+}
+
+static inline void nvme_tcp_device_ddgst_update(struct nvme_tcp_queue *queue,
+						struct sk_buff *skb)
+{
+	if (queue->crc_valid)
+#ifdef CONFIG_TCP_DDP_CRC
+		queue->crc_valid = skb->ddp_crc;
+#else
+		queue->crc_valid = false;
+#endif
+}
+
+static void nvme_tcp_crc_recalculate(struct nvme_tcp_queue *queue,
+				     struct nvme_tcp_data_pdu *pdu)
+{
+	struct nvme_tcp_request *req;
+	struct request *rq;
+
+	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+	if (!rq)
+		return;
+	req = blk_mq_rq_to_pdu(rq);
+	crypto_ahash_init(queue->rcv_hash);
+	req->ddp.sg_table.sgl = req->ddp.first_sgl;
+	/* req->ddp.sg_table is allocated and filled in nvme_tcp_setup_ddp */
+	ahash_request_set_crypt(queue->rcv_hash, req->ddp.sg_table.sgl, NULL,
+				le32_to_cpu(pdu->data_length));
+	crypto_ahash_update(queue->rcv_hash);
+}
+
+
 #ifdef CONFIG_TCP_DDP
 
 bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
@@ -706,6 +742,7 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
 	queue->pdu_offset = 0;
 	queue->data_remaining = -1;
 	queue->ddgst_remaining = 0;
+	queue->crc_valid = true;
 }
 
 static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
@@ -955,6 +992,8 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 	struct nvme_tcp_request *req;
 	struct request *rq;
 
+	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
+		nvme_tcp_device_ddgst_update(queue, skb);
 	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
 	if (!rq) {
 		dev_err(queue->ctrl->ctrl.device,
@@ -992,7 +1031,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 		recv_len = min_t(size_t, recv_len,
 				iov_iter_count(&req->iter));
 
-		if (queue->data_digest)
+		if (queue->data_digest && !test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
 			ret = skb_copy_and_hash_datagram_iter(skb, *offset,
 				&req->iter, recv_len, queue->rcv_hash);
 		else
@@ -1012,7 +1051,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 
 	if (!queue->data_remaining) {
 		if (queue->data_digest) {
-			nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
 			queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
 		} else {
 			if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
@@ -1033,8 +1071,11 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
 	char *ddgst = (char *)&queue->recv_ddgst;
 	size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining);
 	off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
+	bool ddgst_offload_fail;
 	int ret;
 
+	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
+		nvme_tcp_device_ddgst_update(queue, skb);
 	ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
 	if (unlikely(ret))
 		return ret;
@@ -1045,12 +1086,21 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
 	if (queue->ddgst_remaining)
 		return 0;
 
-	if (queue->recv_ddgst != queue->exp_ddgst) {
-		dev_err(queue->ctrl->ctrl.device,
-			"data digest error: recv %#x expected %#x\n",
-			le32_to_cpu(queue->recv_ddgst),
-			le32_to_cpu(queue->exp_ddgst));
-		return -EIO;
+	ddgst_offload_fail = !nvme_tcp_device_ddgst_ok(queue);
+	if (!test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) ||
+	    ddgst_offload_fail) {
+		if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&
+		    ddgst_offload_fail)
+			nvme_tcp_crc_recalculate(queue, pdu);
+
+		nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
+		if (queue->recv_ddgst != queue->exp_ddgst) {
+			dev_err(queue->ctrl->ctrl.device,
+				"data digest error: recv %#x expected %#x\n",
+				le32_to_cpu(queue->recv_ddgst),
+				le32_to_cpu(queue->exp_ddgst));
+			return -EIO;
+		}
 	}
 
 	if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next RFC v1 08/10] nvme-tcp: Deal with netdevice DOWN events
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
                   ` (6 preceding siblings ...)
  2020-09-30 16:20 ` [PATCH net-next RFC v1 07/10] nvme-tcp : Recalculate crc in the end of the capsule Boris Pismenny
@ 2020-09-30 16:20 ` Boris Pismenny
  2020-10-08 22:47   ` Sagi Grimberg
  2020-09-30 16:20 ` [PATCH net-next RFC v1 09/10] net/mlx5e: Add NVMEoTCP offload Boris Pismenny
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

From: Or Gerlitz <ogerlitz@mellanox.com>

For ddp setup/teardown and resync, the offloading logic
uses HW resources at the NIC driver such as SQ and CQ.

These resources are destroyed when the netdevice does down
and hence we must stop using them before the NIC driver
destroyes them.

Use netdevice notifier for that matter -- offloaded connections
are stopped before the stack continues to call the NIC driver
close ndo.

We use the existing recovery flow which has the advantage
of resuming the offload once the connection is re-set.

Since the recovery flow runs in a separate/dedicated WQ
we need to wait in the notifier code for an ACK that all
offloaded queues were stopped which means that the teardown
queue offload ndo was called and the NIC doesn't have any
resources related to that connection any more.

This also buys us proper handling for the UNREGISTER event
b/c our offloading starts in the UP state, and down is always
there between up to unregister.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 drivers/nvme/host/tcp.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9a620d1dacb4..7569b47f0414 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -144,6 +144,7 @@ struct nvme_tcp_ctrl {
 
 static LIST_HEAD(nvme_tcp_ctrl_list);
 static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
+static struct notifier_block nvme_tcp_netdevice_nb;
 static struct workqueue_struct *nvme_tcp_wq;
 static const struct blk_mq_ops nvme_tcp_mq_ops;
 static const struct blk_mq_ops nvme_tcp_admin_mq_ops;
@@ -412,8 +413,6 @@ int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue,
 		queue->ctrl->ctrl.max_segments = limits->max_ddp_sgl_len;
 		queue->ctrl->ctrl.max_hw_sectors =
 			limits->max_ddp_sgl_len << (ilog2(SZ_4K) - 9);
-	} else {
-		queue->ctrl->offloading_netdev = NULL;
 	}
 
 	dev_put(netdev);
@@ -1992,6 +1991,8 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
 {
 	int ret;
 
+	to_tcp_ctrl(ctrl)->offloading_netdev = NULL;
+
 	ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH);
 	if (ret)
 		return ret;
@@ -2885,6 +2886,26 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
 	return ERR_PTR(ret);
 }
 
+static int nvme_tcp_netdev_event(struct notifier_block *this,
+				 unsigned long event, void *ptr)
+{
+	struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
+	struct nvme_tcp_ctrl *ctrl;
+
+	switch (event) {
+	case NETDEV_GOING_DOWN:
+		mutex_lock(&nvme_tcp_ctrl_mutex);
+		list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
+			if (ndev != ctrl->offloading_netdev)
+				continue;
+			nvme_tcp_error_recovery(&ctrl->ctrl);
+		}
+		mutex_unlock(&nvme_tcp_ctrl_mutex);
+		flush_workqueue(nvme_reset_wq);
+	}
+	return NOTIFY_DONE;
+}
+
 static struct nvmf_transport_ops nvme_tcp_transport = {
 	.name		= "tcp",
 	.module		= THIS_MODULE,
@@ -2899,13 +2920,26 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
 
 static int __init nvme_tcp_init_module(void)
 {
+	int ret;
+
 	nvme_tcp_wq = alloc_workqueue("nvme_tcp_wq",
 			WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
 	if (!nvme_tcp_wq)
 		return -ENOMEM;
 
+	nvme_tcp_netdevice_nb.notifier_call = nvme_tcp_netdev_event;
+	ret = register_netdevice_notifier(&nvme_tcp_netdevice_nb);
+	if (ret) {
+		pr_err("failed to register netdev notifier\n");
+		goto out_err_reg_notifier;
+	}
+
 	nvmf_register_transport(&nvme_tcp_transport);
 	return 0;
+
+out_err_reg_notifier:
+	destroy_workqueue(nvme_tcp_wq);
+	return ret;
 }
 
 static void __exit nvme_tcp_cleanup_module(void)
@@ -2913,6 +2947,7 @@ static void __exit nvme_tcp_cleanup_module(void)
 	struct nvme_tcp_ctrl *ctrl;
 
 	nvmf_unregister_transport(&nvme_tcp_transport);
+	unregister_netdevice_notifier(&nvme_tcp_netdevice_nb);
 
 	mutex_lock(&nvme_tcp_ctrl_mutex);
 	list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list)
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next RFC v1 09/10] net/mlx5e: Add NVMEoTCP offload
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
                   ` (7 preceding siblings ...)
  2020-09-30 16:20 ` [PATCH net-next RFC v1 08/10] nvme-tcp: Deal with netdevice DOWN events Boris Pismenny
@ 2020-09-30 16:20 ` Boris Pismenny
  2020-09-30 16:20 ` [PATCH net-next RFC v1 10/10] net/mlx5e: NVMEoTCP, data-path for DDP offload Boris Pismenny
  2020-10-09  0:08 ` [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Sagi Grimberg
  10 siblings, 0 replies; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

Add NVMEoTCP offload to mlx5.

This patch implements the NVME-TCP offload added by previous commits.
Similarly to other layer-5 offload, the offload is mapped to a TIR, and
updated using WQEs.

- Use 128B CQEs when NVME-TCP offload is enabled
- Implement asynchronous ddp invalidaation by completing the nvme-tcp
  request only when the invalidate UMR is done
- Use KLM UMRs to implement ddp
- Use a dedicated icosq for all NVME-TCP work. This SQ is unique in the
  sense that it is driven directly by the NVME-TCP layer to submit and
  invalidate ddp requests.
- Add statistics for offload packets/bytes, ddp setup/teardown, and
  queue init/teardown

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  11 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |   4 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  13 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |   9 +-
 .../mellanox/mlx5/core/en_accel/fs_tcp.c      |  10 +
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 894 ++++++++++++++++++
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h    | 116 +++
 .../mlx5/core/en_accel/nvmeotcp_utils.h       |  79 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  73 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  25 +-
 .../ethernet/mellanox/mlx5/core/en_stats.c    |  26 +
 .../ethernet/mellanox/mlx5/core/en_stats.h    |  16 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |   6 +
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |   6 +
 include/linux/mlx5/device.h                   |   8 +
 include/linux/mlx5/mlx5_ifc.h                 | 121 ++-
 include/linux/mlx5/qp.h                       |   1 +
 19 files changed, 1419 insertions(+), 12 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 99f1ec3b2575..20fc7795f7c6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -201,3 +201,14 @@ config MLX5_SW_STEERING
 	default y
 	help
 	Build support for software-managed steering in the NIC.
+
+config MLX5_EN_NVMEOTCP
+	bool "NVMEoTCP accelaration"
+	depends on MLX5_CORE_EN
+	depends on TCP_DDP
+	depends on TCP_DDP_CRC
+	default y
+	help
+	Build support for NVMEoTCP accelaration in the NIC.
+	Note: Support for hardware with this capability needs to be selected
+	for this option to become available.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9826a041e407..9dd6b41c2486 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -84,3 +84,5 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 					steering/dr_ste.o steering/dr_send.o \
 					steering/dr_cmd.o steering/dr_fw.o \
 					steering/dr_action.o steering/fs_dr.o
+
+mlx5_core-$(CONFIG_MLX5_EN_NVMEOTCP) += en_accel/fs_tcp.o en_accel/nvmeotcp.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5368e06cd71c..a8c0fc98b394 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -209,7 +209,10 @@ struct mlx5e_umr_wqe {
 	struct mlx5_wqe_ctrl_seg       ctrl;
 	struct mlx5_wqe_umr_ctrl_seg   uctrl;
 	struct mlx5_mkey_seg           mkc;
-	struct mlx5_mtt                inline_mtts[0];
+	union {
+		struct mlx5_mtt        inline_mtts[0];
+		struct mlx5_klm        inline_klms[0];
+	};
 };
 
 extern const char mlx5e_self_tests[][ETH_GSTRING_LEN];
@@ -642,6 +645,9 @@ struct mlx5e_channel {
 	struct mlx5e_xdpsq         rq_xdpsq;
 	struct mlx5e_txqsq         sq[MLX5E_MAX_NUM_TC];
 	struct mlx5e_icosq         icosq;   /* internal control operations */
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	struct mlx5e_icosq         nvmeotcpsq;   /* nvmeotcp umrs  */
+#endif
 	bool                       xdp;
 	struct napi_struct         napi;
 	struct device             *pdev;
@@ -815,6 +821,9 @@ struct mlx5e_priv {
 #endif
 #ifdef CONFIG_MLX5_EN_TLS
 	struct mlx5e_tls          *tls;
+#endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	struct mlx5e_nvmeotcp      *nvmeotcp;
 #endif
 	struct devlink_health_reporter *tx_reporter;
 	struct devlink_health_reporter *rx_reporter;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
index a87273e801b2..7d280c35f538 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
@@ -16,6 +16,7 @@ struct mlx5e_cq_param {
 	struct mlx5_wq_param       wq;
 	u16                        eq_ix;
 	u8                         cq_period_mode;
+	bool                       force_cqe128;
 };
 
 struct mlx5e_rq_param {
@@ -38,6 +39,9 @@ struct mlx5e_channel_param {
 	struct mlx5e_sq_param      xdp_sq;
 	struct mlx5e_sq_param      icosq;
 	struct mlx5e_sq_param      async_icosq;
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	struct mlx5e_sq_param      nvmeotcpsq;
+#endif
 };
 
 static inline bool mlx5e_qid_get_ch_if_in_group(struct mlx5e_params *params,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 07ee1d236ab3..1aec1900bee9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -32,6 +32,11 @@ enum mlx5e_icosq_wqe_type {
 	MLX5E_ICOSQ_WQE_SET_PSV_TLS,
 	MLX5E_ICOSQ_WQE_GET_PSV_TLS,
 #endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	MLX5E_ICOSQ_WQE_UMR_NVME_TCP,
+	MLX5E_ICOSQ_WQE_UMR_NVME_TCP_INVALIDATE,
+	MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP,
+#endif
 };
 
 /* General */
@@ -173,6 +178,14 @@ struct mlx5e_icosq_wqe_info {
 		struct {
 			struct mlx5e_ktls_rx_resync_buf *buf;
 		} tls_get_params;
+#endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+		struct {
+			struct mlx5e_nvmeotcp_queue *queue;
+		} nvmeotcp_q;
+		struct {
+			struct nvmeotcp_queue_entry *entry;
+		} nvmeotcp_qe;
 #endif
 	};
 };
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
index 2ea1cdc1ca54..2fa6f9286ed9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -39,6 +39,7 @@
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/tls.h"
 #include "en_accel/tls_rxtx.h"
+#include "en_accel/nvmeotcp.h"
 #include "en.h"
 #include "en/txrx.h"
 
@@ -162,11 +163,17 @@ static inline void mlx5e_accel_tx_finish(struct mlx5e_txqsq *sq,
 
 static inline int mlx5e_accel_init_rx(struct mlx5e_priv *priv)
 {
-	return mlx5e_ktls_init_rx(priv);
+	int tls, nvmeotcp;
+
+	tls = mlx5e_ktls_init_rx(priv);
+	nvmeotcp = mlx5e_nvmeotcp_init_rx(priv);
+
+	return tls && nvmeotcp;
 }
 
 static inline void mlx5e_accel_cleanup_rx(struct mlx5e_priv *priv)
 {
+	mlx5e_nvmeotcp_cleanup_rx(priv);
 	mlx5e_ktls_cleanup_rx(priv);
 }
 #endif /* __MLX5E_EN_ACCEL_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c
index 97f1594cee11..feded6c8cca1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c
@@ -14,6 +14,7 @@ enum accel_fs_tcp_type {
 struct mlx5e_accel_fs_tcp {
 	struct mlx5e_flow_table tables[ACCEL_FS_TCP_NUM_TYPES];
 	struct mlx5_flow_handle *default_rules[ACCEL_FS_TCP_NUM_TYPES];
+	refcount_t		ref_count;
 };
 
 static enum mlx5e_traffic_types fs_accel2tt(enum accel_fs_tcp_type i)
@@ -335,6 +336,7 @@ static int accel_fs_tcp_enable(struct mlx5e_priv *priv)
 			return err;
 		}
 	}
+	refcount_set(&priv->fs.accel_tcp->ref_count, 1);
 	return 0;
 }
 
@@ -358,6 +360,9 @@ void mlx5e_accel_fs_tcp_destroy(struct mlx5e_priv *priv)
 	if (!priv->fs.accel_tcp)
 		return;
 
+	if (!refcount_dec_and_test(&priv->fs.accel_tcp->ref_count))
+		return;
+
 	accel_fs_tcp_disable(priv);
 
 	for (i = 0; i < ACCEL_FS_TCP_NUM_TYPES; i++)
@@ -374,6 +379,11 @@ int mlx5e_accel_fs_tcp_create(struct mlx5e_priv *priv)
 	if (!MLX5_CAP_FLOWTABLE_NIC_RX(priv->mdev, ft_field_support.outer_ip_version))
 		return -EOPNOTSUPP;
 
+	if (priv->fs.accel_tcp) {
+		refcount_inc(&priv->fs.accel_tcp->ref_count);
+		return 0;
+	}
+
 	priv->fs.accel_tcp = kzalloc(sizeof(*priv->fs.accel_tcp), GFP_KERNEL);
 	if (!priv->fs.accel_tcp)
 		return -ENOMEM;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
new file mode 100644
index 000000000000..2dc7d3ad093c
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
@@ -0,0 +1,894 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies.
+
+#include <linux/netdevice.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/idr.h>
+#include <linux/nvme-tcp.h>
+#include "en_accel/nvmeotcp.h"
+#include "en_accel/nvmeotcp_utils.h"
+#include "en_accel/fs_tcp.h"
+#include "en/txrx.h"
+
+#define MAX_NVMEOTCP_QUEUES	(512)
+#define MIN_NVMEOTCP_QUEUES	(1)
+
+static const struct rhashtable_params rhash_queues = {
+	.key_len = sizeof(int),
+	.key_offset = offsetof(struct mlx5e_nvmeotcp_queue, id),
+	.head_offset = offsetof(struct mlx5e_nvmeotcp_queue, hash),
+	.automatic_shrinking = true,
+	.min_size = 1,
+	.max_size = MAX_NVMEOTCP_QUEUES,
+};
+
+#define MLX5_NVME_TCP_MAX_SEGMENTS 128
+
+static u32 mlx5e_get_max_sgl(struct mlx5_core_dev *mdev)
+{
+	return min_t(u32,
+		     MLX5_NVME_TCP_MAX_SEGMENTS,
+		     1 << MLX5_CAP_GEN(mdev, log_max_klm_list_size));
+}
+
+static void mlx5e_nvmeotcp_destroy_tir(struct mlx5e_priv *priv, int tirn)
+{
+	mlx5_core_destroy_tir(priv->mdev, tirn);
+}
+
+static inline u32
+mlx5e_get_channel_ix_from_io_cpu(struct mlx5e_priv *priv, u32 io_cpu)
+{
+	int num_channels = priv->channels.params.num_channels;
+	u32 channel_ix = io_cpu;
+
+	if (channel_ix >= num_channels)
+		channel_ix = channel_ix % num_channels;
+
+	return channel_ix;
+}
+
+static int mlx5e_nvmeotcp_create_tir(struct mlx5e_priv *priv,
+				     struct sock *sk,
+				     struct nvme_tcp_config *config,
+				     struct mlx5e_nvmeotcp_queue *queue)
+{
+	u32 rqtn = priv->direct_tir[queue->channel_ix].rqt.rqtn;
+	int err, inlen;
+	void *tirc;
+	u32 tirn;
+	u32 *in;
+
+	inlen = MLX5_ST_SZ_BYTES(create_tir_in);
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+	tirc = MLX5_ADDR_OF(create_tir_in, in, ctx);
+	MLX5_SET(tirc, tirc, disp_type, MLX5_TIRC_DISP_TYPE_INDIRECT);
+	MLX5_SET(tirc, tirc, rx_hash_fn, MLX5_RX_HASH_FN_INVERTED_XOR8);
+	MLX5_SET(tirc, tirc, indirect_table, rqtn);
+	MLX5_SET(tirc, tirc, transport_domain, priv->mdev->mlx5e_res.td.tdn);
+	MLX5_SET(tirc, tirc, nvmeotcp_zero_copy_en, 1);
+	MLX5_SET(tirc, tirc, nvmeotcp_tag_buffer_table_id,
+		 queue->tag_buf_table_id);
+	MLX5_SET(tirc, tirc, self_lb_block,
+		 MLX5_TIRC_SELF_LB_BLOCK_BLOCK_UNICAST |
+		 MLX5_TIRC_SELF_LB_BLOCK_BLOCK_MULTICAST);
+	err = mlx5_core_create_tir(priv->mdev, in, &tirn);
+
+	if (!err)
+		queue->tirn = tirn;
+
+	kvfree(in);
+	return err;
+}
+
+static
+int mlx5e_create_nvmeotcp_tag_buf_table(struct mlx5_core_dev *mdev,
+					struct mlx5e_nvmeotcp_queue *queue,
+					u8 log_table_size)
+{
+	u32 in[MLX5_ST_SZ_DW(create_nvmeotcp_tag_buf_table_in)] = {};
+	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)];
+	u64 general_obj_types;
+	void *obj;
+	int err;
+
+	obj = MLX5_ADDR_OF(create_nvmeotcp_tag_buf_table_in, in,
+			   nvmeotcp_tag_buf_table_obj);
+
+	general_obj_types = MLX5_CAP_GEN_64(mdev, general_obj_types);
+	if (!(general_obj_types &
+	      MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE))
+		return -EINVAL;
+
+	MLX5_SET(general_obj_in_cmd_hdr, in, opcode,
+		 MLX5_CMD_OP_CREATE_GENERAL_OBJECT);
+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_type,
+		 MLX5_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE);
+	MLX5_SET(nvmeotcp_tag_buf_table_obj, obj,
+		 log_tag_buffer_table_size, log_table_size);
+
+	err = mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+	if (!err)
+		queue->tag_buf_table_id = MLX5_GET(general_obj_out_cmd_hdr,
+						   out, obj_id);
+	return err;
+}
+
+static
+void mlx5_destroy_nvmeotcp_tag_buf_table(struct mlx5_core_dev *mdev, u32 uid)
+{
+	u32 in[MLX5_ST_SZ_DW(general_obj_in_cmd_hdr)] = {};
+	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)];
+
+	MLX5_SET(general_obj_in_cmd_hdr, in, opcode,
+		 MLX5_CMD_OP_DESTROY_GENERAL_OBJECT);
+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_type,
+		 MLX5_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE);
+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_id, uid);
+
+	mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+#define KLM_ALIGNMENT 4
+#define MLX5E_NVMEOTCP_KLM_UMR_WQE_SZ(sgl_len)\
+	(sizeof(struct mlx5e_umr_wqe) +\
+	(sizeof(struct mlx5_klm) * (sgl_len)))
+
+#define MLX5E_NVMEOTCP_KLM_UMR_WQEBBS(sgl_len)\
+	(DIV_ROUND_UP(MLX5E_NVMEOTCP_KLM_UMR_WQE_SZ(sgl_len), MLX5_SEND_WQE_BB))
+
+#define NVMEOTCP_KLM_UMR_DS_CNT(sgl_len)\
+	DIV_ROUND_UP(MLX5E_NVMEOTCP_KLM_UMR_WQE_SZ(sgl_len), MLX5_SEND_WQE_DS)
+
+#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_TIR_PARAMS 0x2
+#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_STATIC_PARAMS 0x2
+#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_UMR 0x0
+
+#define MAX_KLM_ENTRIES_PER_WQE(wqe_size)\
+	(((wqe_size) - sizeof(struct mlx5e_umr_wqe)) / sizeof(struct mlx5_klm))
+
+#define KLM_ENTRIES_PER_WQE(wqe_size)\
+	(MAX_KLM_ENTRIES_PER_WQE(wqe_size) -\
+			(MAX_KLM_ENTRIES_PER_WQE(wqe_size) % KLM_ALIGNMENT))
+#define STATIC_PARAMS_DS_CNT \
+	DIV_ROUND_UP(MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ, MLX5_SEND_WQE_DS)
+#define PROGRESS_PARAMS_DS_CNT \
+	DIV_ROUND_UP(MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ, MLX5_SEND_WQE_DS)
+enum wqe_type {
+	KLM_UMR = 0,
+	BSF_KLM_UMR = 1,
+	SET_PSV_UMR = 2,
+	BSF_UMR = 3,
+	KLM_INV_UMR = 4,
+};
+
+static void
+fill_nvmeotcp_klm_wqe(struct mlx5e_nvmeotcp_queue *queue,
+		      struct mlx5e_umr_wqe *wqe, u16 ccid, u32 klm_entries,
+		      u16 klm_offset, enum wqe_type klm_type)
+{
+	struct scatterlist *sgl_mkey;
+	u32 lkey, i;
+
+	if (klm_type == BSF_KLM_UMR) {
+		for (i = 0; i < klm_entries; i++) {
+			lkey = queue->ccid_table[i + klm_offset].klm_mkey.key;
+			wqe->inline_klms[i].bcount = cpu_to_be32(1);
+			wqe->inline_klms[i].key	   = cpu_to_be32(lkey);
+			wqe->inline_klms[i].va	   = 0;
+		}
+	} else {
+		lkey = queue->priv->mdev->mlx5e_res.mkey.key;
+		for (i = 0; i < klm_entries; i++) {
+			sgl_mkey = &queue->ccid_table[ccid].sgl[i + klm_offset];
+			wqe->inline_klms[i].bcount =
+					cpu_to_be32(sgl_mkey->length);
+			wqe->inline_klms[i].key	   = cpu_to_be32(lkey);
+			wqe->inline_klms[i].va	   =
+					cpu_to_be64(sgl_mkey->dma_address);
+		}
+	}
+}
+
+static void
+build_nvmeotcp_klm_umr(struct mlx5e_nvmeotcp_queue *queue,
+		       struct mlx5e_umr_wqe *wqe, u16 ccid, int klm_entries,
+		       u32 klm_offset, u32 len, enum wqe_type klm_type)
+{
+	u32 id = (klm_type == KLM_UMR) ? queue->ccid_table[ccid].klm_mkey.key :
+		(queue->tirn << MLX5_WQE_CTRL_TIR_TIS_INDEX_SHIFT);
+	u8 opc_mod = (klm_type == KLM_UMR) ? MLX5_CTRL_SEGMENT_OPC_MOD_UMR_UMR :
+		MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_STATIC_PARAMS;
+	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
+	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
+	struct mlx5_mkey_seg          *mkc  = &wqe->mkc;
+	u32 sqn = queue->sq->sqn;
+	u16 pc = queue->sq->pc;
+
+	cseg->opmod_idx_opcode = cpu_to_be32((pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
+					     MLX5_OPCODE_UMR | (opc_mod) << 24);
+	cseg->qpn_ds = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
+				   NVMEOTCP_KLM_UMR_DS_CNT(ALIGN(klm_entries, KLM_ALIGNMENT)));
+	cseg->general_id = cpu_to_be32(id);
+
+	if (!klm_entries) { /* this is invalidate */
+		ucseg->mkey_mask = cpu_to_be64(MLX5_MKEY_MASK_FREE);
+		ucseg->flags = MLX5_UMR_INLINE;
+		mkc->status = MLX5_MKEY_STATUS_FREE;
+		return;
+	}
+
+	if (klm_type == KLM_UMR && !klm_offset) {
+		ucseg->mkey_mask |= cpu_to_be64(MLX5_MKEY_MASK_XLT_OCT_SIZE);
+		wqe->mkc.xlt_oct_size = cpu_to_be32(ALIGN(len, KLM_ALIGNMENT));
+	}
+
+	ucseg->flags = MLX5_UMR_INLINE | MLX5_UMR_TRANSLATION_OFFSET_EN;
+	ucseg->xlt_octowords = cpu_to_be16(ALIGN(klm_entries, KLM_ALIGNMENT));
+	ucseg->xlt_offset = cpu_to_be16(klm_offset);
+	fill_nvmeotcp_klm_wqe(queue, wqe, ccid, klm_entries,
+			      klm_offset, klm_type);
+}
+
+static void
+fill_nvmeotcp_progress_params(struct mlx5e_nvmeotcp_queue *queue,
+			      struct mlx5_seg_nvmeotcp_progress_params *params,
+			      u32 seq)
+{
+	void *ctx = params->ctx;
+
+	MLX5_SET(nvmeotcp_progress_params, ctx,
+		 next_pdu_tcp_sn, seq);
+	MLX5_SET(nvmeotcp_progress_params, ctx, valid, 1);
+	MLX5_SET(nvmeotcp_progress_params, ctx, pdu_tracker_state,
+		 MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_START);
+}
+
+void
+build_nvmeotcp_progress_params(struct mlx5e_nvmeotcp_queue *queue,
+			       struct mlx5e_set_nvmeotcp_progress_params_wqe *wqe,
+			       u32 seq)
+{
+	struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
+	u32 sqn = queue->sq->sqn;
+	u16 pc = queue->sq->pc;
+	u8 opc_mod;
+
+	memset(wqe, 0, MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ);
+	opc_mod = MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_PROGRESS_PARAMS;
+	cseg->opmod_idx_opcode = cpu_to_be32((pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
+					     MLX5_OPCODE_SET_PSV | (opc_mod << 24));
+	cseg->qpn_ds = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
+				   PROGRESS_PARAMS_DS_CNT);
+	cseg->general_id = cpu_to_be32(queue->tirn <<
+				       MLX5_WQE_CTRL_TIR_TIS_INDEX_SHIFT);
+	fill_nvmeotcp_progress_params(queue, &wqe->params, seq);
+}
+
+static void
+fill_nvmeotcp_static_params(struct mlx5e_nvmeotcp_queue *queue,
+			    struct mlx5_seg_nvmeotcp_static_params *params,
+			    u32 resync_seq)
+{
+	struct mlx5_core_dev *mdev = queue->priv->mdev;
+	void *ctx = params->ctx;
+
+	MLX5_SET(transport_static_params, ctx, const_1, 1);
+	MLX5_SET(transport_static_params, ctx, const_2, 2);
+	MLX5_SET(transport_static_params, ctx, acc_type,
+		 MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_NVMETCP);
+	MLX5_SET(transport_static_params, ctx, nvme_resync_tcp_sn, resync_seq);
+	MLX5_SET(transport_static_params, ctx, pda, queue->pda);
+	MLX5_SET(transport_static_params, ctx, ddgst_en, queue->dgst);
+	MLX5_SET(transport_static_params, ctx, ddgst_offload_en,
+		 queue->dgst && MLX5_CAP_DEV_NVMEOTCP(mdev, crc_rx));
+	MLX5_SET(transport_static_params, ctx, hddgst_en, 0);
+	MLX5_SET(transport_static_params, ctx, hdgst_offload_en, 0);
+	MLX5_SET(transport_static_params, ctx, ti,
+		 MLX5_TRANSPORT_STATIC_PARAMS_TI_INITIATOR);
+	MLX5_SET(transport_static_params, ctx, zero_copy_en, 1);
+}
+
+void
+build_nvmeotcp_static_params(struct mlx5e_nvmeotcp_queue *queue,
+			     struct mlx5e_set_nvmeotcp_static_params_wqe *wqe,
+			     u32 resync_seq)
+{
+	u8 opc_mod = MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_STATIC_PARAMS;
+	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
+	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
+	u32 sqn = queue->sq->sqn;
+	u16 pc = queue->sq->pc;
+
+	memset(wqe, 0, MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ);
+
+	cseg->opmod_idx_opcode = cpu_to_be32((pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
+					     MLX5_OPCODE_UMR | (opc_mod) << 24);
+	cseg->qpn_ds = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
+				   STATIC_PARAMS_DS_CNT);
+	cseg->imm = cpu_to_be32(queue->tirn << MLX5_WQE_CTRL_TIR_TIS_INDEX_SHIFT);
+
+	ucseg->flags = MLX5_UMR_INLINE;
+	ucseg->bsf_octowords =
+		cpu_to_be16(MLX5E_NVMEOTCP_STATIC_PARAMS_OCTWORD_SIZE);
+	fill_nvmeotcp_static_params(queue, &wqe->params, resync_seq);
+}
+
+static void
+mlx5e_nvmeotcp_fill_wi(struct mlx5e_nvmeotcp_queue *nvmeotcp_queue,
+		       struct mlx5e_icosq *sq, u32 wqe_bbs,
+		       u16 pi, u16 ccid, enum wqe_type type)
+{
+	struct mlx5e_icosq_wqe_info *wi = &sq->db.wqe_info[pi];
+
+	wi->num_wqebbs = wqe_bbs;
+
+	switch (type) {
+	case SET_PSV_UMR:
+		wi->wqe_type = MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP;
+		break;
+	case KLM_INV_UMR:
+		wi->wqe_type = MLX5E_ICOSQ_WQE_UMR_NVME_TCP_INVALIDATE;
+		break;
+	default:
+		wi->wqe_type = MLX5E_ICOSQ_WQE_UMR_NVME_TCP;
+		break;
+	}
+
+	if (type == KLM_INV_UMR)
+		wi->nvmeotcp_qe.entry = &nvmeotcp_queue->ccid_table[ccid];
+	else if (type == SET_PSV_UMR)
+		wi->nvmeotcp_q.queue = nvmeotcp_queue;
+
+}
+
+static void
+mlx5e_nvmeotcp_post_static_params_wqe(struct mlx5e_nvmeotcp_queue *queue,
+				      u32 resync_seq)
+{
+	struct mlx5e_set_nvmeotcp_static_params_wqe *wqe;
+	struct mlx5e_icosq *sq = queue->sq;
+	u16 pi, wqe_bbs;
+
+	wqe_bbs = MLX5E_NVMEOTCP_STATIC_PARAMS_WQEBBS;
+	pi = mlx5e_icosq_get_next_pi(sq, wqe_bbs);
+	wqe = MLX5E_NVMEOTCP_FETCH_STATIC_PARAMS_WQE(sq, pi);
+	mlx5e_nvmeotcp_fill_wi(NULL, sq, wqe_bbs, pi, 0, BSF_UMR);
+	build_nvmeotcp_static_params(queue, wqe, resync_seq);
+	sq->pc += wqe_bbs;
+	mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, &wqe->ctrl);
+}
+
+static void
+mlx5e_nvmeotcp_post_progress_params_wqe(struct mlx5e_nvmeotcp_queue *queue,
+					u32 seq)
+{
+	struct mlx5e_set_nvmeotcp_progress_params_wqe *wqe;
+	struct mlx5e_icosq *sq = queue->sq;
+	u16 pi, wqe_bbs;
+
+	wqe_bbs = MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQEBBS;
+	pi = mlx5e_icosq_get_next_pi(sq, wqe_bbs);
+	wqe = MLX5E_NVMEOTCP_FETCH_PROGRESS_PARAMS_WQE(sq, pi);
+	mlx5e_nvmeotcp_fill_wi(queue, sq, wqe_bbs, pi, 0, SET_PSV_UMR);
+	build_nvmeotcp_progress_params(queue, wqe, seq);
+	sq->pc += wqe_bbs;
+	mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, &wqe->ctrl);
+}
+
+static void
+post_klm_wqe(struct mlx5e_nvmeotcp_queue *queue,
+	     enum wqe_type wqe_type,
+	     u16 ccid,
+	     u32 klm_length,
+	     u32 *klm_offset)
+{
+	struct mlx5e_icosq *sq = queue->sq;
+	u32 wqe_bbs, cur_klm_entries;
+	struct mlx5e_umr_wqe *wqe;
+	u16 pi, wqe_sz;
+
+	cur_klm_entries = min_t(int, queue->max_klms_per_wqe,
+				klm_length - *klm_offset);
+	wqe_sz = MLX5E_NVMEOTCP_KLM_UMR_WQE_SZ(ALIGN(cur_klm_entries, KLM_ALIGNMENT));
+	wqe_bbs = DIV_ROUND_UP(wqe_sz, MLX5_SEND_WQE_BB);
+	pi = mlx5e_icosq_get_next_pi(sq, wqe_bbs);
+	wqe = MLX5E_NVMEOTCP_FETCH_KLM_WQE(sq, pi);
+	mlx5e_nvmeotcp_fill_wi(queue, sq, wqe_bbs, pi, ccid,
+			       klm_length ? KLM_UMR : KLM_INV_UMR);
+	build_nvmeotcp_klm_umr(queue, wqe, ccid, cur_klm_entries, *klm_offset,
+			       klm_length, wqe_type);
+	*klm_offset += cur_klm_entries;
+	sq->pc += wqe_bbs;
+	sq->doorbell_cseg = &wqe->ctrl;
+}
+
+static int
+mlx5e_nvmeotcp_post_klm_wqe(struct mlx5e_nvmeotcp_queue *queue,
+			    enum wqe_type wqe_type,
+			    u16 ccid,
+			    u32 klm_length)
+{
+	u32 klm_offset = 0, wqes, wqe_sz, max_wqe_bbs, i, room;
+	struct mlx5e_icosq *sq = queue->sq;
+
+	/* TODO: set stricter wqe_sz; using max for now */
+	if (klm_length == 0) {
+		wqes = 1;
+		wqe_sz = MLX5E_NVMEOTCP_STATIC_PARAMS_WQEBBS;
+	} else {
+		wqes = DIV_ROUND_UP(klm_length, queue->max_klms_per_wqe);
+		wqe_sz = MLX5E_NVMEOTCP_KLM_UMR_WQE_SZ(queue->max_klms_per_wqe);
+	}
+
+	max_wqe_bbs = DIV_ROUND_UP(wqe_sz, MLX5_SEND_WQE_BB);
+
+	room = mlx5e_stop_room_for_wqe(max_wqe_bbs) * wqes;
+	if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, room)))
+		return -ENOSPC;
+
+	for (i = 0; i < wqes; i++)
+		post_klm_wqe(queue, wqe_type, ccid, klm_length, &klm_offset);
+
+	mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, sq->doorbell_cseg);
+	return 0;
+}
+
+static int mlx5e_create_nvmeotcp_mkey(struct mlx5_core_dev *mdev,
+				      u8 access_mode,
+				      u32 translation_octword_size,
+				      struct mlx5_core_mkey *mkey)
+{
+	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
+	void *mkc;
+	u32 *in;
+	int err;
+
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+	MLX5_SET(mkc, mkc, free, 1);
+	MLX5_SET(mkc, mkc, translations_octword_size, translation_octword_size);
+	MLX5_SET(mkc, mkc, umr_en, 1);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, lr, 1);
+	MLX5_SET(mkc, mkc, access_mode_1_0, access_mode);
+
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+	MLX5_SET(mkc, mkc, pd, mdev->mlx5e_res.pdn);
+	MLX5_SET(mkc, mkc, length64, 1);
+
+	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
+
+	kvfree(in);
+	return err;
+}
+
+static int
+mlx5e_nvmeotcp_offload_limits(struct net_device *netdev,
+			      struct tcp_ddp_limits *limits)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5_core_dev *mdev = priv->mdev;
+
+	limits->max_ddp_sgl_len = mlx5e_get_max_sgl(mdev);
+	return 0;
+}
+
+#define OCTWORD_SHIFT 4
+#define MAX_DS_VALUE 63
+static int
+mlx5e_nvmeotcp_queue_init(struct net_device *netdev,
+			  struct sock *sk,
+			  struct tcp_ddp_config *tconfig)
+{
+	struct nvme_tcp_config *config = (struct nvme_tcp_config *)tconfig;
+	u8 log_queue_size = order_base_2(config->queue_size);
+	int max_sgls, max_wqe_sz_cap, queue_id, tirn, i, err;
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5_core_dev *mdev = priv->mdev;
+	struct mlx5e_nvmeotcp_queue *queue;
+	struct mlx5e_rq_stats *stats;
+
+	if (tconfig->type != TCP_DDP_NVME) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (config->queue_size >
+	    BIT(MLX5_CAP_DEV_NVMEOTCP(mdev, log_max_nvmeotcp_tag_buffer_size))) {
+		err = -EINVAL;
+		goto out;
+	}
+
+
+	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
+	if (!queue) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	queue_id = ida_simple_get(&priv->nvmeotcp->queue_ids,
+				  MIN_NVMEOTCP_QUEUES, MAX_NVMEOTCP_QUEUES,
+				  GFP_KERNEL);
+	if (queue_id < 0) {
+		err = -ENOSPC;
+		goto free_queue;
+	}
+
+	err = mlx5e_create_nvmeotcp_tag_buf_table(mdev, queue, log_queue_size);
+	if (err)
+		goto remove_queue_id;
+
+	queue->tcp_ddp_ctx.type = TCP_DDP_NVME;
+	queue->sk = sk;
+	queue->id = queue_id;
+	queue->dgst = config->dgst;
+	queue->pda = config->cpda;
+	queue->channel_ix = mlx5e_get_channel_ix_from_io_cpu(priv,
+							     config->io_cpu);
+	queue->sq = &priv->channels.c[queue->channel_ix]->nvmeotcpsq;
+	queue->size = config->queue_size;
+	max_wqe_sz_cap  = min_t(int, MAX_DS_VALUE * MLX5_SEND_WQE_DS,
+				MLX5_CAP_GEN(mdev, max_wqe_sz_sq) << OCTWORD_SHIFT);
+	queue->max_klms_per_wqe = KLM_ENTRIES_PER_WQE(max_wqe_sz_cap);
+	queue->priv = priv;
+	init_completion(&queue->done);
+	queue->context_ready = false;
+
+	/* initializes queue->tirn */
+	err = mlx5e_nvmeotcp_create_tir(priv, sk, config, queue);
+	if (err)
+		goto destroy_tag_buffer_table;
+
+	queue->ccid_table = kcalloc(queue->size,
+				    sizeof(struct nvmeotcp_queue_entry),
+				    GFP_KERNEL);
+	if (!queue->ccid_table) {
+		err = -ENOMEM;
+		goto destroy_tir;
+	}
+
+	err = rhashtable_insert_fast(&priv->nvmeotcp->queue_hash, &queue->hash,
+				     rhash_queues);
+	if (err)
+		goto free_ccid_table;
+
+	mlx5e_nvmeotcp_post_static_params_wqe(queue, 0);
+	mlx5e_nvmeotcp_post_progress_params_wqe(queue, tcp_sk(sk)->copied_seq);
+	max_sgls = mlx5e_get_max_sgl(mdev);
+	for (i = 0; i < config->queue_size; i++) {
+		err = mlx5e_create_nvmeotcp_mkey(mdev,
+						 MLX5_MKC_ACCESS_MODE_KLMS,
+						 max_sgls,
+						 &queue->ccid_table[i].klm_mkey);
+		if (err)
+			goto free_sgl;
+	}
+
+	err = mlx5e_nvmeotcp_post_klm_wqe(queue, BSF_KLM_UMR, 0, queue->size);
+	if (err)
+		goto free_sgl;
+
+	WARN_ON(!wait_for_completion_timeout(&queue->done, 1800));
+
+	tirn = queue->tirn;
+	if (queue->context_ready)
+		queue->fh = mlx5e_accel_fs_add_sk(priv, sk, tirn, queue_id);
+
+	if (IS_ERR_OR_NULL(queue->fh)) {
+		err = -EINVAL;
+		goto free_sgl;
+	}
+
+	stats = &priv->channel_stats[queue->channel_ix].rq;
+	stats->nvmeotcp_queue_init++;
+	write_lock_bh(&sk->sk_callback_lock);
+	rcu_assign_pointer(inet_csk(sk)->icsk_ulp_ddp_data, queue);
+	write_unlock_bh(&sk->sk_callback_lock);
+	refcount_set(&queue->ref_count, 1);
+	static_branch_inc(&skip_copy_enabled);
+	return err;
+
+free_sgl:
+	while (i--)
+		mlx5_core_destroy_mkey(mdev, &queue->ccid_table[i].klm_mkey);
+	rhashtable_remove_fast(&priv->nvmeotcp->queue_hash,
+			       &queue->hash, rhash_queues);
+free_ccid_table:
+	kfree(queue->ccid_table);
+destroy_tir:
+	mlx5e_nvmeotcp_destroy_tir(priv, queue->tirn);
+destroy_tag_buffer_table:
+	mlx5_destroy_nvmeotcp_tag_buf_table(mdev, queue->tag_buf_table_id);
+remove_queue_id:
+	ida_simple_remove(&priv->nvmeotcp->queue_ids, queue_id);
+free_queue:
+	kfree(queue);
+out:
+	stats->nvmeotcp_queue_init_fail++;
+	return err;
+}
+
+static void
+mlx5e_nvmeotcp_queue_teardown(struct net_device *netdev,
+			      struct sock *sk)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5_core_dev *mdev = priv->mdev;
+	struct mlx5e_nvmeotcp_queue *queue;
+	struct mlx5e_rq_stats *stats;
+	int i;
+
+	queue = (struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
+
+	stats = &priv->channel_stats[queue->channel_ix].rq;
+	stats->nvmeotcp_queue_teardown++;
+
+	WARN_ON(refcount_read(&queue->ref_count) != 1);
+	mlx5e_accel_fs_del_sk(queue->fh);
+
+	for (i = 0; i < queue->size; i++)
+		mlx5_core_destroy_mkey(mdev, &queue->ccid_table[i].klm_mkey);
+
+	rhashtable_remove_fast(&priv->nvmeotcp->queue_hash, &queue->hash,
+			       rhash_queues);
+	kfree(queue->ccid_table);
+	mlx5e_nvmeotcp_destroy_tir(priv, queue->tirn);
+	mlx5_destroy_nvmeotcp_tag_buf_table(mdev, queue->tag_buf_table_id);
+	ida_simple_remove(&priv->nvmeotcp->queue_ids, queue->id);
+	static_branch_dec(&skip_copy_enabled);
+	write_lock_bh(&sk->sk_callback_lock);
+	rcu_assign_pointer(inet_csk(sk)->icsk_ulp_ddp_data, NULL);
+	write_unlock_bh(&sk->sk_callback_lock);
+	mlx5e_nvmeotcp_put_queue(queue);
+}
+
+static int
+mlx5e_nvmeotcp_ddp_setup(struct net_device *netdev,
+			 struct sock *sk,
+			 struct tcp_ddp_io *ddp)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct scatterlist *sg = ddp->sg_table.sgl;
+	struct mlx5e_nvmeotcp_queue *queue;
+	struct mlx5e_rq_stats *stats;
+	struct mlx5_core_dev *mdev;
+	int count = 0;
+
+	queue = (struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
+
+	mdev = queue->priv->mdev;
+	count = dma_map_sg(mdev->device, ddp->sg_table.sgl, ddp->nents,
+			   DMA_FROM_DEVICE);
+
+	if (WARN_ON(count > mlx5e_get_max_sgl(mdev)))
+		return -ENOSPC;
+
+	queue->ccid_table[ddp->command_id].ddp = ddp;
+	queue->ccid_table[ddp->command_id].sgl = sg;
+	queue->ccid_table[ddp->command_id].ccid_gen++;
+	queue->ccid_table[ddp->command_id].sgl_length = count;
+
+	stats = &priv->channel_stats[queue->channel_ix].rq;
+	stats->nvmeotcp_ddp_setup++;
+	if (unlikely(mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, count)))
+		stats->nvmeotcp_ddp_setup_fail++;
+
+	return 0;
+}
+
+void mlx5e_nvmeotcp_ddp_inv_done(struct mlx5e_icosq_wqe_info *wi)
+{
+	struct nvmeotcp_queue_entry *q_entry = wi->nvmeotcp_qe.entry;
+	struct mlx5e_nvmeotcp_queue *queue = q_entry->queue;
+	struct mlx5_core_dev *mdev = queue->priv->mdev;
+	struct tcp_ddp_io *ddp = q_entry->ddp;
+	const struct tcp_ddp_ulp_ops *ulp_ops;
+
+	dma_unmap_sg(mdev->device, ddp->sg_table.sgl,
+		     q_entry->sgl_length, DMA_FROM_DEVICE);
+
+	q_entry->sgl_length = 0;
+
+	ulp_ops = inet_csk(queue->sk)->icsk_ulp_ddp_ops;
+	if (ulp_ops && ulp_ops->resync_request)
+		ulp_ops->ddp_teardown_done(q_entry->ddp_ctx);
+}
+
+void mlx5e_nvmeotcp_ctx_comp(struct mlx5e_icosq_wqe_info *wi)
+{
+	struct mlx5e_nvmeotcp_queue *queue = wi->nvmeotcp_q.queue;
+
+	if (unlikely(!queue))
+		return;
+
+	queue->context_ready = true;
+
+	complete(&queue->done);
+}
+
+static int
+mlx5e_nvmeotcp_ddp_teardown(struct net_device *netdev,
+			    struct sock *sk,
+			    struct tcp_ddp_io *ddp,
+			    void *ddp_ctx)
+{
+	struct mlx5e_nvmeotcp_queue *queue =
+		(struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct nvmeotcp_queue_entry *q_entry;
+	struct mlx5e_rq_stats *stats;
+
+	q_entry  = &queue->ccid_table[ddp->command_id];
+	WARN_ON(q_entry->sgl_length == 0);
+
+	q_entry->ddp_ctx = ddp_ctx;
+	q_entry->queue = queue;
+
+	mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, 0);
+	stats = &priv->channel_stats[queue->channel_ix].rq;
+	stats->nvmeotcp_ddp_teardown++;
+
+	return 0;
+}
+
+static void
+mlx5e_nvmeotcp_dev_resync(struct net_device *netdev,
+			  struct sock *sk, u32 seq)
+{
+	struct mlx5e_nvmeotcp_queue *queue =
+				(struct mlx5e_nvmeotcp_queue *)tcp_ddp_get_ctx(sk);
+
+	mlx5e_nvmeotcp_post_static_params_wqe(queue, seq);
+}
+
+static const struct tcp_ddp_dev_ops mlx5e_nvmeotcp_ops = {
+	.tcp_ddp_limits = mlx5e_nvmeotcp_offload_limits,
+	.tcp_ddp_sk_add = mlx5e_nvmeotcp_queue_init,
+	.tcp_ddp_sk_del = mlx5e_nvmeotcp_queue_teardown,
+	.tcp_ddp_setup = mlx5e_nvmeotcp_ddp_setup,
+	.tcp_ddp_teardown = mlx5e_nvmeotcp_ddp_teardown,
+	.tcp_ddp_resync = mlx5e_nvmeotcp_dev_resync,
+};
+
+struct mlx5e_nvmeotcp_queue *
+mlx5e_nvmeotcp_get_queue(struct mlx5e_nvmeotcp *nvmeotcp, int id)
+{
+	struct mlx5e_nvmeotcp_queue *queue;
+
+	rcu_read_lock();
+	queue = rhashtable_lookup_fast(&nvmeotcp->queue_hash,
+				       &id, rhash_queues);
+	if (queue && !IS_ERR(queue))
+		if (!refcount_inc_not_zero(&queue->ref_count))
+			queue = NULL;
+	rcu_read_unlock();
+	return queue;
+}
+
+void mlx5e_nvmeotcp_put_queue(struct mlx5e_nvmeotcp_queue *queue)
+{
+	if (refcount_dec_and_test(&queue->ref_count))
+		kfree(queue);
+}
+
+int set_feature_nvme_tcp(struct net_device *netdev, bool enable)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	int err = 0;
+
+	mutex_lock(&priv->state_lock);
+	if (enable)
+		err = mlx5e_accel_fs_tcp_create(priv);
+	else
+		mlx5e_accel_fs_tcp_destroy(priv);
+	mutex_unlock(&priv->state_lock);
+	if (err)
+		return err;
+
+	priv->nvmeotcp->enable = enable;
+	err = mlx5e_safe_reopen_channels(priv);
+	return err;
+}
+
+int set_feature_nvme_tcp_crc(struct net_device *netdev, bool enable)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	int err = 0;
+
+	mutex_lock(&priv->state_lock);
+	if (enable)
+		err = mlx5e_accel_fs_tcp_create(priv);
+	else
+		mlx5e_accel_fs_tcp_destroy(priv);
+	mutex_unlock(&priv->state_lock);
+
+	return err;
+}
+
+void mlx5e_nvmeotcp_build_netdev(struct mlx5e_priv *priv)
+{
+	struct net_device *netdev = priv->netdev;
+
+	if (!MLX5_CAP_GEN(priv->mdev, nvmeotcp))
+		return;
+
+	if (MLX5_CAP_DEV_NVMEOTCP(priv->mdev, zerocopy)) {
+		netdev->features |= NETIF_F_HW_TCP_DDP;
+		netdev->hw_features |= NETIF_F_HW_TCP_DDP;
+	}
+
+	if (MLX5_CAP_DEV_NVMEOTCP(priv->mdev, crc_rx)) {
+		netdev->features |= NETIF_F_HW_TCP_DDP_CRC;
+		netdev->hw_features |= NETIF_F_HW_TCP_DDP_CRC;
+	}
+
+	netdev->tcp_ddp_ops = &mlx5e_nvmeotcp_ops;
+	priv->nvmeotcp->enable = true;
+}
+
+int mlx5e_nvmeotcp_init_rx(struct mlx5e_priv *priv)
+{
+	int ret = 0;
+
+	if (priv->netdev->features & NETIF_F_HW_TCP_DDP) {
+		ret = mlx5e_accel_fs_tcp_create(priv);
+		if (ret)
+			return ret;
+	}
+
+	if (priv->netdev->features & NETIF_F_HW_TCP_DDP_CRC)
+		ret = mlx5e_accel_fs_tcp_create(priv);
+
+	return ret;
+}
+
+void mlx5e_nvmeotcp_cleanup_rx(struct mlx5e_priv *priv)
+{
+	if (priv->netdev->features & NETIF_F_HW_TCP_DDP)
+		mlx5e_accel_fs_tcp_destroy(priv);
+
+	if (priv->netdev->features & NETIF_F_HW_TCP_DDP_CRC)
+		mlx5e_accel_fs_tcp_destroy(priv);
+}
+
+int mlx5e_nvmeotcp_init(struct mlx5e_priv *priv)
+{
+	struct mlx5e_nvmeotcp *nvmeotcp = kzalloc(sizeof(*nvmeotcp), GFP_KERNEL);
+	int ret = 0;
+
+	if (!nvmeotcp)
+		return -ENOMEM;
+
+	ida_init(&nvmeotcp->queue_ids);
+	ret = rhashtable_init(&nvmeotcp->queue_hash, &rhash_queues);
+	if (ret)
+		goto err_ida;
+
+	priv->nvmeotcp = nvmeotcp;
+	goto out;
+
+err_ida:
+	ida_destroy(&nvmeotcp->queue_ids);
+	kfree(nvmeotcp);
+out:
+	return ret;
+}
+
+void mlx5e_nvmeotcp_cleanup(struct mlx5e_priv *priv)
+{
+	struct mlx5e_nvmeotcp *nvmeotcp = priv->nvmeotcp;
+
+	if (!nvmeotcp)
+		return;
+
+	rhashtable_destroy(&nvmeotcp->queue_hash);
+	ida_destroy(&nvmeotcp->queue_ids);
+	kfree(nvmeotcp);
+	priv->nvmeotcp = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
new file mode 100644
index 000000000000..3b84dd9f49f6
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies.
+#ifndef __MLX5E_NVMEOTCP_H__
+#define __MLX5E_NVMEOTCP_H__
+
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+
+#include "linux/nvme-tcp.h"
+#include "en.h"
+
+struct nvmeotcp_queue_entry {
+	struct mlx5e_nvmeotcp_queue 	*queue;
+	u32				sgl_length;
+	struct mlx5_core_mkey		klm_mkey;
+	struct scatterlist		*sgl;
+	u32				ccid_gen;
+
+	/* for the ddp invalidate done callback */
+	void	 		*ddp_ctx;
+	struct tcp_ddp_io	*ddp;
+};
+
+/**
+ *	struct mlx5e_nvmeotcp_queue - MLX5 metadata for NVMEoTCP queue
+ *	@fh: Flow handle representing the 5-tuple steering for this flow
+ *	@tirn: Destination TIR number created for NVMEoTCP offload
+ *	@id: Flow tag ID used to identify this queue
+ *	@size: NVMEoTCP queue depth
+ *	@sq: Send queue used for sending control messages
+ *	@ccid_table: Table holding metadata for each CC
+ *	@tag_buf_table_id: Tag buffer table for CCIDs
+ *	@hash: Hash table of queues mapped by @id
+ *	@ref_count: Reference count for this structure
+ *	@ccoff: Offset within the current CC
+ *	@pda: Padding alignment
+ *	@ccid_gen: Generation ID for the CCID, used to avoid conflicts in DDP
+ *	@max_klms_per_wqe: Number of KLMs per DDP operation
+ *	@channel_ix: Channel IX for this nvmeotcp_queue
+ *	@sk: The socket used by the NVMe-TCP queue
+ *	@ccid: ID of the current CC
+ *	@ccsglidx: Index within the scatter-gather list (SGL) of the current CC
+ *	@ccoff_inner: Current offset within the @ccsglidx element
+ *	@priv: mlx5e netdev priv
+ *	@inv_done: invalidate callback of the nvme tcp driver
+ */
+struct mlx5e_nvmeotcp_queue {
+	struct tcp_ddp_ctx		tcp_ddp_ctx;
+	struct mlx5_flow_handle		*fh;
+	int				tirn;
+	int				id;
+	u32				size;
+	struct mlx5e_icosq		*sq;
+	struct nvmeotcp_queue_entry	*ccid_table;
+	u32				tag_buf_table_id;
+	struct rhash_head		hash;
+	refcount_t			ref_count;
+	bool				dgst;
+	int				pda;
+	u32				ccid_gen;
+	u32				max_klms_per_wqe;
+	u32				channel_ix;
+	struct sock			*sk;
+
+	/* current ccid fields */
+	off_t				ccoff;
+	int				ccid;
+	int				ccsglidx;
+	int				ccoff_inner;
+
+	/* for ddp invalidate flow */
+	struct mlx5e_priv		*priv;
+
+	/* for flow_steering flow */
+	struct completion		done;
+	bool				context_ready;
+};
+
+struct mlx5e_nvmeotcp {
+	struct ida				queue_ids;
+	struct rhashtable			queue_hash;
+	bool 					enable;
+};
+
+void mlx5e_nvmeotcp_build_netdev(struct mlx5e_priv *priv);
+int mlx5e_nvmeotcp_init(struct mlx5e_priv *priv);
+int set_feature_nvme_tcp(struct net_device *netdev, bool enable);
+int set_feature_nvme_tcp_crc(struct net_device *netdev, bool enable);
+void mlx5e_nvmeotcp_cleanup(struct mlx5e_priv *priv);
+
+int mlx5e_nvmeotcp_get_count(struct mlx5e_priv *priv);
+int mlx5e_nvmeotcp_get_strings(struct mlx5e_priv *priv, uint8_t *data);
+struct mlx5e_nvmeotcp_queue *
+mlx5e_nvmeotcp_get_queue(struct mlx5e_nvmeotcp *nvmeotcp, int id);
+void mlx5e_nvmeotcp_put_queue(struct mlx5e_nvmeotcp_queue *queue);
+
+void mlx5e_nvmeotcp_ddp_inv_done(struct mlx5e_icosq_wqe_info *wi);
+void mlx5e_nvmeotcp_ctx_comp(struct mlx5e_icosq_wqe_info *wi);
+
+int mlx5e_nvmeotcp_init_rx(struct mlx5e_priv *priv);
+void mlx5e_nvmeotcp_cleanup_rx(struct mlx5e_priv *priv);
+#else
+
+static inline void mlx5e_nvmeotcp_build_netdev(struct mlx5e_priv *priv) { }
+static inline int mlx5e_nvmeotcp_init(struct mlx5e_priv *priv) { return 0; }
+static inline void mlx5e_nvmeotcp_cleanup(struct mlx5e_priv *priv) { }
+
+static inline int mlx5e_nvmeotcp_get_count(struct mlx5e_priv *priv) { return 0; }
+static inline int mlx5e_nvmeotcp_get_strings(struct mlx5e_priv *priv, uint8_t *data) { return 0; }
+static inline int set_feature_nvme_tcp(struct net_device *netdev, bool enable) { return 0; }
+static inline int set_feature_nvme_tcp_crc(struct net_device *netdev, bool enable) { return 0; }
+
+static inline int mlx5e_nvmeotcp_init_rx(struct mlx5e_priv *priv) { return 0; }
+static inline void mlx5e_nvmeotcp_cleanup_rx(struct mlx5e_priv *priv) { }
+
+#endif
+#endif /* __MLX5E_NVMEOTCP_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
new file mode 100644
index 000000000000..e76bea9fd8c8
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies.
+
+#ifndef __MLX5E_NVMEOTCP_UTILS_H__
+#define __MLX5E_NVMEOTCP_UTILS_H__
+
+#include "en.h"
+#include "en_accel/nvmeotcp.h"
+
+enum {
+	MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_START     = 0,
+	MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_TRACKING  = 1,
+	MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_SEARCHING = 2,
+};
+
+struct mlx5_seg_nvmeotcp_static_params {
+	u8     ctx[MLX5_ST_SZ_BYTES(transport_static_params)];
+};
+
+struct mlx5_seg_nvmeotcp_progress_params {
+	u8     ctx[MLX5_ST_SZ_BYTES(nvmeotcp_progress_params)];
+};
+
+struct mlx5e_set_nvmeotcp_static_params_wqe {
+	struct mlx5_wqe_ctrl_seg          ctrl;
+	struct mlx5_wqe_umr_ctrl_seg      uctrl;
+	struct mlx5_mkey_seg              mkc;
+	struct mlx5_seg_nvmeotcp_static_params params;
+};
+
+struct mlx5e_set_nvmeotcp_progress_params_wqe {
+	struct mlx5_wqe_ctrl_seg            ctrl;
+	struct mlx5_seg_nvmeotcp_progress_params params;
+};
+
+struct mlx5e_get_psv_wqe {
+	struct mlx5_wqe_ctrl_seg ctrl;
+	struct mlx5_seg_get_psv  psv;
+};
+
+///////////////////////////////////////////
+#define MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ \
+	(sizeof(struct mlx5e_set_nvmeotcp_static_params_wqe))
+
+#define MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ \
+	(sizeof(struct mlx5e_set_nvmeotcp_progress_params_wqe))
+#define MLX5E_NVMEOTCP_STATIC_PARAMS_OCTWORD_SIZE \
+	(MLX5_ST_SZ_BYTES(transport_static_params) / MLX5_SEND_WQE_DS)
+
+#define MLX5E_NVMEOTCP_STATIC_PARAMS_WQEBBS \
+	(DIV_ROUND_UP(MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ, MLX5_SEND_WQE_BB))
+#define MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQEBBS \
+	(DIV_ROUND_UP(MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ, MLX5_SEND_WQE_BB))
+
+#define MLX5E_NVMEOTCP_FETCH_STATIC_PARAMS_WQE(sq, pi) \
+	((struct mlx5e_set_nvmeotcp_static_params_wqe *)\
+	 mlx5e_fetch_wqe(&(sq)->wq, pi, sizeof(struct mlx5e_set_nvmeotcp_static_params_wqe)))
+
+#define MLX5E_NVMEOTCP_FETCH_PROGRESS_PARAMS_WQE(sq, pi) \
+	((struct mlx5e_set_nvmeotcp_progress_params_wqe *)\
+	 mlx5e_fetch_wqe(&(sq)->wq, pi, sizeof(struct mlx5e_set_nvmeotcp_progress_params_wqe)))
+
+#define MLX5E_NVMEOTCP_FETCH_KLM_WQE(sq, pi) \
+	((struct mlx5e_umr_wqe *)\
+	 mlx5e_fetch_wqe(&(sq)->wq, pi, sizeof(struct mlx5e_umr_wqe)))
+
+#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_PROGRESS_PARAMS 0x4
+
+void
+build_nvmeotcp_progress_params(struct mlx5e_nvmeotcp_queue *queue,
+			       struct mlx5e_set_nvmeotcp_progress_params_wqe *wqe,
+			       u32 seq);
+
+void
+build_nvmeotcp_static_params(struct mlx5e_nvmeotcp_queue *queue,
+			     struct mlx5e_set_nvmeotcp_static_params_wqe *wqe,
+			     u32 resync_seq);
+
+#endif /* __MLX5E_NVMEOTCP_UTILS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 961cdce37cc4..9813ef699fab 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -47,6 +47,7 @@
 #include "en_accel/ipsec.h"
 #include "en_accel/en_accel.h"
 #include "en_accel/tls.h"
+#include "en_accel/nvmeotcp.h"
 #include "accel/ipsec.h"
 #include "accel/tls.h"
 #include "lib/vxlan.h"
@@ -1800,9 +1801,15 @@ static int mlx5e_open_queues(struct mlx5e_channel *c,
 	if (err)
 		goto err_close_async_icosq_cq;
 
-	err = mlx5e_open_tx_cqs(c, params, cparam);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	err = mlx5e_open_cq(c, icocq_moder, &cparam->nvmeotcpsq.cqp, &c->nvmeotcpsq.cq);
 	if (err)
 		goto err_close_icosq_cq;
+#endif
+
+	err = mlx5e_open_tx_cqs(c, params, cparam);
+	if (err)
+		goto err_close_nvmeotcpsq_cq;
 
 	err = mlx5e_open_cq(c, params->tx_cq_moderation, &cparam->xdp_sq.cqp, &c->xdpsq.cq);
 	if (err)
@@ -1829,9 +1836,15 @@ static int mlx5e_open_queues(struct mlx5e_channel *c,
 	if (err)
 		goto err_close_async_icosq;
 
-	err = mlx5e_open_sqs(c, params, cparam);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	err = mlx5e_open_icosq(c, params, &cparam->nvmeotcpsq, &c->nvmeotcpsq);
 	if (err)
 		goto err_close_icosq;
+#endif
+
+	err = mlx5e_open_sqs(c, params, cparam);
+	if (err)
+		goto err_close_nvmeotcpsq;
 
 	if (c->xdp) {
 		err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, NULL,
@@ -1860,7 +1873,12 @@ static int mlx5e_open_queues(struct mlx5e_channel *c,
 err_close_sqs:
 	mlx5e_close_sqs(c);
 
+err_close_nvmeotcpsq:
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	mlx5e_close_icosq(&c->nvmeotcpsq);
+
 err_close_icosq:
+#endif
 	mlx5e_close_icosq(&c->icosq);
 
 err_close_async_icosq:
@@ -1881,12 +1899,16 @@ static int mlx5e_open_queues(struct mlx5e_channel *c,
 err_close_tx_cqs:
 	mlx5e_close_tx_cqs(c);
 
+err_close_nvmeotcpsq_cq:
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	mlx5e_close_cq(&c->nvmeotcpsq.cq);
+
 err_close_icosq_cq:
+#endif
 	mlx5e_close_cq(&c->icosq.cq);
 
 err_close_async_icosq_cq:
 	mlx5e_close_cq(&c->async_icosq.cq);
-
 	return err;
 }
 
@@ -1897,6 +1919,9 @@ static void mlx5e_close_queues(struct mlx5e_channel *c)
 	if (c->xdp)
 		mlx5e_close_xdpsq(&c->rq_xdpsq);
 	mlx5e_close_sqs(c);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	mlx5e_close_icosq(&c->nvmeotcpsq);
+#endif
 	mlx5e_close_icosq(&c->icosq);
 	mlx5e_close_icosq(&c->async_icosq);
 	napi_disable(&c->napi);
@@ -1905,6 +1930,9 @@ static void mlx5e_close_queues(struct mlx5e_channel *c)
 	mlx5e_close_cq(&c->rq.cq);
 	mlx5e_close_cq(&c->xdpsq.cq);
 	mlx5e_close_tx_cqs(c);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	mlx5e_close_cq(&c->nvmeotcpsq.cq);
+#endif
 	mlx5e_close_cq(&c->icosq.cq);
 	mlx5e_close_cq(&c->async_icosq.cq);
 }
@@ -1988,6 +2016,9 @@ static void mlx5e_activate_channel(struct mlx5e_channel *c)
 		mlx5e_activate_txqsq(&c->sq[tc]);
 	mlx5e_activate_icosq(&c->icosq);
 	mlx5e_activate_icosq(&c->async_icosq);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	mlx5e_activate_icosq(&c->nvmeotcpsq);
+#endif
 	mlx5e_activate_rq(&c->rq);
 
 	if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
@@ -2002,6 +2033,9 @@ static void mlx5e_deactivate_channel(struct mlx5e_channel *c)
 		mlx5e_deactivate_xsk(c);
 
 	mlx5e_deactivate_rq(&c->rq);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	mlx5e_deactivate_icosq(&c->nvmeotcpsq);
+#endif
 	mlx5e_deactivate_icosq(&c->async_icosq);
 	mlx5e_deactivate_icosq(&c->icosq);
 	for (tc = 0; tc < c->num_tc; tc++)
@@ -2185,7 +2219,8 @@ static void mlx5e_build_common_cq_param(struct mlx5e_priv *priv,
 	void *cqc = param->cqc;
 
 	MLX5_SET(cqc, cqc, uar_page, priv->mdev->priv.uar->index);
-	if (MLX5_CAP_GEN(priv->mdev, cqe_128_always) && cache_line_size() >= 128)
+	if (MLX5_CAP_GEN(priv->mdev, cqe_128_always) &&
+	    (cache_line_size() >= 128 || param->force_cqe128))
 		MLX5_SET(cqc, cqc, cqe_sz, CQE_STRIDE_128_PAD);
 }
 
@@ -2199,6 +2234,11 @@ void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv,
 	void *cqc = param->cqc;
 	u8 log_cq_size;
 
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	/* nvme-tcp offload mandates 128 byte cqes */
+	param->force_cqe128 |= priv->nvmeotcp->enable;
+#endif
+
 	switch (params->rq_wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		log_cq_size = mlx5e_mpwqe_get_log_rq_size(params, xsk) +
@@ -2307,6 +2347,9 @@ static void mlx5e_build_channel_param(struct mlx5e_priv *priv,
 	mlx5e_build_xdpsq_param(priv, params, &cparam->xdp_sq);
 	mlx5e_build_icosq_param(priv, icosq_log_wq_sz, &cparam->icosq);
 	mlx5e_build_icosq_param(priv, async_icosq_log_wq_sz, &cparam->async_icosq);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	mlx5e_build_icosq_param(priv, params->log_sq_size, &cparam->nvmeotcpsq);
+#endif
 }
 
 int mlx5e_open_channels(struct mlx5e_priv *priv,
@@ -3851,6 +3894,10 @@ int mlx5e_set_features(struct net_device *netdev, netdev_features_t features)
 	err |= MLX5E_HANDLE_FEATURE(NETIF_F_NTUPLE, set_feature_arfs);
 #endif
 	err |= MLX5E_HANDLE_FEATURE(NETIF_F_HW_TLS_RX, mlx5e_ktls_set_feature_rx);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	err |= MLX5E_HANDLE_FEATURE(NETIF_F_HW_TCP_DDP, set_feature_nvme_tcp);
+	err |= MLX5E_HANDLE_FEATURE(NETIF_F_HW_TCP_DDP_CRC, set_feature_nvme_tcp_crc);
+#endif
 
 	if (err) {
 		netdev->features = oper_features;
@@ -3887,6 +3934,19 @@ static netdev_features_t mlx5e_fix_features(struct net_device *netdev,
 		features &= ~NETIF_F_RXHASH;
 		if (netdev->features & NETIF_F_RXHASH)
 			netdev_warn(netdev, "Disabling rxhash, not supported when CQE compress is active\n");
+
+		features &= ~NETIF_F_HW_TCP_DDP;
+		if (netdev->features & NETIF_F_HW_TCP_DDP)
+			netdev_warn(netdev, "Disabling tcp-ddp offload, not supported when CQE compress is active\n");
+	}
+
+	if (netdev->features & NETIF_F_LRO) {
+		features &= ~NETIF_F_HW_TCP_DDP;
+		if (netdev->features & NETIF_F_HW_TCP_DDP)
+			netdev_warn(netdev, "Disabling tcp-ddp offload, not supported when LRO is active\n");
+		features &= ~NETIF_F_HW_TCP_DDP_CRC;
+		if (netdev->features & NETIF_F_HW_TCP_DDP_CRC)
+			netdev_warn(netdev, "Disabling tcp-ddp-crc offload, not supported when LRO is active\n");
 	}
 
 	mutex_unlock(&priv->state_lock);
@@ -4940,6 +5000,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	mlx5e_set_netdev_dev_addr(netdev);
 	mlx5e_ipsec_build_netdev(priv);
 	mlx5e_tls_build_netdev(priv);
+	mlx5e_nvmeotcp_build_netdev(priv);
 }
 
 void mlx5e_create_q_counters(struct mlx5e_priv *priv)
@@ -5004,6 +5065,9 @@ static int mlx5e_nic_init(struct mlx5_core_dev *mdev,
 	err = mlx5e_tls_init(priv);
 	if (err)
 		mlx5_core_err(mdev, "TLS initialization failed, %d\n", err);
+	err = mlx5e_nvmeotcp_init(priv);
+	if (err)
+		mlx5_core_err(mdev, "NVMEoTCP initialization failed, %d\n", err);
 	mlx5e_build_nic_netdev(netdev);
 	err = mlx5e_devlink_port_register(priv);
 	if (err)
@@ -5017,6 +5081,7 @@ static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
 {
 	mlx5e_health_destroy_reporters(priv);
 	mlx5e_devlink_port_unregister(priv);
+	mlx5e_nvmeotcp_cleanup(priv);
 	mlx5e_tls_cleanup(priv);
 	mlx5e_ipsec_cleanup(priv);
 	mlx5e_netdev_cleanup(priv->netdev, priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 599f5b5ebc97..ac99dbb3573a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -47,6 +47,7 @@
 #include "fpga/ipsec.h"
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/tls_rxtx.h"
+#include "en_accel/nvmeotcp.h"
 #include "lib/clock.h"
 #include "en/xdp.h"
 #include "en/xsk/rx.h"
@@ -617,16 +618,26 @@ void mlx5e_free_icosq_descs(struct mlx5e_icosq *sq)
 		ci = mlx5_wq_cyc_ctr2ix(&sq->wq, sqcc);
 		wi = &sq->db.wqe_info[ci];
 		sqcc += wi->num_wqebbs;
-#ifdef CONFIG_MLX5_EN_TLS
 		switch (wi->wqe_type) {
+#ifdef CONFIG_MLX5_EN_TLS
 		case MLX5E_ICOSQ_WQE_SET_PSV_TLS:
 			mlx5e_ktls_handle_ctx_completion(wi);
 			break;
 		case MLX5E_ICOSQ_WQE_GET_PSV_TLS:
 			mlx5e_ktls_handle_get_psv_completion(wi, sq);
 			break;
-		}
 #endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+		case MLX5E_ICOSQ_WQE_UMR_NVME_TCP:
+			break;
+		case MLX5E_ICOSQ_WQE_UMR_NVME_TCP_INVALIDATE:
+			mlx5e_nvmeotcp_ddp_inv_done(wi);
+			break;
+		case MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP:
+			mlx5e_nvmeotcp_ctx_comp(wi);
+			break;
+#endif
+		}
 	}
 	sq->cc = sqcc;
 }
@@ -695,6 +706,16 @@ int mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 			case MLX5E_ICOSQ_WQE_GET_PSV_TLS:
 				mlx5e_ktls_handle_get_psv_completion(wi, sq);
 				break;
+#endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+			case MLX5E_ICOSQ_WQE_UMR_NVME_TCP:
+				break;
+			case MLX5E_ICOSQ_WQE_UMR_NVME_TCP_INVALIDATE:
+				mlx5e_nvmeotcp_ddp_inv_done(wi);
+				break;
+			case MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP:
+				mlx5e_nvmeotcp_ctx_comp(wi);
+				break;
 #endif
 			default:
 				netdev_WARN_ONCE(cq->channel->netdev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 78f6a6f0a7e0..25d203d64bb2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -34,6 +34,7 @@
 #include "en.h"
 #include "en_accel/tls.h"
 #include "en_accel/en_accel.h"
+#include "en_accel/nvmeotcp.h"
 
 static unsigned int stats_grps_num(struct mlx5e_priv *priv)
 {
@@ -189,6 +190,14 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_resync_res_ok) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_resync_res_skip) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_err) },
+#endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_queue_init) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_queue_init_fail) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_queue_teardown) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_setup) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_setup_fail) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_teardown) },
 #endif
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_events) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_poll) },
@@ -314,6 +323,14 @@ static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
 		s->rx_tls_resync_res_ok     += rq_stats->tls_resync_res_ok;
 		s->rx_tls_resync_res_skip   += rq_stats->tls_resync_res_skip;
 		s->rx_tls_err               += rq_stats->tls_err;
+#endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+		s->rx_nvmeotcp_queue_init      += rq_stats->nvmeotcp_queue_init;
+		s->rx_nvmeotcp_queue_init_fail += rq_stats->nvmeotcp_queue_init_fail;
+		s->rx_nvmeotcp_queue_teardown  += rq_stats->nvmeotcp_queue_teardown;
+		s->rx_nvmeotcp_ddp_setup       += rq_stats->nvmeotcp_ddp_setup;
+		s->rx_nvmeotcp_ddp_setup_fail  += rq_stats->nvmeotcp_ddp_setup_fail;
+		s->rx_nvmeotcp_ddp_teardown    += rq_stats->nvmeotcp_ddp_teardown;
 #endif
 		s->ch_events      += ch_stats->events;
 		s->ch_poll        += ch_stats->poll;
@@ -390,6 +407,7 @@ static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
 			s->tx_tls_drop_no_sync_data += sq_stats->tls_drop_no_sync_data;
 			s->tx_tls_drop_bypass_req   += sq_stats->tls_drop_bypass_req;
 #endif
+
 			s->tx_cqes		+= sq_stats->cqes;
 
 			/* https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92657 */
@@ -1559,6 +1577,14 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, tls_resync_res_skip) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, tls_err) },
 #endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_queue_init) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_queue_init_fail) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_queue_teardown) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_setup) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_setup_fail) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_teardown) },
+#endif
 };
 
 static const struct counter_desc sq_stats_desc[] = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 162daaadb0d8..5c1c0ad88ff4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -175,6 +175,14 @@ struct mlx5e_sw_stats {
 	u64 rx_congst_umr;
 	u64 rx_arfs_err;
 	u64 rx_recover;
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	u64 rx_nvmeotcp_queue_init;
+	u64 rx_nvmeotcp_queue_init_fail;
+	u64 rx_nvmeotcp_queue_teardown;
+	u64 rx_nvmeotcp_ddp_setup;
+	u64 rx_nvmeotcp_ddp_setup_fail;
+	u64 rx_nvmeotcp_ddp_teardown;
+#endif
 	u64 ch_events;
 	u64 ch_poll;
 	u64 ch_arm;
@@ -338,6 +346,14 @@ struct mlx5e_rq_stats {
 	u64 tls_resync_res_skip;
 	u64 tls_err;
 #endif
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	u64 nvmeotcp_queue_init;
+	u64 nvmeotcp_queue_init_fail;
+	u64 nvmeotcp_queue_teardown;
+	u64 nvmeotcp_ddp_setup;
+	u64 nvmeotcp_ddp_setup_fail;
+	u64 nvmeotcp_ddp_teardown;
+#endif
 };
 
 struct mlx5e_sq_stats {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index d5868670f8a5..c2bd2e8d5508 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -158,6 +158,9 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 		 * queueing more WQEs and overflowing the async ICOSQ.
 		 */
 		clear_bit(MLX5E_SQ_STATE_PENDING_XSK_TX, &c->async_icosq.state);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	mlx5e_poll_ico_cq(&c->nvmeotcpsq.cq);
+#endif
 
 	busy |= INDIRECT_CALL_2(rq->post_wqes,
 				mlx5e_post_rx_mpwqes,
@@ -196,6 +199,9 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	mlx5e_cq_arm(&rq->cq);
 	mlx5e_cq_arm(&c->icosq.cq);
 	mlx5e_cq_arm(&c->async_icosq.cq);
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+	mlx5e_cq_arm(&c->nvmeotcpsq.cq);
+#endif
 	mlx5e_cq_arm(&c->xdpsq.cq);
 
 	if (xsk_open) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw.c b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
index 02558ac2ace6..5e7544ccae91 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
@@ -256,6 +256,12 @@ int mlx5_query_hca_caps(struct mlx5_core_dev *dev)
 			return err;
 	}
 
+	if (MLX5_CAP_GEN(dev, nvmeotcp)) {
+		err = mlx5_core_get_caps(dev, MLX5_CAP_DEV_NVMEOTCP);
+		if (err)
+			return err;
+	}
+
 	return 0;
 }
 
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 81ca5989009b..afadf4cf6d7a 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -263,6 +263,7 @@ enum {
 enum {
 	MLX5_MKEY_MASK_LEN		= 1ull << 0,
 	MLX5_MKEY_MASK_PAGE_SIZE	= 1ull << 1,
+	MLX5_MKEY_MASK_XLT_OCT_SIZE	= 1ull << 2,
 	MLX5_MKEY_MASK_START_ADDR	= 1ull << 6,
 	MLX5_MKEY_MASK_PD		= 1ull << 7,
 	MLX5_MKEY_MASK_EN_RINVAL	= 1ull << 8,
@@ -1153,6 +1154,7 @@ enum mlx5_cap_type {
 	MLX5_CAP_VDPA_EMULATION = 0x13,
 	MLX5_CAP_DEV_EVENT = 0x14,
 	MLX5_CAP_IPSEC,
+	MLX5_CAP_DEV_NVMEOTCP = 0x19,
 	/* NUM OF CAP Types */
 	MLX5_CAP_NUM
 };
@@ -1373,6 +1375,12 @@ enum mlx5_qcam_feature_groups {
 #define MLX5_CAP_IPSEC(mdev, cap)\
 	MLX5_GET(ipsec_cap, (mdev)->caps.hca_cur[MLX5_CAP_IPSEC], cap)
 
+#define MLX5_CAP_DEV_NVMEOTCP(mdev, cap)\
+	MLX5_GET(nvmeotcp_cap, mdev->caps.hca_cur[MLX5_CAP_DEV_NVMEOTCP], cap)
+
+#define MLX5_CAP64_NVMEOTCP(mdev, cap)\
+	MLX5_GET64(nvmeotcp_cap, mdev->caps.hca_cur[MLX5_CAP_DEV_NVMEOTCP], cap)
+
 enum {
 	MLX5_CMD_STAT_OK			= 0x0,
 	MLX5_CMD_STAT_INT_ERR			= 0x1,
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index de1ffb4804d6..a85d1c4b3ff0 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1231,7 +1231,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         log_max_srq_sz[0x8];
 	u8         log_max_qp_sz[0x8];
 	u8         event_cap[0x1];
-	u8         reserved_at_91[0x7];
+	u8         reserved_at_91[0x5];
+	u8         nvmeotcp[0x1];
+	u8         reserved_at_97[0x1];
 	u8         prio_tag_required[0x1];
 	u8         reserved_at_99[0x2];
 	u8         log_max_qp[0x5];
@@ -1519,7 +1521,8 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 
 	u8         general_obj_types[0x40];
 
-	u8         reserved_at_440[0x20];
+	u8         reserved_at_440[0x8];
+	u8         create_qp_start_hint[0x18];
 
 	u8         reserved_at_460[0x3];
 	u8         log_max_uctx[0x5];
@@ -2969,6 +2972,21 @@ struct mlx5_ifc_roce_addr_layout_bits {
 	u8         reserved_at_e0[0x20];
 };
 
+struct mlx5_ifc_nvmeotcp_cap_bits {
+	u8    zerocopy[0x1];
+	u8    crc_rx[0x1];
+	u8    crc_tx[0x1];
+	u8    reserved_at_3[0x15];
+	u8    version[0x8];
+
+	u8    reserved_at_20[0x13];
+	u8    log_max_nvmeotcp_tag_buffer_table[0x5];
+	u8    reserved_at_38[0x3];
+	u8    log_max_nvmeotcp_tag_buffer_size[0x5];
+
+	u8    reserved_at_40[0x7c0];
+};
+
 union mlx5_ifc_hca_cap_union_bits {
 	struct mlx5_ifc_cmd_hca_cap_bits cmd_hca_cap;
 	struct mlx5_ifc_odp_cap_bits odp_cap;
@@ -2985,6 +3003,7 @@ union mlx5_ifc_hca_cap_union_bits {
 	struct mlx5_ifc_tls_cap_bits tls_cap;
 	struct mlx5_ifc_device_mem_cap_bits device_mem_cap;
 	struct mlx5_ifc_virtio_emulation_cap_bits virtio_emulation_cap;
+	struct mlx5_ifc_nvmeotcp_cap_bits nvmeotcp_cap;
 	u8         reserved_at_0[0x8000];
 };
 
@@ -3179,7 +3198,9 @@ struct mlx5_ifc_tirc_bits {
 
 	u8         disp_type[0x4];
 	u8         tls_en[0x1];
-	u8         reserved_at_25[0x1b];
+	u8         nvmeotcp_zero_copy_en[0x1];
+	u8         nvmeotcp_crc_en[0x1];
+	u8         reserved_at_27[0x19];
 
 	u8         reserved_at_40[0x40];
 
@@ -3210,7 +3231,8 @@ struct mlx5_ifc_tirc_bits {
 
 	struct mlx5_ifc_rx_hash_field_select_bits rx_hash_field_selector_inner;
 
-	u8         reserved_at_2c0[0x4c0];
+	u8         nvmeotcp_tag_buffer_table_id[0x20];
+	u8         reserved_at_2e0[0x4a0];
 };
 
 enum {
@@ -10655,11 +10677,13 @@ struct mlx5_ifc_affiliated_event_header_bits {
 enum {
 	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = BIT(0xc),
 	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = BIT(0x13),
+	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE = BIT(0x21),
 };
 
 enum {
 	MLX5_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = 0xc,
 	MLX5_GENERAL_OBJECT_TYPES_IPSEC = 0x13,
+	MLX5_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE = 0x21
 };
 
 enum {
@@ -10734,6 +10758,20 @@ struct mlx5_ifc_create_encryption_key_in_bits {
 	struct mlx5_ifc_encryption_key_obj_bits encryption_key_object;
 };
 
+struct mlx5_ifc_nvmeotcp_tag_buf_table_obj_bits {
+	u8    modify_field_select[0x40];
+
+	u8    reserved_at_20[0x20];
+
+	u8    reserved_at_40[0x1b];
+	u8    log_tag_buffer_table_size[0x5];
+};
+
+struct mlx5_ifc_create_nvmeotcp_tag_buf_table_in_bits {
+	struct mlx5_ifc_general_obj_in_cmd_hdr_bits general_obj_in_cmd_hdr;
+	struct mlx5_ifc_nvmeotcp_tag_buf_table_obj_bits nvmeotcp_tag_buf_table_obj;
+};
+
 enum {
 	MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_KEY_SIZE_128 = 0x0,
 	MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_KEY_SIZE_256 = 0x1,
@@ -10744,6 +10782,18 @@ enum {
 	MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_TYPE_IPSEC = 0x2,
 };
 
+enum {
+	MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_XTS               = 0x0,
+	MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_TLS               = 0x1,
+	MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_NVMETCP           = 0x2,
+	MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_NVMETCP_WITH_TLS  = 0x3,
+};
+
+enum {
+	MLX5_TRANSPORT_STATIC_PARAMS_TI_INITIATOR  = 0x0,
+	MLX5_TRANSPORT_STATIC_PARAMS_TI_TARGET     = 0x1,
+};
+
 struct mlx5_ifc_tls_static_params_bits {
 	u8         const_2[0x2];
 	u8         tls_version[0x4];
@@ -10784,4 +10834,67 @@ enum {
 	MLX5_MTT_PERM_RW	= MLX5_MTT_PERM_READ | MLX5_MTT_PERM_WRITE,
 };
 
+struct mlx5_ifc_nvmeotcp_progress_params_bits {
+	u8    valid[0x1];
+	u8    reserved_at_1[0x7];
+	u8    pd[0x18];
+
+	u8    next_pdu_tcp_sn[0x20];
+
+	u8    hw_resync_tcp_sn[0x20];
+
+	u8    pdu_tracker_state[0x2];
+	u8    offloading_state[0x2];
+	u8    reserved_at_64[0xc];
+	u8    cccid_ttag[0x10];
+};
+
+struct mlx5_ifc_transport_static_params_bits {
+	u8    const_2[0x2];
+	u8    tls_version[0x4];
+	u8    const_1[0x2];
+	u8    reserved_at_8[0x14];
+	u8    acc_type[0x4];
+
+	u8    reserved_at_20[0x20];
+
+	u8    initial_record_number[0x40];
+
+	u8    resync_tcp_sn[0x20];
+
+	u8    gcm_iv[0x20];
+
+	u8    implicit_iv[0x40];
+
+	u8    reserved_at_100[0x8];
+	u8    dek_index[0x18];
+
+	u8    reserved_at_120[0x15];
+	u8    ti[0x1];
+	u8    zero_copy_en[0x1];
+	u8    ddgst_offload_en[0x1];
+	u8    hdgst_offload_en[0x1];
+	u8    ddgst_en[0x1];
+	u8    hddgst_en[0x1];
+	u8    pda[0x5];
+
+	u8    nvme_resync_tcp_sn[0x20];
+
+	u8    reserved_at_160[0xa0];
+};
+
+struct mlx5_ifc_nvmeotcp_sgl_entry_bits {
+	u8    address[0x40];
+
+	u8    byte_count[0x20];
+};
+
+struct mlx5_ifc_nvmeotcp_umr_params_bits {
+	u8    ccid[0x20];
+
+	u8    reserved_at_20[0x10];
+	u8    sgl_length[0x10];
+
+	struct mlx5_ifc_nvmeotcp_sgl_entry_bits sgl[0];
+};
 #endif /* MLX5_IFC_H */
diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h
index 36492a1342cf..8b62d3f4a868 100644
--- a/include/linux/mlx5/qp.h
+++ b/include/linux/mlx5/qp.h
@@ -220,6 +220,7 @@ struct mlx5_wqe_ctrl_seg {
 #define MLX5_WQE_CTRL_OPCODE_MASK 0xff
 #define MLX5_WQE_CTRL_WQE_INDEX_MASK 0x00ffff00
 #define MLX5_WQE_CTRL_WQE_INDEX_SHIFT 8
+#define MLX5_WQE_CTRL_TIR_TIS_INDEX_SHIFT 8
 
 enum {
 	MLX5_ETH_WQE_L3_INNER_CSUM      = 1 << 4,
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next RFC v1 10/10] net/mlx5e: NVMEoTCP, data-path for DDP offload
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
                   ` (8 preceding siblings ...)
  2020-09-30 16:20 ` [PATCH net-next RFC v1 09/10] net/mlx5e: Add NVMEoTCP offload Boris Pismenny
@ 2020-09-30 16:20 ` Boris Pismenny
  2020-10-09  0:08 ` [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Sagi Grimberg
  10 siblings, 0 replies; 35+ messages in thread
From: Boris Pismenny @ 2020-09-30 16:20 UTC (permalink / raw)
  To: kuba, davem, saeedm, hch, sagi, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

NVMEoTCP direct data placement constructs an SKB from each CQE, while
pointing at NVME buffers.

This enables the offload, as the NVMe-TCP layer will skip the copy when
src == dst.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |   1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |   1 +
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        | 256 ++++++++++++++++++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.h        |  25 ++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  51 +++-
 .../ethernet/mellanox/mlx5/core/en_stats.c    |  12 +
 .../ethernet/mellanox/mlx5/core/en_stats.h    |   8 +
 include/linux/mlx5/device.h                   |  30 +-
 10 files changed, 379 insertions(+), 8 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9dd6b41c2486..89ffb1dae75c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -85,4 +85,4 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 					steering/dr_cmd.o steering/dr_fw.o \
 					steering/dr_action.o steering/fs_dr.o
 
-mlx5_core-$(CONFIG_MLX5_EN_NVMEOTCP) += en_accel/fs_tcp.o en_accel/nvmeotcp.o
+mlx5_core-$(CONFIG_MLX5_EN_NVMEOTCP) += en_accel/fs_tcp.o en_accel/nvmeotcp.o en_accel/nvmeotcp_rxtx.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index a8c0fc98b394..47611401a55d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -536,6 +536,7 @@ struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
+			       struct mlx5_cqe64 *cqe,
 			       u16 cqe_bcnt, u32 head_offset, u32 page_idx);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 8e7b877d8a12..9a6fbd1b1c34 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -25,6 +25,7 @@ static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, void *data,
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5e_mpw_info *wi,
+						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
 						    u32 page_idx)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
index 7f88ccf67fdd..112c5b3ec165 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
@@ -11,6 +11,7 @@
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5e_mpw_info *wi,
+						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
 						    u32 page_idx);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
new file mode 100644
index 000000000000..93b8ab497460
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
@@ -0,0 +1,256 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies.
+
+#include "en_accel/nvmeotcp_rxtx.h"
+#include "en_accel/nvmeotcp.h"
+#include <linux/mlx5/mlx5_ifc.h>
+
+#define	MLX5E_TC_FLOW_ID_MASK  0x00ffffff
+static void nvmeotcp_update_resync(struct mlx5e_nvmeotcp_queue *queue,
+				   struct mlx5e_cqe128 *cqe128)
+{
+	const struct tcp_ddp_ulp_ops *ulp_ops;
+	struct mlx5e_rq_stats *stats;
+	u32 seq;
+
+	seq = be32_to_cpu(cqe128->resync_tcp_sn);
+	ulp_ops = inet_csk(queue->sk)->icsk_ulp_ddp_ops;
+	if (ulp_ops && ulp_ops->resync_request)
+		ulp_ops->resync_request(queue->sk, seq, TCP_DDP_RESYNC_REQ);
+
+	stats = queue->priv->channels.c[queue->channel_ix]->rq.stats;
+	stats->nvmeotcp_resync++;
+}
+
+static void mlx5e_nvmeotcp_advance_sgl_iter(struct mlx5e_nvmeotcp_queue *queue)
+{
+	struct nvmeotcp_queue_entry *nqe = &queue->ccid_table[queue->ccid];
+
+	queue->ccoff += nqe->sgl[queue->ccsglidx].length;
+	queue->ccoff_inner = 0;
+	queue->ccsglidx++;
+}
+
+static inline void
+mlx5e_nvmeotcp_add_skb_frag(struct net_device *netdev, struct sk_buff *skb,
+			    struct mlx5e_nvmeotcp_queue *queue,
+			    struct nvmeotcp_queue_entry *nqe, u32 fragsz)
+{
+	dma_sync_single_for_cpu(&netdev->dev,
+				nqe->sgl[queue->ccsglidx].offset + queue->ccoff_inner,
+				fragsz, DMA_FROM_DEVICE);
+	page_ref_inc(compound_head(sg_page(&(nqe->sgl[queue->ccsglidx]))));
+	// XXX: consider reducing the truesize, as no new memory is consumed
+	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+			sg_page(&(nqe->sgl[queue->ccsglidx])),
+			nqe->sgl[queue->ccsglidx].offset + queue->ccoff_inner,
+			fragsz,
+			fragsz);
+}
+
+int mlx5_nvmeotcp_get_headlen(struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
+{
+	struct mlx5e_cqe128 *cqe128;
+
+	if (!cqe_is_nvmeotcp_zc(cqe) || cqe_is_nvmeotcp_resync(cqe))
+		return cqe_bcnt;
+
+	cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);
+	return be16_to_cpu(cqe128->hlen);
+}
+
+static struct sk_buff*
+mlx5_nvmeotcp_add_tail_nonlinear(struct mlx5e_nvmeotcp_queue *queue,
+				 struct sk_buff *skb, skb_frag_t *org_frags,
+				 int org_nr_frags, int frag_index)
+{
+	struct mlx5e_priv *priv = queue->priv;
+	struct mlx5e_rq_stats *stats;
+
+	while (org_nr_frags != frag_index) {
+		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
+			dev_kfree_skb_any(skb);
+			stats = priv->channels.c[queue->channel_ix]->rq.stats;
+			stats->nvmeotcp_drop++;
+			return NULL;
+		}
+		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+				skb_frag_page(&org_frags[frag_index]),
+				skb_frag_off(&org_frags[frag_index]),
+				skb_frag_size(&org_frags[frag_index]),
+				skb_frag_size(&org_frags[frag_index]));
+		page_ref_inc(skb_frag_page(&org_frags[frag_index]));
+		frag_index++;
+	}
+	return skb;
+}
+
+static struct sk_buff*
+mlx5_nvmeotcp_add_tail(struct mlx5e_nvmeotcp_queue *queue, struct sk_buff *skb,
+		       int offset, int len)
+{
+	struct mlx5e_priv *priv = queue->priv;
+	struct mlx5e_rq_stats *stats;
+
+	if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
+		dev_kfree_skb_any(skb);
+		stats = priv->channels.c[queue->channel_ix]->rq.stats;
+		stats->nvmeotcp_drop++;
+		return NULL;
+	}
+	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+			virt_to_page(skb->data),
+			offset,
+			len,
+			len);
+	page_ref_inc(virt_to_page(skb->data));
+	return skb;
+}
+
+static void mlx5_nvmeotcp_trim_nonlinear(struct sk_buff *skb,
+					 skb_frag_t *org_frags,
+					 int *frag_index,
+					 int remaining)
+{
+	unsigned int frag_size;
+	int nr_frags;
+
+	/* skip @remaining bytes in frags */
+	*frag_index = 0;
+	while (remaining) {
+		frag_size = skb_frag_size(&skb_shinfo(skb)->frags[*frag_index]);
+		if (frag_size > remaining) {
+			skb_frag_off_add(&skb_shinfo(skb)->frags[*frag_index],
+					 remaining);
+			skb_frag_size_sub(&skb_shinfo(skb)->frags[*frag_index],
+					  remaining);
+			remaining = 0;
+		} else {
+			remaining -= frag_size;
+			skb_frag_unref(skb, *frag_index);
+			*frag_index += 1;
+		}
+	}
+
+	/* save original frags for the tail and unref */
+	nr_frags = skb_shinfo(skb)->nr_frags;
+	memcpy(&org_frags[*frag_index], &skb_shinfo(skb)->frags[*frag_index],
+	       (nr_frags - *frag_index) * sizeof(skb_frag_t));
+	while (--nr_frags >= *frag_index)
+		skb_frag_unref(skb, nr_frags);
+
+	/* remove frags from skb */
+	skb_shinfo(skb)->nr_frags = 0;
+	skb->len -= skb->data_len;
+	skb->truesize -= skb->data_len;
+	skb->data_len = 0;
+}
+
+struct sk_buff*
+mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt,
+			     bool linear)
+{
+	int ccoff, cclen, hlen, ccid, remaining, fragsz, to_copy = 0;
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	skb_frag_t org_frags[MAX_SKB_FRAGS];
+	struct mlx5e_nvmeotcp_queue *queue;
+	struct nvmeotcp_queue_entry *nqe;
+	struct mlx5e_rq_stats *stats;
+	int org_nr_frags, frag_index;
+	struct mlx5e_cqe128 *cqe128;
+	u32 queue_id;
+
+	queue_id = (be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK);
+	queue = mlx5e_nvmeotcp_get_queue(priv->nvmeotcp, queue_id);
+	if (unlikely(!queue)) {
+		dev_kfree_skb_any(skb);
+		return NULL;
+	}
+
+	cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);
+	if (cqe_is_nvmeotcp_resync(cqe)) {
+		nvmeotcp_update_resync(queue, cqe128);
+		mlx5e_nvmeotcp_put_queue(queue);
+		return skb;
+	}
+
+	stats = priv->channels.c[queue->channel_ix]->rq.stats;
+
+	/* cc ddp from cqe */
+	ccid = be16_to_cpu(cqe128->ccid);
+	ccoff = be32_to_cpu(cqe128->ccoff);
+	cclen = be16_to_cpu(cqe128->cclen);
+	hlen  = be16_to_cpu(cqe128->hlen);
+
+	/* carve a hole in the skb for DDP data */
+	if (linear) {
+		skb_trim(skb, hlen);
+	} else {
+		org_nr_frags = skb_shinfo(skb)->nr_frags;
+		mlx5_nvmeotcp_trim_nonlinear(skb, org_frags, &frag_index,
+					     cclen);
+	}
+
+	nqe = &queue->ccid_table[ccid];
+
+	/* packet starts new ccid? */
+	if (queue->ccid != ccid || queue->ccid_gen != nqe->ccid_gen) {
+		queue->ccid = ccid;
+		queue->ccoff = 0;
+		queue->ccoff_inner = 0;
+		queue->ccsglidx = 0;
+		queue->ccid_gen = nqe->ccid_gen;
+	}
+
+	/* skip inside cc until the ccoff in the cqe */
+	while (queue->ccoff + queue->ccoff_inner < ccoff) {
+		remaining = nqe->sgl[queue->ccsglidx].length - queue->ccoff_inner;
+		fragsz = min_t(off_t, remaining,
+			       ccoff - (queue->ccoff + queue->ccoff_inner));
+
+		if (fragsz == remaining)
+			mlx5e_nvmeotcp_advance_sgl_iter(queue);
+		else
+			queue->ccoff_inner += fragsz;
+	}
+
+	/* adjust the skb according to the cqe cc */
+	while (to_copy < cclen) {
+		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
+			dev_kfree_skb_any(skb);
+			stats->nvmeotcp_drop++;
+			mlx5e_nvmeotcp_put_queue(queue);
+			return NULL;
+		}
+
+		remaining = nqe->sgl[queue->ccsglidx].length - queue->ccoff_inner;
+		fragsz = min_t(int, remaining, cclen - to_copy);
+
+		mlx5e_nvmeotcp_add_skb_frag(netdev, skb, queue, nqe, fragsz);
+		to_copy += fragsz;
+		if (fragsz == remaining)
+			mlx5e_nvmeotcp_advance_sgl_iter(queue);
+		else
+			queue->ccoff_inner += fragsz;
+	}
+
+	if (cqe_bcnt > hlen + cclen) {
+		remaining = cqe_bcnt - hlen - cclen;
+		if (linear)
+			skb = mlx5_nvmeotcp_add_tail(queue, skb,
+						     offset_in_page(skb->data) +
+								hlen + cclen,
+						     remaining);
+		else
+			skb = mlx5_nvmeotcp_add_tail_nonlinear(queue, skb,
+							       org_frags,
+							       org_nr_frags,
+							       frag_index);
+	}
+
+	stats->nvmeotcp_offload_packets++;
+	stats->nvmeotcp_offload_bytes += cclen;
+	mlx5e_nvmeotcp_put_queue(queue);
+	return skb;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
new file mode 100644
index 000000000000..85af1650633c
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies.
+
+#ifndef __MLX5E_NVMEOTCP_RXTX_H__
+#define __MLX5E_NVMEOTCP_RXTX_H__
+
+#ifdef CONFIG_MLX5_EN_NVMEOTCP
+
+#include <linux/skbuff.h>
+#include "en.h"
+
+struct sk_buff*
+mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt, bool linear);
+
+int mlx5_nvmeotcp_get_headlen(struct mlx5_cqe64 *cqe, u32 cqe_bcnt);
+#else
+int mlx5_nvmeotcp_get_headlen(struct mlx5_cqe64 *cqe, u32 cqe_bcnt) { return cqe_bcnt; }
+struct sk_buff*
+mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt, bool linear) { return skb; }
+
+#endif /* CONFIG_MLX5_EN_NVMEOTCP */
+
+#endif /* __MLX5E_NVMEOTCP_RXTX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index ac99dbb3573a..b60b4be152e4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -48,6 +48,7 @@
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/tls_rxtx.h"
 #include "en_accel/nvmeotcp.h"
+#include "en_accel/nvmeotcp_rxtx.h"
 #include "lib/clock.h"
 #include "en/xdp.h"
 #include "en/xsk/rx.h"
@@ -57,9 +58,11 @@
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
+				struct mlx5_cqe64 *cqe,
 				u16 cqe_bcnt, u32 head_offset, u32 page_idx);
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
+				   struct mlx5_cqe64 *cqe,
 				   u16 cqe_bcnt, u32 head_offset, u32 page_idx);
 static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
@@ -1076,6 +1079,10 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 	if (unlikely(mlx5_ipsec_is_rx_flow(cqe)))
 		mlx5e_ipsec_offload_handle_rx_skb(netdev, skb, cqe);
 
+#if defined(CONFIG_TCP_DDP_CRC) && defined(CONFIG_MLX5_EN_NVMEOTCP)
+	skb->ddp_crc = cqe_is_nvmeotcp_crcvalid(cqe);
+#endif
+
 	if (lro_num_seg > 1) {
 		mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
 		skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt, lro_num_seg);
@@ -1189,16 +1196,28 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	/* queue up for recycling/reuse */
 	page_ref_inc(di->page);
 
+#if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
+	if (cqe_is_nvmeotcp_zc(cqe))
+		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
+						   cqe_bcnt, true);
+#endif
+
 	return skb;
 }
 
+static u16 mlx5e_get_headlen_hint(struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
+{
+	return  min_t(u32, MLX5E_RX_MAX_HEAD,
+		      mlx5_nvmeotcp_get_headlen(cqe, cqe_bcnt));
+}
+
 static struct sk_buff *
 mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			     struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt)
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
+	u16 headlen = mlx5e_get_headlen_hint(cqe, cqe_bcnt);
 	struct mlx5e_wqe_frag_info *head_wi = wi;
-	u16 headlen      = min_t(u32, MLX5E_RX_MAX_HEAD, cqe_bcnt);
 	u16 frag_headlen = headlen;
 	u16 byte_cnt     = cqe_bcnt - headlen;
 	struct sk_buff *skb;
@@ -1207,7 +1226,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	 * might spread among multiple pages.
 	 */
 	skb = napi_alloc_skb(rq->cq.napi,
-			     ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
+			     ALIGN(headlen, sizeof(long)));
 	if (unlikely(!skb)) {
 		rq->stats->buff_alloc_err++;
 		return NULL;
@@ -1233,6 +1252,12 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	skb->tail += headlen;
 	skb->len  += headlen;
 
+#if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
+	if (cqe_is_nvmeotcp_zc(cqe))
+		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
+						   cqe_bcnt, false);
+#endif
+
 	return skb;
 }
 
@@ -1386,7 +1411,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
 	skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
-			      rq, wi, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -1417,17 +1442,18 @@ const struct mlx5e_rx_handlers mlx5e_rx_handlers_rep = {
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
+				   struct mlx5_cqe64 *cqe,
 				   u16 cqe_bcnt, u32 head_offset, u32 page_idx)
 {
-	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
+	u16 headlen = mlx5e_get_headlen_hint(cqe, cqe_bcnt);
 	u32 frag_offset    = head_offset + headlen;
 	u32 byte_cnt       = cqe_bcnt - headlen;
 	struct mlx5e_dma_info *head_di = di;
 	struct sk_buff *skb;
 
 	skb = napi_alloc_skb(rq->cq.napi,
-			     ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
+			     ALIGN(headlen, sizeof(long)));
 	if (unlikely(!skb)) {
 		rq->stats->buff_alloc_err++;
 		return NULL;
@@ -1458,11 +1484,18 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	skb->tail += headlen;
 	skb->len  += headlen;
 
+#if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
+	if (cqe_is_nvmeotcp_zc(cqe))
+		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
+						   cqe_bcnt, false);
+#endif
+
 	return skb;
 }
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
+				struct mlx5_cqe64 *cqe,
 				u16 cqe_bcnt, u32 head_offset, u32 page_idx)
 {
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
@@ -1504,6 +1537,12 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 	/* queue up for recycling/reuse */
 	page_ref_inc(di->page);
 
+#if defined(CONFIG_TCP_DDP) && defined(CONFIG_MLX5_EN_NVMEOTCP)
+	if (cqe_is_nvmeotcp_zc(cqe))
+		skb = mlx5e_nvmeotcp_handle_rx_skb(rq->netdev, skb, cqe,
+						   cqe_bcnt, true);
+#endif
+
 	return skb;
 }
 
@@ -1542,7 +1581,7 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
 	skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
-			      rq, wi, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 25d203d64bb2..8fe28694d7cf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -198,6 +198,10 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_setup) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_setup_fail) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_teardown) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_drop) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_resync) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_offload_packets) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_offload_bytes) },
 #endif
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_events) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_poll) },
@@ -331,6 +335,10 @@ static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
 		s->rx_nvmeotcp_ddp_setup       += rq_stats->nvmeotcp_ddp_setup;
 		s->rx_nvmeotcp_ddp_setup_fail  += rq_stats->nvmeotcp_ddp_setup_fail;
 		s->rx_nvmeotcp_ddp_teardown    += rq_stats->nvmeotcp_ddp_teardown;
+		s->rx_nvmeotcp_drop            += rq_stats->nvmeotcp_drop;
+		s->rx_nvmeotcp_resync          += rq_stats->nvmeotcp_resync;
+		s->rx_nvmeotcp_offload_packets += rq_stats->nvmeotcp_offload_packets;
+		s->rx_nvmeotcp_offload_bytes   += rq_stats->nvmeotcp_offload_bytes;
 #endif
 		s->ch_events      += ch_stats->events;
 		s->ch_poll        += ch_stats->poll;
@@ -1584,6 +1592,10 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_setup) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_setup_fail) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_teardown) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_drop) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_resync) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_offload_packets) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_offload_bytes) },
 #endif
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 5c1c0ad88ff4..be1574e61945 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -182,6 +182,10 @@ struct mlx5e_sw_stats {
 	u64 rx_nvmeotcp_ddp_setup;
 	u64 rx_nvmeotcp_ddp_setup_fail;
 	u64 rx_nvmeotcp_ddp_teardown;
+	u64 rx_nvmeotcp_drop;
+	u64 rx_nvmeotcp_resync;
+	u64 rx_nvmeotcp_offload_packets;
+	u64 rx_nvmeotcp_offload_bytes;
 #endif
 	u64 ch_events;
 	u64 ch_poll;
@@ -353,6 +357,10 @@ struct mlx5e_rq_stats {
 	u64 nvmeotcp_ddp_setup;
 	u64 nvmeotcp_ddp_setup_fail;
 	u64 nvmeotcp_ddp_teardown;
+	u64 nvmeotcp_drop;
+	u64 nvmeotcp_resync;
+	u64 nvmeotcp_offload_packets;
+	u64 nvmeotcp_offload_bytes;
 #endif
 };
 
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index afadf4cf6d7a..c1a75f727ade 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -779,7 +779,7 @@ struct mlx5_err_cqe {
 
 struct mlx5_cqe64 {
 	u8		tls_outer_l3_tunneled;
-	u8		rsvd0;
+	u8		nvmetcp;
 	__be16		wqe_id;
 	u8		lro_tcppsh_abort_dupack;
 	u8		lro_min_ttl;
@@ -812,6 +812,19 @@ struct mlx5_cqe64 {
 	u8		op_own;
 };
 
+struct mlx5e_cqe128 {
+	__be16		cclen;
+	__be16		hlen;
+	union {
+		__be32		resync_tcp_sn;
+		__be32		ccoff;
+	};
+	__be16		ccid;
+	__be16		rsvd8;
+	u8		rsvd12[52];
+	struct mlx5_cqe64 cqe64;
+};
+
 struct mlx5_mini_cqe8 {
 	union {
 		__be32 rx_hash_result;
@@ -842,6 +855,21 @@ enum {
 
 #define MLX5_MINI_CQE_ARRAY_SIZE 8
 
+static inline bool cqe_is_nvmeotcp_resync(struct mlx5_cqe64 *cqe)
+{
+	return ((cqe->nvmetcp >> 6) & 0x1);
+}
+
+static inline bool cqe_is_nvmeotcp_crcvalid(struct mlx5_cqe64 *cqe)
+{
+	return ((cqe->nvmetcp >> 5) & 0x1);
+}
+
+static inline bool cqe_is_nvmeotcp_zc(struct mlx5_cqe64 *cqe)
+{
+	return ((cqe->nvmetcp >> 4) & 0x1);
+}
+
 static inline u8 mlx5_get_cqe_format(struct mlx5_cqe64 *cqe)
 {
 	return (cqe->op_own >> 2) & 0x3;
-- 
2.24.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 02/10] net: Introduce direct data placement tcp offload
  2020-09-30 16:20 ` [PATCH net-next RFC v1 02/10] net: Introduce direct data placement tcp offload Boris Pismenny
@ 2020-10-08 21:47   ` Sagi Grimberg
  2020-10-11 14:44     ` Boris Pismenny
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2020-10-08 21:47 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz


> + * tcp_ddp.h
> + *	Author:	Boris Pismenny <borisp@mellanox.com>
> + *	Copyright (C) 2020 Mellanox Technologies.
> + */
> +#ifndef _TCP_DDP_H
> +#define _TCP_DDP_H
> +
> +#include <linux/blkdev.h>

Why is blkdev.h needed?

> +#include <linux/netdevice.h>
> +#include <net/inet_connection_sock.h>
> +#include <net/sock.h>
> +
> +/* limits returned by the offload driver, zero means don't care */
> +struct tcp_ddp_limits {
> +	int	 max_ddp_sgl_len;
> +};
> +
> +enum tcp_ddp_type {
> +	TCP_DDP_NVME = 1,
> +};
> +
> +struct tcp_ddp_config {
> +	enum tcp_ddp_type    type;
> +	unsigned char        buf[];

A little kdoc may help here as its not exactly clear what is
buf used for (at this point at least)...

> +};
> +
> +struct nvme_tcp_config {

struct nvme_tcp_ddp_config

> +	struct tcp_ddp_config   cfg;
> +
> +	u16			pfv;
> +	u8			cpda;
> +	u8			dgst;
> +	int			queue_size;
> +	int			queue_id;
> +	int			io_cpu;
> +};
> +

Other than that this looks good to me.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 03/10] net: Introduce crc offload for tcp ddp ulp
  2020-09-30 16:20 ` [PATCH net-next RFC v1 03/10] net: Introduce crc offload for tcp ddp ulp Boris Pismenny
@ 2020-10-08 21:51   ` Sagi Grimberg
  2020-10-11 14:58     ` Boris Pismenny
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2020-10-08 21:51 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz


> This commit itroduces support for CRC offload to direct data placement
               introduces
> ULP on the receive side. Both DDP and CRC share a common API to
> initialize the offload for a TCP socket. But otherwise, both can
> be executed independently.

 From the API it not clear that the offload engine does crc32c, do you
see this extended to other crc types in the future?

Other than this the patch looks good to me,
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 04/10] net/tls: expose get_netdev_for_sock
  2020-09-30 16:20 ` [PATCH net-next RFC v1 04/10] net/tls: expose get_netdev_for_sock Boris Pismenny
@ 2020-10-08 21:56   ` Sagi Grimberg
  0 siblings, 0 replies; 35+ messages in thread
From: Sagi Grimberg @ 2020-10-08 21:56 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path
  2020-09-30 16:20 ` [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path Boris Pismenny
@ 2020-10-08 22:19   ` Sagi Grimberg
  2020-10-19 18:28     ` Boris Pismenny
       [not found]     ` <PH0PR18MB3845430DDF572E0DD4832D06CCED0@PH0PR18MB3845.namprd18.prod.outlook.com>
  0 siblings, 2 replies; 35+ messages in thread
From: Sagi Grimberg @ 2020-10-08 22:19 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz



On 9/30/20 9:20 AM, Boris Pismenny wrote:
> This commit introduces direct data placement offload to NVME
> TCP. There is a context per queue, which is established after the
> handshake
> using the tcp_ddp_sk_add/del NDOs.
> 
> Additionally, a resynchronization routine is used to assist
> hardware recovery from TCP OOO, and continue the offload.
> Resynchronization operates as follows:
> 1. TCP OOO causes the NIC HW to stop the offload
> 2. NIC HW identifies a PDU header at some TCP sequence number,
> and asks NVMe-TCP to confirm it.
> This request is delivered from the NIC driver to NVMe-TCP by first
> finding the socket for the packet that triggered the request, and
> then fiding the nvme_tcp_queue that is used by this routine.
> Finally, the request is recorded in the nvme_tcp_queue.
> 3. When NVMe-TCP observes the requested TCP sequence, it will compare
> it with the PDU header TCP sequence, and report the result to the
> NIC driver (tcp_ddp_resync), which will update the HW,
> and resume offload when all is successful.
> 
> Furthermore, we let the offloading driver advertise what is the max hw
> sectors/segments via tcp_ddp_limits.
> 
> A follow-up patch introduces the data-path changes required for this
> offload.
> 
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> Signed-off-by: Yoray Zack <yorayz@mellanox.com>
> ---
>   drivers/nvme/host/tcp.c  | 188 +++++++++++++++++++++++++++++++++++++++
>   include/linux/nvme-tcp.h |   2 +
>   2 files changed, 190 insertions(+)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 8f4f29f18b8c..06711ac095f2 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -62,6 +62,7 @@ enum nvme_tcp_queue_flags {
>   	NVME_TCP_Q_ALLOCATED	= 0,
>   	NVME_TCP_Q_LIVE		= 1,
>   	NVME_TCP_Q_POLLING	= 2,
> +	NVME_TCP_Q_OFFLOADS     = 3,
>   };
>   
>   enum nvme_tcp_recv_state {
> @@ -110,6 +111,8 @@ struct nvme_tcp_queue {
>   	void (*state_change)(struct sock *);
>   	void (*data_ready)(struct sock *);
>   	void (*write_space)(struct sock *);
> +
> +	atomic64_t  resync_req;
>   };
>   
>   struct nvme_tcp_ctrl {
> @@ -129,6 +132,8 @@ struct nvme_tcp_ctrl {
>   	struct delayed_work	connect_work;
>   	struct nvme_tcp_request async_req;
>   	u32			io_queues[HCTX_MAX_TYPES];
> +
> +	struct net_device       *offloading_netdev;
>   };
>   
>   static LIST_HEAD(nvme_tcp_ctrl_list);
> @@ -223,6 +228,159 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
>   	return nvme_tcp_pdu_data_left(req) <= len;
>   }
>   
> +#ifdef CONFIG_TCP_DDP
> +
> +bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
> +const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops __read_mostly = {
> +	.resync_request		= nvme_tcp_resync_request,
> +};
> +
> +static
> +int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
> +			    struct nvme_tcp_config *config)
> +{
> +	struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, true);
> +	struct tcp_ddp_config *ddp_config = (struct tcp_ddp_config *)config;
> +	int ret;
> +
> +	if (unlikely(!netdev)) {

Let's remove unlikely from non datapath routines, its slightly
confusing.

> +		pr_info_ratelimited("%s: netdev not found\n", __func__);

dev_info_ratelimited with queue->ctrl->ctrl.device ?
Also, lets remove __func__. This usually is not very helpful.

> +		return -EINVAL;
> +	}
> +
> +	if (!(netdev->features & NETIF_F_HW_TCP_DDP)) {
> +		dev_put(netdev);
> +		return -EINVAL;

EINVAL or ENODEV?

> +	}
> +
> +	ret = netdev->tcp_ddp_ops->tcp_ddp_sk_add(netdev,
> +						 queue->sock->sk,
> +						 ddp_config);
> +	if (!ret)
> +		inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = &nvme_tcp_ddp_ulp_ops;
> +	else
> +		dev_put(netdev);
> +	return ret;
> +}
> +
> +static
> +void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
> +{
> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
> +
> +	if (unlikely(!netdev)) {
> +		pr_info_ratelimited("%s: netdev not found\n", __func__);

Same here.

> +		return;
> +	}
> +
> +	netdev->tcp_ddp_ops->tcp_ddp_sk_del(netdev, queue->sock->sk);
> +
> +	inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = NULL;

Just a general question, why is this needed?

> +	dev_put(netdev); /* put the queue_init get_netdev_for_sock() */
> +}
> +
> +static
> +int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue,
> +			    struct tcp_ddp_limits *limits)
> +{
> +	struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, true);
> +	int ret = 0;
> +
> +	if (unlikely(!netdev)) {
> +		pr_info_ratelimited("%s: netdev not found\n", __func__);
> +		return -EINVAL;

Same here

> +	}
> +
> +	if (netdev->features & NETIF_F_HW_TCP_DDP &&
> +	    netdev->tcp_ddp_ops &&
> +	    netdev->tcp_ddp_ops->tcp_ddp_limits)
> +			ret = netdev->tcp_ddp_ops->tcp_ddp_limits(netdev, limits);
> +	else
> +			ret = -EOPNOTSUPP;
> +
> +	if (!ret) {
> +		queue->ctrl->offloading_netdev = netdev;
> +		pr_info("%s netdev %s offload limits: max_ddp_sgl_len %d\n",
> +			__func__, netdev->name, limits->max_ddp_sgl_len);

dev_info, and given that it per-queue, please make it dev_dbg.

> +		queue->ctrl->ctrl.max_segments = limits->max_ddp_sgl_len;
> +		queue->ctrl->ctrl.max_hw_sectors =
> +			limits->max_ddp_sgl_len << (ilog2(SZ_4K) - 9);
> +	} else {
> +		queue->ctrl->offloading_netdev = NULL;

Maybe nullify in the controller setup intead?

> +	}
> +
> +	dev_put(netdev);
> +
> +	return ret;
> +}
> +
> +static
> +void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
> +			      unsigned int pdu_seq)
> +{
> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
> +	u64 resync_val;
> +	u32 resync_seq;
> +
> +	if (unlikely(!netdev)) {
> +		pr_info_ratelimited("%s: netdev not found\n", __func__);
> +		return;

What happens now, fallback to SW? Maybe dev_warn then..
will the SW keep seeing these responses after one failed?

> +	}
> +
> +	resync_val = atomic64_read(&queue->resync_req);
> +	if ((resync_val & TCP_DDP_RESYNC_REQ) == 0)
> +		return;
> +
> +	resync_seq = resync_val >> 32;
> +	if (before(pdu_seq, resync_seq))
> +		return;

I think it will be better to pass the skb to this func and keep the
pdu_seq contained locally.

> +
> +	if (atomic64_cmpxchg(&queue->resync_req, resync_val, (resync_val - 1)))
> +		netdev->tcp_ddp_ops->tcp_ddp_resync(netdev, queue->sock->sk, pdu_seq);

A small comment on this manipulation may help the reader.

> +}
> +
> +bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
> +{
> +	struct nvme_tcp_queue *queue = sk->sk_user_data;
> +
> +	atomic64_set(&queue->resync_req,
> +		     (((uint64_t)seq << 32) | flags));
> +
> +	return true;
> +}
> +
> +#else
> +
> +static
> +int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
> +			    struct nvme_tcp_config *config)
> +{
> +	return -EINVAL;
> +}
> +
> +static
> +void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
> +{}
> +
> +static
> +int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue,
> +			    struct tcp_ddp_limits *limits)
> +{
> +	return -EINVAL;
> +}
> +
> +static
> +void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
> +			      unsigned int pdu_seq)
> +{}
> +
> +bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
> +{
> +	return false;
> +}
> +
> +#endif
> +
>   static void nvme_tcp_init_iter(struct nvme_tcp_request *req,
>   		unsigned int dir)
>   {
> @@ -628,6 +786,11 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   	size_t rcv_len = min_t(size_t, *len, queue->pdu_remaining);
>   	int ret;
>   
> +	u64 pdu_seq = TCP_SKB_CB(skb)->seq + *offset - queue->pdu_offset;
> +
> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
> +		nvme_tcp_resync_response(queue, pdu_seq);

Here, just pass in (queue, skb)

> +
>   	ret = skb_copy_bits(skb, *offset,
>   		&pdu[queue->pdu_offset], rcv_len);
>   	if (unlikely(ret))
> @@ -1370,6 +1533,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
>   {
>   	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>   	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
> +	struct nvme_tcp_config config;
> +	struct tcp_ddp_limits limits;
>   	int ret, rcv_pdu_size;
>   
>   	queue->ctrl = ctrl;
> @@ -1487,6 +1652,26 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
>   #endif
>   	write_unlock_bh(&queue->sock->sk->sk_callback_lock);
>   
> +	if (nvme_tcp_queue_id(queue) != 0) {

	if (!nvme_tcp_admin_queue(queue)) {

> +		config.cfg.type		= TCP_DDP_NVME;
> +		config.pfv		= NVME_TCP_PFV_1_0;
> +		config.cpda		= 0;
> +		config.dgst		= queue->hdr_digest ?
> +						NVME_TCP_HDR_DIGEST_ENABLE : 0;
> +		config.dgst		|= queue->data_digest ?
> +						NVME_TCP_DATA_DIGEST_ENABLE : 0;
> +		config.queue_size	= queue->queue_size;
> +		config.queue_id		= nvme_tcp_queue_id(queue);
> +		config.io_cpu		= queue->io_cpu;

Can the config initialization move to nvme_tcp_offload_socket?

> +
> +		ret = nvme_tcp_offload_socket(queue, &config);
> +		if (!ret)
> +			set_bit(NVME_TCP_Q_OFFLOADS, &queue->flags);
> +	} else {
> +		ret = nvme_tcp_offload_limits(queue, &limits);
> +	}

I'm thinking that instead of this conditional, we want to place
nvme nvme_tcp_alloc_admin_queue in nvme_tcp_alloc_admin_queue, and
also move nvme_tcp_alloc_admin_queue to __nvme_tcp_alloc_io_queues
loop.

> +	/* offload is opportunistic - failure is non-critical */

Than make it void...

> +
>   	return 0;
>   
>   err_init_connect:
> @@ -1519,6 +1704,9 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
>   	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>   	nvme_tcp_restore_sock_calls(queue);
>   	cancel_work_sync(&queue->io_work);
> +
> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
> +		nvme_tcp_unoffload_socket(queue);

Why not in nvme_tcp_free_queue, symmetric to the alloc?

>   }
>   
>   static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
> diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
> index 959e0bd9a913..65df64c34ecd 100644
> --- a/include/linux/nvme-tcp.h
> +++ b/include/linux/nvme-tcp.h
> @@ -8,6 +8,8 @@
>   #define _LINUX_NVME_TCP_H
>   
>   #include <linux/nvme.h>
> +#include <net/sock.h>
> +#include <net/tcp_ddp.h>

Why is this needed? I think we want to place this in tcp.c no?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 06/10] nvme-tcp: Add DDP data-path
  2020-09-30 16:20 ` [PATCH net-next RFC v1 06/10] nvme-tcp: Add DDP data-path Boris Pismenny
@ 2020-10-08 22:29   ` Sagi Grimberg
  2020-10-08 23:00     ` Sagi Grimberg
  2020-11-08  9:44     ` Boris Pismenny
  0 siblings, 2 replies; 35+ messages in thread
From: Sagi Grimberg @ 2020-10-08 22:29 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz


>   bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
> +void nvme_tcp_ddp_teardown_done(void *ddp_ctx);
>   const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops __read_mostly = {
> +
>   	.resync_request		= nvme_tcp_resync_request,
> +	.ddp_teardown_done	= nvme_tcp_ddp_teardown_done,
>   };
>   
> +static
> +int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
> +			  uint16_t command_id,
> +			  struct request *rq)
> +{
> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
> +	int ret;
> +
> +	if (unlikely(!netdev)) {
> +		pr_info_ratelimited("%s: netdev not found\n", __func__);

dev_info_ratelimited

> +		return -EINVAL;
> +	}
> +
> +	ret = netdev->tcp_ddp_ops->tcp_ddp_teardown(netdev, queue->sock->sk,
> +						    &req->ddp, rq);
> +	sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE);
> +	req->offloaded = false;
> +	return ret;
> +}
> +
> +void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
> +{
> +	struct request *rq = ddp_ctx;
> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> +
> +	if (!nvme_try_complete_req(rq, cpu_to_le16(req->status << 1), req->result))
> +		nvme_complete_rq(rq);
> +}
> +
> +static
> +int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
> +		       uint16_t command_id,
> +		       struct request *rq)
> +{
> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
> +	int ret;
> +
> +	req->offloaded = false;
> +
> +	if (unlikely(!netdev)) {
> +		pr_info_ratelimited("%s: netdev not found\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	req->ddp.command_id = command_id;
> +	req->ddp.sg_table.sgl = req->ddp.first_sgl;
> +	ret = sg_alloc_table_chained(&req->ddp.sg_table,
> +		blk_rq_nr_phys_segments(rq), req->ddp.sg_table.sgl,
> +		SG_CHUNK_SIZE);
> +	if (ret)
> +		return -ENOMEM;

newline here

> +	req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl);
> +
> +	ret = netdev->tcp_ddp_ops->tcp_ddp_setup(netdev,
> +						 queue->sock->sk,
> +						 &req->ddp);
> +	if (!ret)
> +		req->offloaded = true;
> +	return ret;
> +}
> +
>   static
>   int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
>   			    struct nvme_tcp_config *config)
> @@ -351,6 +422,25 @@ bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
>   
>   #else
>   
> +static
> +int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
> +		       uint16_t command_id,
> +		       struct request *rq)
> +{
> +	return -EINVAL;
> +}
> +
> +static
> +int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
> +			  uint16_t command_id,
> +			  struct request *rq)
> +{
> +	return -EINVAL;
> +}
> +
> +void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
> +{}
> +
>   static
>   int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
>   			    struct nvme_tcp_config *config)
> @@ -630,6 +720,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
>   static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
>   		struct nvme_completion *cqe)
>   {
> +	struct nvme_tcp_request *req;
>   	struct request *rq;
>   
>   	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
> @@ -641,8 +732,15 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
>   		return -EINVAL;
>   	}
>   
> -	if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
> -		nvme_complete_rq(rq);
> +	req = blk_mq_rq_to_pdu(rq);
> +	if (req->offloaded) {
> +		req->status = cqe->status;
> +		req->result = cqe->result;
> +		nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
> +	} else {
> +		if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
> +			nvme_complete_rq(rq);
> +	}
>   	queue->nr_cqe++;
>   
>   	return 0;
> @@ -836,9 +934,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   static inline void nvme_tcp_end_request(struct request *rq, u16 status)
>   {
>   	union nvme_result res = {};
> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> +	struct nvme_tcp_queue *queue = req->queue;
> +	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
>   
> -	if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
> -		nvme_complete_rq(rq);
> +	if (req->offloaded) {
> +		req->status = cpu_to_le16(status << 1);
> +		req->result = res;
> +		nvme_tcp_teardown_ddp(queue, pdu->command_id, rq);
> +	} else {
> +		if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
> +			nvme_complete_rq(rq);
> +	}
>   }
>   
>   static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
> @@ -1115,6 +1222,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>   	bool inline_data = nvme_tcp_has_inline_data(req);
>   	u8 hdgst = nvme_tcp_hdgst_len(queue);
>   	int len = sizeof(*pdu) + hdgst - req->offset;
> +	struct request *rq = blk_mq_rq_from_pdu(req);
>   	int flags = MSG_DONTWAIT;
>   	int ret;
>   
> @@ -1123,6 +1231,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>   	else
>   		flags |= MSG_EOR;
>   
> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&
> +	    blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ)
> +		nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq);

I'd assume that this is something we want to setup in
nvme_tcp_setup_cmd_pdu. Why do it here?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 07/10] nvme-tcp : Recalculate crc in the end of the capsule
  2020-09-30 16:20 ` [PATCH net-next RFC v1 07/10] nvme-tcp : Recalculate crc in the end of the capsule Boris Pismenny
@ 2020-10-08 22:44   ` Sagi Grimberg
       [not found]     ` <PH0PR18MB3845764B48FD24C87FA34304CCED0@PH0PR18MB3845.namprd18.prod.outlook.com>
  2020-11-08 14:46     ` Boris Pismenny
  0 siblings, 2 replies; 35+ messages in thread
From: Sagi Grimberg @ 2020-10-08 22:44 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz


> crc offload of the nvme capsule. Check if all the skb bits
> are on, and if not recalculate the crc in SW and check it.

Can you clarify in the patch description that this is only
for pdu data digest and not header digest?

> 
> This patch reworks the receive-side crc calculation to always
> run at the end, so as to keep a single flow for both offload
> and non-offload. This change simplifies the code, but it may degrade
> performance for non-offload crc calculation.

??

 From my scan it doeesn't look like you do that.. Am I missing something?
Can you explain?

> 
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> Signed-off-by: Yoray Zack <yorayz@mellanox.com>
> ---
>   drivers/nvme/host/tcp.c | 66 ++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 58 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 7bd97f856677..9a620d1dacb4 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -94,6 +94,7 @@ struct nvme_tcp_queue {
>   	size_t			data_remaining;
>   	size_t			ddgst_remaining;
>   	unsigned int		nr_cqe;
> +	bool			crc_valid;
>   
>   	/* send state */
>   	struct nvme_tcp_request *request;
> @@ -233,6 +234,41 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
>   	return nvme_tcp_pdu_data_left(req) <= len;
>   }
>   
> +static inline bool nvme_tcp_device_ddgst_ok(struct nvme_tcp_queue *queue)

Maybe call it nvme_tcp_ddp_ddgst_ok?

> +{
> +	return queue->crc_valid;
> +}
> +
> +static inline void nvme_tcp_device_ddgst_update(struct nvme_tcp_queue *queue,
> +						struct sk_buff *skb)

Maybe call it nvme_tcp_ddp_ddgst_update?

> +{
> +	if (queue->crc_valid)
> +#ifdef CONFIG_TCP_DDP_CRC
> +		queue->crc_valid = skb->ddp_crc;
> +#else
> +		queue->crc_valid = false;
> +#endif
> +}
> +
> +static void nvme_tcp_crc_recalculate(struct nvme_tcp_queue *queue,
> +				     struct nvme_tcp_data_pdu *pdu)

Maybe call it nvme_tcp_ddp_ddgst_recalc?

> +{
> +	struct nvme_tcp_request *req;
> +	struct request *rq;
> +
> +	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
> +	if (!rq)
> +		return;
> +	req = blk_mq_rq_to_pdu(rq);
> +	crypto_ahash_init(queue->rcv_hash);
> +	req->ddp.sg_table.sgl = req->ddp.first_sgl;
> +	/* req->ddp.sg_table is allocated and filled in nvme_tcp_setup_ddp */
> +	ahash_request_set_crypt(queue->rcv_hash, req->ddp.sg_table.sgl, NULL,
> +				le32_to_cpu(pdu->data_length));
> +	crypto_ahash_update(queue->rcv_hash);
> +}
> +
> +
>   #ifdef CONFIG_TCP_DDP
>   
>   bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
> @@ -706,6 +742,7 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>   	queue->pdu_offset = 0;
>   	queue->data_remaining = -1;
>   	queue->ddgst_remaining = 0;
> +	queue->crc_valid = true;
>   }
>   
>   static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> @@ -955,6 +992,8 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   	struct nvme_tcp_request *req;
>   	struct request *rq;
>   
> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
> +		nvme_tcp_device_ddgst_update(queue, skb);

Is the queue->data_digest condition missing here?

>   	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
>   	if (!rq) {
>   		dev_err(queue->ctrl->ctrl.device,
> @@ -992,7 +1031,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   		recv_len = min_t(size_t, recv_len,
>   				iov_iter_count(&req->iter));
>   
> -		if (queue->data_digest)
> +		if (queue->data_digest && !test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
>   			ret = skb_copy_and_hash_datagram_iter(skb, *offset,
>   				&req->iter, recv_len, queue->rcv_hash);

This is the skb copy and hash, not clear why you say that you move this
to the end...

>   		else
> @@ -1012,7 +1051,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   
>   	if (!queue->data_remaining) {
>   		if (queue->data_digest) {
> -			nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);

If I instead do:
			if (!test_bit(NVME_TCP_Q_OFFLOADS,
                                       &queue->flags))
				nvme_tcp_ddgst_final(queue->rcv_hash,
						     &queue->exp_ddgst);

Does that help the mess in nvme_tcp_recv_ddgst?

>   			queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
>   		} else {
>   			if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
> @@ -1033,8 +1071,11 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
>   	char *ddgst = (char *)&queue->recv_ddgst;
>   	size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining);
>   	off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
> +	bool ddgst_offload_fail;
>   	int ret;
>   
> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
> +		nvme_tcp_device_ddgst_update(queue, skb);
>   	ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
>   	if (unlikely(ret))
>   		return ret;
> @@ -1045,12 +1086,21 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
>   	if (queue->ddgst_remaining)
>   		return 0;
>   
> -	if (queue->recv_ddgst != queue->exp_ddgst) {
> -		dev_err(queue->ctrl->ctrl.device,
> -			"data digest error: recv %#x expected %#x\n",
> -			le32_to_cpu(queue->recv_ddgst),
> -			le32_to_cpu(queue->exp_ddgst));
> -		return -EIO;
> +	ddgst_offload_fail = !nvme_tcp_device_ddgst_ok(queue);
> +	if (!test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) ||
> +	    ddgst_offload_fail) {
> +		if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&
> +		    ddgst_offload_fail)
> +			nvme_tcp_crc_recalculate(queue, pdu);
> +
> +		nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
> +		if (queue->recv_ddgst != queue->exp_ddgst) {
> +			dev_err(queue->ctrl->ctrl.device,
> +				"data digest error: recv %#x expected %#x\n",
> +				le32_to_cpu(queue->recv_ddgst),
> +				le32_to_cpu(queue->exp_ddgst));
> +			return -EIO;

This gets convoluted here...

> +		}
>   	}
>   
>   	if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 08/10] nvme-tcp: Deal with netdevice DOWN events
  2020-09-30 16:20 ` [PATCH net-next RFC v1 08/10] nvme-tcp: Deal with netdevice DOWN events Boris Pismenny
@ 2020-10-08 22:47   ` Sagi Grimberg
  2020-10-11  6:54     ` Or Gerlitz
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2020-10-08 22:47 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz



On 9/30/20 9:20 AM, Boris Pismenny wrote:
> From: Or Gerlitz <ogerlitz@mellanox.com>
> 
> For ddp setup/teardown and resync, the offloading logic
> uses HW resources at the NIC driver such as SQ and CQ.
> 
> These resources are destroyed when the netdevice does down
> and hence we must stop using them before the NIC driver
> destroyes them.
> 
> Use netdevice notifier for that matter -- offloaded connections
> are stopped before the stack continues to call the NIC driver
> close ndo.
> 
> We use the existing recovery flow which has the advantage
> of resuming the offload once the connection is re-set.
> 
> Since the recovery flow runs in a separate/dedicated WQ
> we need to wait in the notifier code for an ACK that all
> offloaded queues were stopped which means that the teardown
> queue offload ndo was called and the NIC doesn't have any
> resources related to that connection any more.
> 
> This also buys us proper handling for the UNREGISTER event
> b/c our offloading starts in the UP state, and down is always
> there between up to unregister.
> 
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
> Signed-off-by: Yoray Zack <yorayz@mellanox.com>
> ---
>   drivers/nvme/host/tcp.c | 39 +++++++++++++++++++++++++++++++++++++--
>   1 file changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 9a620d1dacb4..7569b47f0414 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -144,6 +144,7 @@ struct nvme_tcp_ctrl {
>   
>   static LIST_HEAD(nvme_tcp_ctrl_list);
>   static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
> +static struct notifier_block nvme_tcp_netdevice_nb;
>   static struct workqueue_struct *nvme_tcp_wq;
>   static const struct blk_mq_ops nvme_tcp_mq_ops;
>   static const struct blk_mq_ops nvme_tcp_admin_mq_ops;
> @@ -412,8 +413,6 @@ int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue,
>   		queue->ctrl->ctrl.max_segments = limits->max_ddp_sgl_len;
>   		queue->ctrl->ctrl.max_hw_sectors =
>   			limits->max_ddp_sgl_len << (ilog2(SZ_4K) - 9);
> -	} else {
> -		queue->ctrl->offloading_netdev = NULL;

Squash this change to the patch that introduced it.

>   	}
>   
>   	dev_put(netdev);
> @@ -1992,6 +1991,8 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>   {
>   	int ret;
>   
> +	to_tcp_ctrl(ctrl)->offloading_netdev = NULL;
> +
>   	ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH);
>   	if (ret)
>   		return ret;
> @@ -2885,6 +2886,26 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
>   	return ERR_PTR(ret);
>   }
>   
> +static int nvme_tcp_netdev_event(struct notifier_block *this,
> +				 unsigned long event, void *ptr)
> +{
> +	struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
> +	struct nvme_tcp_ctrl *ctrl;
> +
> +	switch (event) {
> +	case NETDEV_GOING_DOWN:
> +		mutex_lock(&nvme_tcp_ctrl_mutex);
> +		list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
> +			if (ndev != ctrl->offloading_netdev)
> +				continue;
> +			nvme_tcp_error_recovery(&ctrl->ctrl);
> +		}
> +		mutex_unlock(&nvme_tcp_ctrl_mutex);
> +		flush_workqueue(nvme_reset_wq);

Worth a small comment that this we want the err_work to complete
here. So if someone changes workqueues he may see this.

> +	}
> +	return NOTIFY_DONE;
> +}
> +
>   static struct nvmf_transport_ops nvme_tcp_transport = {
>   	.name		= "tcp",
>   	.module		= THIS_MODULE,

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 06/10] nvme-tcp: Add DDP data-path
  2020-10-08 22:29   ` Sagi Grimberg
@ 2020-10-08 23:00     ` Sagi Grimberg
  2020-11-08 13:59       ` Boris Pismenny
  2020-11-08  9:44     ` Boris Pismenny
  1 sibling, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2020-10-08 23:00 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz


>>   static
>>   int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
>>                   struct nvme_tcp_config *config)
>> @@ -630,6 +720,7 @@ static void nvme_tcp_error_recovery(struct 
>> nvme_ctrl *ctrl)
>>   static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
>>           struct nvme_completion *cqe)
>>   {
>> +    struct nvme_tcp_request *req;
>>       struct request *rq;
>>       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
>> @@ -641,8 +732,15 @@ static int nvme_tcp_process_nvme_cqe(struct 
>> nvme_tcp_queue *queue,
>>           return -EINVAL;
>>       }
>> -    if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
>> -        nvme_complete_rq(rq);
>> +    req = blk_mq_rq_to_pdu(rq);
>> +    if (req->offloaded) {
>> +        req->status = cqe->status;
>> +        req->result = cqe->result;
>> +        nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
>> +    } else {
>> +        if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
>> +            nvme_complete_rq(rq);
>> +    }

Oh forgot to ask,

We have places in the driver that we may complete (cancel) one
or more requests from the error recovery or timeout flow. We
first prevent future incoming RX on the socket such that we
can safely cancel requests. This may break with the deferred
completion in ddp_teardown_done.

If I have a request that is waiting for ddp_teardown_done do
I have a way to tell the HW to never call ddp_teardown_done
on a specific socket?

If so the place to is in nvme_tcp_stop_queue.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 01/10] iov_iter: Skip copy in memcpy_to_page if src==dst
  2020-09-30 16:20 ` [PATCH net-next RFC v1 01/10] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
@ 2020-10-08 23:05   ` Sagi Grimberg
  0 siblings, 0 replies; 35+ messages in thread
From: Sagi Grimberg @ 2020-10-08 23:05 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

You probably want Al to have a look at this..

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads
  2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
                   ` (9 preceding siblings ...)
  2020-09-30 16:20 ` [PATCH net-next RFC v1 10/10] net/mlx5e: NVMEoTCP, data-path for DDP offload Boris Pismenny
@ 2020-10-09  0:08 ` Sagi Grimberg
  10 siblings, 0 replies; 35+ messages in thread
From: Sagi Grimberg @ 2020-10-09  0:08 UTC (permalink / raw)
  To: Boris Pismenny, kuba, davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: boris.pismenny, linux-nvme, netdev



On 9/30/20 9:20 AM, Boris Pismenny wrote:
> This series adds support for nvme-tcp receive offloads
> which do not mandate the offload of the network stack to the device.
> Instead, these work together with TCP to offload:
> 1. copy from SKB to the block layer buffers
> 2. CRC verification for received PDU
> 
> The series implements these as a generic offload infrastructure for storage
> protocols, which we call TCP Direct Data Placement (TCP_DDP) and TCP DDP CRC,
> respectively. We use this infrastructure to implement NVMe-TCP offload for copy
> and CRC. Future implementations can reuse the same infrastructure for other
> protcols such as iSCSI.
> 
> Note:
> These offloads are similar in nature to the packet-based NIC TLS offloads,
> which are already upstream (see net/tls/tls_device.c).
> You can read more about TLS offload here:
> https://www.kernel.org/doc/html/latest/networking/tls-offload.html
> 
> Initialization and teardown:
> =========================================
> The offload for IO queues is initialized after the handshake of the
> NVMe-TCP protocol is finished by calling `nvme_tcp_offload_socket`
> with the tcp socket of the nvme_tcp_queue:
> This operation sets all relevant hardware contexts in
> hardware. If it fails, then the IO queue proceeds as usually with no offload.
> If it succeeds then `nvme_tcp_setup_ddp` and `nvme_tcp_teardown_ddp` may be
> called to perform copy offload, and crc offload will be used.
> This initialization does not change the normal operation of nvme-tcp in any
> way besides adding the option to call the above mentioned NDO operations.
> 
> For the admin queue, nvme-tcp does not initialize the offload.
> Instead, nvme-tcp calls the driver to configure limits for the controller,
> such as max_hw_sectors and max_segments; these must be limited to accomodate
> potential HW resource limits, and to improve performance.
> 
> If some error occured, and the IO queue must be closed or reconnected, then
> offload is teardown and initialized again. Additionally, we handle netdev
> down events via the existing error recovery flow.
> 
> Copy offload works as follows:
> =========================================
> The nvme-tcp layer calls the NIC drive to map block layer buffers to ccid using
> `nvme_tcp_setup_ddp` before sending the read request. When the repsonse is
> received, then the NIC HW will write the PDU payload directly into the
> designated buffer, and build an SKB such that it points into the destination
> buffer; this SKB represents the entire packet received on the wire, but it
> points to the block layer buffers. Once nvme-tcp attempts to copy data from
> this SKB to the block layer buffer it can skip the copy by checking in the
> copying function (memcpy_to_page):
> if (src == dst) -> skip copy
> Finally, when the PDU has been processed to completion, the nvme-tcp layer
> releases the NIC HW context be calling `nvme_tcp_teardown_ddp` which
> asynchronously unmaps the buffers from NIC HW.
> 
> As the last change is to a sensative function, we are careful to place it under
> static_key which is only enabled when this functionality is actually used for
> nvme-tcp copy offload.
> 
> Asynchronous completion:
> =========================================
> The NIC must release its mapping between command IDs and the target buffers.
> This mapping is released when NVMe-TCP calls the NIC
> driver (`nvme_tcp_offload_socket`).
> As completing IOs is performance criticial, we introduce asynchronous
> completions for NVMe-TCP, i.e. NVMe-TCP calls the NIC, which will later
> call NVMe-TCP to complete the IO (`nvme_tcp_ddp_teardown_done`).
> 
> An alternative approach is to move all the functions related to coping from
> SKBs to the block layer buffers inside the nvme-tcp code - about 200 LOC.
> 
> CRC offload works as follows:
> =========================================
> After offload is initialized, we use the SKB's ddp_crc bit to indicate that:
> "there was no problem with the verification of all CRC fields in this packet's
> payload". The bit is set to zero if there was an error, or if HW skipped
> offload for some reason. If *any* SKB in a PDU has (ddp_crc != 1), then software
> must compute the CRC, and check it. We perform this check, and
> accompanying software fallback at the end of the processing of a received PDU.
> 
> SKB changes:
> =========================================
> The CRC offload requires an additional bit in the SKB, which is useful for
> preventing the coalescing of SKB with different crc offload values. This bit
> is similar in concept to the "decrypted" bit.
> 
> Performance:
> =========================================
> The expected performance gain from this offload varies with the block size.
> We perform a CPU cycles breakdown of the copy/CRC operations in nvme-tcp
> fio random read workloads:
> For 4K blocks we see up to 11% improvement for a 100% read fio workload,
> while for 128K blocks we see upto 52%. If we run nvme-tcp, and skip these
> operations, then we observe a gain of about 1.1x and 2x respectively.

Nice!

> 
> Resynchronization:
> =========================================
> The resynchronization flow is performed to reset the hardware tracking of
> NVMe-TCP PDUs within the TCP stream. The flow consists of a request from
> the driver, regarding a possible location of a PDU header. Followed by
> a response from the nvme-tcp driver.
> 
> This flow is rare, and it should happen only after packet loss or
> reordering events that involve nvme-tcp PDU headers.
> 
> The patches are organized as follows:
> =========================================
> Patch 1         the iov_iter change to skip copy if (src == dst)
> Patches 2-3     the infrastructure for all TCP DDP
>                  and TCP DDP CRC offloads, respectively.
> Patch 4         exposes the get_netdev_for_sock function from TLS
> Patch 5         NVMe-TCP changes to call NIC driver on queue init/teardown
> Patches 6       NVMe-TCP changes to call NIC driver on IO operation
>                  setup/teardown, and support async completions.
> Patches 7       NVMe-TCP changes to support CRC offload on receive.
>                  Also, this patch moves CRC calculation to the end of PDU
>                  in case offload requires software fallback.
> Patches 8       NVMe-TCP handling of netdev events: stop the offload if
>                  netdev is going down
> Patches 9-10    implement support for NVMe-TCP copy and CRC offload in
>                  the mlx5 NIC driver
> 
> Testing:
> =========================================
> This series was tested using fio with various configurations of IO sizes,
> depths, MTUs, and with both the SPDK and kernel NVMe-TCP targets.
> 
> Future work:
> =========================================
> A follow-up series will introduce support for transmit side CRC. Then,
> we will work on adding support for TLS in NVMe-TCP and combining the
> two offloads.

Boris, Or and Yoray

Thanks for submitting this work. Overall this looks good to me.
The model here is not messy at all which is not trivial when it comes to 
tcp offloads.

Gave you comments in the patches themselves but overall this looks
good!

Would love to see TLS work from you moving forward.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 08/10] nvme-tcp: Deal with netdevice DOWN events
  2020-10-08 22:47   ` Sagi Grimberg
@ 2020-10-11  6:54     ` Or Gerlitz
  0 siblings, 0 replies; 35+ messages in thread
From: Or Gerlitz @ 2020-10-11  6:54 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Yoray Zack, Boris Pismenny, Ben Ben-Ishay, boris.pismenny,
	linux-nvme, David Miller, axboe, Eric Dumazet, Alexander Viro,
	Linux Netdev List, kbusch, Jakub Kicinski, Or Gerlitz,
	Saeed Mahameed, Christoph Hellwig

On Fri, Oct 9, 2020 at 1:50 AM Sagi Grimberg <sagi@grimberg.me> wrote:

> > @@ -412,8 +413,6 @@ int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue,
> >               queue->ctrl->ctrl.max_segments = limits->max_ddp_sgl_len;
> >               queue->ctrl->ctrl.max_hw_sectors =
> >                       limits->max_ddp_sgl_len << (ilog2(SZ_4K) - 9);
> > -     } else {
> > -             queue->ctrl->offloading_netdev = NULL;
>
> Squash this change to the patch that introduced it.

OK, will look on that and I guess it should be fine to make this as
you suggested


> > +     case NETDEV_GOING_DOWN:
> > +             mutex_lock(&nvme_tcp_ctrl_mutex);
> > +             list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
> > +                     if (ndev != ctrl->offloading_netdev)
> > +                             continue;
> > +                     nvme_tcp_error_recovery(&ctrl->ctrl);
> > +             }
> > +             mutex_unlock(&nvme_tcp_ctrl_mutex);
> > +             flush_workqueue(nvme_reset_wq);
>
> Worth a small comment that this we want the err_work to complete
> here. So if someone changes workqueues he may see this.


ack

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 02/10] net: Introduce direct data placement tcp offload
  2020-10-08 21:47   ` Sagi Grimberg
@ 2020-10-11 14:44     ` Boris Pismenny
  0 siblings, 0 replies; 35+ messages in thread
From: Boris Pismenny @ 2020-10-11 14:44 UTC (permalink / raw)
  To: Sagi Grimberg, Boris Pismenny, kuba, davem, saeedm, hch, axboe,
	kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, Boris Pismenny, boris.pismenny,
	linux-nvme, netdev, Or Gerlitz



On 09/10/2020 0:47, Sagi Grimberg wrote:
>> + * tcp_ddp.h
>> + *	Author:	Boris Pismenny <borisp@mellanox.com>
>> + *	Copyright (C) 2020 Mellanox Technologies.
>> + */
>> +#ifndef _TCP_DDP_H
>> +#define _TCP_DDP_H
>> +
>> +#include <linux/blkdev.h>
> Why is blkdev.h needed?

That's a lefotover from a previous iteration over this code. I'll remove it for the next patch.

>
>> +#include <linux/netdevice.h>
>> +#include <net/inet_connection_sock.h>
>> +#include <net/sock.h>
>> +
>> +/* limits returned by the offload driver, zero means don't care */
>> +struct tcp_ddp_limits {
>> +	int	 max_ddp_sgl_len;
>> +};
>> +
>> +enum tcp_ddp_type {
>> +	TCP_DDP_NVME = 1,
>> +};
>> +
>> +struct tcp_ddp_config {
>> +	enum tcp_ddp_type    type;
>> +	unsigned char        buf[];
> A little kdoc may help here as its not exactly clear what is
> buf used for (at this point at least)...

Will add.

>> +};
>> +
>> +struct nvme_tcp_config {
> struct nvme_tcp_ddp_config

Sure.

>> +	struct tcp_ddp_config   cfg;
>> +
>> +	u16			pfv;
>> +	u8			cpda;
>> +	u8			dgst;
>> +	int			queue_size;
>> +	int			queue_id;
>> +	int			io_cpu;
>> +};
>> +
> Other than that this looks good to me.
Thanks Sagi!

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 03/10] net: Introduce crc offload for tcp ddp ulp
  2020-10-08 21:51   ` Sagi Grimberg
@ 2020-10-11 14:58     ` Boris Pismenny
  0 siblings, 0 replies; 35+ messages in thread
From: Boris Pismenny @ 2020-10-11 14:58 UTC (permalink / raw)
  To: Sagi Grimberg, Boris Pismenny, kuba, davem, saeedm, hch, axboe,
	kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, Boris Pismenny, boris.pismenny,
	linux-nvme, netdev, Or Gerlitz

On 09/10/2020 0:51, Sagi Grimberg wrote:
>> This commit itroduces support for CRC offload to direct data placement
>                introduces
>> ULP on the receive side. Both DDP and CRC share a common API to
>> initialize the offload for a TCP socket. But otherwise, both can
>> be executed independently.
>  From the API it not clear that the offload engine does crc32c, do you
> see this extended to other crc types in the future?

Yes, it is somewhat implicit, and it depends on the tcp ddp configuration. At the moment we only do nvme-tcp, and that has only CRC32C. If in the future, there would be other CRC variants, or data digest algorithms, then the code could be easily extended.

In general, any data digest over TCP can leverage this code and SKB member. Not only other CRC types can benefit from it, but even more complex data digest algorithms like SHA can use this. In essence this bit is similar to the TLS skb->decrypted bit. In TLS, skb->decrypted also indicates that the authentication has no error, exactly like ddp_crc indicates that there is no CRC32C error. The only reason we didn't use the same bit for both is that these two protocol offloads can be combined and that will benefit from two independent bits.

>
> Other than this the patch looks good to me,
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path
  2020-10-08 22:19   ` Sagi Grimberg
@ 2020-10-19 18:28     ` Boris Pismenny
       [not found]     ` <PH0PR18MB3845430DDF572E0DD4832D06CCED0@PH0PR18MB3845.namprd18.prod.outlook.com>
  1 sibling, 0 replies; 35+ messages in thread
From: Boris Pismenny @ 2020-10-19 18:28 UTC (permalink / raw)
  To: Sagi Grimberg, Boris Pismenny, kuba, davem, saeedm, hch, axboe,
	kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz

On 09/10/2020 1:19, Sagi Grimberg wrote:
> On 9/30/20 9:20 AM, Boris Pismenny wrote:
>> This commit introduces direct data placement offload to NVME
>> TCP. There is a context per queue, which is established after the
>> handshake
>> using the tcp_ddp_sk_add/del NDOs.
>>
>> Additionally, a resynchronization routine is used to assist
>> hardware recovery from TCP OOO, and continue the offload.
>> Resynchronization operates as follows:
>> 1. TCP OOO causes the NIC HW to stop the offload
>> 2. NIC HW identifies a PDU header at some TCP sequence number,
>> and asks NVMe-TCP to confirm it.
>> This request is delivered from the NIC driver to NVMe-TCP by first
>> finding the socket for the packet that triggered the request, and
>> then fiding the nvme_tcp_queue that is used by this routine.
>> Finally, the request is recorded in the nvme_tcp_queue.
>> 3. When NVMe-TCP observes the requested TCP sequence, it will compare
>> it with the PDU header TCP sequence, and report the result to the
>> NIC driver (tcp_ddp_resync), which will update the HW,
>> and resume offload when all is successful.
>>
>> Furthermore, we let the offloading driver advertise what is the max hw
>> sectors/segments via tcp_ddp_limits.
>>
>> A follow-up patch introduces the data-path changes required for this
>> offload.
>>
>> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
>> Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
>> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
>> Signed-off-by: Yoray Zack <yorayz@mellanox.com>
>> ---
>>   drivers/nvme/host/tcp.c  | 188 +++++++++++++++++++++++++++++++++++++++
>>   include/linux/nvme-tcp.h |   2 +
>>   2 files changed, 190 insertions(+)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 8f4f29f18b8c..06711ac095f2 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -62,6 +62,7 @@ enum nvme_tcp_queue_flags {
>>   	NVME_TCP_Q_ALLOCATED	= 0,
>>   	NVME_TCP_Q_LIVE		= 1,
>>   	NVME_TCP_Q_POLLING	= 2,
>> +	NVME_TCP_Q_OFFLOADS     = 3,
>>   };
>>   
>>   enum nvme_tcp_recv_state {
>> @@ -110,6 +111,8 @@ struct nvme_tcp_queue {
>>   	void (*state_change)(struct sock *);
>>   	void (*data_ready)(struct sock *);
>>   	void (*write_space)(struct sock *);
>> +
>> +	atomic64_t  resync_req;
>>   };
>>   
>>   struct nvme_tcp_ctrl {
>> @@ -129,6 +132,8 @@ struct nvme_tcp_ctrl {
>>   	struct delayed_work	connect_work;
>>   	struct nvme_tcp_request async_req;
>>   	u32			io_queues[HCTX_MAX_TYPES];
>> +
>> +	struct net_device       *offloading_netdev;
>>   };
>>   
>>   static LIST_HEAD(nvme_tcp_ctrl_list);
>> @@ -223,6 +228,159 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
>>   	return nvme_tcp_pdu_data_left(req) <= len;
>>   }
>>   
>> +#ifdef CONFIG_TCP_DDP
>> +
>> +bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
>> +const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops __read_mostly = {
>> +	.resync_request		= nvme_tcp_resync_request,
>> +};
>> +
>> +static
>> +int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
>> +			    struct nvme_tcp_config *config)
>> +{
>> +	struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, true);
>> +	struct tcp_ddp_config *ddp_config = (struct tcp_ddp_config *)config;
>> +	int ret;
>> +
>> +	if (unlikely(!netdev)) {
> Let's remove unlikely from non datapath routines, its slightly
> confusing.
>
>> +		pr_info_ratelimited("%s: netdev not found\n", __func__);
> dev_info_ratelimited with queue->ctrl->ctrl.device ?
> Also, lets remove __func__. This usually is not very helpful.
>
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (!(netdev->features & NETIF_F_HW_TCP_DDP)) {
>> +		dev_put(netdev);
>> +		return -EINVAL;
> EINVAL or ENODEV?
ENODEV seems more appropriate, we'll use it. Thanks
>> +	}
>> +
>> +	ret = netdev->tcp_ddp_ops->tcp_ddp_sk_add(netdev,
>> +						 queue->sock->sk,
>> +						 ddp_config);
>> +	if (!ret)
>> +		inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = &nvme_tcp_ddp_ulp_ops;
>> +	else
>> +		dev_put(netdev);
>> +	return ret;
>> +}
>> +
>> +static
>> +void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
>> +{
>> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
>> +
>> +	if (unlikely(!netdev)) {
>> +		pr_info_ratelimited("%s: netdev not found\n", __func__);
> Same here.
>
>> +		return;
>> +	}
>> +
>> +	netdev->tcp_ddp_ops->tcp_ddp_sk_del(netdev, queue->sock->sk);
>> +
>> +	inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = NULL;
> Just a general question, why is this needed?
This assignment is symmetric with the nvme_tcp_offload_socket assignment. The idea was to ensure that the functions established during offload cannot be used from this moment.

>> +	dev_put(netdev); /* put the queue_init get_netdev_for_sock() */
>> +}
>> +
>> +static
>> +int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue,
>> +			    struct tcp_ddp_limits *limits)
>> +{
>> +	struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, true);
>> +	int ret = 0;
>> +
>> +	if (unlikely(!netdev)) {
>> +		pr_info_ratelimited("%s: netdev not found\n", __func__);
>> +		return -EINVAL;
> Same here
>
>> +	}
>> +
>> +	if (netdev->features & NETIF_F_HW_TCP_DDP &&
>> +	    netdev->tcp_ddp_ops &&
>> +	    netdev->tcp_ddp_ops->tcp_ddp_limits)
>> +			ret = netdev->tcp_ddp_ops->tcp_ddp_limits(netdev, limits);
>> +	else
>> +			ret = -EOPNOTSUPP;
>> +
>> +	if (!ret) {
>> +		queue->ctrl->offloading_netdev = netdev;
>> +		pr_info("%s netdev %s offload limits: max_ddp_sgl_len %d\n",
>> +			__func__, netdev->name, limits->max_ddp_sgl_len);
> dev_info, and given that it per-queue, please make it dev_dbg.
>
>> +		queue->ctrl->ctrl.max_segments = limits->max_ddp_sgl_len;
>> +		queue->ctrl->ctrl.max_hw_sectors =
>> +			limits->max_ddp_sgl_len << (ilog2(SZ_4K) - 9);
>> +	} else {
>> +		queue->ctrl->offloading_netdev = NULL;
> Maybe nullify in the controller setup intead?
It is already set to zero after allocation (i.e., kzalloc). The goal here is to ensure it is zero in case there is a reset and it was non zero due to offload being used in the past. 

>> +	}
>> +
>> +	dev_put(netdev);
>> +
>> +	return ret;
>> +}
>> +
>> +static
>> +void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
>> +			      unsigned int pdu_seq)
>> +{
>> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
>> +	u64 resync_val;
>> +	u32 resync_seq;
>> +
>> +	if (unlikely(!netdev)) {
>> +		pr_info_ratelimited("%s: netdev not found\n", __func__);
>> +		return;
> What happens now, fallback to SW? Maybe dev_warn then..
> will the SW keep seeing these responses after one failed?
As long as there is no resync response the system falls back to software, and this message is emitted. I'll move it to below to display it only when it is relevant and not every time this function is called.

>> +	}
>> +
>> +	resync_val = atomic64_read(&queue->resync_req);
>> +	if ((resync_val & TCP_DDP_RESYNC_REQ) == 0)
>> +		return;
>> +
>> +	resync_seq = resync_val >> 32;
>> +	if (before(pdu_seq, resync_seq))
>> +		return;
> I think it will be better to pass the skb to this func and keep the
> pdu_seq contained locally.

This requires passing the offset to obtain the sequence:
u64 pdu_seq = TCP_SKB_CB(skb)->seq + *offset - queue->pdu_offset;

It makes the interface a bit ugly, IMO.

>> +
>> +	if (atomic64_cmpxchg(&queue->resync_req, resync_val, (resync_val - 1)))
>> +		netdev->tcp_ddp_ops->tcp_ddp_resync(netdev, queue->sock->sk, pdu_seq);
> A small comment on this manipulation may help the reader.
>
>> +}
>> +
>> +bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
>> +{
>> +	struct nvme_tcp_queue *queue = sk->sk_user_data;
>> +
>> +	atomic64_set(&queue->resync_req,
>> +		     (((uint64_t)seq << 32) | flags));
>> +
>> +	return true;
>> +}
>> +
>> +#else
>> +
>> +static
>> +int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
>> +			    struct nvme_tcp_config *config)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +static
>> +void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
>> +{}
>> +
>> +static
>> +int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue,
>> +			    struct tcp_ddp_limits *limits)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +static
>> +void nvme_tcp_resync_response(struct nvme_tcp_queue *queue,
>> +			      unsigned int pdu_seq)
>> +{}
>> +
>> +bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
>> +{
>> +	return false;
>> +}
>> +
>> +#endif
>> +
>>   static void nvme_tcp_init_iter(struct nvme_tcp_request *req,
>>   		unsigned int dir)
>>   {
>> @@ -628,6 +786,11 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>>   	size_t rcv_len = min_t(size_t, *len, queue->pdu_remaining);
>>   	int ret;
>>   
>> +	u64 pdu_seq = TCP_SKB_CB(skb)->seq + *offset - queue->pdu_offset;
>> +
>> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
>> +		nvme_tcp_resync_response(queue, pdu_seq);
> Here, just pass in (queue, skb)

Do you mean (queue, skb, offset)?
We need the offset for the pdu_seq calculation above.

>> +
>>   	ret = skb_copy_bits(skb, *offset,
>>   		&pdu[queue->pdu_offset], rcv_len);
>>   	if (unlikely(ret))
>> @@ -1370,6 +1533,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
>>   {
>>   	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>>   	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
>> +	struct nvme_tcp_config config;
>> +	struct tcp_ddp_limits limits;
>>   	int ret, rcv_pdu_size;
>>   
>>   	queue->ctrl = ctrl;
>> @@ -1487,6 +1652,26 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
>>   #endif
>>   	write_unlock_bh(&queue->sock->sk->sk_callback_lock);
>>   
>> +	if (nvme_tcp_queue_id(queue) != 0) {
> 	if (!nvme_tcp_admin_queue(queue)) {
>
>> +		config.cfg.type		= TCP_DDP_NVME;
>> +		config.pfv		= NVME_TCP_PFV_1_0;
>> +		config.cpda		= 0;
>> +		config.dgst		= queue->hdr_digest ?
>> +						NVME_TCP_HDR_DIGEST_ENABLE : 0;
>> +		config.dgst		|= queue->data_digest ?
>> +						NVME_TCP_DATA_DIGEST_ENABLE : 0;
>> +		config.queue_size	= queue->queue_size;
>> +		config.queue_id		= nvme_tcp_queue_id(queue);
>> +		config.io_cpu		= queue->io_cpu;
> Can the config initialization move to nvme_tcp_offload_socket?

Definitely. The original idea of placing it here was that the nvme-tcp handshake may influence the parameters. But, at this moment, I realize that it is orthogonal.

>> +
>> +		ret = nvme_tcp_offload_socket(queue, &config);
>> +		if (!ret)
>> +			set_bit(NVME_TCP_Q_OFFLOADS, &queue->flags);
>> +	} else {
>> +		ret = nvme_tcp_offload_limits(queue, &limits);
>> +	}
> I'm thinking that instead of this conditional, we want to place
> nvme nvme_tcp_alloc_admin_queue in nvme_tcp_alloc_admin_queue, and
> also move nvme_tcp_alloc_admin_queue to __nvme_tcp_alloc_io_queues
> loop.

We need to do it only after the socket is connected to obtain the 5-tuple and the appropriate netdev, and preferably after the protocol handshake, as there is nothing to offload there.

>> +	/* offload is opportunistic - failure is non-critical */
> Than make it void...
>
>> +
>>   	return 0;
>>   
>>   err_init_connect:
>> @@ -1519,6 +1704,9 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
>>   	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>>   	nvme_tcp_restore_sock_calls(queue);
>>   	cancel_work_sync(&queue->io_work);
>> +
>> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
>> +		nvme_tcp_unoffload_socket(queue);
> Why not in nvme_tcp_free_queue, symmetric to the alloc?

Sure. I've tried to keep it close to the socket disconnect, which isn't symmetrical for some reason, which suggests that these are interchangable?!

>>   }
>>   
>>   static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
>> diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
>> index 959e0bd9a913..65df64c34ecd 100644
>> --- a/include/linux/nvme-tcp.h
>> +++ b/include/linux/nvme-tcp.h
>> @@ -8,6 +8,8 @@
>>   #define _LINUX_NVME_TCP_H
>>   
>>   #include <linux/nvme.h>
>> +#include <net/sock.h>
>> +#include <net/tcp_ddp.h>
> Why is this needed? I think we want to place this in tcp.c no?
Not needed. It's probably leftover from previous iterations on the code. Removed for the next iteration of the patchset.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path
       [not found]     ` <PH0PR18MB3845430DDF572E0DD4832D06CCED0@PH0PR18MB3845.namprd18.prod.outlook.com>
@ 2020-11-08  6:51       ` Shai Malin
  2020-11-09 23:23         ` Sagi Grimberg
  0 siblings, 1 reply; 35+ messages in thread
From: Shai Malin @ 2020-11-08  6:51 UTC (permalink / raw)
  To: linux-nvme, Sagi Grimberg, Boris Pismenny, Boris Pismenny, kuba,
	davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ariel Elior, Ben Ben-Ishay, Michal Kalderon,
	boris.pismenny, linux-nvme, netdev, Or Gerlitz


On 09/10/2020 1:19, Sagi Grimberg wrote:
> On 9/30/20 9:20 AM, Boris Pismenny wrote:
> > This commit introduces direct data placement offload to NVME TCP.
> > There is a context per queue, which is established after the 
> > handshake using the tcp_ddp_sk_add/del NDOs.
> >
> > Additionally, a resynchronization routine is used to assist hardware 
> > recovery from TCP OOO, and continue the offload.
> > Resynchronization operates as follows:
> > 1. TCP OOO causes the NIC HW to stop the offload 2. NIC HW 
> > identifies a PDU header at some TCP sequence number, and asks 
> > NVMe-TCP to
> confirm
> > it.
> > This request is delivered from the NIC driver to NVMe-TCP by first 
> > finding the socket for the packet that triggered the request, and 
> > then fiding the nvme_tcp_queue that is used by this routine.
> > Finally, the request is recorded in the nvme_tcp_queue.
> > 3. When NVMe-TCP observes the requested TCP sequence, it will 
> > compare it with the PDU header TCP sequence, and report the result 
> > to the NIC driver (tcp_ddp_resync), which will update the HW, and 
> > resume offload when all is successful.
> >
> > Furthermore, we let the offloading driver advertise what is the max 
> > hw sectors/segments via tcp_ddp_limits.
> >
> > A follow-up patch introduces the data-path changes required for this 
> > offload.
> >
> > Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> > Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
> > Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> > Signed-off-by: Yoray Zack <yorayz@mellanox.com>
> > ---
> >   drivers/nvme/host/tcp.c  | 188
> +++++++++++++++++++++++++++++++++++++++
> >   include/linux/nvme-tcp.h |   2 +
> >   2 files changed, 190 insertions(+)
> >
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index
> > 8f4f29f18b8c..06711ac095f2 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -62,6 +62,7 @@ enum nvme_tcp_queue_flags {
> >   	NVME_TCP_Q_ALLOCATED	= 0,
> >   	NVME_TCP_Q_LIVE		= 1,
> >   	NVME_TCP_Q_POLLING	= 2,
> > +	NVME_TCP_Q_OFFLOADS     = 3,

Sagi - following our discussion and your suggestions regarding the NVMeTCP Offload ULP module that we are working on at Marvell in which a TCP_OFFLOAD transport type would be added, we are concerned that perhaps the generic term "offload" for both the transport type (for the Marvell work) and for the DDP and CRC offload queue (for the Mellanox work) may be misleading and confusing to developers and to users. Perhaps the naming should be "direct data placement", e.g. NVME_TCP_Q_DDP or NVME_TCP_Q_DIRECT?
Also, no need to quote the entire patch. Just a few lines above your response like I did here.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH net-next RFC v1 07/10] nvme-tcp : Recalculate crc in the end of the capsule
       [not found]       ` <PH0PR18MB38458FD325BD77983D2623D4CCEB0@PH0PR18MB3845.namprd18.prod.outlook.com>
@ 2020-11-08  6:59         ` Shai Malin
  2020-11-08  7:28           ` Boris Pismenny
  0 siblings, 1 reply; 35+ messages in thread
From: Shai Malin @ 2020-11-08  6:59 UTC (permalink / raw)
  To: linux-nvme, Sagi Grimberg, Boris Pismenny, kuba, davem, saeedm,
	hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ariel Elior, Ben Ben-Ishay, Michal Kalderon,
	boris.pismenny, linux-nvme, netdev, Or Gerlitz


On 09/10/2020 1:44, Sagi Grimberg wrote:
> On 9/30/20 7:20 PM, Boris Pismenny wrote:
> 
> > crc offload of the nvme capsule. Check if all the skb bits are on, 
> > and if not recalculate the crc in SW and check it.
> 
> Can you clarify in the patch description that this is only for pdu 
> data digest and not header digest?
> 

Not a security expert, but according to my understanding, the NVMeTCP data digest is a layer 5 CRC,  and as such it is expected to be end-to-end, meaning it is computed by layer 5 on the transmitter and verified on layer 5 on the receiver.
Any data corruption which happens in any of the lower layers, including their software processing, should be protected by this CRC. For example, if the IP or TCP stack has a bug that corrupts the NVMeTCP payload data, the CRC should protect against it. It seems that may not be the case with this offload.


> >
> > This patch reworks the receive-side crc calculation to always run at 
> > the end, so as to keep a single flow for both offload and non-offload.
> > This change simplifies the code, but it may degrade performance for 
> > non-offload crc calculation.
> 
> ??
> 
>  From my scan it doeesn't look like you do that.. Am I missing something?
> Can you explain?
> 
> >
> > Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> > Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
> > Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> > Signed-off-by: Yoray Zack <yorayz@mellanox.com>
> > ---
> >   drivers/nvme/host/tcp.c | 66
> ++++++++++++++++++++++++++++++++++++-----
> >   1 file changed, 58 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index
> > 7bd97f856677..9a620d1dacb4 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -94,6 +94,7 @@ struct nvme_tcp_queue {
> >   	size_t			data_remaining;
> >   	size_t			ddgst_remaining;
> >   	unsigned int		nr_cqe;
> > +	bool			crc_valid;

I suggest to rename it to ddgst_valid.



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 07/10] nvme-tcp : Recalculate crc in the end of the capsule
  2020-11-08  6:59         ` Shai Malin
@ 2020-11-08  7:28           ` Boris Pismenny
  0 siblings, 0 replies; 35+ messages in thread
From: Boris Pismenny @ 2020-11-08  7:28 UTC (permalink / raw)
  To: Shai Malin, linux-nvme, Sagi Grimberg, Boris Pismenny, kuba,
	davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ariel Elior, Ben Ben-Ishay, Michal Kalderon,
	boris.pismenny, netdev, Or Gerlitz


On 08/11/2020 8:59, Shai Malin wrote:
> On 09/10/2020 1:44, Sagi Grimberg wrote:
>> On 9/30/20 7:20 PM, Boris Pismenny wrote:
>>
>>> crc offload of the nvme capsule. Check if all the skb bits are on, 
>>> and if not recalculate the crc in SW and check it.
>> Can you clarify in the patch description that this is only for pdu 
>> data digest and not header digest?
>>
> Not a security expert, but according to my understanding, the NVMeTCP data digest is a layer 5 CRC,  and as such it is expected to be end-to-end, meaning it is computed by layer 5 on the transmitter and verified on layer 5 on the receiver.
> Any data corruption which happens in any of the lower layers, including their software processing, should be protected by this CRC. For example, if the IP or TCP stack has a bug that corrupts the NVMeTCP payload data, the CRC should protect against it. It seems that may not be the case with this offload.

If the TCP/IP stack corrupts packet data, then likely many TCP/IP consumers will be effected, and it will be fixed promptly.
Unlike with TOE, these bugs can be easily fixed/hotpatched by the community.

>
>>> This patch reworks the receive-side crc calculation to always run at 
>>> the end, so as to keep a single flow for both offload and non-offload.
>>> This change simplifies the code, but it may degrade performance for 
>>> non-offload crc calculation.
>> ??
>>
>>  From my scan it doeesn't look like you do that.. Am I missing something?
>> Can you explain?
>>
>>> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
>>> Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
>>> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
>>> Signed-off-by: Yoray Zack <yorayz@mellanox.com>
>>> ---
>>>   drivers/nvme/host/tcp.c | 66
>> ++++++++++++++++++++++++++++++++++++-----
>>>   1 file changed, 58 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index
>>> 7bd97f856677..9a620d1dacb4 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -94,6 +94,7 @@ struct nvme_tcp_queue {
>>>   	size_t			data_remaining;
>>>   	size_t			ddgst_remaining;
>>>   	unsigned int		nr_cqe;
>>> +	bool			crc_valid;
> I suggest to rename it to ddgst_valid.
>
Sure

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 06/10] nvme-tcp: Add DDP data-path
  2020-10-08 22:29   ` Sagi Grimberg
  2020-10-08 23:00     ` Sagi Grimberg
@ 2020-11-08  9:44     ` Boris Pismenny
  2020-11-09 23:18       ` Sagi Grimberg
  1 sibling, 1 reply; 35+ messages in thread
From: Boris Pismenny @ 2020-11-08  9:44 UTC (permalink / raw)
  To: Sagi Grimberg, Boris Pismenny, kuba, davem, saeedm, hch, axboe,
	kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz



On 09/10/2020 1:29, Sagi Grimberg wrote:
>
>>   static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>> @@ -1115,6 +1222,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>>   	bool inline_data = nvme_tcp_has_inline_data(req);
>>   	u8 hdgst = nvme_tcp_hdgst_len(queue);
>>   	int len = sizeof(*pdu) + hdgst - req->offset;
>> +	struct request *rq = blk_mq_rq_from_pdu(req);
>>   	int flags = MSG_DONTWAIT;
>>   	int ret;
>>   
>> @@ -1123,6 +1231,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>>   	else
>>   		flags |= MSG_EOR;
>>   
>> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&
>> +	    blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ)
>> +		nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq);
> I'd assume that this is something we want to setup in
> nvme_tcp_setup_cmd_pdu. Why do it here?
Our goal in placing it here is to keep both setup and teardown in the same thread.
This enables drivers to avoid locking for per-queue operations.



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 06/10] nvme-tcp: Add DDP data-path
  2020-10-08 23:00     ` Sagi Grimberg
@ 2020-11-08 13:59       ` Boris Pismenny
  0 siblings, 0 replies; 35+ messages in thread
From: Boris Pismenny @ 2020-11-08 13:59 UTC (permalink / raw)
  To: Sagi Grimberg, Boris Pismenny, kuba, davem, saeedm, hch, axboe,
	kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz



On 09/10/2020 2:00, Sagi Grimberg wrote:
>>>   static
>>>   int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
>>>                   struct nvme_tcp_config *config)
>>> @@ -630,6 +720,7 @@ static void nvme_tcp_error_recovery(struct 
>>> nvme_ctrl *ctrl)
>>>   static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
>>>           struct nvme_completion *cqe)
>>>   {
>>> +    struct nvme_tcp_request *req;
>>>       struct request *rq;
>>>       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
>>> @@ -641,8 +732,15 @@ static int nvme_tcp_process_nvme_cqe(struct 
>>> nvme_tcp_queue *queue,
>>>           return -EINVAL;
>>>       }
>>> -    if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
>>> -        nvme_complete_rq(rq);
>>> +    req = blk_mq_rq_to_pdu(rq);
>>> +    if (req->offloaded) {
>>> +        req->status = cqe->status;
>>> +        req->result = cqe->result;
>>> +        nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
>>> +    } else {
>>> +        if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
>>> +            nvme_complete_rq(rq);
>>> +    }
> Oh forgot to ask,
>
> We have places in the driver that we may complete (cancel) one
> or more requests from the error recovery or timeout flow. We
> first prevent future incoming RX on the socket such that we
> can safely cancel requests. This may break with the deferred
> completion in ddp_teardown_done.
>
> If I have a request that is waiting for ddp_teardown_done do
> I have a way to tell the HW to never call ddp_teardown_done
> on a specific socket?
>
> If so the place to is in nvme_tcp_stop_queue.
Interesting and indeed, it is a problem that we haven't considered.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 07/10] nvme-tcp : Recalculate crc in the end of the capsule
  2020-10-08 22:44   ` Sagi Grimberg
       [not found]     ` <PH0PR18MB3845764B48FD24C87FA34304CCED0@PH0PR18MB3845.namprd18.prod.outlook.com>
@ 2020-11-08 14:46     ` Boris Pismenny
  1 sibling, 0 replies; 35+ messages in thread
From: Boris Pismenny @ 2020-11-08 14:46 UTC (permalink / raw)
  To: Sagi Grimberg, Boris Pismenny, kuba, davem, saeedm, hch, axboe,
	kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz


On 09/10/2020 1:44, Sagi Grimberg wrote:
>> crc offload of the nvme capsule. Check if all the skb bits
>> are on, and if not recalculate the crc in SW and check it.
> Can you clarify in the patch description that this is only
> for pdu data digest and not header digest?

Will do

>
>> This patch reworks the receive-side crc calculation to always
>> run at the end, so as to keep a single flow for both offload
>> and non-offload. This change simplifies the code, but it may degrade
>> performance for non-offload crc calculation.
> ??
>
>  From my scan it doeesn't look like you do that.. Am I missing something?
> Can you explain?

The performance of CRC data digest in the offload's fallback path may be less good compared to CRC calculation with skb_copy_and_hash.
To be clear, the fallback path is occurs when `queue->data_digest && test_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags)`, while we receive SKBs where `skb->ddp_crc = 0`

>
>>   	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
>>   	if (!rq) {
>>   		dev_err(queue->ctrl->ctrl.device,
>> @@ -992,7 +1031,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>>   		recv_len = min_t(size_t, recv_len,
>>   				iov_iter_count(&req->iter));
>>   
>> -		if (queue->data_digest)
>> +		if (queue->data_digest && !test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
>>   			ret = skb_copy_and_hash_datagram_iter(skb, *offset,
>>   				&req->iter, recv_len, queue->rcv_hash);
> This is the skb copy and hash, not clear why you say that you move this
> to the end...

See the offload fallback path below

>
>>   		else
>> @@ -1012,7 +1051,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>>   
>>   	if (!queue->data_remaining) {
>>   		if (queue->data_digest) {
>> -			nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
> If I instead do:
> 			if (!test_bit(NVME_TCP_Q_OFFLOADS,
>                                        &queue->flags))
> 				nvme_tcp_ddgst_final(queue->rcv_hash,
> 						     &queue->exp_ddgst);
>
> Does that help the mess in nvme_tcp_recv_ddgst?

Not really, as the code path there takes care of the fallback path, i.e. offloaded requested, but didn't succeed.

>
>>   			queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
>>   		} else {
>>   			if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
>> @@ -1033,8 +1071,11 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
>>   	char *ddgst = (char *)&queue->recv_ddgst;
>>   	size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining);
>>   	off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
>> +	bool ddgst_offload_fail;
>>   	int ret;
>>   
>> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
>> +		nvme_tcp_device_ddgst_update(queue, skb);
>>   	ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
>>   	if (unlikely(ret))
>>   		return ret;
>> @@ -1045,12 +1086,21 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
>>   	if (queue->ddgst_remaining)
>>   		return 0;
>>   
>> -	if (queue->recv_ddgst != queue->exp_ddgst) {
>> -		dev_err(queue->ctrl->ctrl.device,
>> -			"data digest error: recv %#x expected %#x\n",
>> -			le32_to_cpu(queue->recv_ddgst),
>> -			le32_to_cpu(queue->exp_ddgst));
>> -		return -EIO;
>> +	ddgst_offload_fail = !nvme_tcp_device_ddgst_ok(queue);
>> +	if (!test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) ||
>> +	    ddgst_offload_fail) {
>> +		if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&
>> +		    ddgst_offload_fail)
>> +			nvme_tcp_crc_recalculate(queue, pdu);
>> +
>> +		nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
>> +		if (queue->recv_ddgst != queue->exp_ddgst) {
>> +			dev_err(queue->ctrl->ctrl.device,
>> +				"data digest error: recv %#x expected %#x\n",
>> +				le32_to_cpu(queue->recv_ddgst),
>> +				le32_to_cpu(queue->exp_ddgst));
>> +			return -EIO;
> This gets convoluted here...

Will try to simplify, the general idea is that there are 3 paths with common code:
1. non-offload
2. offload failed
3. offload success
(1) and (2) share the code for finalizing checking the data digest, while (3) skips this entirely.

In other words, how about this:
```
          offload_fail = !nvme_tcp_ddp_ddgst_ok(queue);                                               
          offload = test_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags);                                   
          if (!offload || offload_fail) {                                                             
                  if (offload && offload_fail) // software-fallback                                   
                          nvme_tcp_ddp_ddgst_recalc(queue, pdu);                                      
                                                                                                      
                  nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);                           
                  if (queue->recv_ddgst != queue->exp_ddgst) {                                        
                          dev_err(queue->ctrl->ctrl.device,                                           
                                  "data digest error: recv %#x expected %#x\n",                       
                                  le32_to_cpu(queue->recv_ddgst),                                     
                                  le32_to_cpu(queue->exp_ddgst));                                     
                          return -EIO;                                                                
                  }                                                                                   
          } 
```

>
>> +		}
>>   	}
>>   
>>   	if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
>>


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 06/10] nvme-tcp: Add DDP data-path
  2020-11-08  9:44     ` Boris Pismenny
@ 2020-11-09 23:18       ` Sagi Grimberg
  0 siblings, 0 replies; 35+ messages in thread
From: Sagi Grimberg @ 2020-11-09 23:18 UTC (permalink / raw)
  To: Boris Pismenny, Boris Pismenny, kuba, davem, saeedm, hch, axboe,
	kbusch, viro, edumazet
  Cc: Yoray Zack, Ben Ben-Ishay, boris.pismenny, linux-nvme, netdev,
	Or Gerlitz


>>>    static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>>> @@ -1115,6 +1222,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>>>    	bool inline_data = nvme_tcp_has_inline_data(req);
>>>    	u8 hdgst = nvme_tcp_hdgst_len(queue);
>>>    	int len = sizeof(*pdu) + hdgst - req->offset;
>>> +	struct request *rq = blk_mq_rq_from_pdu(req);
>>>    	int flags = MSG_DONTWAIT;
>>>    	int ret;
>>>    
>>> @@ -1123,6 +1231,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>>>    	else
>>>    		flags |= MSG_EOR;
>>>    
>>> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&
>>> +	    blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ)
>>> +		nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq);
>> I'd assume that this is something we want to setup in
>> nvme_tcp_setup_cmd_pdu. Why do it here?
> Our goal in placing it here is to keep both setup and teardown in the same thread.
> This enables drivers to avoid locking for per-queue operations.

I also think that it is cleaner when setting up the PDU. Do note that if
queues match 1x1 with cpu cores then any synchronization is pretty
lightweight, and if not, we have other synchronizations anyways...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path
  2020-11-08  6:51       ` Shai Malin
@ 2020-11-09 23:23         ` Sagi Grimberg
  2020-11-11  5:12           ` FW: " Shai Malin
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2020-11-09 23:23 UTC (permalink / raw)
  To: Shai Malin, linux-nvme, Boris Pismenny, Boris Pismenny, kuba,
	davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ariel Elior, Ben Ben-Ishay, Michal Kalderon,
	boris.pismenny, netdev, Or Gerlitz


>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index
>>> 8f4f29f18b8c..06711ac095f2 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -62,6 +62,7 @@ enum nvme_tcp_queue_flags {
>>>    	NVME_TCP_Q_ALLOCATED	= 0,
>>>    	NVME_TCP_Q_LIVE		= 1,
>>>    	NVME_TCP_Q_POLLING	= 2,
>>> +	NVME_TCP_Q_OFFLOADS     = 3,
> 
> Sagi - following our discussion and your suggestions regarding the NVMeTCP Offload ULP module that we are working on at Marvell in which a TCP_OFFLOAD transport type would be added,

We still need to see how this pans out.. it's hard to predict if this is
the best approach before seeing the code. I'd suggest to share some code
so others can share their input.

> we are concerned that perhaps the generic term "offload" for both the transport type (for the Marvell work) and for the DDP and CRC offload queue (for the Mellanox work) may be misleading and confusing to developers and to users. Perhaps the naming should be "direct data placement", e.g. NVME_TCP_Q_DDP or NVME_TCP_Q_DIRECT?

We can call this NVME_TCP_Q_DDP, no issues with that.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* FW: [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path
  2020-11-09 23:23         ` Sagi Grimberg
@ 2020-11-11  5:12           ` Shai Malin
  2020-11-11  5:43             ` Shai Malin
  0 siblings, 1 reply; 35+ messages in thread
From: Shai Malin @ 2020-11-11  5:12 UTC (permalink / raw)
  To: linux-nvme, Sagi Grimberg, Boris Pismenny, Boris Pismenny, kuba,
	davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ariel Elior, Ben Ben-Ishay, Michal Kalderon,
	boris.pismenny, netdev, Or Gerlitz


On 11/10/2020 1:24 AM, Sagi Grimberg wrote: 

> >>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index
> >>> 8f4f29f18b8c..06711ac095f2 100644
> >>> --- a/drivers/nvme/host/tcp.c
> >>> +++ b/drivers/nvme/host/tcp.c
> >>> @@ -62,6 +62,7 @@ enum nvme_tcp_queue_flags {
> >>>    	NVME_TCP_Q_ALLOCATED	= 0,
> >>>    	NVME_TCP_Q_LIVE		= 1,
> >>>    	NVME_TCP_Q_POLLING	= 2,
> >>> +	NVME_TCP_Q_OFFLOADS     = 3,
> >
> > Sagi - following our discussion and your suggestions regarding the
> > NVMeTCP Offload ULP module that we are working on at Marvell in which
> > a TCP_OFFLOAD transport type would be added,
> 
> We still need to see how this pans out.. it's hard to predict if this is the best
> approach before seeing the code. I'd suggest to share some code so others
> can share their input.
> 

We plan to do this soon.

> > we are concerned that perhaps the generic term "offload" for both the
> transport type (for the Marvell work) and for the DDP and CRC offload queue
> (for the Mellanox work) may be misleading and confusing to developers and
> to users. Perhaps the naming should be "direct data placement", e.g.
> NVME_TCP_Q_DDP or NVME_TCP_Q_DIRECT?
> 
> We can call this NVME_TCP_Q_DDP, no issues with that.
> 

Great. Thanks.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path
  2020-11-11  5:12           ` FW: " Shai Malin
@ 2020-11-11  5:43             ` Shai Malin
  0 siblings, 0 replies; 35+ messages in thread
From: Shai Malin @ 2020-11-11  5:43 UTC (permalink / raw)
  To: linux-nvme, Sagi Grimberg, Boris Pismenny, Boris Pismenny, kuba,
	davem, saeedm, hch, axboe, kbusch, viro, edumazet
  Cc: Yoray Zack, Ariel Elior, Ben Ben-Ishay, Michal Kalderon,
	boris.pismenny, netdev, Or Gerlitz


On 11/10/2020 1:24 AM, Sagi Grimberg wrote: 

> >>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c 
> >>> index
> >>> 8f4f29f18b8c..06711ac095f2 100644
> >>> --- a/drivers/nvme/host/tcp.c
> >>> +++ b/drivers/nvme/host/tcp.c
> >>> @@ -62,6 +62,7 @@ enum nvme_tcp_queue_flags {
> >>>    	NVME_TCP_Q_ALLOCATED	= 0,
> >>>    	NVME_TCP_Q_LIVE		= 1,
> >>>    	NVME_TCP_Q_POLLING	= 2,
> >>> +	NVME_TCP_Q_OFFLOADS     = 3,
> >
> > Sagi - following our discussion and your suggestions regarding the 
> > NVMeTCP Offload ULP module that we are working on at Marvell in 
> > which a TCP_OFFLOAD transport type would be added,
> 
> We still need to see how this pans out.. it's hard to predict if this 
> is the best approach before seeing the code. I'd suggest to share some 
> code so others can share their input.
> 

We plan to do this soon.

> > we are concerned that perhaps the generic term "offload" for both 
> > the
> transport type (for the Marvell work) and for the DDP and CRC offload 
> queue (for the Mellanox work) may be misleading and confusing to 
> developers and to users. Perhaps the naming should be "direct data placement", e.g.
> NVME_TCP_Q_DDP or NVME_TCP_Q_DIRECT?
> 
> We can call this NVME_TCP_Q_DDP, no issues with that.
> 

Great. Thanks.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, back to index

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-30 16:20 [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Boris Pismenny
2020-09-30 16:20 ` [PATCH net-next RFC v1 01/10] iov_iter: Skip copy in memcpy_to_page if src==dst Boris Pismenny
2020-10-08 23:05   ` Sagi Grimberg
2020-09-30 16:20 ` [PATCH net-next RFC v1 02/10] net: Introduce direct data placement tcp offload Boris Pismenny
2020-10-08 21:47   ` Sagi Grimberg
2020-10-11 14:44     ` Boris Pismenny
2020-09-30 16:20 ` [PATCH net-next RFC v1 03/10] net: Introduce crc offload for tcp ddp ulp Boris Pismenny
2020-10-08 21:51   ` Sagi Grimberg
2020-10-11 14:58     ` Boris Pismenny
2020-09-30 16:20 ` [PATCH net-next RFC v1 04/10] net/tls: expose get_netdev_for_sock Boris Pismenny
2020-10-08 21:56   ` Sagi Grimberg
2020-09-30 16:20 ` [PATCH net-next RFC v1 05/10] nvme-tcp: Add DDP offload control path Boris Pismenny
2020-10-08 22:19   ` Sagi Grimberg
2020-10-19 18:28     ` Boris Pismenny
     [not found]     ` <PH0PR18MB3845430DDF572E0DD4832D06CCED0@PH0PR18MB3845.namprd18.prod.outlook.com>
2020-11-08  6:51       ` Shai Malin
2020-11-09 23:23         ` Sagi Grimberg
2020-11-11  5:12           ` FW: " Shai Malin
2020-11-11  5:43             ` Shai Malin
2020-09-30 16:20 ` [PATCH net-next RFC v1 06/10] nvme-tcp: Add DDP data-path Boris Pismenny
2020-10-08 22:29   ` Sagi Grimberg
2020-10-08 23:00     ` Sagi Grimberg
2020-11-08 13:59       ` Boris Pismenny
2020-11-08  9:44     ` Boris Pismenny
2020-11-09 23:18       ` Sagi Grimberg
2020-09-30 16:20 ` [PATCH net-next RFC v1 07/10] nvme-tcp : Recalculate crc in the end of the capsule Boris Pismenny
2020-10-08 22:44   ` Sagi Grimberg
     [not found]     ` <PH0PR18MB3845764B48FD24C87FA34304CCED0@PH0PR18MB3845.namprd18.prod.outlook.com>
     [not found]       ` <PH0PR18MB38458FD325BD77983D2623D4CCEB0@PH0PR18MB3845.namprd18.prod.outlook.com>
2020-11-08  6:59         ` Shai Malin
2020-11-08  7:28           ` Boris Pismenny
2020-11-08 14:46     ` Boris Pismenny
2020-09-30 16:20 ` [PATCH net-next RFC v1 08/10] nvme-tcp: Deal with netdevice DOWN events Boris Pismenny
2020-10-08 22:47   ` Sagi Grimberg
2020-10-11  6:54     ` Or Gerlitz
2020-09-30 16:20 ` [PATCH net-next RFC v1 09/10] net/mlx5e: Add NVMEoTCP offload Boris Pismenny
2020-09-30 16:20 ` [PATCH net-next RFC v1 10/10] net/mlx5e: NVMEoTCP, data-path for DDP offload Boris Pismenny
2020-10-09  0:08 ` [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Sagi Grimberg

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git