linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Aurelien Aptel <aaptel@nvidia.com>
To: netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org,
	edumazet@google.com, pabeni@redhat.com, saeedm@nvidia.com,
	tariqt@nvidia.com, leon@kernel.org,
	linux-nvme@lists.infradead.org, sagi@grimberg.me, hch@lst.de,
	kbusch@kernel.org, axboe@fb.com, chaitanyak@nvidia.com
Cc: smalin@nvidia.com, aaptel@nvidia.com, ogerlitz@nvidia.com,
	yorayz@nvidia.com, borisp@nvidia.com, aurelien.aptel@gmail.com,
	malin1024@gmail.com
Subject: [PATCH v7 20/23] net/mlx5e: NVMEoTCP, ddp setup and resync
Date: Tue, 25 Oct 2022 16:59:55 +0300	[thread overview]
Message-ID: <20221025135958.6242-21-aaptel@nvidia.com> (raw)
In-Reply-To: <20221025135958.6242-1-aaptel@nvidia.com>

From: Ben Ben-Ishay <benishay@nvidia.com>

NVMEoTCP offload uses buffer registration for every NVME request to perform
direct data placement. This is achieved by creating a NIC HW mapping
between the CCID (command capsule ID) to the set of buffers that compose
the request. The registration is implemented via MKEY for which we do
fast/async mapping using KLM UMR WQE.

The buffer registration takes place when the ULP calls the ddp_setup op
which is done before they send their corresponding request to the other
side (e.g nvmf target). We don't wait for the completion of the
registration before returning back to the ulp. The reason being that
the HW mapping should be in place fast enough vs the RTT it would take
for the request to be responded. If this doesn't happen, some IO may not
be ddp-offloaded, but that doesn't stop the overall offloading session.

When the offloading HW gets out of sync with the protocol session, a
hardware/software handshake takes place to resync. The ddp_resync op is the
part of the handshake where the SW confirms to the HW that a indeed they
identified correctly a PDU header at a certain TCP sequence number. This
allows the HW to resume the offload.

The 1st part of the handshake is when the HW identifies such sequence
number in an arriving packet. A special mark is made on the completion
(cqe) and then the mlx5 driver invokes the ddp resync_request callback
advertised by the ULP in the ddp context - this is in downstream patch.

Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
Signed-off-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 146 +++++++++++++++++-
 1 file changed, 144 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
index 4df761beebe6..79d0c7e9dc64 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
@@ -685,19 +685,156 @@ mlx5e_nvmeotcp_queue_teardown(struct net_device *netdev,
 	mlx5e_nvmeotcp_put_queue(queue);
 }
 
+static bool
+mlx5e_nvmeotcp_validate_small_sgl_suffix(struct scatterlist *sg, int sg_len, int mtu)
+{
+	int i, hole_size, hole_len, chunk_size = 0;
+
+	for (i = 1; i < sg_len; i++)
+		chunk_size += sg_dma_len(&sg[i]);
+
+	if (chunk_size >= mtu)
+		return true;
+
+	hole_size = mtu - chunk_size - 1;
+	hole_len = DIV_ROUND_UP(hole_size, PAGE_SIZE);
+
+	if (sg_len + hole_len > MAX_SKB_FRAGS)
+		return false;
+
+	return true;
+}
+
+static bool
+mlx5e_nvmeotcp_validate_big_sgl_suffix(struct scatterlist *sg, int sg_len, int mtu)
+{
+	int i, j, last_elem, window_idx, window_size = MAX_SKB_FRAGS - 1;
+	int chunk_size = 0;
+
+	last_elem = sg_len - window_size;
+	window_idx = window_size;
+
+	for (j = 1; j < window_size; j++)
+		chunk_size += sg_dma_len(&sg[j]);
+
+	for (i = 1; i <= last_elem; i++, window_idx++) {
+		chunk_size += sg_dma_len(&sg[window_idx]);
+		if (chunk_size < mtu - 1)
+			return false;
+
+		chunk_size -= sg_dma_len(&sg[i]);
+	}
+
+	return true;
+}
+
+/* This function makes sure that the middle/suffix of a PDU SGL meets the
+ * restriction of MAX_SKB_FRAGS. There are two cases here:
+ * 1. sg_len < MAX_SKB_FRAGS - the extreme case here is a packet that consists
+ * of one byte from the first SG element + the rest of the SGL and the remaining
+ * space of the packet will be scattered to the WQE and will be pointed by
+ * SKB frags.
+ * 2. sg_len => MAX_SKB_FRAGS - the extreme case here is a packet that consists
+ * of one byte from middle SG element + 15 continuous SG elements + one byte
+ * from a sequential SG element or the rest of the packet.
+ */
+static bool
+mlx5e_nvmeotcp_validate_sgl_suffix(struct scatterlist *sg, int sg_len, int mtu)
+{
+	int ret;
+
+	if (sg_len < MAX_SKB_FRAGS)
+		ret = mlx5e_nvmeotcp_validate_small_sgl_suffix(sg, sg_len, mtu);
+	else
+		ret = mlx5e_nvmeotcp_validate_big_sgl_suffix(sg, sg_len, mtu);
+
+	return ret;
+}
+
+static bool
+mlx5e_nvmeotcp_validate_sgl_prefix(struct scatterlist *sg, int sg_len, int mtu)
+{
+	int i, hole_size, hole_len, tmp_len, chunk_size = 0;
+
+	tmp_len = min_t(int, sg_len, MAX_SKB_FRAGS);
+
+	for (i = 0; i < tmp_len; i++)
+		chunk_size += sg_dma_len(&sg[i]);
+
+	if (chunk_size >= mtu)
+		return true;
+
+	hole_size = mtu - chunk_size;
+	hole_len = DIV_ROUND_UP(hole_size, PAGE_SIZE);
+
+	if (tmp_len + hole_len > MAX_SKB_FRAGS)
+		return false;
+
+	return true;
+}
+
+/* This function is responsible to ensure that a PDU could be offloaded.
+ * PDU is offloaded by building a non-linear SKB such that each SGL element is
+ * placed in frag, thus this function should ensure that all packets that
+ * represent part of the PDU won't exaggerate from MAX_SKB_FRAGS SGL.
+ * In addition NVMEoTCP offload has one PDU offload for packet restriction.
+ * Packet could start with a new PDU and then we should check that the prefix
+ * of the PDU meets the requirement or a packet can start in the middle of SG
+ * element and then we should check that the suffix of PDU meets the requirement.
+ */
+static bool
+mlx5e_nvmeotcp_validate_sgl(struct scatterlist *sg, int sg_len, int mtu)
+{
+	int max_hole_frags;
+
+	max_hole_frags = DIV_ROUND_UP(mtu, PAGE_SIZE);
+	if (sg_len + max_hole_frags <= MAX_SKB_FRAGS)
+		return true;
+
+	if (!mlx5e_nvmeotcp_validate_sgl_prefix(sg, sg_len, mtu) ||
+	    !mlx5e_nvmeotcp_validate_sgl_suffix(sg, sg_len, mtu))
+		return false;
+
+	return true;
+}
+
 static int
 mlx5e_nvmeotcp_ddp_setup(struct net_device *netdev,
 			 struct sock *sk,
 			 struct ulp_ddp_io *ddp)
 {
+	struct scatterlist *sg = ddp->sg_table.sgl;
+	struct mlx5e_nvmeotcp_queue_entry *nvqt;
 	struct mlx5e_nvmeotcp_queue *queue;
+	struct mlx5_core_dev *mdev;
+	int i, size = 0, count = 0;
 
 	queue = container_of(ulp_ddp_get_ctx(sk),
 			     struct mlx5e_nvmeotcp_queue, ulp_ddp_ctx);
+	mdev = queue->priv->mdev;
+	count = dma_map_sg(mdev->device, ddp->sg_table.sgl, ddp->nents,
+			   DMA_FROM_DEVICE);
+
+	if (count <= 0)
+		return -EINVAL;
 
-	/* Placeholder - map_sg and initializing the count */
+	if (WARN_ON(count > mlx5e_get_max_sgl(mdev)))
+		return -ENOSPC;
+
+	if (!mlx5e_nvmeotcp_validate_sgl(sg, count, READ_ONCE(netdev->mtu)))
+		return -EOPNOTSUPP;
+
+	for (i = 0; i < count; i++)
+		size += sg_dma_len(&sg[i]);
+
+	nvqt = &queue->ccid_table[ddp->command_id];
+	nvqt->size = size;
+	nvqt->ddp = ddp;
+	nvqt->sgl = sg;
+	nvqt->ccid_gen++;
+	nvqt->sgl_length = count;
+	mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, count);
 
-	mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, 0);
 	return 0;
 }
 
@@ -721,6 +858,11 @@ static void
 mlx5e_nvmeotcp_ddp_resync(struct net_device *netdev,
 			  struct sock *sk, u32 seq)
 {
+	struct mlx5e_nvmeotcp_queue *queue =
+		container_of(ulp_ddp_get_ctx(sk), struct mlx5e_nvmeotcp_queue, ulp_ddp_ctx);
+
+	queue->after_resync_cqe = 1;
+	mlx5e_nvmeotcp_rx_post_static_params_wqe(queue, seq);
 }
 
 static const struct ulp_ddp_dev_ops mlx5e_nvmeotcp_ops = {
-- 
2.31.1



  parent reply	other threads:[~2022-10-25 14:09 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-25 13:59 [PATCH v7 00/23] nvme-tcp receive offloads Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 01/23] net: Introduce direct data placement tcp offload Aurelien Aptel
2022-10-25 22:39   ` Jakub Kicinski
2022-10-26 15:01     ` Shai Malin
2022-10-26 16:24       ` Jakub Kicinski
2022-10-28 10:32         ` Shai Malin
2022-10-28 15:40           ` Jakub Kicinski
2022-10-31 18:13             ` Shai Malin
2022-10-31 23:47               ` Jakub Kicinski
2022-11-03 17:23                 ` Shai Malin
2022-11-03 17:29                   ` Aurelien Aptel
2022-11-04  1:57                     ` Jakub Kicinski
2022-11-04 13:44                       ` Aurelien Aptel
2022-11-04 16:15                         ` Jakub Kicinski
2022-10-25 13:59 ` [PATCH v7 02/23] iov_iter: DDP copy to iter/pages Aurelien Aptel
2022-10-25 16:01   ` Christoph Hellwig
2022-10-26 16:05     ` Aurelien Aptel
2022-10-25 22:40   ` Jakub Kicinski
2022-10-25 13:59 ` [PATCH v7 03/23] net/tls: export get_netdev_for_sock Aurelien Aptel
2022-10-25 16:12   ` Christoph Hellwig
2022-10-26 15:55     ` Aurelien Aptel
2022-10-30 16:06       ` Christoph Hellwig
2022-10-25 13:59 ` [PATCH v7 04/23] Revert "nvme-tcp: remove the unused queue_size member in nvme_tcp_queue" Aurelien Aptel
2022-10-25 16:14   ` Christoph Hellwig
2022-10-26 11:02     ` Sagi Grimberg
2022-10-26 11:52       ` Shai Malin
2022-10-25 13:59 ` [PATCH v7 05/23] nvme-tcp: Add DDP offload control path Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 06/23] nvme-tcp: Add DDP data-path Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 07/23] nvme-tcp: RX DDGST offload Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 08/23] nvme-tcp: Deal with netdevice DOWN events Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 09/23] nvme-tcp: Add modparam to control the ULP offload enablement Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 10/23] Documentation: add ULP DDP offload documentation Aurelien Aptel
2022-10-26 22:23   ` kernel test robot
2022-10-25 13:59 ` [PATCH v7 11/23] net/mlx5e: Rename from tls to transport static params Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 12/23] net/mlx5e: Refactor ico sq polling to get budget Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 13/23] net/mlx5e: Have mdev pointer directly on the icosq structure Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 14/23] net/mlx5e: Refactor doorbell function to allow avoiding a completion Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 15/23] net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 16/23] net/mlx5e: NVMEoTCP, offload initialization Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 17/23] net/mlx5e: TCP flow steering for nvme-tcp acceleration Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 18/23] net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 19/23] net/mlx5e: NVMEoTCP, queue init/teardown Aurelien Aptel
2022-10-25 13:59 ` Aurelien Aptel [this message]
2022-10-25 13:59 ` [PATCH v7 21/23] net/mlx5e: NVMEoTCP, async ddp invalidation Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 22/23] net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload Aurelien Aptel
2022-10-25 13:59 ` [PATCH v7 23/23] net/mlx5e: NVMEoTCP, statistics Aurelien Aptel
2022-10-25 16:00 ` [PATCH v7 00/23] nvme-tcp receive offloads Christoph Hellwig
2022-10-26  8:28   ` Or Gerlitz
2022-10-26 11:52   ` Aurelien Aptel
2022-10-27 10:35 ` Sagi Grimberg
2022-10-27 10:45   ` Shai Malin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221025135958.6242-21-aaptel@nvidia.com \
    --to=aaptel@nvidia.com \
    --cc=aurelien.aptel@gmail.com \
    --cc=axboe@fb.com \
    --cc=borisp@nvidia.com \
    --cc=chaitanyak@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=malin1024@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=saeedm@nvidia.com \
    --cc=sagi@grimberg.me \
    --cc=smalin@nvidia.com \
    --cc=tariqt@nvidia.com \
    --cc=yorayz@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).